Re: Filter data on row key in Cassandra Hadoop's Random Partitioner

2012-12-12 Thread Шамим
You can use Apache PIG to load data and filter it by row key, filter in pig is 
very fast.
Regards
  Shamim

11.12.2012, 20:46, Ayush V. ayushv...@gmail.com:
 I'm working on Cassandra Hadoop intergration (MapReduce). We have used Random
 Partioner to insert data to gain faster write. Now we have to read that data
 from cassandra in MapReduce and perform some calculation on it.

 From the lots of data we have in cassandra we wan't to fetch data only for
 particular ROW-KEYs but we are unable to do it due to RandomPartioner -
 assertion is there in code.

 Can anyone please guide me how should I filter data based on RowKey on
 Cassandra level itself (I know data is distributed across regions using Hash
 of the RowKey)?

 Does using secondary indexes (still trying to understand how it works) will
 solve my problem or is there some other way around?

 I will be really appreciated if someone could answer my queries.

 Thanks
 AV

 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Filter-data-on-row-key-in-Cassandra-Hadoop-s-Random-Partitioner-tp7584212.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.


RE: cassandra vs couchbase benchmark

2012-12-12 Thread Viktor Jevdokimov
Pure marketing comparing apples to oranges.

Was Cassandra usage optimized?
- What consistency level was used? (fastest reads with ONE)
- Does Cassandra client used was token aware? (make request to appropriate node)
- Was dynamic snitch turned off? (prevent forward request to other replica if 
can be processed locally)
- Does Cassandra data model was used to mimic Couchbase data model? (Couchbase 
has only 1 value for 1 row)
- What caching was used on Cassandra? (Couchbase uses memcache built-in)

For our use case we've seen much better results upon testing from single node.
Throughput grows almost linearly adding nodes and growing amount of data to the 
same level as for single node.
Single node stats:
- A column family with 30 GiB compressed data per node (100 GiB uncompressed)
- 1 row with 10-30 columns weighted 0.5-2 KiB uncompressed
- 1 node with 6 cores 24 GiB RAM, 8 GiB heap, 1600 MiB new heap
- key cache only, 2M keys
- random reads 70%, random writes 30%
- Read latencies 10ms AVE100 with CPU 95%, 50k reads/s
- Read latencies 5ms AVE100 with CPU 60%, 20k reads/s

In reality, with many column families with different amount of data, read/write 
rates, performance results may significantly vary.
Just need to know what and how to optimize for Cassandra to get best results.


Couchbase is not for our use case because of its data model (requires reads for 
updates/inserts), so we can't compare it to Cassandra.



Best regards / Pagarbiai

Viktor Jevdokimov
Senior Developer

Email: viktor.jevdoki...@adform.com
Phone: +370 5 212 3063
Fax: +370 5 261 0453

J. Jasinskio 16C,
LT-01112 Vilnius,
Lithuania



Disclaimer: The information contained in this message and attachments is 
intended solely for the attention and use of the named addressee and may be 
confidential. If you are not the intended recipient, you are reminded that the 
information remains the property of the sender. You must not use, disclose, 
distribute, copy, print or rely on this e-mail. If you have received this 
message in error, please contact the sender immediately and irrevocably delete 
this message and any copies. -Original Message-
 From: Radim Kolar [mailto:h...@filez.com]
 Sent: Tuesday, December 11, 2012 17:42
 To: user@cassandra.apache.org
 Subject: cassandra vs couchbase benchmark

 http://www.slideshare.net/Couchbase/benchmarking-couchbase#btnNext


Re: Batch mutation streaming

2012-12-12 Thread Ben Hood
Hey Aaron,

That sounds sensible - thanks for the heads up.

Cheers,

Ben

On Dec 10, 2012, at 0:47, aaron morton aa...@thelastpickle.com wrote:

 (and if the message is being decoded on the server site as a complete 
 message, then presumably the same resident memory consumption applies there 
 too).
 Yerp. 
 And every row mutation in your batch becomes a task in the Mutation thread 
 pool. If one replica gets 500 row mutations from one client request it will 
 take a while for the (default) 32 threads to chew through them. While this is 
 going on other client request will be effectively blocked. 
 
 Depending on the number of clients, I would start with say 50 rows per 
 mutation and keep and eye of the *request* latency. 
 
 Hope that helps. 
 
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 9/12/2012, at 7:18 AM, Ben Hood 0x6e6...@gmail.com wrote:
 
 Thanks for the clarification Andrey. If that is the case, I had better 
 ensure that I don't put the entire contents of a very long input stream into 
 a single batch, since that is presumably going to cause a very large message 
 to accumulate on the client side (and if the message is being decoded on the 
 server site as a complete message, then presumably the same resident memory 
 consumption applies there too).
 
 Cheers,
 
 
 Ben
 
 On Dec 7, 2012, at 17:24, Andrey Ilinykh ailin...@gmail.com wrote:
 
 Cassandra uses thrift messages to pass data to and from server. A batch is 
 just a convenient way to create such message. Nothing happens until you 
 send this message. Probably, this is what you call close the batch.
 
 Thank you,
   Andrey
 
 
 On Fri, Dec 7, 2012 at 5:34 AM, Ben Hood 0x6e6...@gmail.com wrote:
 Hi,
 
 I'd like my app to stream a large number of events into Cassandra that 
 originate from the same network input stream. If I create one batch 
 mutation, can I just keep appending events to the Cassandra batch until 
 I'm done, or are there some practical considerations about doing this 
 (e.g. too much stuff buffering up on the client or server side, visibility 
 of the data within the batch that hasn't been closed by the client yet)? 
 Barring any discussion about atomicity, if I were able to stream a largish 
 source into Cassandra, what would happen if the client crashed and didn't 
 close the batch? Or is this kind of thing just a normal occurrence that 
 Cassandra has to be aware of anyway?
 
 Cheers,
 
 Ben
 


Re: Why Secondary indexes is so slowly by my test?

2012-12-12 Thread Hiller, Dean
You could always try PlayOrm's query capability on top of cassandra ;)….it 
works for us.

Dean

From: Chengying Fang cyf...@ngnsoft.commailto:cyf...@ngnsoft.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Tuesday, December 11, 2012 8:22 PM
To: user user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Why Secondary indexes is so slowly by my test?

Thanks to Low. We use CompositeColumn to substitue it in single not-equality 
and definite equalitys query. And we will give up cassandra because of the weak 
query ability and unstability. Many times, we found our data in confusion 
without definite  cause in our cluster. For example, only two rows in one CF, 
row1-columnname1-columnvalue1,row2-columnname2-columnvalue2, but some times, it 
becomes row1-columnname1-columnvalue2,row2-columnname2-columnvalue1. Notice the 
wrong column value.


-- Original --
From:  Richard Lowr...@acunu.commailto:r...@acunu.com;
Date:  Tue, Dec 11, 2012 07:44 PM
To:  useruser@cassandra.apache.orgmailto:user@cassandra.apache.org;
Subject:  Re: Why Secondary indexes is so slowly by my test?

Hi,

Secondary index lookups are more complicated than normal queries so will be 
slower. Items have to first be queried in the index, then retrieved from their 
actual location. Also, inserting into indexed CFs will be slower (but will get 
substantially faster in 1.2 due to CASSANDRA-2897).

If you need to retrieve large amounts of data with your query, you would be 
better off changing your data model to not use secondary indexes.

Richard.


On 7 December 2012 03:08, Chengying Fang 
cyf...@ngnsoft.commailto:cyf...@ngnsoft.com wrote:
Hi guys,

I found Secondary indexes too slowly in my product(amazon large instance) with 
cassandra, then I did test again as describe here. But the result is the same 
as product. What's wrong with cassandra or me?
Now my test:
newly installed ubuntu-12.04 LTS , apache-cassandra-1.1.6, default configure, 
just one keyspace(test) and one CF(TestIndex):

 1.  CREATECOLUMN FAMILY TestIndex
 2.  WITH comparator = UTF8Type
 3.  AND key_validation_class=UTF8Type
 4.  AND default_validation_class = UTF8Type
 5.  AND column_metadata = [
 6.  {column_name: tk, validation_class: UTF8Type, index_type: KEYS}
 7.  {column_name: from, validation_class: UTF8Type}
 8.  {column_name: to, validation_class: UTF8Type}
 9.  {column_name: tm, validation_class: UTF8Type}
 10. ];

and 'tk' just three value:'A'(1000row),'B'(1000row),'X'(increment by test)
The test query from cql:
1,without index:selectcount(*) from TestIndex limit 100;
2,with index:selectcount(*) from TestIndex where tk='X' limit 100;
When I insert 6 row 'X', the time:1s and 12s.
When 'X' up to 13,the time:2.3s and 33s.
When 'X' up to 25,the time:3.8s and 53s.

According to this, when 'X' up to billon, what's the result? Can Secondary 
indexes be used in product? I hope it's my mistake in doing this test.Can 
anyone give some tips about it?
Thanks in advance.
fancy



--
Richard Low
Acunu | http://www.acunu.com | @acunu


Re: Vnode migration path

2012-12-12 Thread Eric Evans
On Tue, Dec 11, 2012 at 4:28 PM, Michael Kjellman
mkjell...@barracuda.com wrote:
 Awesome (and very welcomed news), what kind of failure conditions can we
 expect if a node goes down during the migration?

A shuffle is just a bunch of moves mapped out ahead of time, and
worked through by each node incrementally.  These mappings are
persisted, so a node that went down during a shuffle would simply pick
back up where it left off when it came back online.

A node irrecoverably failing mid-shuffle would likely require some
intervention to sort out.

Either way, there won't be any other impact to the cluster beyond what
is normally caused by a down node.


 From: Richard Low r...@acunu.com
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Tuesday, December 11, 2012 3:33 AM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: Vnode migration path

 Hi Mike,

 There's also the shuffle utility (in the bin directory) that can
 incrementally move ranges around to migrate to vnodes.

 Richard.


 On 11 December 2012 08:47, Michael Kjellman mkjell...@barracuda.com wrote:

 So I'm wondering if anyone has given thought to their migration path to
 Vnodes. Other than having a separate cluster and migrating the data from the
 old cluster to the vnode cluster what else can we do.

 One suggestion I've heard is start up a second Cassandra instance on each
 node on different ports and migrate between nodes that way.

 Best,
 mike

 'Like' us on Facebook for exclusive content and other resources on all
 Barracuda Networks solutions.

 Visit http://barracudanetworks.com/facebook







 --
 Richard Low
 Acunu | http://www.acunu.com | @acunu

 --
 'Like' us on Facebook for exclusive content and other resources on all
 Barracuda Networks solutions.
 Visit http://barracudanetworks.com/facebook
   ­­



--
Eric Evans
Acunu | http://www.acunu.com | @acunu


Null Error Running pig_cassandra

2012-12-12 Thread James Schappet
When trying to run the example-script.pig, I get the following error, null
error.




tsunami:pig schappetj$ bin/pig_cassandra -x local example-script.pig
Using /Library/pig-0.10.0/pig-0.10.0.jar.
2012-12-12 11:02:54,079 [main] INFO  org.apache.pig.Main - Apache Pig
version 0.10.0 (r1328203) compiled Apr 20 2012, 00:33:25
2012-12-12 11:02:54,079 [main] INFO  org.apache.pig.Main - Logging error
messages to: 
/Library/apache-cassandra-1.2.0-beta3-src/examples/pig/pig_1355331774074.log
2012-12-12 11:02:54.211 java[78283:1c03] Unable to load realm info from
SCDynamicStore
2012-12-12 11:02:54,400 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
to hadoop file system at: file:///
2012-12-12 11:02:54,890 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1200: null
Details at logfile:
/Library/apache-cassandra-1.2.0-beta3-src/examples/pig/pig_1355331774074.log





PIG 0.10.0
Cassandra 1.2.0-Beta3 w/source

ENV Osx Lion:
$ java -version
java version 1.7.0_04
Java(TM) SE Runtime Environment (build 1.7.0_04-b21)
Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

set | grep PIG
PIG_HOME=/Library/pig-0.10.0
PIG_INITIAL_ADDRESS=localhost
PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner
PIG_RPC_PORT=9160








Pig Stack Trace
---
ERROR 1200: null

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during
parsing. null
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1597)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1540)
at org.apache.pig.PigServer.registerQuery(PigServer.java:540)
at 
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:970)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.
java:386)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:189
)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165
)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:555)
at org.apache.pig.Main.main(Main.java:111)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57
)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: Failed to parse: null
at 
org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1589)
... 14 more
Caused by: java.lang.NullPointerException
at org.apache.cassandra.utils.Hex.hexToBytes(Hex.java:51)
at 
org.apache.cassandra.hadoop.ConfigHelper.predicateFromString(ConfigHelper.ja
va:206)
at 
org.apache.cassandra.hadoop.ConfigHelper.getInputSlicePredicate(ConfigHelper
.java:176)
at 
org.apache.cassandra.hadoop.pig.CassandraStorage.setLocation(CassandraStorag
e.java:577)
at 
org.apache.cassandra.hadoop.pig.CassandraStorage.getSchema(CassandraStorage.
java:610)
at 
org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoa
d.java:151)
at 
org.apache.pig.newplan.logical.relational.LOLoad.getSchema(LOLoad.java:110)
at 
org.apache.pig.parser.LogicalPlanGenerator.alias_col_ref(LogicalPlanGenerato
r.java:15356)
at 
org.apache.pig.parser.LogicalPlanGenerator.col_ref(LogicalPlanGenerator.java
:15203)
at 
org.apache.pig.parser.LogicalPlanGenerator.projectable_expr(LogicalPlanGener
ator.java:8881)
at 
org.apache.pig.parser.LogicalPlanGenerator.var_expr(LogicalPlanGenerator.jav
a:8632)
at 
org.apache.pig.parser.LogicalPlanGenerator.expr(LogicalPlanGenerator.java:79
84)
at 
org.apache.pig.parser.LogicalPlanGenerator.flatten_clause(LogicalPlanGenerat
or.java:6103)
at 
org.apache.pig.parser.LogicalPlanGenerator.flatten_generated_item(LogicalPla
nGenerator.java:5926)
at 
org.apache.pig.parser.LogicalPlanGenerator.generate_clause(LogicalPlanGenera
tor.java:14101)
at 
org.apache.pig.parser.LogicalPlanGenerator.foreach_plan(LogicalPlanGenerator
.java:12493)
at 
org.apache.pig.parser.LogicalPlanGenerator.foreach_clause(LogicalPlanGenerat
or.java:12360)
at 
org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.ja
va:1577)
at 
org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGene
rator.java:789)
at 
org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.ja
va:507)
at 
org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:3
82)
at 
org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:175)
... 15 more

Re: Cassandra on EC2 - describe_ring() is giving private IPs

2012-12-12 Thread santi kumar
When I configured rpc_address with public IP, cassandra is not starting up.
It's trowing 'unable to create thrift socket on public IP. When I changed
it to private IP, it was good.

java.lang.RuntimeException: Unable to create thrift socket to /
107.21.80.94:9160
at
org.apache.cassandra.thrift.CassandraDaemon$ThriftServer.init(CassandraDaemon.java:148)
at
org.apache.cassandra.thrift.CassandraDaemon.startServer(CassandraDaemon.java:76)
at
org.apache.cassandra.service.AbstractCassandraDaemon.startRPCServer(AbstractCassandraDaemon.java:300)
at
org.apache.cassandra.service.AbstractCassandraDaemon.start(AbstractCassandraDaemon.java:272)
at
org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:369)
at
org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:107)
Caused by: org.apache.thrift.transport.TTransportException: Could not
create ServerSocket on address /107.21.80.94:9160.
at
org.apache.cassandra.thrift.TCustomServerSocket.init(TCustomServerSocket.java:80)
at
org.apache.cassandra.thrift.CassandraDaemon$ThriftServer.init(CassandraDaemon.java:141)


On Wed, Dec 12, 2012 at 3:34 AM, aaron morton aa...@thelastpickle.comwrote:

 Though I configured the listen_address with public dns, still I had the
 same issue.

 Internally the public DNS resolves to the private IP.

 looks like describe_ring() is the one which provides the details.

 describe_ring() returns includes the registered RPC addresses for the
 nodes. Trying setting the rpc_address to the public IP.

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 11/12/2012, at 11:32 PM, santi kumar santi.ku...@gmail.com wrote:

 We have a 4 node cluster in us-east region in two different AZ's. Clients
 connect to this cluster from our datacenter which is not on AWS.

 Hector clients are initialized with public DNS names, then listern_address
 is with private ip and rpc_address is with 0.0.0.0.

 Having issues with Node Auto Discovery by Hector. When it's trying to
 discover the ring, the end points are initialized with private IP's for all
 Token Ranges. It checks with the existing hosts (which are initialized
 public DNS) and thinks that there is a new node got added to the cluster.

 looks like describe_ring() is the one which provides the details. Though I
 configured the listen_address with public dns, still I had the same issue.

 Any idea, what is the best way to configure for EC2. Have gone through the
 link

 https://docs.google.com/document/d/175duUNIx7m5mCDa2sjXVI04ekyMa5bdiWdu-AFgisaY/edit?hl=en

 But not sure whether it's fixed in 1.1.4. When I run the nodetool ring, it
 gives the private ips. But in the above doc, it shows the public IPs as
 part of nodetool ring.

 Some insight into this is really helpful.

 Thanks
 Santi





Re: cassandra vs couchbase benchmark

2012-12-12 Thread Radim Kolar
if dataset fits into memory and data used in test almost fits into 
memory then cassandra is slow compared to other leading nosql databases, 
it can go up to 10:1 ratio. Check infinispan benchmarks. Common use 
pattern is to use memcached on top of cassandra.


cassandra is good if you have way more data then your RAM size. It beats 
SQL database 10:1 ratio at cost low flexibility for queries.


Re: Cassandra on EC2 - describe_ring() is giving private IPs

2012-12-12 Thread santi kumar
Yes That worked. Thanks for the pointer. Once the broadcast_address is
pointed to public IP, end points are coming with public IP. so Hectors
NodeAutoDiscoveryService matches with the existing host and not treating it
as new node.

On Wed, Dec 12, 2012 at 11:10 PM, Andrey Ilinykh ailin...@gmail.com wrote:

 It makes sense. rpc_address is interface to listen. Try to set up public
 IP to broadcast_address.

 Andrey


 On Wed, Dec 12, 2012 at 9:33 AM, santi kumar santi.ku...@gmail.comwrote:

 When I configured rpc_address with public IP, cassandra is not starting
 up. It's trowing 'unable to create thrift socket on public IP. When I
 changed it to private IP, it was good.

 java.lang.RuntimeException: Unable to create thrift socket to /
 107.21.80.94:9160
 at
 org.apache.cassandra.thrift.CassandraDaemon$ThriftServer.init(CassandraDaemon.java:148)
 at
 org.apache.cassandra.thrift.CassandraDaemon.startServer(CassandraDaemon.java:76)
 at
 org.apache.cassandra.service.AbstractCassandraDaemon.startRPCServer(AbstractCassandraDaemon.java:300)
 at
 org.apache.cassandra.service.AbstractCassandraDaemon.start(AbstractCassandraDaemon.java:272)
 at
 org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:369)
 at
 org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:107)
 Caused by: org.apache.thrift.transport.TTransportException: Could not
 create ServerSocket on address /107.21.80.94:9160.
 at
 org.apache.cassandra.thrift.TCustomServerSocket.init(TCustomServerSocket.java:80)
 at
 org.apache.cassandra.thrift.CassandraDaemon$ThriftServer.init(CassandraDaemon.java:141)



 On Wed, Dec 12, 2012 at 3:34 AM, aaron morton aa...@thelastpickle.comwrote:

 Though I configured the listen_address with public dns, still I had the
 same issue.

 Internally the public DNS resolves to the private IP.

 looks like describe_ring() is the one which provides the details.

 describe_ring() returns includes the registered RPC addresses for the
 nodes. Trying setting the rpc_address to the public IP.

 Cheers

-
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 11/12/2012, at 11:32 PM, santi kumar santi.ku...@gmail.com wrote:

 We have a 4 node cluster in us-east region in two different AZ's.
 Clients connect to this cluster from our datacenter which is not on AWS.

 Hector clients are initialized with public DNS names, then
 listern_address is with private ip and rpc_address is with 0.0.0.0.

 Having issues with Node Auto Discovery by Hector. When it's trying to
 discover the ring, the end points are initialized with private IP's for all
 Token Ranges. It checks with the existing hosts (which are initialized
 public DNS) and thinks that there is a new node got added to the cluster.

 looks like describe_ring() is the one which provides the details. Though
 I configured the listen_address with public dns, still I had the same
 issue.

 Any idea, what is the best way to configure for EC2. Have gone through
 the link

 https://docs.google.com/document/d/175duUNIx7m5mCDa2sjXVI04ekyMa5bdiWdu-AFgisaY/edit?hl=en

 But not sure whether it's fixed in 1.1.4. When I run the nodetool ring,
 it gives the private ips. But in the above doc, it shows the public IPs as
 part of nodetool ring.

 Some insight into this is really helpful.

 Thanks
 Santi







Re: bug with cqlsh for foreign charater

2012-12-12 Thread aaron morton
Can you please put together a test case using CQL 3 to write and read the data  
and create a ticket at https://issues.apache.org/jira/browse/CASSANDRA ?

Thanks
Aaron


-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 12/12/2012, at 11:12 AM, Wei Zhu wz1...@yahoo.com wrote:

 I have a column family with composite column
 
 CompositeType(UTF8Type, UTF8Type, LongType, UTF8Type)
 It stores (firstName, LastName, userID, meaningfulColumnName), if I insert 
 the record with foreign characters, 
 looks like that cqlsh -3 interprets the values wrong. I can get the values 
 back from Hector correctly. 
 
 Look at the highlighted row, the column2 and column3 are reversed. 
 
 cqlsh:XXX select * from YY where key = 1;
  key   | column1   | column2   | column3   | column4 | value
 ---+---+---+---+-+
  1 |   |  friendlname4 | 10004 |   v |
  
 333030303034403440304030402d404040667269656e646c6e616d6534
  1 | afriendfname2 | afriendlname2 | 10002 |   v | 
 3330303030324032403040304061667269656e64666e616d65322d40404061667269656e646c6e616d6532
  1 |  friendfname1 |   | 10001 |   v |
  
 33303030303140314030403040667269656e64666e616d65312d404040
  1 |  friendfname4 |  friendlname4 | 10004 |   v |
  
 33303030303440344030403040667269656e64666e616d65342d404040667269656e646c6e616d6534
  1 |中国人 |   صباح الخير  | 10003 |   v | 
 33303030303340334030403040e4b8ade59bbde4baba2d40404020d8b5d8a8d8a7d8ad20d8a7d984d8aed98ad8b120
 
 Thanks.
 -Wei



Re: Multiple Data Center shows very uneven load

2012-12-12 Thread aaron morton
 c:\SERVERS\apache-cassandra-1.1.6\binnodetool -h 11.111.111.1 ring
 Starting NodeTool
 Address DC  RackStatus State   Load   
 Effective-Ownership Token
 
 Token(bytes[6c03])
 11.111.111.1VA  SVA Up Normal  1.44 GB
 33.33%  Token(bytes[6101])
 22.222.22.2 LA  SLA Up Normal  361.25 MB  
 33.33%  Token(bytes[6103])
 33.333.333.3VA  SVA Up Normal  357.01 MB  
 33.33%  Token(bytes[6601])
 44.444.44.4 LA  SLA Up Normal  380.05 MB  
 33.33%  Token(bytes[6603])
 55.555.555.5VA  SNY Up Normal  498.43 MB  
 33.33%  Token(bytes[6c01])
 66.666.666.6LA  SNY Up Normal  512.21 MB  
 33.33%  Token(bytes[6c03])
 
 This already shows uneven load despite the fact that nodes actually hold the
 same amount of data (according to effective-ownership and application design
 / token values). 

Assuming you are using the Byte Ordered Partitioner. If so it's up to you to 
ensure your rows are evenly distributed between the nodes. 

Node 11.111.111.1 is responsible for a very wide range of tokens. I would 
suggest using the Random Partitioner unless you have a *very good* reason not 
to. I would go as far to say if you are using Cassandra for the firs time avoid 
the BOP until you have a better understanding of how Cassandra works. 


 Why 11.111.111.1 still has 700+ MB of data? I tried nodetool repair,
 nodetool cleanup, and node restart (for 11.111.111.1) -- none of that helped
 (it still had those 700+ MB of data). 
Glancing at the logs it could be Hints or commit log replaying.
Use nodetool cfstats to see which CF's are on the node and what their size is. 

If this is still a test the easiest thing to do is nuke it and start again. 

Cheers


-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 12/12/2012, at 12:23 PM, Sergey Olefir solf.li...@gmail.com wrote:

 aaron morton wrote
 Hi Sergey, I think you have forgotten to include some information in your
 email. 
 
 Ah, I used Nable's markup and it seems to have eaten text somehow. Anyway,
 here it is without formatting (much harder to read though)
 
 I have a very similar issue myself and would love to know what (if anything)
 needs to be done (I'm using 1.1.6 at the moment). After stress-test (maximum
 load), I had the following on my nodes (every IP is changed to the
 corresponding fake number): 
 
 c:\SERVERS\apache-cassandra-1.1.6\binnodetool -h 11.111.111.1 ring
 Starting NodeTool
 Address DC  RackStatus State   Load   
 Effective-Ownership Token
 
 Token(bytes[6c03])
 11.111.111.1VA  SVA Up Normal  1.44 GB
 33.33%  Token(bytes[6101])
 22.222.22.2 LA  SLA Up Normal  361.25 MB  
 33.33%  Token(bytes[6103])
 33.333.333.3VA  SVA Up Normal  357.01 MB  
 33.33%  Token(bytes[6601])
 44.444.44.4 LA  SLA Up Normal  380.05 MB  
 33.33%  Token(bytes[6603])
 55.555.555.5VA  SNY Up Normal  498.43 MB  
 33.33%  Token(bytes[6c01])
 66.666.666.6LA  SNY Up Normal  512.21 MB  
 33.33%  Token(bytes[6c03])
 
 This already shows uneven load despite the fact that nodes actually hold the
 same amount of data (according to effective-ownership and application design
 / token values). 
 
 However I then proceeded to drop the single keyspace that was on this
 cluster and I got this: 
 
 c:\SERVERS\apache-cassandra-1.1.6\binnodetool -h 11.111.111.1 ring
 Starting NodeTool
 Note: Ownership information does not include topology, please specify a
 keyspace.
 Address DC  RackStatus State   LoadOwns   
 
 Token
 
 Token(bytes[6c03])
 11.111.111.1VA  SVA Up Normal  767.16 MB  
 16.67%  Token(bytes[6101])
 22.222.22.2 LA  SLA Up Normal  35.19 KB   
 16.67%  Token(bytes[6103])
 33.333.333.3VA  SVA Up Normal  52.91 KB   
 16.67%  Token(bytes[6601])
 44.444.44.4 LA  SLA Up Normal  50.09 KB   
 16.67%  Token(bytes[6603])
 55.555.555.5VA  SNY Up Normal  50.12 KB   
 16.67%  Token(bytes[6c01])
 66.666.666.6LA  SNY Up Normal  44.23 KB   
 16.67%  Token(bytes[6c03])
 
 
 Why 11.111.111.1 still has 700+ MB of data? I tried nodetool repair,
 nodetool cleanup, and node restart (for 11.111.111.1) -- none of that helped
 (it still had those 700+ MB of data). 
 
 Here's info from the filesystem on that node: 
 

Re: Consistency QUORUM does not work anymore (hector:Could not fullfill request on this host)

2012-12-12 Thread aaron morton
 sliceRangeQuery.setRange(Character.Min_Value, Character.Max_Value, false, 
 Integer.Max_Value);
Try selecting a smaller number of rows. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 12/12/2012, at 8:12 PM, dong.yajun dongt...@gmail.com wrote:

 hi aaron. 
 
 There is no problem with get_slices but get_range_slices failed. the way I 
 use this method is: 
 
 sliceRangeQuery.setFamily(family);
 sliceRangeQuery.setKeys(rowkey,rowkey);
 sliceRangeQuery.setRange(Character.Min_Value, Character.Max_Value, false, 
 Integer.Max_Value);
 sliceRangeQuery.setRowCount(1); 
 
 current family has only abort 1 rows, each row has ~10columns, the data 
 abort 25k. 
 
 
 On Wed, Dec 12, 2012 at 11:13 AM, dong.yajun dongt...@gmail.com wrote:
 Thanks aaron. 
 
 more information is I can read data  correctly using Aqulies with 
 Local_Quorum. 
 
 I just check the system.log which is normal on 172.16.74.31 and the number of 
 RPC time out is 10s. the client exception occured on 2012-12-05. 
 
 and the all log on server on 2012-12-05 were:
 
  WARN [pool-1-thread-1] 2012-12-05 11:17:10,974 Memtable.java (line 169) 
 setting live ratio to minimum of 1.0 instead of 0.09080883864721297
  INFO [pool-1-thread-1] 2012-12-05 11:17:10,974 Memtable.java (line 179) 
 CFS(Keyspace='APIPortal', ColumnFamily='Log') liveRatio is 1.0 (just-counted 
 was 1.0).  calculation took 4ms for 256 columns
  INFO [pool-1-thread-1] 2012-12-05 17:48:25,988 Memtable.java (line 179) 
 CFS(Keyspace='APIPortal', ColumnFamily='WebSite') liveRatio is 
 2.780009341429239 (just-counted was 2.780009341429239).  calculation took 1ms 
 for 52 columns
  INFO [pool-1-thread-1] 2012-12-05 17:48:27,944 Memtable.java (line 179) 
 CFS(Keyspace='APIPortal', ColumnFamily='WebSite') liveRatio is 
 2.780009341429239 (just-counted was 2.3128462147190714).  calculation took 
 1ms for 75 columns
  INFO [CompactionExecutor:153] 2012-12-05 18:11:27,718 AutoSavingCache.java 
 (line 269) Saved OpsCenter-rollups60-KeyCache (18 items) in 2 ms
  INFO [COMMIT-LOG-WRITER] 2012-12-05 20:31:10,025 CommitLogSegment.java (line 
 60) Creating new commitlog segment 
 /data/cassandra/commitlog/CommitLog-1354768270025.log
  INFO [ScheduledTasks:1] 2012-12-05 21:47:38,352 GCInspector.java (line 123) 
 GC for ParNew: 437 ms for 1 collections, 1163185072 used; max is 8375238656
  INFO [CompactionExecutor:163] 2012-12-05 22:11:27,679 AutoSavingCache.java 
 (line 269) Saved APIPortal-WebSite-KeyCache (1 items) in 2 ms
 
 On Wed, Dec 12, 2012 at 5:45 AM, aaron morton aa...@thelastpickle.com wrote:
  Caused by: TimedOutException()
 Means the nodes involved in the request did not return to the co ordinator 
 before the rpc_timeout expired.
 
 Check the logs on the servers to see if they are overloaded and dropping 
 messages.
 
 Also check that you are not asking for too much data.
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 11/12/2012, at 10:13 PM, dong.yajun dongt...@gmail.com wrote:
 
  hi list,
  I am using Cassandra with 3 data centers, each DC has more than 10 nodes.
 
  the schema for a keyspace:
  {DC1:3, DC2:3, DC3:3}
 
  now, I put some rows using hector with CL Local_Quorum in DC1 ,and then I 
  get a row with the same CL Local_Quorum in DC1,some exceptions were occured:
 
  Cassandra with dsc-1.0.5, and Hector with 1.1-2.
 
  2012-12-05 21:26:49,667 - WARN [pool-1-thread-3:JCLLoggerAdapter@379] - 
  Could not fullfill request on this host CassandraClient172.16.74.31:9160-1
  2012-12-05 21:26:49,668 - WARN [pool-1-thread-3:JCLLoggerAdapter@437] - 
  Exception:
  me.prettyprint.hector.api.exceptions.HTimedOutException: TimedOutException()
  at 
  me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:35)
  at 
  me.prettyprint.cassandra.service.KeyspaceServiceImpl$3.execute(KeyspaceServiceImpl.java:163)
  at 
  me.prettyprint.cassandra.service.KeyspaceServiceImpl$3.execute(KeyspaceServiceImpl.java:145)
  at 
  me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:103)
  at 
  me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:258)
  at 
  me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:131)
  at 
  me.prettyprint.cassandra.service.KeyspaceServiceImpl.getRangeSlices(KeyspaceServiceImpl.java:167)
  at 
  me.prettyprint.cassandra.model.thrift.ThriftRangeSlicesQuery$1.doInKeyspace(ThriftRangeSlicesQuery.java:66)
  at 
  me.prettyprint.cassandra.model.thrift.ThriftRangeSlicesQuery$1.doInKeyspace(ThriftRangeSlicesQuery.java:62)
  at 
  

Re: Multiple Data Center shows very uneven load

2012-12-12 Thread Sergey Olefir
I do have a (good?) reason for ByteOrderedPartitioner - I need to be able to
do range queries. At the same time I'm aware of the need to balance the
cluster - so I'm hashing my keys and prefixing them with latin letters from
a to p (16 in total) - hence the tokens I'm using.

Based on my experiments I also think that effective ownership as reported
by nodetool is accurate - so it seems this cluster is perfectly balanced.

What I'm worrying about here is the fact that the load on the cluster is
very uneven even after a short test and it doesn't get better when cluster
is idle. Even more worryingly it doesn't fix itself after I dropped the only
keyspace that was ever on this cluster. So I'm afraid that some data is
getting orphaned - and it could become a real problem in production.

I'd love some kind of confirmation one way or another - is this something I
need to worry about or will it fix itself somehow over time?

P.S. I actually had a repeat performance under another test - hugely uneven
data load and data that won't go away even after keyspace drop.


aaron morton wrote
 c:\SERVERS\apache-cassandra-1.1.6\binnodetool -h 11.111.111.1 ring
 Starting NodeTool
 Address DC  RackStatus State   Load   
 Effective-Ownership Token
 
 Token(bytes[6c03])
 11.111.111.1VA  SVA Up Normal  1.44 GB
 33.33%  Token(bytes[6101])
 22.222.22.2 LA  SLA Up Normal  361.25 MB  
 33.33%  Token(bytes[6103])
 33.333.333.3VA  SVA Up Normal  357.01 MB  
 33.33%  Token(bytes[6601])
 44.444.44.4 LA  SLA Up Normal  380.05 MB  
 33.33%  Token(bytes[6603])
 55.555.555.5VA  SNY Up Normal  498.43 MB  
 33.33%  Token(bytes[6c01])
 66.666.666.6LA  SNY Up Normal  512.21 MB  
 33.33%  Token(bytes[6c03])
 
 This already shows uneven load despite the fact that nodes actually hold
 the
 same amount of data (according to effective-ownership and application
 design
 / token values). 
 
 Assuming you are using the Byte Ordered Partitioner. If so it's up to you
 to ensure your rows are evenly distributed between the nodes. 
 
 Node 11.111.111.1 is responsible for a very wide range of tokens. I would
 suggest using the Random Partitioner unless you have a *very good* reason
 not to. I would go as far to say if you are using Cassandra for the firs
 time avoid the BOP until you have a better understanding of how Cassandra
 works. 
 
 
 Why 11.111.111.1 still has 700+ MB of data? I tried nodetool repair,
 nodetool cleanup, and node restart (for 11.111.111.1) -- none of that
 helped
 (it still had those 700+ MB of data). 
 Glancing at the logs it could be Hints or commit log replaying.
 Use nodetool cfstats to see which CF's are on the node and what their size
 is. 
 
 If this is still a test the easiest thing to do is nuke it and start
 again. 
 
 Cheers
 
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 12/12/2012, at 12:23 PM, Sergey Olefir lt;

 solf.lists@

 gt; wrote:
 
 aaron morton wrote
 Hi Sergey, I think you have forgotten to include some information in
 your
 email. 
 
 Ah, I used Nable's markup and it seems to have eaten text somehow.
 Anyway,
 here it is without formatting (much harder to read though)
 
 I have a very similar issue myself and would love to know what (if
 anything)
 needs to be done (I'm using 1.1.6 at the moment). After stress-test
 (maximum
 load), I had the following on my nodes (every IP is changed to the
 corresponding fake number): 
 
 c:\SERVERS\apache-cassandra-1.1.6\binnodetool -h 11.111.111.1 ring
 Starting NodeTool
 Address DC  RackStatus State   Load   
 Effective-Ownership Token
 
 Token(bytes[6c03])
 11.111.111.1VA  SVA Up Normal  1.44 GB
 33.33%  Token(bytes[6101])
 22.222.22.2 LA  SLA Up Normal  361.25 MB  
 33.33%  Token(bytes[6103])
 33.333.333.3VA  SVA Up Normal  357.01 MB  
 33.33%  Token(bytes[6601])
 44.444.44.4 LA  SLA Up Normal  380.05 MB  
 33.33%  Token(bytes[6603])
 55.555.555.5VA  SNY Up Normal  498.43 MB  
 33.33%  Token(bytes[6c01])
 66.666.666.6LA  SNY Up Normal  512.21 MB  
 33.33%  Token(bytes[6c03])
 
 This already shows uneven load despite the fact that nodes actually hold
 the
 same amount of data (according to effective-ownership and application
 design
 / token values). 
 
 However I then proceeded to drop the single keyspace that was on this
 cluster and I got this: 
 
 c:\SERVERS\apache-cassandra-1.1.6\binnodetool -h 11.111.111.1 ring
 Starting NodeTool
 Note: 

Re: Multiple Data Center shows very uneven load

2012-12-12 Thread Sergey Olefir
Nick Bailey-2 wrote
 Dropping a keyspace causes a snapshot to be taken of the keyspace before
 it
 is removed from the schema. So it won't actually delete any data. You can
 manually delete the data from /var/lib/cassandra/
 ks
 /lt;cf[s]gt;/snapshots

Indeed, it looks like snapshot is on the file system. However it looks like
it is not the only thing by a long shot, i.e.:
cassa1-1:/var/log/cassandra# du -k /spool1/cassandra/data/1.1/
375372 
/spool1/cassandra/data/1.1/rainmanLoadTestKeyspace/marquisColumnFamily/snapshots/1355222054452-marquisColumnFamily
375376 
/spool1/cassandra/data/1.1/rainmanLoadTestKeyspace/marquisColumnFamily/snapshots
375380 
/spool1/cassandra/data/1.1/rainmanLoadTestKeyspace/marquisColumnFamily
375384  /spool1/cassandra/data/1.1/rainmanLoadTestKeyspace
4   /spool1/cassandra/data/1.1/system/Versions
52  /spool1/cassandra/data/1.1/system/schema_columns
4   /spool1/cassandra/data/1.1/system/Schema
28  /spool1/cassandra/data/1.1/system/NodeIdInfo
4   /spool1/cassandra/data/1.1/system/Migrations
28  /spool1/cassandra/data/1.1/system/schema_keyspaces
28  /spool1/cassandra/data/1.1/system/schema_columnfamilies
786348  /spool1/cassandra/data/1.1/system/HintsColumnFamily
52  /spool1/cassandra/data/1.1/system/LocationInfo
4   /spool1/cassandra/data/1.1/system/IndexInfo
786556  /spool1/cassandra/data/1.1/system
1161944 /spool1/cassandra/data/1.1/


And also 700+MB in the commitlog. Neither of which seemed to 'go away' on
its own when idle or even after running nodetool repair/cleanup and even
dropping keyspace.

I suppose these hints and commitlog may be the reason behind huge difference
in load on nodes -- but why does it happen and more importantly is it
harmful? Will it keep accumulating?



--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Multiple-Data-Center-shows-very-uneven-load-tp7584197p7584256.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Null Error Running pig_cassandra

2012-12-12 Thread aaron morton
there is about 3 checks that should have caught the Null. 

 at 
 org.apache.cassandra.hadoop.ConfigHelper.getInputSlicePredicate(ConfigHelper.java:176)
This line does not match the source code for the 1.2.0-beta3 tag.

Can you try it with the 1.1.7 bin distro ? 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 13/12/2012, at 6:04 AM, James Schappet jschap...@gmail.com wrote:

 When trying to run the example-script.pig, I get the following error, null 
 error.
 
 
 
 
 tsunami:pig schappetj$ bin/pig_cassandra -x local example-script.pig 
 Using /Library/pig-0.10.0/pig-0.10.0.jar.
 2012-12-12 11:02:54,079 [main] INFO  org.apache.pig.Main - Apache Pig version 
 0.10.0 (r1328203) compiled Apr 20 2012, 00:33:25
 2012-12-12 11:02:54,079 [main] INFO  org.apache.pig.Main - Logging error 
 messages to: 
 /Library/apache-cassandra-1.2.0-beta3-src/examples/pig/pig_1355331774074.log
 2012-12-12 11:02:54.211 java[78283:1c03] Unable to load realm info from 
 SCDynamicStore
 2012-12-12 11:02:54,400 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to hadoop file system at: file:///
 2012-12-12 11:02:54,890 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1200: null
 Details at logfile: 
 /Library/apache-cassandra-1.2.0-beta3-src/examples/pig/pig_1355331774074.log
 
 
 
 
 
 PIG 0.10.0
 Cassandra 1.2.0-Beta3 w/source
 
 ENV Osx Lion:
 $ java -version
 java version 1.7.0_04
 Java(TM) SE Runtime Environment (build 1.7.0_04-b21)
 Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)
 
 set | grep PIG
 PIG_HOME=/Library/pig-0.10.0
 PIG_INITIAL_ADDRESS=localhost
 PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner
 PIG_RPC_PORT=9160
 
 
 
 
 
 
 
 
 Pig Stack Trace
 ---
 ERROR 1200: null
 
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during 
 parsing. null
 at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1597)
 at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1540)
 at org.apache.pig.PigServer.registerQuery(PigServer.java:540)
 at 
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:970)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:189)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
 at org.apache.pig.Main.run(Main.java:555)
 at org.apache.pig.Main.main(Main.java:111)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
 Caused by: Failed to parse: null
 at 
 org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
 at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1589)
 ... 14 more
 Caused by: java.lang.NullPointerException
 at org.apache.cassandra.utils.Hex.hexToBytes(Hex.java:51)
 at 
 org.apache.cassandra.hadoop.ConfigHelper.predicateFromString(ConfigHelper.java:206)
 at 
 org.apache.cassandra.hadoop.ConfigHelper.getInputSlicePredicate(ConfigHelper.java:176)
 at 
 org.apache.cassandra.hadoop.pig.CassandraStorage.setLocation(CassandraStorage.java:577)
 at 
 org.apache.cassandra.hadoop.pig.CassandraStorage.getSchema(CassandraStorage.java:610)
 at 
 org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:151)
 at 
 org.apache.pig.newplan.logical.relational.LOLoad.getSchema(LOLoad.java:110)
 at 
 org.apache.pig.parser.LogicalPlanGenerator.alias_col_ref(LogicalPlanGenerator.java:15356)
 at 
 org.apache.pig.parser.LogicalPlanGenerator.col_ref(LogicalPlanGenerator.java:15203)
 at 
 org.apache.pig.parser.LogicalPlanGenerator.projectable_expr(LogicalPlanGenerator.java:8881)
 at 
 org.apache.pig.parser.LogicalPlanGenerator.var_expr(LogicalPlanGenerator.java:8632)
 at 
 org.apache.pig.parser.LogicalPlanGenerator.expr(LogicalPlanGenerator.java:7984)
 at 
 org.apache.pig.parser.LogicalPlanGenerator.flatten_clause(LogicalPlanGenerator.java:6103)
 at 
 org.apache.pig.parser.LogicalPlanGenerator.flatten_generated_item(LogicalPlanGenerator.java:5926)
 at 
 org.apache.pig.parser.LogicalPlanGenerator.generate_clause(LogicalPlanGenerator.java:14101)
 at 
 org.apache.pig.parser.LogicalPlanGenerator.foreach_plan(LogicalPlanGenerator.java:12493)
 at 
 

Re: upgrade from 0.8.5 to 1.1.6, now it cannot find schema

2012-12-12 Thread aaron morton
 in-vm cassandra
Embedded ? 

The location of the SSTables has changed in 1.1, they are know in 
/var/lib/cassandra/data/KS_NAME/CF_NAME/SSTable.data Is the data in the right 
place ?

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 13/12/2012, at 6:54 AM, Anand Somani meatfor...@gmail.com wrote:

 Hi,
 
 We have a service which uses in-vm cassandra and creates the schema if does 
 not exist programmatically, this has worked for us for sometime (including 
 upgrades to the service) and we have been using 0.8.5. 
 
 Now we are testing the upgrade to 1.1.6 and noticed that on upgrade the 
 cassandra fails to find the old schema and wants to create a schema!!. Even 
 using cli it does not show the old schema.
 
 Has anybody come across this? FYI we have another cluster where we run 
 cassandra as a separate process and schema is also created outside using cli, 
 the upgrade there went fine!!
 
 Has anybody seen this behavior? Any clues?
 
 I am going to look at creating the schema from outside, so see if that is 
 culprit but wanted to see if anybody had any suggestions/thoughts.
 
 Thanks
 Anand



Re: Multiple Data Center shows very uneven load

2012-12-12 Thread aaron morton
try nodetool drain. It will flush everything to disk and the commit log will be 
truncated.

HH can be ignored. If you really want them gone they can be purged using the 
JMX interface, or you can stop the node and delete the sstables. 


Cheers
 
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 13/12/2012, at 10:35 AM, Sergey Olefir solf.li...@gmail.com wrote:

 Nick Bailey-2 wrote
 Dropping a keyspace causes a snapshot to be taken of the keyspace before
 it
 is removed from the schema. So it won't actually delete any data. You can
 manually delete the data from /var/lib/cassandra/
 ks
 /lt;cf[s]gt;/snapshots
 
 Indeed, it looks like snapshot is on the file system. However it looks like
 it is not the only thing by a long shot, i.e.:
 cassa1-1:/var/log/cassandra# du -k /spool1/cassandra/data/1.1/
 375372 
 /spool1/cassandra/data/1.1/rainmanLoadTestKeyspace/marquisColumnFamily/snapshots/1355222054452-marquisColumnFamily
 375376 
 /spool1/cassandra/data/1.1/rainmanLoadTestKeyspace/marquisColumnFamily/snapshots
 375380 
 /spool1/cassandra/data/1.1/rainmanLoadTestKeyspace/marquisColumnFamily
 375384  /spool1/cassandra/data/1.1/rainmanLoadTestKeyspace
 4   /spool1/cassandra/data/1.1/system/Versions
 52  /spool1/cassandra/data/1.1/system/schema_columns
 4   /spool1/cassandra/data/1.1/system/Schema
 28  /spool1/cassandra/data/1.1/system/NodeIdInfo
 4   /spool1/cassandra/data/1.1/system/Migrations
 28  /spool1/cassandra/data/1.1/system/schema_keyspaces
 28  /spool1/cassandra/data/1.1/system/schema_columnfamilies
 786348  /spool1/cassandra/data/1.1/system/HintsColumnFamily
 52  /spool1/cassandra/data/1.1/system/LocationInfo
 4   /spool1/cassandra/data/1.1/system/IndexInfo
 786556  /spool1/cassandra/data/1.1/system
 1161944 /spool1/cassandra/data/1.1/
 
 
 And also 700+MB in the commitlog. Neither of which seemed to 'go away' on
 its own when idle or even after running nodetool repair/cleanup and even
 dropping keyspace.
 
 I suppose these hints and commitlog may be the reason behind huge difference
 in load on nodes -- but why does it happen and more importantly is it
 harmful? Will it keep accumulating?
 
 
 
 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Multiple-Data-Center-shows-very-uneven-load-tp7584197p7584256.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.



Datastax C*ollege Credit Webinar Series : Create your first Java App w/ Cassandra

2012-12-12 Thread Brian O'Neill
FWIW --
I'm presenting tomorrow for the Datastax C*ollege Credit Webinar Series:
http://brianoneill.blogspot.com/2012/12/presenting-for-datastax-college-credit.html

I hope to make CQL part of the presentation and show how it integrates
with the Java APIs.
If you are interested, drop in.

-brian

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42


Re: Why Secondary indexes is so slowly by my test?

2012-12-12 Thread Chengying Fang
You are right, Dean. It's due to the heavy result returned by query, not index 
itself. According to my test, if the result  rows less than 5000, it's very 
quick. But how to limit the result? It seems row limit is a good choice. But if 
do so, some rows I wanted  maybe miss because the row order not fulfill query 
conditions.
For example: CF User{I1,C1} with Index I1. Query conditions:I1=foo, order by 
C1. If I1=foo return 1 limit 100, I can't get the right result of C1. Also 
we can not always set row range fulfill the query conditions when doing query. 
Maybe I should redesign the CF model to fix it.
 
-- Original --
From:  Hiller, Deandean.hil...@nrel.gov;
Date:  Wed, Dec 12, 2012 10:51 PM
To:  user@cassandra.apache.orguser@cassandra.apache.org; 

Subject:  Re: Why Secondary indexes is so slowly by my test?

 
You could always try PlayOrm's query capability on top of cassandra ;)??.it 
works for us.

Dean

From: Chengying Fang cyf...@ngnsoft.commailto:cyf...@ngnsoft.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Tuesday, December 11, 2012 8:22 PM
To: user user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Why Secondary indexes is so slowly by my test?

Thanks to Low. We use CompositeColumn to substitue it in single not-equality 
and definite equalitys query. And we will give up cassandra because of the weak 
query ability and unstability. Many times, we found our data in confusion 
without definite  cause in our cluster. For example, only two rows in one CF, 
row1-columnname1-columnvalue1,row2-columnname2-columnvalue2, but some times, it 
becomes row1-columnname1-columnvalue2,row2-columnname2-columnvalue1. Notice the 
wrong column value.


-- Original --
From:  Richard Lowr...@acunu.commailto:r...@acunu.com;
Date:  Tue, Dec 11, 2012 07:44 PM
To:  useruser@cassandra.apache.orgmailto:user@cassandra.apache.org;
Subject:  Re: Why Secondary indexes is so slowly by my test?

Hi,

Secondary index lookups are more complicated than normal queries so will be 
slower. Items have to first be queried in the index, then retrieved from their 
actual location. Also, inserting into indexed CFs will be slower (but will get 
substantially faster in 1.2 due to CASSANDRA-2897).

If you need to retrieve large amounts of data with your query, you would be 
better off changing your data model to not use secondary indexes.

Richard.


On 7 December 2012 03:08, Chengying Fang 
cyf...@ngnsoft.commailto:cyf...@ngnsoft.com wrote:
Hi guys,

I found Secondary indexes too slowly in my product(amazon large instance) with 
cassandra, then I did test again as describe here. But the result is the same 
as product. What's wrong with cassandra or me?
Now my test:
newly installed ubuntu-12.04 LTS , apache-cassandra-1.1.6, default configure, 
just one keyspace(test) and one CF(TestIndex):

 1.  CREATECOLUMN FAMILY TestIndex
 2.  WITH comparator = UTF8Type
 3.  AND key_validation_class=UTF8Type
 4.  AND default_validation_class = UTF8Type
 5.  AND column_metadata = [
 6.  {column_name: tk, validation_class: UTF8Type, index_type: KEYS}
 7.  {column_name: from, validation_class: UTF8Type}
 8.  {column_name: to, validation_class: UTF8Type}
 9.  {column_name: tm, validation_class: UTF8Type}
 10. ];

and 'tk' just three value:'A'(1000row),'B'(1000row),'X'(increment by test)
The test query from cql:
1,without index:selectcount(*) from TestIndex limit 100;
2,with index:selectcount(*) from TestIndex where tk='X' limit 100;
When I insert 6 row 'X', the time:1s and 12s.
When 'X' up to 13,the time:2.3s and 33s.
When 'X' up to 25,the time:3.8s and 53s.

According to this, when 'X' up to billon, what's the result? Can Secondary 
indexes be used in product? I hope it's my mistake in doing this test.Can 
anyone give some tips about it?
Thanks in advance.
fancy



--
Richard Low
Acunu | http://www.acunu.com | @acunu

Re: Why Secondary indexes is so slowly by my test?

2012-12-12 Thread aaron morton
The IndexClause for the get_indexed_slices takes a start key. You can page the 
results from your secondary index query by making multiple calls with a sane 
count and including a start key. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 13/12/2012, at 6:34 PM, Chengying Fang cyf...@ngnsoft.com wrote:

 You are right, Dean. It's due to the heavy result returned by query, not 
 index itself. According to my test, if the result  rows less than 5000, it's 
 very quick. But how to limit the result? It seems row limit is a good choice. 
 But if do so, some rows I wanted  maybe miss because the row order not 
 fulfill query conditions.
 For example: CF User{I1,C1} with Index I1. Query conditions:I1=foo, order by 
 C1. If I1=foo return 1 limit 100, I can't get the right result of C1. 
 Also we can not always set row range fulfill the query conditions when doing 
 query. Maybe I should redesign the CF model to fix it.
  
 -- Original --
 From:  Hiller, Deandean.hil...@nrel.gov;
 Date:  Wed, Dec 12, 2012 10:51 PM
 To:  user@cassandra.apache.orguser@cassandra.apache.org;
 Subject:  Re: Why Secondary indexes is so slowly by my test?
  
 You could always try PlayOrm's query capability on top of cassandra ;)??.it 
 works for us.
 
 Dean
 
 From: Chengying Fang cyf...@ngnsoft.commailto:cyf...@ngnsoft.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Tuesday, December 11, 2012 8:22 PM
 To: user user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: Why Secondary indexes is so slowly by my test?
 
 Thanks to Low. We use CompositeColumn to substitue it in single not-equality 
 and definite equalitys query. And we will give up cassandra because of the 
 weak query ability and unstability. Many times, we found our data in 
 confusion without definite  cause in our cluster. For example, only two rows 
 in one CF, row1-columnname1-columnvalue1,row2-columnname2-columnvalue2, but 
 some times, it becomes 
 row1-columnname1-columnvalue2,row2-columnname2-columnvalue1. Notice the wrong 
 column value.
 
 
 -- Original --
 From:  Richard Lowr...@acunu.commailto:r...@acunu.com;
 Date:  Tue, Dec 11, 2012 07:44 PM
 To:  useruser@cassandra.apache.orgmailto:user@cassandra.apache.org;
 Subject:  Re: Why Secondary indexes is so slowly by my test?
 
 Hi,
 
 Secondary index lookups are more complicated than normal queries so will be 
 slower. Items have to first be queried in the index, then retrieved from 
 their actual location. Also, inserting into indexed CFs will be slower (but 
 will get substantially faster in 1.2 due to CASSANDRA-2897).
 
 If you need to retrieve large amounts of data with your query, you would be 
 better off changing your data model to not use secondary indexes.
 
 Richard.
 
 
 On 7 December 2012 03:08, Chengying Fang 
 cyf...@ngnsoft.commailto:cyf...@ngnsoft.com wrote:
 Hi guys,
 
 I found Secondary indexes too slowly in my product(amazon large instance) 
 with cassandra, then I did test again as describe here. But the result is the 
 same as product. What's wrong with cassandra or me?
 Now my test:
 newly installed ubuntu-12.04 LTS , apache-cassandra-1.1.6, default configure, 
 just one keyspace(test) and one CF(TestIndex):
 
  1.  CREATECOLUMN FAMILY TestIndex
  2.  WITH comparator = UTF8Type
  3.  AND key_validation_class=UTF8Type
  4.  AND default_validation_class = UTF8Type
  5.  AND column_metadata = [
  6.  {column_name: tk, validation_class: UTF8Type, index_type: KEYS}
  7.  {column_name: from, validation_class: UTF8Type}
  8.  {column_name: to, validation_class: UTF8Type}
  9.  {column_name: tm, validation_class: UTF8Type}
  10. ];
 
 and 'tk' just three value:'A'(1000row),'B'(1000row),'X'(increment by test)
 The test query from cql:
 1,without index:selectcount(*) from TestIndex limit 100;
 2,with index:selectcount(*) from TestIndex where tk='X' limit 100;
 When I insert 6 row 'X', the time:1s and 12s.
 When 'X' up to 13,the time:2.3s and 33s.
 When 'X' up to 25,the time:3.8s and 53s.
 
 According to this, when 'X' up to billon, what's the result? Can Secondary 
 indexes be used in product? I hope it's my mistake in doing this test.Can 
 anyone give some tips about it?
 Thanks in advance.
 fancy
 
 
 
 --
 Richard Low
 Acunu | http://www.acunu.com | @acunu