Re: Correct model

Hiller, Dean Mon, 24 Sep 2012 11:25:32 -0700

PlayOrm will automatically create a CF to index my CF?

It creates 3 CF's for all indices, IntegerIndice, DecimalIndice, and 
StringIndice such that the ad-hoc tool that is in development can display the 
indices as it knows the prefix of the composite column name is of Integer, 
Decimal or String and it knows the postfix type as well so it can translate 
back from bytes to the types and properly display in a GUI (i.e. On top of 
SELECT, the ad-hoc tool is adding a way to view the induce rows so you can 
check if they got corrupt or not).


Will it auto-manage it, like Cassandra's secondary indexes?

YES

Further detail…

You annotated fields with @NoSqlIndexed and PlayOrm adds/removes from the index 
as you add/modify/remove the entity…..a modify does a remove old val from index 
and insert new value into index.

An example would be PlayOrm stores all long, int, short, byte in a type that 
uses the least amount of space so IF you have a long OR BigInteger between –128 
to 128 it only ends up storing 1 byte in cassandra(SAVING tons of space!!!).  
Then if you are indexing a type that is one of those, PlayOrm creates a 
IntegerIndice table.

Right now, another guy is working on playorm-server which is a webgui to allow 
ad-hoc access to all your data as well so you can ad-hoc queries to see data 
and instead of showing Hex, it shows the real values by translating the bytes 
to String for the schema portions that it is aware of that is.

Later,
Dean

From: Marcelo Elias Del Valle <mvall...@gmail.com<mailto:mvall...@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Monday, September 24, 2012 12:09 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Correct model

Dean,

    There is one last thing I would like to ask about playOrm by this list, the 
next questiosn will come by stackOverflow. Just because of the context, I 
prefer asking this here:
     When you say playOrm indexes a table (which would be a CF behind the 
scenes), what do you mean? PlayOrm will automatically create a CF to index my 
CF? Will it auto-manage it, like Cassandra's secondary indexes?
     In Cassandra, the application is responsible for maintaining the index, 
right? I might be wrong, but unless I am using secondary indexes I need to 
update index values manually, right?
     I got confused when you said "PlayOrm indexes the columns you choose". How 
do I choose and what exactly it means?

Best regards,
Marcelo Valle.

2012/9/24 Hiller, Dean <dean.hil...@nrel.gov<mailto:dean.hil...@nrel.gov>>
Oh, ok, you were talking about the wide row pattern, right?

yes

But playORM is compatible with Aaron's model, isn't it?

Not yet, PlayOrm supports partitioning one table multiple ways as it indexes 
the columns(in your case, the userid FK column and the time column)

Can I map exactly this using playORM?

Not yet, but the plan is to map these typical Cassandra scenarios as well.

 Can I ask playOrm questions in this list?

The best place to ask PlayOrm questions is on stack overflow and tag with 
PlayOrm though I monitor this list and stack overflow for questions(there are 
already a few questions on stack overflow).

The examples directory is empty for now, I would like to see how to set up the 
connection with it.

Running build or build.bat is always kept working and all 62 tests pass(or we 
don't merge to master) so to see how to make a connection or run an example

 1.  Run build.bat or build which generates parsing code
 2.  Import into eclipse (it already has .classpath and .project for you 
already there)
 3.  In FactorySingleton.java you can modify IN_MEMORY to CASSANDRA or not and 
run any of the tests in-memory or against localhost(We run the test suite also 
against a 6 node cluster as well and all passes)
 4.  FactorySingleton probably has the code you are looking for plus you need a 
class called nosql.Persistence or it won't scan your jar file.(class file not 
xml file like JPA)

Do you mean I need to load all the keys in memory to do a multi get?

No, you batch.  I am not sure about CQL, but PlayOrm returns a Cursor not the 
results so you can loop through every key and behind the scenes it is doing 
batch requests so you can load up 100 keys and make one multi get request for 
those 100 keys and then can load up the next 100 keys, etc. etc. etc.  I need 
to look more into the apis and protocol of CQL to see if it allows this style 
of batching.  PlayOrm does support this style of batching today.  Aaron would 
know if CQL does.

Why did you move? Hector is being considered for being the "official" client 
for Cassandra, isn't it?

At the time, I wanted the file streaming feature.  Also, Hector seemed a bit 
cumbersome as well compared to astyanax or at least if you were building a 
platform and had no use for typing the columns.  Just personal preference 
really here.

I am not sure I understood this part. If I need to refactor, having the 
partition id in the key would be a bad thing? What would be the alternative? In 
my case, as I use userId : partitionId as row key, this might be a problem, 
right?

PlayOrm indexes the columns you choose(ie. The ones you want to use in the 
where clause) and partitions by columns you choose not based on the key so in 
PlayOrm, the key is typically a TimeUUID or something cluster unique…..any 
tables referencing that TimeUUID never have to change.  With Cassandra 
partitioning, if you repartition that table a different way or go for some kind 
of major change(usually done with map/reduce), all your foreign keys "may" have 
to change….it really depends on the situation though.  Maybe you get the design 
right and never have to change.

@NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId) SELECT t 
FROM TABLE as t "+
"INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and t.numShares < 
:shares"),

What would happen behind the scenes when I execute this query?

In this case, t or TABLE is a partitioned table since a partition is defined.  
And t.activityTypeInfo refers to the ActivityTypeInfo table which is not 
partitioned(AND ActivityTypeInfo won't scale to billions of rows because there 
is no partitioning but maybe you don't need it!!!).  Behind the scenes when you 
call getResult, it returns a cursor that has NOT done anything yet.  When you 
start looping through the cursor, behind the scenes it is batching requests 
asking for next 500 matches(configurable) so you never run out of memory….it is 
EXACTLY like a database cursor.  You can even use the cursor to show a user the 
first set of results and when user clicks next pick up right where the cursor 
left off (if you saved it to the HttpSession).

You can only use joins with partition keys, right?

Nope, joins work on anything.  You only need to specify the partitionId when 
you have a partitioned table in the list of join tables. (That is what the 
PARTITIONS clause is for, to identify partitionId = what?)…it was put BEFORE 
the SQL instead of within it…CQL took the opposite approach but PlayOrm can 
also join different partitions together as well ;) ).

In this case, is partId the row id of TABLE CF?

Nope, partId is one of the columns.  There is a test case on this class in 
PlayOrm …(notice the annotation NoSqlPartitionByThisField on the column/field 
in the entity)…

https://github.com/deanhiller/playorm/blob/master/input/javasrc/com/alvazan/test/db/PartitionedSingleTrade.java

PlayOrm allows partitioned tables AND non-partioned tables(non-partitioned 
tables won't scale but maybe you will never have that many rows).  You can join 
any two combinations(non-partitioned with partitioned, non-partitioned with 
non-partitioned, partition with another partition).

I only prefer stackoverflow as I like referencing links/questions with their 
urls.  To reference this email is very hard later on as I have to find it so in 
general, I HATE email lists ;) but it seems cassandra prefers them so any 
questions on PlayOrm you can put there and I am not sure how many on this may 
or may not be interested so it creates less noise on this list too.

Later,
Dean


From: Marcelo Elias Del Valle 
<mvall...@gmail.com<mailto:mvall...@gmail.com><mailto:mvall...@gmail.com<mailto:mvall...@gmail.com>>>
Reply-To: 
"user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>"
 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>
Date: Monday, September 24, 2012 11:07 AM
To: 
"user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>"
 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>
Subject: Re: Correct model



2012/9/24 Hiller, Dean 
<dean.hil...@nrel.gov<mailto:dean.hil...@nrel.gov><mailto:dean.hil...@nrel.gov<mailto:dean.hil...@nrel.gov>>>
I am confused.  In this email you say you want "get all requests for a user" 
and in a previous one you said "Select all the users which has new requests, 
since date D" so let me answer both…

I have both needs. These are the two queries I need to perform on the model.

For latter, you make ONE query into the latest partition(ONE partition) of the 
GlobalRequestsCF which gives you the most recent requests ALONG with the user 
ids of those requests.  If you queried all partitions, you would most likely 
blow out your JVM memory.

For the former, you make ONE query to the UserRequestsCF with userid = <your 
user id> to get all the requests for that user

Now I think I got the main idea! This answered a lot!

Sorry, I was skipping some context.  A lot of the backing indexing sometimes is 
done as a long row so in playOrm, too many rows in a partition means == too 
many columns in the indexing row for that partition.  I believe the same is 
true in cassandra for their indexing.

Oh, ok, you were talking about the wide row pattern, right? But playORM is 
compatible with Aaron's model, isn't it? Can I map exactly this using playORM? 
The hardest thing for me to use playORM now is I don't know Cassandra well yet, 
and I know playORM even less. Can I ask playOrm questions in this list? I will 
try to create a POC here!
Only now I am starting to understand what it does ;-) The examples directory is 
empty for now, I would like to see how to set up the connection with it.

Cassandra spreads all your data out on all nodes with or without partitions.  A 
single partition does have it's data co-located though.

Now I see. The main advantage of using partitions is keeping the indexes small 
enough. It has nothing to do with the nodes. Thanks!

If you are at 100k(and the requests are rather small), you could embed all the 
requests in the user or go with Aaron's below suggestion of a UserRequestsCF.  
If your requests are rather large, you probably don't want to embed them in the 
User.  Either way, it's one query or one row key lookup.

I see it now.

Multiget ignores partitions…you feed it a LIST of keys and it gets them.  It 
just so happens that partitionId had to be part of your row key.

Do you mean I need to load all the keys in memory to do a multiget?

I have used Hector and now use Astyanax, I don't worry much about that layer, 
but I feed astyanax 3 nodes and I believe it discovers some of the other ones.  
I believe the latter is true but am not 100% sure as I have not looked at that 
code.

Why did you move? Hector is being considered for being the "official" client 
for Cassandra, isn't it? I looked at the Astyanax api and it seemed much more 
high level though

As an analogy on the above, if you happen to have used PlayOrm, you would ONLY 
need one Requests table and you partition by user AND time(two views into the 
same data partitioned two different ways) and you can do exactly the same thing 
as Aaron's example.  PlayOrm doesn't embed the partition ids in the key leaving 
it free to partition twice like in your case….and in a refactor, you have to 
map/reduce A LOT more rows because of rows having the FK of 
<partitionid><subrowkey> whereas if you don't have partition id in the key, you 
only map/reduce the partitioned table in a redesign/refactor.  That said, we 
will be adding support for CQL partitioning in addition to PlayOrm partitioning 
even though it can be a little less flexible sometimes.

I am not sure I understood this part. If I need to refactor, having the 
partition id in the key would be a bad thing? What would be the alternative? In 
my case, as I use userId : partitionId as row key, this might be a problem, 
right?

Also, CQL locates all the data on one node for a partition.  We have found it 
can be faster "sometimes" with the parallelized disks that the partitions are 
NOT all on one node so PlayOrm partitions are virtual only and do not relate to 
where the rows are stored.  An example on our 6 nodes was a join query on a 
partition with 1,000,000 rows took 60ms (of course I can't compare to CQL here 
since it doesn't do joins).  It really depends how much data is going to come 
back in the query though too?  There are tradeoff's between disk parallel nodes 
and having your data all on one node of course.

I guess I am still not ready for this level of info. :D
In the playORM readme, we have the following:


@NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId) SELECT t 
FROM TABLE as t "+
"INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and t.numShares < 
:shares"),

What would happen behind the scenes when I execute this query? You can only use 
joins with partition keys, right?
In this case, is partId the row id of TABLE CF?


Thanks a lot for the answers

--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr



--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Correct model

Reply via email to