Don't forget SSDs for indexing joy and a reasonable amount of cpu or those 
indexes will be very behind.
If you size the hardware correctly and avoid very silly configuration it works 
really well for this sort of purpose, especially when combined with Spark to do 
any hardcore analysis on the filtered dataset.

- Ryan Svihla




On Sat, Oct 17, 2015 at 7:12 PM -0700, "Jack Krupansky" 
<jack.krupan...@gmail.com> wrote:










Yes, you can have all your normal data centers with DSE configured for 
real-time data access and then have a data center that shares the same data but 
has DSE Search (Solr indexing) enabled. Your Cassandra data will get replicated 
to the Search data center and then indexed there and only there. You do need to 
have more RAM on the DSE Search nodes for the indexing, and maybe more nodes as 
well to assure decent latency for complex queries.
-- Jack Krupansky

On Sat, Oct 17, 2015 at 3:54 PM, Mark Lewis <m...@lewisworld.org> wrote:


I hadn't considered it because I didn't think it could be configured just for a 
single data center; can it?
On Oct 17, 2015 8:50 AM, "Jack Krupansky" <jack.krupan...@gmail.com> wrote:
Did you consider DSE Search in a DC?
-- Jack Krupansky

On Sat, Oct 17, 2015 at 11:30 AM, Mark Lewis <m...@lewisworld.org> wrote:
I've got an existing C* cluster spread across three data centers, and I'm 
wrestling with how to add some support for ad-hoc user reporting against 
(ideally) near real-time data.  
The type of reports I want to support basically boil down to allowing the user 
to select a single highly-denormalized "Table" from a predefined list, pick 
some filters (ideally with arbitrary boolean logic), project out some columns, 
and allow for some simple grouping and aggregation.  I've seen several 
companies expose reporting this way and it seems like a good way to avoid the 
complexity of joins while still providing a good deal of flexibility.
Has anybody done this or have any recommendations?
My current thinking is that I'd like to have the ad-hoc reporting 
infrastructure in separate data centers from our active production OLTP-type 
stuff, both to isolate any load away from the OLTP infrastructure and also 
because I'll likely need other stuff there (Spark?) to support ad-hoc reporting.
So I basically have two problems:(1) Get an eventually-consistent view of the 
data into a data-center I can query against relativly quickly (so no big batch 
imports)(2) Be able to run ad-hoc user queries against it
If I just think about query flexibility, I might consider dumping data into 
PostgreSQL nodes (practical because the data that any individual user can query 
will fit onto a single node).  But then I have the problem of getting the data 
there; I looked into an architecture using Kafka to pump data from the OLTP 
data centers to PostgreSQL mirrors, but down that road lies the need to 
manually deal with the eventual consistency.  Ugh.
If I just run C* nodes in my reporting cluster that makes the problem of 
getting the data into the right place with eventual consistency easy to solve 
and I like that idea quite a lot, but then I need to run reporting against C*.  
I could make the queries I need to run reasonably performant with enough 
secondary-indexes or materialized views (we're upgrading to 3.0 soon), but I 
would need a lot of secondary-indexes and materialized views, and I'd rather 
not pay to store them in all of my data centers.  I wish there were a way to 
define secondary-indexes or materialized views to only exist in one DC of a 
cluster, but unless I've missed something it doesn't look possible.
Any advice or case studies in this area would be greatly appreciated.
-- Mark

Reply via email to