Don't forget SSDs for indexing joy and a reasonable amount of cpu or those indexes will be very behind. If you size the hardware correctly and avoid very silly configuration it works really well for this sort of purpose, especially when combined with Spark to do any hardcore analysis on the filtered dataset.
- Ryan Svihla On Sat, Oct 17, 2015 at 7:12 PM -0700, "Jack Krupansky" <jack.krupan...@gmail.com> wrote: Yes, you can have all your normal data centers with DSE configured for real-time data access and then have a data center that shares the same data but has DSE Search (Solr indexing) enabled. Your Cassandra data will get replicated to the Search data center and then indexed there and only there. You do need to have more RAM on the DSE Search nodes for the indexing, and maybe more nodes as well to assure decent latency for complex queries. -- Jack Krupansky On Sat, Oct 17, 2015 at 3:54 PM, Mark Lewis <m...@lewisworld.org> wrote: I hadn't considered it because I didn't think it could be configured just for a single data center; can it? On Oct 17, 2015 8:50 AM, "Jack Krupansky" <jack.krupan...@gmail.com> wrote: Did you consider DSE Search in a DC? -- Jack Krupansky On Sat, Oct 17, 2015 at 11:30 AM, Mark Lewis <m...@lewisworld.org> wrote: I've got an existing C* cluster spread across three data centers, and I'm wrestling with how to add some support for ad-hoc user reporting against (ideally) near real-time data. The type of reports I want to support basically boil down to allowing the user to select a single highly-denormalized "Table" from a predefined list, pick some filters (ideally with arbitrary boolean logic), project out some columns, and allow for some simple grouping and aggregation. I've seen several companies expose reporting this way and it seems like a good way to avoid the complexity of joins while still providing a good deal of flexibility. Has anybody done this or have any recommendations? My current thinking is that I'd like to have the ad-hoc reporting infrastructure in separate data centers from our active production OLTP-type stuff, both to isolate any load away from the OLTP infrastructure and also because I'll likely need other stuff there (Spark?) to support ad-hoc reporting. So I basically have two problems:(1) Get an eventually-consistent view of the data into a data-center I can query against relativly quickly (so no big batch imports)(2) Be able to run ad-hoc user queries against it If I just think about query flexibility, I might consider dumping data into PostgreSQL nodes (practical because the data that any individual user can query will fit onto a single node). But then I have the problem of getting the data there; I looked into an architecture using Kafka to pump data from the OLTP data centers to PostgreSQL mirrors, but down that road lies the need to manually deal with the eventual consistency. Ugh. If I just run C* nodes in my reporting cluster that makes the problem of getting the data into the right place with eventual consistency easy to solve and I like that idea quite a lot, but then I need to run reporting against C*. I could make the queries I need to run reasonably performant with enough secondary-indexes or materialized views (we're upgrading to 3.0 soon), but I would need a lot of secondary-indexes and materialized views, and I'd rather not pay to store them in all of my data centers. I wish there were a way to define secondary-indexes or materialized views to only exist in one DC of a cluster, but unless I've missed something it doesn't look possible. Any advice or case studies in this area would be greatly appreciated. -- Mark