Re: How to create an additional cluster in Cassandra exclusively for Analytics Purpose

Bhuvan Rawal Mon, 07 Mar 2016 10:26:54 -0800

Thanks for the correction Jon. (Atmost 2000 queries *per cluster* for
serving 100 searches.)


On Mon, Mar 7, 2016 at 11:47 PM, Jonathan Haddad <j...@jonhaddad.com> wrote:

> If you're doing 100 searches a second each machine will be serving at most
> 100 requests per second, not 2000.
>
> On Mon, Mar 7, 2016 at 10:13 AM Bhuvan Rawal <bhu1ra...@gmail.com> wrote:
>
>> Well thats certainly true, there are these points worth discussing here :
>>
>> 1. Scatter Gather queries - Especially if the cluster size is large. Say
>> we have a 20 node cluster, and we are searching 100 times a second. then
>> effectively coordinator would be hitting each node 2000 times (20*100) That
>> factor will only increase as the number of node goes higher. Im sure having
>> a centralized index alleviates that problem.
>> 2. High Cardinality (For columns like email / phone number)
>> 3. Low Cardinality (Boolean column or any column with limited set of
>> available options).
>>
>> SASI seems to be a good solution for Like queries this doc
>> <https://github.com/apache/cassandra/blob/trunk/doc/SASI.md> looks
>> really promising. But wouldn't it be better to tackle the use cases of
>> search differently than from data storage ones, from a design standpoint?
>>
>> On Sun, Mar 6, 2016 at 9:14 PM, Jack Krupansky <jack.krupan...@gmail.com>
>> wrote:
>>
>>> I don't have any direct personal experience with Stratio. It will all
>>> depend on your queries and your data cardinality - some queries are fine
>>> with secondary indexes while other are quite poor. Ditto for Lucene and
>>> Solr.
>>>
>>> It is also worth noting that the new SASI feature of Cassandra supports
>>> keyword and prefix/suffix search. But it doesn't support multi-column ad
>>> hoc queries, which is what people tend to use Lucene and Solr for. So,
>>> again, it all depends on your queries and your data cardinality.
>>>
>>> -- Jack Krupansky
>>>
>>> On Sun, Mar 6, 2016 at 1:29 AM, Bhuvan Rawal <bhu1ra...@gmail.com>
>>> wrote:
>>>
>>>> Yes Jack, we are rolling out with Stratio right now, we will assess the
>>>> performance benefit it yields and can go for ElasticSearch/Solr later.
>>>>
>>>> As per your experience how does Stratio perform vis-a-vis Secondary
>>>> Indexes?
>>>>
>>>> On Sun, Mar 6, 2016 at 11:15 AM, Jack Krupansky <
>>>> jack.krupan...@gmail.com> wrote:
>>>>
>>>>> You haven't been clear about how you intend to add Solr. You can also
>>>>> use Stratio or Stargate for basic Lucene search if you don't want need 
>>>>> full
>>>>> Solr support and want to stick to open source rather than go with DSE
>>>>> Search for Solr.
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>> On Sun, Mar 6, 2016 at 12:25 AM, Bhuvan Rawal <bhu1ra...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Sean and Nirmallaya.
>>>>>>
>>>>>> @Jack, We are going with DSC right now and plan to use spark and
>>>>>> later solr over the analytics DC. The use case is to have  olap and oltp
>>>>>> workloads separated and not intertwine them, whether it is achieved by
>>>>>> creating a new DC or a new cluster altogether. From Nirmallaya's and 
>>>>>> Sean's
>>>>>> answer I could understand that its easily achievable by creating a 
>>>>>> separate
>>>>>> DC, app client will need to be made DC aware and it should not make a
>>>>>> coordinator in dc3. And same goes for spark configuration, it should read
>>>>>> from 3rd DC. Correct me if I'm wrong.
>>>>>>
>>>>>> On Mar 4, 2016 7:55 PM, "Jack Krupansky" <jack.krupan...@gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > DataStax Enterprise (DSE) should be fine for three or even four
>>>>>> data centers in the same cluster. Or are you talking about some custom 
>>>>>> Solr
>>>>>> implementation?
>>>>>> >
>>>>>> > -- Jack Krupansky
>>>>>> >
>>>>>> > On Fri, Mar 4, 2016 at 9:21 AM, <sean_r_dur...@homedepot.com>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> Sure. Just add a new DC. Alter your keyspaces with a new
>>>>>> replication factor for that DC. Run repairs on the new DC to get the data
>>>>>> streamed. Then make sure your clients only connect to the DC(s) that they
>>>>>> need.
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> Separation of workloads is one of the key powers of a Cassandra
>>>>>> cluster.
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> You may want to look at different configurations for the analytics
>>>>>> cluster – smaller replication factor, more memory per node, more disk per
>>>>>> node, perhaps less vnodes. Others may chime in with their experience.
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> Sean Durity
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> From: Bhuvan Rawal [mailto:bhu1ra...@gmail.com]
>>>>>> >> Sent: Friday, March 04, 2016 3:27 AM
>>>>>> >> To: user@cassandra.apache.org
>>>>>> >> Subject: How to create an additional cluster in Cassandra
>>>>>> exclusively for Analytics Purpose
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> Hi,
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> We would like to create an additional C* data center for batch
>>>>>> processing using spark on CFS. We would like to limit this DC exclusively
>>>>>> for Spark operations and would like to continue the Application Servers 
>>>>>> to
>>>>>> continue fetching data from OLTP.
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> Is there any way to configure the same?
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> 
>>>>>> >>
>>>>>> >> Regards,
>>>>>> >>
>>>>>> >> Bhuvan
>>>>>> >>
>>>>>> >>
>>>>>> >> ________________________________
>>>>>> >>
>>>>>> >> The information in this Internet Email is confidential and may be
>>>>>> legally privileged. It is intended solely for the addressee. Access to 
>>>>>> this
>>>>>> Email by anyone else is unauthorized. If you are not the intended
>>>>>> recipient, any disclosure, copying, distribution or any action taken or
>>>>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>>>>> When addressed to our clients any opinions or advice contained in this
>>>>>> Email are subject to the terms and conditions expressed in any applicable
>>>>>> governing The Home Depot terms of business or client engagement letter. 
>>>>>> The
>>>>>> Home Depot disclaims all responsibility and liability for the accuracy 
>>>>>> and
>>>>>> content of this attachment and for any damages or losses arising from any
>>>>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>>>>> items of a destructive nature, which may be contained in this attachment
>>>>>> and shall not be liable for direct, indirect, consequential or special
>>>>>> damages in connection with this e-mail message or its attachment.
>>>>>> >
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>

Re: How to create an additional cluster in Cassandra exclusively for Analytics Purpose

Reply via email to