Re: Get column family size

Ryan Svihla Fri, 12 Dec 2014 05:02:45 -0800

What version are you on (key estimate I see in 1.2 and 2.0) ? What size is
your heap (ideally 8GB, can be lower, but it requires a lot of tuning)?
What kind of disk do you have (SANs are going to cause you problems)?
Assuming all of those are the right answer, then you have the following
options to find a count on a million+ rows:


1) Use a more recent version of which has a better version of cqlsh
(anything 2.1 on) and a large limit
2) Use a recent native driver to query a recent version of Cassandra (2.0
+) so you can have autopaging support see for details here under "Automatic
Paging"
http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0
3) page through the token range by hand with smaller pages so you can get
an answer, for example if you had a deterministic range to your partition
key, start with the first value and only query a 1000 at a time, then get
the last value of the previous query and repeat. This is not for the faint
of heart and I do not recommend it.
4) use an Analytics engine like spark to read your data (
https://github.com/datastax/spark-cassandra-connector/) and let it do the
hard stuff.



On Fri, Dec 12, 2014 at 12:22 AM, Chamila Wijayarathna <
cdwijayarat...@gmail.com> wrote:
>
> Hi Philip, Ryan,
>
> I checked cassandra system.log for any issues, but it showed no error
> there.
>
> I tried using cfstats and it gave me
> https://gist.github.com/cdwijayarathna/e6b4d3d7d8c272fcfd24. It doesn't
> seem to have any information like number of keys.
>
> I am running cassandra in a single node and have 1million + rows.
>
> Thank You!
>
> On Fri, Dec 12, 2014 at 2:57 AM, Ryan Svihla <rsvi...@datastax.com> wrote:
>>
>> An estimated partition key count can be had from nodetool cfstats,
>> however for large data sets analytics style queries (such as verification
>> of large data sets) I recommend spark, hive, hadoop, and even solr for some
>> use cases.
>>
>> On Thu, Dec 11, 2014 at 3:10 PM, Philip Thompson <
>> philip.thomp...@datastax.com> wrote:
>>>
>>> Chamila,
>>>
>>> You can find more detailed explanations in previous posts on this
>>> mailing list as to why, but a "Select count(*) from table;" query is
>>> inefficient in Cassandra for non-trivial datasets. You will need a better
>>> way to get the number of partition keys of a CF, which hopefully someone
>>> else in the user list can provide, as I have never needed to do that.
>>>
>>> On Thu, Dec 11, 2014 at 1:59 PM, Chamila Wijayarathna <
>>> cdwijayarat...@gmail.com> wrote:
>>>
>>>> Hi Philip,
>>>>
>>>> Yes, I'm using cqlsh. Is there any way I can solve this?
>>>>
>>>> Thank You!
>>>>
>>>> On Fri, Dec 12, 2014 at 12:26 AM, Philip Thompson <
>>>> philip.thomp...@datastax.com> wrote:
>>>>
>>>>> I assume the query you are sending is through cqlsh. You are actually
>>>>> getting a client-side timeout error, which is unclear in 2.1.2, but I
>>>>> believe the error message will be more helpful as of 2.1.3.
>>>>>
>>>>> On Thu, Dec 11, 2014 at 1:52 PM, Chamila Wijayarathna <
>>>>> cdwijayarat...@gmail.com> wrote:
>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>> I am trying to get the number of key value pairs.
>>>>>>
>>>>>> I used following query for this.
>>>>>>
>>>>>> select count(*) from corpus.word_usage ;
>>>>>>
>>>>>> This returns number of key value pairs when CF is relatively small.
>>>>>> But when I insert more key-velue pairs, I am getting error saying,
>>>>>> "errors={}, last_host=127.0.0.1".
>>>>>>
>>>>>> What is the reason for this? Is there any better way to get the size
>>>>>> (number of key value pairs) of a CF in CQL?
>>>>>>
>>>>>> Thank You!
>>>>>>
>>>>>> --
>>>>>> *Chamila Dilshan Wijayarathna,*
>>>>>> SMIEEE, SMIESL,
>>>>>> Undergraduate,
>>>>>> Department of Computer Science and Engineering,
>>>>>> University of Moratuwa.
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Chamila Dilshan Wijayarathna,*
>>>> SMIEEE, SMIESL,
>>>> Undergraduate,
>>>> Department of Computer Science and Engineering,
>>>> University of Moratuwa.
>>>>
>>>
>>>
>>
>> --
>>
>> [image: datastax_logo.png] <http://www.datastax.com/>
>>
>> Ryan Svihla
>>
>> Solution Architect
>>
>> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>
>> DataStax is the fastest, most scalable distributed database technology,
>> delivering Apache Cassandra to the world’s most innovative enterprises.
>> Datastax is built to be agile, always-on, and predictably scalable to any
>> size. With more than 500 customers in 45 countries, DataStax is the
>> database technology and transactional backbone of choice for the worlds
>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>
>>
>
> --
> *Chamila Dilshan Wijayarathna,*
> SMIEEE, SMIESL,
> Undergraduate,
> Department of Computer Science and Engineering,
> University of Moratuwa.
>


-- 

[image: datastax_logo.png] <http://www.datastax.com/>

Ryan Svihla

Solution Architect

[image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
<http://www.linkedin.com/pub/ryan-svihla/12/621/727/>

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.

Re: Get column family size

Reply via email to