Re: Large primary keys

2016-04-14 Thread Robert Wille
That would be a nice solution, but 3.4 is way too bleeding edge. I’ll just go 
with the digest for now. Thanks for pointing it out. I’ll have to consider a 
migration in the future when production is on 3.x.

On Apr 11, 2016, at 10:19 PM, Jack Krupansky 
<jack.krupan...@gmail.com<mailto:jack.krupan...@gmail.com>> wrote:

Check out the text indexing feature of the new SASI feature in Cassandra 3.4. 
You could write a custom tokenizer to extract entities and then be able to 
query for documents that contain those entities.

That said, using a SHA digest key for the primary key has merit for direct 
access to the document given the document text.

-- Jack Krupansky

On Mon, Apr 11, 2016 at 7:12 PM, James Carman 
<ja...@carmanconsulting.com<mailto:ja...@carmanconsulting.com>> wrote:
S3 maybe?

On Mon, Apr 11, 2016 at 7:05 PM Robert Wille 
<rwi...@fold3.com<mailto:rwi...@fold3.com>> wrote:
I do realize its kind of a weird use case, but it is legitimate. I have a 
collection of documents that I need to index, and I want to perform entity 
extraction on them and give the extracted entities special treatment in my 
full-text index. Because entity extraction costs money, and each document will 
end up being indexed multiple times, I want to cache them in Cassandra. The 
document text is the obvious key to retrieve entities from the cache. If I use 
the document ID, then I have to track timestamps. I know that sounds like a 
simple workaround, but I’m presenting a much-simplified view of my actual data 
model.

The reason for needing the text in the table, and not just a digest, is that 
sometimes entity extraction has to be deferred due to license limitations. In 
those cases, the entity extraction occurs on a background process, and the 
entities will be included in the index the next time the document is indexed.

I will use a digest as the key. I suspected that would be the answer, but its 
good to get confirmation.

Robert

On Apr 11, 2016, at 4:36 PM, Jan Kesten 
<j.kes...@enercast.de<mailto:j.kes...@enercast.de>> wrote:

> Hi Robert,
>
> why do you need the actual text as a key? I sounds a bit unatural at least 
> for me. Keep in mind that you cannot do "like" queries on keys in cassandra. 
> For performance and keeping things more readable I would prefer hashing your 
> text and use the hash as key.
>
> You should also take into account to store the keys (hashes) in a seperate 
> table per day / hour or something like that, so you can quickly get all keys 
> for a time range. A query without the partition key may be very slow.
>
> Jan
>
> Am 11.04.2016 um 23:43 schrieb Robert Wille:
>> I have a need to be able to use the text of a document as the primary key in 
>> a table. These texts are usually less than 1K, but can sometimes be 10’s of 
>> K’s in size. Would it be better to use a digest of the text as the key? I 
>> have a background process that will occasionally need to do a full table 
>> scan and retrieve all of the texts, so using the digest doesn’t eliminate 
>> the need to store the text. Anyway, is it better to keep primary keys small, 
>> or is C* okay with large primary keys?
>>
>> Robert
>>
>





Re: Large primary keys

2016-04-11 Thread Jack Krupansky
Check out the text indexing feature of the new SASI feature in Cassandra
3.4. You could write a custom tokenizer to extract entities and then be
able to query for documents that contain those entities.

That said, using a SHA digest key for the primary key has merit for direct
access to the document given the document text.

-- Jack Krupansky

On Mon, Apr 11, 2016 at 7:12 PM, James Carman <ja...@carmanconsulting.com>
wrote:

> S3 maybe?
>
> On Mon, Apr 11, 2016 at 7:05 PM Robert Wille <rwi...@fold3.com> wrote:
>
>> I do realize its kind of a weird use case, but it is legitimate. I have a
>> collection of documents that I need to index, and I want to perform entity
>> extraction on them and give the extracted entities special treatment in my
>> full-text index. Because entity extraction costs money, and each document
>> will end up being indexed multiple times, I want to cache them in
>> Cassandra. The document text is the obvious key to retrieve entities from
>> the cache. If I use the document ID, then I have to track timestamps. I
>> know that sounds like a simple workaround, but I’m presenting a
>> much-simplified view of my actual data model.
>>
>> The reason for needing the text in the table, and not just a digest, is
>> that sometimes entity extraction has to be deferred due to license
>> limitations. In those cases, the entity extraction occurs on a background
>> process, and the entities will be included in the index the next time the
>> document is indexed.
>>
>> I will use a digest as the key. I suspected that would be the answer, but
>> its good to get confirmation.
>>
>> Robert
>>
>> On Apr 11, 2016, at 4:36 PM, Jan Kesten <j.kes...@enercast.de> wrote:
>>
>> > Hi Robert,
>> >
>> > why do you need the actual text as a key? I sounds a bit unatural at
>> least for me. Keep in mind that you cannot do "like" queries on keys in
>> cassandra. For performance and keeping things more readable I would prefer
>> hashing your text and use the hash as key.
>> >
>> > You should also take into account to store the keys (hashes) in a
>> seperate table per day / hour or something like that, so you can quickly
>> get all keys for a time range. A query without the partition key may be
>> very slow.
>> >
>> > Jan
>> >
>> > Am 11.04.2016 um 23:43 schrieb Robert Wille:
>> >> I have a need to be able to use the text of a document as the primary
>> key in a table. These texts are usually less than 1K, but can sometimes be
>> 10’s of K’s in size. Would it be better to use a digest of the text as the
>> key? I have a background process that will occasionally need to do a full
>> table scan and retrieve all of the texts, so using the digest doesn’t
>> eliminate the need to store the text. Anyway, is it better to keep primary
>> keys small, or is C* okay with large primary keys?
>> >>
>> >> Robert
>> >>
>> >
>>
>>


Re: Large primary keys

2016-04-11 Thread James Carman
S3 maybe?
On Mon, Apr 11, 2016 at 7:05 PM Robert Wille <rwi...@fold3.com> wrote:

> I do realize its kind of a weird use case, but it is legitimate. I have a
> collection of documents that I need to index, and I want to perform entity
> extraction on them and give the extracted entities special treatment in my
> full-text index. Because entity extraction costs money, and each document
> will end up being indexed multiple times, I want to cache them in
> Cassandra. The document text is the obvious key to retrieve entities from
> the cache. If I use the document ID, then I have to track timestamps. I
> know that sounds like a simple workaround, but I’m presenting a
> much-simplified view of my actual data model.
>
> The reason for needing the text in the table, and not just a digest, is
> that sometimes entity extraction has to be deferred due to license
> limitations. In those cases, the entity extraction occurs on a background
> process, and the entities will be included in the index the next time the
> document is indexed.
>
> I will use a digest as the key. I suspected that would be the answer, but
> its good to get confirmation.
>
> Robert
>
> On Apr 11, 2016, at 4:36 PM, Jan Kesten <j.kes...@enercast.de> wrote:
>
> > Hi Robert,
> >
> > why do you need the actual text as a key? I sounds a bit unatural at
> least for me. Keep in mind that you cannot do "like" queries on keys in
> cassandra. For performance and keeping things more readable I would prefer
> hashing your text and use the hash as key.
> >
> > You should also take into account to store the keys (hashes) in a
> seperate table per day / hour or something like that, so you can quickly
> get all keys for a time range. A query without the partition key may be
> very slow.
> >
> > Jan
> >
> > Am 11.04.2016 um 23:43 schrieb Robert Wille:
> >> I have a need to be able to use the text of a document as the primary
> key in a table. These texts are usually less than 1K, but can sometimes be
> 10’s of K’s in size. Would it be better to use a digest of the text as the
> key? I have a background process that will occasionally need to do a full
> table scan and retrieve all of the texts, so using the digest doesn’t
> eliminate the need to store the text. Anyway, is it better to keep primary
> keys small, or is C* okay with large primary keys?
> >>
> >> Robert
> >>
> >
>
>


Re: Large primary keys

2016-04-11 Thread Robert Wille
I do realize its kind of a weird use case, but it is legitimate. I have a 
collection of documents that I need to index, and I want to perform entity 
extraction on them and give the extracted entities special treatment in my 
full-text index. Because entity extraction costs money, and each document will 
end up being indexed multiple times, I want to cache them in Cassandra. The 
document text is the obvious key to retrieve entities from the cache. If I use 
the document ID, then I have to track timestamps. I know that sounds like a 
simple workaround, but I’m presenting a much-simplified view of my actual data 
model.

The reason for needing the text in the table, and not just a digest, is that 
sometimes entity extraction has to be deferred due to license limitations. In 
those cases, the entity extraction occurs on a background process, and the 
entities will be included in the index the next time the document is indexed.

I will use a digest as the key. I suspected that would be the answer, but its 
good to get confirmation.

Robert

On Apr 11, 2016, at 4:36 PM, Jan Kesten <j.kes...@enercast.de> wrote:

> Hi Robert,
> 
> why do you need the actual text as a key? I sounds a bit unatural at least 
> for me. Keep in mind that you cannot do "like" queries on keys in cassandra. 
> For performance and keeping things more readable I would prefer hashing your 
> text and use the hash as key.
> 
> You should also take into account to store the keys (hashes) in a seperate 
> table per day / hour or something like that, so you can quickly get all keys 
> for a time range. A query without the partition key may be very slow.
> 
> Jan
> 
> Am 11.04.2016 um 23:43 schrieb Robert Wille:
>> I have a need to be able to use the text of a document as the primary key in 
>> a table. These texts are usually less than 1K, but can sometimes be 10’s of 
>> K’s in size. Would it be better to use a digest of the text as the key? I 
>> have a background process that will occasionally need to do a full table 
>> scan and retrieve all of the texts, so using the digest doesn’t eliminate 
>> the need to store the text. Anyway, is it better to keep primary keys small, 
>> or is C* okay with large primary keys?
>> 
>> Robert
>> 
> 



Re: Large primary keys

2016-04-11 Thread Jan Kesten

Hi Robert,

why do you need the actual text as a key? I sounds a bit unatural at 
least for me. Keep in mind that you cannot do "like" queries on keys in 
cassandra. For performance and keeping things more readable I would 
prefer hashing your text and use the hash as key.


You should also take into account to store the keys (hashes) in a 
seperate table per day / hour or something like that, so you can quickly 
get all keys for a time range. A query without the partition key may be 
very slow.


Jan

Am 11.04.2016 um 23:43 schrieb Robert Wille:

I have a need to be able to use the text of a document as the primary key in a 
table. These texts are usually less than 1K, but can sometimes be 10’s of K’s 
in size. Would it be better to use a digest of the text as the key? I have a 
background process that will occasionally need to do a full table scan and 
retrieve all of the texts, so using the digest doesn’t eliminate the need to 
store the text. Anyway, is it better to keep primary keys small, or is C* okay 
with large primary keys?

Robert





Re: Large primary keys

2016-04-11 Thread James Carman
Why does the text need to be the key?

On Mon, Apr 11, 2016 at 6:04 PM Robert Wille <rwi...@fold3.com> wrote:

> I have a need to be able to use the text of a document as the primary key
> in a table. These texts are usually less than 1K, but can sometimes be 10’s
> of K’s in size. Would it be better to use a digest of the text as the key?
> I have a background process that will occasionally need to do a full table
> scan and retrieve all of the texts, so using the digest doesn’t eliminate
> the need to store the text. Anyway, is it better to keep primary keys
> small, or is C* okay with large primary keys?
>
> Robert
>
>


Re: Large primary keys

2016-04-11 Thread Bryan Cheng
While large primary keys (within reason) should work, IMO anytime you're
doing equality testing you are really better off minimizing the size of the
key. Huge primary keys will also have very negative impacts on your key
cache. I would err on the side of the digest, but I've never had a need for
large keys so perhaps someone who has used them before would have a
different perspective.

On Mon, Apr 11, 2016 at 2:43 PM, Robert Wille <rwi...@fold3.com> wrote:

> I have a need to be able to use the text of a document as the primary key
> in a table. These texts are usually less than 1K, but can sometimes be 10’s
> of K’s in size. Would it be better to use a digest of the text as the key?
> I have a background process that will occasionally need to do a full table
> scan and retrieve all of the texts, so using the digest doesn’t eliminate
> the need to store the text. Anyway, is it better to keep primary keys
> small, or is C* okay with large primary keys?
>
> Robert
>
>


Large primary keys

2016-04-11 Thread Robert Wille
I have a need to be able to use the text of a document as the primary key in a 
table. These texts are usually less than 1K, but can sometimes be 10’s of K’s 
in size. Would it be better to use a digest of the text as the key? I have a 
background process that will occasionally need to do a full table scan and 
retrieve all of the texts, so using the digest doesn’t eliminate the need to 
store the text. Anyway, is it better to keep primary keys small, or is C* okay 
with large primary keys?

Robert