Re: Tokenization and SAI query syntax

2023-08-14 Thread Jon Haddad
I was thinking a subproject like I’d normally use with Gradle. Is there an advantage to moving it out completely? On 2023/08/13 18:34:38 Caleb Rackliffe wrote: > We’ve already started down the path of using a git sub-module for the Accord > library. That could be an option at some point. > >

Re: Tokenization and SAI query syntax

2023-08-13 Thread Caleb Rackliffe
We’ve already started down the path of using a git sub-module for the Accord library. That could be an option at some point. > On Aug 13, 2023, at 12:53 PM, Jon Haddad wrote: > > Functions make sense to me too. In addition to the reasons listed, I if we > acknowledge that functions in

Re: Tokenization and SAI query syntax

2023-08-13 Thread Jon Haddad
Functions make sense to me too. In addition to the reasons listed, I if we acknowledge that functions in predicates are inevitable, then it makes total sense to use them here. I think this is the most forward thinking approach. Assuming this happens, one thing that would be great down the

Re: Tokenization and SAI query syntax

2023-08-07 Thread Benedict
Yep, this sounds like the potentially least bad approach for now. Sorry Caleb, I jumped in without properly reading the thread and assumed we were proposing changes to CQL.If it’s clear we’re dropping into a sub-language and providing a sub-query to it that’s SAI-specific, that gives us pretty

Re: Tokenization and SAI query syntax

2023-08-07 Thread Josh McKenzie
Been chatting a bit w/Caleb about this offline and poking around to better educate myself. > using functions (ignoring the implementation complexity) at least removes > ambiguity. This, plus using functions lets us kick the can down the road a bit in terms of landing on an integrated grammar

Re: Tokenization and SAI query syntax

2023-08-07 Thread Caleb Rackliffe
> I do not think we should start using lucene syntax for it, it will make people think they can do everything else lucene allows. I'm sure we won't be supporting everything Lucene allows, but this is going to evolve. Right off the bat, if you introduce support for tokenization and filtering,

Re: Tokenization and SAI query syntax

2023-08-07 Thread Atri Sharma
Why not start with SQLish operators supported by many databases (LIKE and CONTAINS)? On Mon, Aug 7, 2023 at 10:01 PM J. D. Jordan wrote: > I am also -1 on directly exposing lucene like syntax here. Besides being > ugly, SAI is not lucene, I do not think we should start using lucene syntax > for

Re: Tokenization and SAI query syntax

2023-08-07 Thread J. D. Jordan
I am also -1 on directly exposing lucene like syntax here. Besides being ugly, SAI is not lucene, I do not think we should start using lucene syntax for it, it will make people think they can do everything else lucene allows.On Aug 7, 2023, at 5:13 AM, Benedict wrote:I’m strongly opposed to : It

Re: Tokenization and SAI query syntax

2023-08-07 Thread Caleb Rackliffe
@Benedict I'm not particularly keen to try to graft the Lucene syntax into CQL itself, to be clear. What I'm proposing is more along the lines of allowing that syntax via "expr" and leaving that Lucene systems would call "filters" in predicates currently expressible by CQL. On Mon, Aug 7, 2023 at

Re: Tokenization and SAI query syntax

2023-08-07 Thread Benedict
I’m strongly opposed to : It is very dissimilar to our current operators. CQL is already not the prettiest language, but let’s not make it a total mish mash.On 7 Aug 2023, at 10:59, Mike Adamson wrote:I am also in agreement with 'column : token' in that 'I don't hate it' but I'd like to offer an

Re: Tokenization and SAI query syntax

2023-08-07 Thread Mike Adamson
I am also in agreement with 'column : token' in that 'I don't hate it' but I'd like to offer an alternative to this in 'column HAS token'. HAS is currently not a keyword that we use so wouldn't cause any brain conflicts. While I don't hate ':' I have a particular dislike of the lucene search

Re: Tokenization and SAI query syntax

2023-08-03 Thread Jon Haddad
Assuming SAI is a superset of SASI, and we were to set up something so that SASI indexes auto convert to SAI, this gives even more weight to my point regarding how differing behavior for the same syntax can lead to issues. Imo the best case scenario results in the user not even noticing their

Re: Tokenization and SAI query syntax

2023-08-03 Thread Jon Haddad
Yes, I understand that. What I'm trying to point out is the potential confusion with having the same syntax behave differently for different index types. I'm not holding this view strongly, I'd just like folks to consider the impact to the end user, who in my experience is great with foot

Re: Tokenization and SAI query syntax

2023-08-02 Thread Caleb Rackliffe
For what it's worth, I'd very much like to completely remove SASI from the codebase for 6.0. The only remaining functionality gaps at the moment are LIKE (prefix/suffix) queries and its limited tokenization capabilities, both of which already have SAI Phase 2 Jiras. On Wed, Aug 2, 2023 at 7:20 PM

Re: Tokenization and SAI query syntax

2023-08-02 Thread Jeremiah Jordan
SASI just uses “=“ for the tokenized equality matching, which is the exact thing this discussion is about changing/not liking. > On Aug 2, 2023, at 7:18 PM, J. D. Jordan wrote: > > I do not think LIKE actually applies here. LIKE is used for prefix, > contains, or suffix searches in SASI

Re: Tokenization and SAI query syntax

2023-08-02 Thread J. D. Jordan
I do not think LIKE actually applies here. LIKE is used for prefix, contains, or suffix searches in SASI depending on the index type. This is about exact matching of tokens. > On Aug 2, 2023, at 5:53 PM, Jon Haddad wrote: > > Certain bits of functionality also already exist on the SASI side

Re: Tokenization and SAI query syntax

2023-08-02 Thread Jon Haddad
Certain bits of functionality also already exist on the SASI side of things, but I'm not sure how much overlap there is. Currently, there's a LIKE keyword that handles token matching, although it seems to have some differences from the feature set in SAI. That said, there seems to be enough

Re: Tokenization and SAI query syntax

2023-08-01 Thread Caleb Rackliffe
Here are some additional bits of prior art, if anyone finds them useful: The Stratio Lucene Index - https://github.com/Stratio/cassandra-lucene-index#examples Stratio was the reason C* added the "expr" functionality. They embedded something similar to ElasticSearch JSON, which probably isn't my

Re: Tokenization and SAI query syntax

2023-07-24 Thread Josh McKenzie
> `column CONTAINS term`. Contains is used by both Java and Python for > substring searches, so at least some users will be surprised by term-based > behavior. I wonder whether users are in their "programming language" headspace or in their "querying a database" headspace when interacting with

Re: Tokenization and SAI query syntax

2023-07-24 Thread Benedict
I have a strong preference not to use the name of an SQL operator, since it precludes us later providing the SQL standard operator to users.What about CONTAINS TOKEN term? Or CONTAINS TERM term?On 24 Jul 2023, at 13:34, Andrés de la Peña wrote:`column = term` is definitively problematic because

Re: Tokenization and SAI query syntax

2023-07-24 Thread Andrés de la Peña
`column = term` is definitively problematic because it creates an ambiguity when the queried column belongs to the primary key. For some queries we wouldn't know whether the user wants a primary key query using regular equality or an index query using the analyzer. `term_matches(column, term)`

Tokenization and SAI query syntax

2023-07-24 Thread Jonathan Ellis
Hi all, With phase 1 of SAI wrapping up, I’d like to start the ball rolling on aligning around phase 2 features. In particular, we need to nail down the syntax for doing non-exact string matches. We have a proof of concept that includes full Lucene analyzer and filter functionality – just the