Re: Different ideas for querying unique and non-unique records

2017-08-30 Thread Rick Leir
Susheel, Just a guess, but carrot2.org might be useful. But it might be 
overkill. Cheers -- Rick

On August 30, 2017 7:40:08 AM MDT, Susheel Kumar  wrote:
>Hello,
>
>I am looking for different ideas/suggestions to solve the use case am
>working on.
>
>We have couple of fields in schema along with id, business_email and
>personal_email.  We need to return all records based on unique business
>and
>personal email's.
>
>The criteria for unique records is either of business or personal email
>has
>not repeated again in other records.
>The criteria for non-unique records is if any of the business or
>personal
>email has occurred/repeats in other records then all those records are
>non-unique.
>E.g considering below documents.
>- for unique records below only id=1 should be returned (since john.doe
>is
>not present in any other records personal or business email)
>- non unique records, below id=2,3 should be returned (since
>isabel.dora is
>present in multiple records. doesn't matter if it is present in
>business or
>personal email)
>
>Documents
>===
>{id:1,business_email_s:john@abc.com,personal_email_s:john@abc.com}
>{id:2,business_email_s:isabel.d...@abc.com}
>{id:3,personal_email_s:isabel.d...@abc.com}
>
>I am able to solve this using Streaming expression query but not sure
>if
>performance will become an bottleneck as the streaming expression is
>quite
>big. So looking for
>different ideas like using de-dupe or during ingestion/pre-process etc.
>without impacting performance much.
>
>Thanks,
>Susheel

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Different ideas for querying unique and non-unique records

2017-08-30 Thread Susheel Kumar
Hello,

I am looking for different ideas/suggestions to solve the use case am
working on.

We have couple of fields in schema along with id, business_email and
personal_email.  We need to return all records based on unique business and
personal email's.

The criteria for unique records is either of business or personal email has
not repeated again in other records.
The criteria for non-unique records is if any of the business or personal
email has occurred/repeats in other records then all those records are
non-unique.
E.g considering below documents.
- for unique records below only id=1 should be returned (since john.doe is
not present in any other records personal or business email)
- non unique records, below id=2,3 should be returned (since isabel.dora is
present in multiple records. doesn't matter if it is present in business or
personal email)

Documents
===
{id:1,business_email_s:john@abc.com,personal_email_s:john@abc.com}
{id:2,business_email_s:isabel.d...@abc.com}
{id:3,personal_email_s:isabel.d...@abc.com}

I am able to solve this using Streaming expression query but not sure if
performance will become an bottleneck as the streaming expression is quite
big. So looking for
different ideas like using de-dupe or during ingestion/pre-process etc.
without impacting performance much.

Thanks,
Susheel