Re: Mutli term synonyms

2015-04-29 Thread Kaushik
.   I
  encountered this during evaluation of Solr for a project here at NLM.
  We
  decided to use Solr for different projects instead. I considered the
  following approaches:
   - use a custom tokenizer at index time that indexed all of the multiple
  term alternatives.
   - index the data, and then have an enrichment process that queries on
  each source synonym, and generates an update to add the target synonyms.
 Follow this with an optimize.
   - During the indexing process, but before sending the data to Solr,
  process the data to tokenize and add synonyms to another field.
 
  Both the custom tokenizer and enrichment process share the feature that
  they use Solr's own tokenizer rather than duplicate it.   The enrichment
  process seems to me only workable in environments where you can re-index
  all data periodically, so no continuous stream of data to index that
 needs
  to be handled relatively quickly once it is generated.The last
 method
  of pre-processing the data seems the least desirable to me from a
 blue-sky
  perspective, but is probably the easiest to implement and the most
  independent of Solr.
 
  Hope this helps,
 
  Dan Davis, Systems/Applications Architect (Contractor),
  Office of Computer and Communications Systems,
  National Library of Medicine, NIH
 
  -Original Message-
  From: Kaushik [mailto:kaushika...@gmail.com]
  Sent: Monday, April 20, 2015 10:47 AM
  To: solr-user@lucene.apache.org
  Subject: Mutli term synonyms
 
  Hello,
 
  Reading up on synonyms it looks like there is no real solution for multi
  term synonyms. Is that right? I have a use case where I need to map one
  multi term phrase to another. i.e. Tween 20 needs to be translated to
  Polysorbate 40.
 
  Any thoughts as to how this can be achieved?
 
  Thanks,
  Kaushik
 



Re: Mutli term synonyms

2015-04-29 Thread Roman Chyla
I'm not sure I understand - the autophrasing filter will allow the
parser to see all the tokens, so that they can be parsed (and
multi-token synonyms) identified. So if you are using the same
analyzer at query and index time, they should be able to see the same
stuff.

are you using multi-token synonyms, or just entries that look like
multi synonym? (in the first case, the tokens are separated by null
byte) - in the second case, they are just strings even with
whitespaces, your synonym file must contain exactly the same entries
as your analyzer sees them (and in the same order; or you have to use
the same analyzer to load the synonym files)

can you post the relevant part of your schema.xml?


note: I can confirm that multi-token synonym expansion can be made to
work, even in complex cases - we do it - but likely, if you need
multi-token synonyms, you will also need a smarter query parser.
sometimes your users will use query strings that contain overlapping
synonym entries, to handle that, you will have to know how to generate
all possible 'reads', example

synonym:

foo bar, foobar
hey foo, heyfoo

user input:

hey foo bar

possible readings:

((hey foo) +bar) OR (hey +(foo bar))

i'm simplifying it here, the fun starts when you are seeing a phrase query :)

On Tue, Apr 28, 2015 at 10:31 AM, Kaushik kaushika...@gmail.com wrote:
 Hi there,

 I tried the solution provided in
 https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
 .The mentioned solution works when the indexed data does not have alpha
 numerics or special characters. But in  my case the synonyms are something
 like the below.


  T-MAZ 20  POLYOXYETHYLENE (20) SORBITAN MONOLAURATE  SORBITAN
 MONODODECANOATE  POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE  POLYOXYETHYLENE
 SORBITAN MONOLAURATE  POLYSORBATE 20 [MART.]  SORBIMACROGOL LAURATE
 300  POLYSORBATE
 20 [FHFI]  FEMA NO. 2915

 They have alpha numerics, special characters, spaces, etc. Is there a way
 to implment synonyms even in such case?

 Thanks,
 Kaushik

 On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] 
 daniel.da...@nih.gov wrote:

 Handling MESH descriptor preferred terms and such is similar.   I
 encountered this during evaluation of Solr for a project here at NLM.   We
 decided to use Solr for different projects instead. I considered the
 following approaches:
  - use a custom tokenizer at index time that indexed all of the multiple
 term alternatives.
  - index the data, and then have an enrichment process that queries on
 each source synonym, and generates an update to add the target synonyms.
Follow this with an optimize.
  - During the indexing process, but before sending the data to Solr,
 process the data to tokenize and add synonyms to another field.

 Both the custom tokenizer and enrichment process share the feature that
 they use Solr's own tokenizer rather than duplicate it.   The enrichment
 process seems to me only workable in environments where you can re-index
 all data periodically, so no continuous stream of data to index that needs
 to be handled relatively quickly once it is generated.The last method
 of pre-processing the data seems the least desirable to me from a blue-sky
 perspective, but is probably the easiest to implement and the most
 independent of Solr.

 Hope this helps,

 Dan Davis, Systems/Applications Architect (Contractor),
 Office of Computer and Communications Systems,
 National Library of Medicine, NIH

 -Original Message-
 From: Kaushik [mailto:kaushika...@gmail.com]
 Sent: Monday, April 20, 2015 10:47 AM
 To: solr-user@lucene.apache.org
 Subject: Mutli term synonyms

 Hello,

 Reading up on synonyms it looks like there is no real solution for multi
 term synonyms. Is that right? I have a use case where I need to map one
 multi term phrase to another. i.e. Tween 20 needs to be translated to
 Polysorbate 40.

 Any thoughts as to how this can be achieved?

 Thanks,
 Kaushik



Re: Mutli term synonyms

2015-04-29 Thread Roman Chyla
   MONODODECANOATE  POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE  POLYOXYETHYLENE
   SORBITAN MONOLAURATE  POLYSORBATE 20 [MART.]  SORBIMACROGOL LAURATE
   300  POLYSORBATE
   20 [FHFI]  FEMA NO. 2915
  
   They have alpha numerics, special characters, spaces, etc. Is there a
 way
   to implment synonyms even in such case?
  
   Thanks,
   Kaushik
  
   On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] 
   daniel.da...@nih.gov wrote:
  
   Handling MESH descriptor preferred terms and such is similar.   I
   encountered this during evaluation of Solr for a project here at NLM.
   We
   decided to use Solr for different projects instead. I considered
 the
   following approaches:
- use a custom tokenizer at index time that indexed all of the
 multiple
   term alternatives.
- index the data, and then have an enrichment process that queries on
   each source synonym, and generates an update to add the target
 synonyms.
  Follow this with an optimize.
- During the indexing process, but before sending the data to Solr,
   process the data to tokenize and add synonyms to another field.
  
   Both the custom tokenizer and enrichment process share the feature
 that
   they use Solr's own tokenizer rather than duplicate it.   The
 enrichment
   process seems to me only workable in environments where you can
 re-index
   all data periodically, so no continuous stream of data to index that
  needs
   to be handled relatively quickly once it is generated.The last
  method
   of pre-processing the data seems the least desirable to me from a
  blue-sky
   perspective, but is probably the easiest to implement and the most
   independent of Solr.
  
   Hope this helps,
  
   Dan Davis, Systems/Applications Architect (Contractor),
   Office of Computer and Communications Systems,
   National Library of Medicine, NIH
  
   -Original Message-
   From: Kaushik [mailto:kaushika...@gmail.com]
   Sent: Monday, April 20, 2015 10:47 AM
   To: solr-user@lucene.apache.org
   Subject: Mutli term synonyms
  
   Hello,
  
   Reading up on synonyms it looks like there is no real solution for
 multi
   term synonyms. Is that right? I have a use case where I need to map
 one
   multi term phrase to another. i.e. Tween 20 needs to be translated to
   Polysorbate 40.
  
   Any thoughts as to how this can be achieved?
  
   Thanks,
   Kaushik
  
 



Re: Mutli term synonyms

2015-04-29 Thread Roman Chyla
: I can confirm that multi-token synonym expansion can be made to
work, even in complex cases - we do it - but likely, if you need
multi-token synonyms, you will also need a smarter query parser.
sometimes your users will use query strings that contain overlapping
synonym entries, to handle that, you will have to know how to
 generate
all possible 'reads', example
   
synonym:
   
foo bar, foobar
hey foo, heyfoo
   
user input:
   
hey foo bar
   
possible readings:
   
((hey foo) +bar) OR (hey +(foo bar))
   
i'm simplifying it here, the fun starts when you are seeing a phrase
   query
:)
   
On Tue, Apr 28, 2015 at 10:31 AM, Kaushik kaushika...@gmail.com
  wrote:
 Hi there,

 I tried the solution provided in

   
  
 
 https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
 .The mentioned solution works when the indexed data does not have
  alpha
 numerics or special characters. But in  my case the synonyms are
something
 like the below.


  T-MAZ 20  POLYOXYETHYLENE (20) SORBITAN MONOLAURATE  SORBITAN
 MONODODECANOATE  POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE
 POLYOXYETHYLENE
 SORBITAN MONOLAURATE  POLYSORBATE 20 [MART.]  SORBIMACROGOL LAURATE
 300  POLYSORBATE
 20 [FHFI]  FEMA NO. 2915

 They have alpha numerics, special characters, spaces, etc. Is
 there a
   way
 to implment synonyms even in such case?

 Thanks,
 Kaushik

 On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] 
 daniel.da...@nih.gov wrote:

 Handling MESH descriptor preferred terms and such is similar.   I
 encountered this during evaluation of Solr for a project here at
  NLM.
 We
 decided to use Solr for different projects instead. I
 considered
   the
 following approaches:
  - use a custom tokenizer at index time that indexed all of the
   multiple
 term alternatives.
  - index the data, and then have an enrichment process that
 queries
  on
 each source synonym, and generates an update to add the target
   synonyms.
Follow this with an optimize.
  - During the indexing process, but before sending the data to
 Solr,
 process the data to tokenize and add synonyms to another field.

 Both the custom tokenizer and enrichment process share the feature
   that
 they use Solr's own tokenizer rather than duplicate it.   The
   enrichment
 process seems to me only workable in environments where you can
   re-index
 all data periodically, so no continuous stream of data to index
 that
needs
 to be handled relatively quickly once it is generated.The last
method
 of pre-processing the data seems the least desirable to me from a
blue-sky
 perspective, but is probably the easiest to implement and the most
 independent of Solr.

 Hope this helps,

 Dan Davis, Systems/Applications Architect (Contractor),
 Office of Computer and Communications Systems,
 National Library of Medicine, NIH

 -Original Message-
 From: Kaushik [mailto:kaushika...@gmail.com]
 Sent: Monday, April 20, 2015 10:47 AM
 To: solr-user@lucene.apache.org
 Subject: Mutli term synonyms

 Hello,

 Reading up on synonyms it looks like there is no real solution for
   multi
 term synonyms. Is that right? I have a use case where I need to
 map
   one
 multi term phrase to another. i.e. Tween 20 needs to be translated
  to
 Polysorbate 40.

 Any thoughts as to how this can be achieved?

 Thanks,
 Kaushik

   
  
 



Re: Mutli term synonyms

2015-04-29 Thread Kaushik
 'reads', example
  
   synonym:
  
   foo bar, foobar
   hey foo, heyfoo
  
   user input:
  
   hey foo bar
  
   possible readings:
  
   ((hey foo) +bar) OR (hey +(foo bar))
  
   i'm simplifying it here, the fun starts when you are seeing a phrase
  query
   :)
  
   On Tue, Apr 28, 2015 at 10:31 AM, Kaushik kaushika...@gmail.com
 wrote:
Hi there,
   
I tried the solution provided in
   
  
 
 https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
.The mentioned solution works when the indexed data does not have
 alpha
numerics or special characters. But in  my case the synonyms are
   something
like the below.
   
   
 T-MAZ 20  POLYOXYETHYLENE (20) SORBITAN MONOLAURATE  SORBITAN
MONODODECANOATE  POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE  POLYOXYETHYLENE
SORBITAN MONOLAURATE  POLYSORBATE 20 [MART.]  SORBIMACROGOL LAURATE
300  POLYSORBATE
20 [FHFI]  FEMA NO. 2915
   
They have alpha numerics, special characters, spaces, etc. Is there a
  way
to implment synonyms even in such case?
   
Thanks,
Kaushik
   
On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] 
daniel.da...@nih.gov wrote:
   
Handling MESH descriptor preferred terms and such is similar.   I
encountered this during evaluation of Solr for a project here at
 NLM.
We
decided to use Solr for different projects instead. I considered
  the
following approaches:
 - use a custom tokenizer at index time that indexed all of the
  multiple
term alternatives.
 - index the data, and then have an enrichment process that queries
 on
each source synonym, and generates an update to add the target
  synonyms.
   Follow this with an optimize.
 - During the indexing process, but before sending the data to Solr,
process the data to tokenize and add synonyms to another field.
   
Both the custom tokenizer and enrichment process share the feature
  that
they use Solr's own tokenizer rather than duplicate it.   The
  enrichment
process seems to me only workable in environments where you can
  re-index
all data periodically, so no continuous stream of data to index that
   needs
to be handled relatively quickly once it is generated.The last
   method
of pre-processing the data seems the least desirable to me from a
   blue-sky
perspective, but is probably the easiest to implement and the most
independent of Solr.
   
Hope this helps,
   
Dan Davis, Systems/Applications Architect (Contractor),
Office of Computer and Communications Systems,
National Library of Medicine, NIH
   
-Original Message-
From: Kaushik [mailto:kaushika...@gmail.com]
Sent: Monday, April 20, 2015 10:47 AM
To: solr-user@lucene.apache.org
Subject: Mutli term synonyms
   
Hello,
   
Reading up on synonyms it looks like there is no real solution for
  multi
term synonyms. Is that right? I have a use case where I need to map
  one
multi term phrase to another. i.e. Tween 20 needs to be translated
 to
Polysorbate 40.
   
Any thoughts as to how this can be achieved?
   
Thanks,
Kaushik
   
  
 



Re: Mutli term synonyms

2015-04-29 Thread Kaushik
 it; but it doesnt.
   
What could I be doing wrong?
   
On Wed, Apr 29, 2015 at 2:10 AM, Roman Chyla roman.ch...@gmail.com
wrote:
   
 I'm not sure I understand - the autophrasing filter will allow the
 parser to see all the tokens, so that they can be parsed (and
 multi-token synonyms) identified. So if you are using the same
 analyzer at query and index time, they should be able to see the
 same
 stuff.

 are you using multi-token synonyms, or just entries that look like
 multi synonym? (in the first case, the tokens are separated by null
 byte) - in the second case, they are just strings even with
 whitespaces, your synonym file must contain exactly the same
 entries
 as your analyzer sees them (and in the same order; or you have to
 use
 the same analyzer to load the synonym files)

 can you post the relevant part of your schema.xml?


 note: I can confirm that multi-token synonym expansion can be made
 to
 work, even in complex cases - we do it - but likely, if you need
 multi-token synonyms, you will also need a smarter query parser.
 sometimes your users will use query strings that contain
 overlapping
 synonym entries, to handle that, you will have to know how to
  generate
 all possible 'reads', example

 synonym:

 foo bar, foobar
 hey foo, heyfoo

 user input:

 hey foo bar

 possible readings:

 ((hey foo) +bar) OR (hey +(foo bar))

 i'm simplifying it here, the fun starts when you are seeing a
 phrase
query
 :)

 On Tue, Apr 28, 2015 at 10:31 AM, Kaushik kaushika...@gmail.com
   wrote:
  Hi there,
 
  I tried the solution provided in
 

   
  
 
 https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
  .The mentioned solution works when the indexed data does not have
   alpha
  numerics or special characters. But in  my case the synonyms are
 something
  like the below.
 
 
   T-MAZ 20  POLYOXYETHYLENE (20) SORBITAN MONOLAURATE  SORBITAN
  MONODODECANOATE  POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE
  POLYOXYETHYLENE
  SORBITAN MONOLAURATE  POLYSORBATE 20 [MART.]  SORBIMACROGOL
 LAURATE
  300  POLYSORBATE
  20 [FHFI]  FEMA NO. 2915
 
  They have alpha numerics, special characters, spaces, etc. Is
  there a
way
  to implment synonyms even in such case?
 
  Thanks,
  Kaushik
 
  On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] 
  daniel.da...@nih.gov wrote:
 
  Handling MESH descriptor preferred terms and such is similar.
  I
  encountered this during evaluation of Solr for a project here at
   NLM.
  We
  decided to use Solr for different projects instead. I
  considered
the
  following approaches:
   - use a custom tokenizer at index time that indexed all of the
multiple
  term alternatives.
   - index the data, and then have an enrichment process that
  queries
   on
  each source synonym, and generates an update to add the target
synonyms.
 Follow this with an optimize.
   - During the indexing process, but before sending the data to
  Solr,
  process the data to tokenize and add synonyms to another field.
 
  Both the custom tokenizer and enrichment process share the
 feature
that
  they use Solr's own tokenizer rather than duplicate it.   The
enrichment
  process seems to me only workable in environments where you can
re-index
  all data periodically, so no continuous stream of data to index
  that
 needs
  to be handled relatively quickly once it is generated.The
 last
 method
  of pre-processing the data seems the least desirable to me from
 a
 blue-sky
  perspective, but is probably the easiest to implement and the
 most
  independent of Solr.
 
  Hope this helps,
 
  Dan Davis, Systems/Applications Architect (Contractor),
  Office of Computer and Communications Systems,
  National Library of Medicine, NIH
 
  -Original Message-
  From: Kaushik [mailto:kaushika...@gmail.com]
  Sent: Monday, April 20, 2015 10:47 AM
  To: solr-user@lucene.apache.org
  Subject: Mutli term synonyms
 
  Hello,
 
  Reading up on synonyms it looks like there is no real solution
 for
multi
  term synonyms. Is that right? I have a use case where I need to
  map
one
  multi term phrase to another. i.e. Tween 20 needs to be
 translated
   to
  Polysorbate 40.
 
  Any thoughts as to how this can be achieved?
 
  Thanks,
  Kaushik
 

   
  
 



Re: Mutli term synonyms

2015-04-29 Thread Roman Chyla
   field name=namePolysorbate 20/field
   /doc

 So when I query SOLR /autphrase for tween 20 or FEMA NO. 2915, I
  expect
to
 see the record containig Polysorbate 20. i.e.


   
  
 
 http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=true
 should have retrieved it; but it doesnt.

 What could I be doing wrong?

 On Wed, Apr 29, 2015 at 2:10 AM, Roman Chyla 
 roman.ch...@gmail.com
 wrote:

  I'm not sure I understand - the autophrasing filter will allow
 the
  parser to see all the tokens, so that they can be parsed (and
  multi-token synonyms) identified. So if you are using the same
  analyzer at query and index time, they should be able to see the
  same
  stuff.
 
  are you using multi-token synonyms, or just entries that look
 like
  multi synonym? (in the first case, the tokens are separated by
 null
  byte) - in the second case, they are just strings even with
  whitespaces, your synonym file must contain exactly the same
  entries
  as your analyzer sees them (and in the same order; or you have to
  use
  the same analyzer to load the synonym files)
 
  can you post the relevant part of your schema.xml?
 
 
  note: I can confirm that multi-token synonym expansion can be
 made
  to
  work, even in complex cases - we do it - but likely, if you need
  multi-token synonyms, you will also need a smarter query parser.
  sometimes your users will use query strings that contain
  overlapping
  synonym entries, to handle that, you will have to know how to
   generate
  all possible 'reads', example
 
  synonym:
 
  foo bar, foobar
  hey foo, heyfoo
 
  user input:
 
  hey foo bar
 
  possible readings:
 
  ((hey foo) +bar) OR (hey +(foo bar))
 
  i'm simplifying it here, the fun starts when you are seeing a
  phrase
 query
  :)
 
  On Tue, Apr 28, 2015 at 10:31 AM, Kaushik kaushika...@gmail.com
 
wrote:
   Hi there,
  
   I tried the solution provided in
  
 

   
  
 
 https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
   .The mentioned solution works when the indexed data does not
 have
alpha
   numerics or special characters. But in  my case the synonyms
 are
  something
   like the below.
  
  
T-MAZ 20  POLYOXYETHYLENE (20) SORBITAN MONOLAURATE  SORBITAN
   MONODODECANOATE  POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE
   POLYOXYETHYLENE
   SORBITAN MONOLAURATE  POLYSORBATE 20 [MART.]  SORBIMACROGOL
  LAURATE
   300  POLYSORBATE
   20 [FHFI]  FEMA NO. 2915
  
   They have alpha numerics, special characters, spaces, etc. Is
   there a
 way
   to implment synonyms even in such case?
  
   Thanks,
   Kaushik
  
   On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] 
   daniel.da...@nih.gov wrote:
  
   Handling MESH descriptor preferred terms and such is similar.
   I
   encountered this during evaluation of Solr for a project here
 at
NLM.
   We
   decided to use Solr for different projects instead. I
   considered
 the
   following approaches:
- use a custom tokenizer at index time that indexed all of
 the
 multiple
   term alternatives.
- index the data, and then have an enrichment process that
   queries
on
   each source synonym, and generates an update to add the target
 synonyms.
  Follow this with an optimize.
- During the indexing process, but before sending the data to
   Solr,
   process the data to tokenize and add synonyms to another
 field.
  
   Both the custom tokenizer and enrichment process share the
  feature
 that
   they use Solr's own tokenizer rather than duplicate it.   The
 enrichment
   process seems to me only workable in environments where you
 can
 re-index
   all data periodically, so no continuous stream of data to
 index
   that
  needs
   to be handled relatively quickly once it is generated.The
  last
  method
   of pre-processing the data seems the least desirable to me
 from
  a
  blue-sky
   perspective, but is probably the easiest to implement and the
  most
   independent of Solr.
  
   Hope this helps,
  
   Dan Davis, Systems/Applications Architect (Contractor),
   Office of Computer and Communications Systems,
   National Library of Medicine, NIH
  
   -Original Message-
   From: Kaushik [mailto:kaushika...@gmail.com]
   Sent: Monday, April 20, 2015 10:47 AM
   To: solr-user@lucene.apache.org
   Subject: Mutli term synonyms
  
   Hello,
  
   Reading up on synonyms it looks like there is no real solution
  for
 multi

Re: Mutli term synonyms

2015-04-28 Thread Kaushik
Hi there,

I tried the solution provided in
https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
.The mentioned solution works when the indexed data does not have alpha
numerics or special characters. But in  my case the synonyms are something
like the below.


 T-MAZ 20  POLYOXYETHYLENE (20) SORBITAN MONOLAURATE  SORBITAN
MONODODECANOATE  POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE  POLYOXYETHYLENE
SORBITAN MONOLAURATE  POLYSORBATE 20 [MART.]  SORBIMACROGOL LAURATE
300  POLYSORBATE
20 [FHFI]  FEMA NO. 2915

They have alpha numerics, special characters, spaces, etc. Is there a way
to implment synonyms even in such case?

Thanks,
Kaushik

On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] 
daniel.da...@nih.gov wrote:

 Handling MESH descriptor preferred terms and such is similar.   I
 encountered this during evaluation of Solr for a project here at NLM.   We
 decided to use Solr for different projects instead. I considered the
 following approaches:
  - use a custom tokenizer at index time that indexed all of the multiple
 term alternatives.
  - index the data, and then have an enrichment process that queries on
 each source synonym, and generates an update to add the target synonyms.
Follow this with an optimize.
  - During the indexing process, but before sending the data to Solr,
 process the data to tokenize and add synonyms to another field.

 Both the custom tokenizer and enrichment process share the feature that
 they use Solr's own tokenizer rather than duplicate it.   The enrichment
 process seems to me only workable in environments where you can re-index
 all data periodically, so no continuous stream of data to index that needs
 to be handled relatively quickly once it is generated.The last method
 of pre-processing the data seems the least desirable to me from a blue-sky
 perspective, but is probably the easiest to implement and the most
 independent of Solr.

 Hope this helps,

 Dan Davis, Systems/Applications Architect (Contractor),
 Office of Computer and Communications Systems,
 National Library of Medicine, NIH

 -Original Message-
 From: Kaushik [mailto:kaushika...@gmail.com]
 Sent: Monday, April 20, 2015 10:47 AM
 To: solr-user@lucene.apache.org
 Subject: Mutli term synonyms

 Hello,

 Reading up on synonyms it looks like there is no real solution for multi
 term synonyms. Is that right? I have a use case where I need to map one
 multi term phrase to another. i.e. Tween 20 needs to be translated to
 Polysorbate 40.

 Any thoughts as to how this can be achieved?

 Thanks,
 Kaushik



Mutli term synonyms

2015-04-20 Thread Kaushik
Hello,

Reading up on synonyms it looks like there is no real solution for multi
term synonyms. Is that right? I have a use case where I need to map one
multi term phrase to another. i.e. Tween 20 needs to be translated to
Polysorbate 40.

Any thoughts as to how this can be achieved?

Thanks,
Kaushik


RE: Mutli term synonyms

2015-04-20 Thread Davis, Daniel (NIH/NLM) [C]
Handling MESH descriptor preferred terms and such is similar.   I encountered 
this during evaluation of Solr for a project here at NLM.   We decided to use 
Solr for different projects instead. I considered the following approaches:
 - use a custom tokenizer at index time that indexed all of the multiple term 
alternatives.   
 - index the data, and then have an enrichment process that queries on each 
source synonym, and generates an update to add the target synonyms.  
   Follow this with an optimize.
 - During the indexing process, but before sending the data to Solr, process 
the data to tokenize and add synonyms to another field.

Both the custom tokenizer and enrichment process share the feature that they 
use Solr's own tokenizer rather than duplicate it.   The enrichment process 
seems to me only workable in environments where you can re-index all data 
periodically, so no continuous stream of data to index that needs to be handled 
relatively quickly once it is generated.The last method of pre-processing 
the data seems the least desirable to me from a blue-sky perspective, but is 
probably the easiest to implement and the most independent of Solr.

Hope this helps,

Dan Davis, Systems/Applications Architect (Contractor),
Office of Computer and Communications Systems,
National Library of Medicine, NIH

-Original Message-
From: Kaushik [mailto:kaushika...@gmail.com] 
Sent: Monday, April 20, 2015 10:47 AM
To: solr-user@lucene.apache.org
Subject: Mutli term synonyms

Hello,

Reading up on synonyms it looks like there is no real solution for multi term 
synonyms. Is that right? I have a use case where I need to map one multi term 
phrase to another. i.e. Tween 20 needs to be translated to Polysorbate 40.

Any thoughts as to how this can be achieved?

Thanks,
Kaushik