RE: Solr Phonetic Search Highlight issue in search results

2013-04-02 Thread Soumyanayan Kar
Thanks a lot Erick for trying this out.

Will wait for a reply from your end.

Thanks  Regards,

Soumya.


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 01 April 2013 05:46 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Phonetic Search Highlight issue in search results

Good question, you're causing me to think... about code I know very little
about G.

So rather than spouting off, I tried it and.. it works fine for me, either
with or without using fast vector highlighter on, admittedly, a very simple
test.

So I think I'd try peeling off all the extra stuff you've put into your
configs (sorry, I don't have time right now to try to reproduce) and get the
very simple case working, then build the rest back up and see where the
problem begins.

Sorry for the mis-direction!

Erick



On Mon, Apr 1, 2013 at 1:07 AM, Soumyanayan Kar soumyanayan@rebaca.com
wrote:
 Hi Erick,

 Thanks for the reply. But help me understand this: If Solr is able to 
 isolate the two documents which contain the term fact being the 
 phonetic equivalent of the search term fakt, then why will it be 
 unable to highlight the terms based on the same logic it uses to search
the documents.

 Also, it is correctly highlighting the results in other searches which 
 are also approximate searches and not exact ones for eg. Fuzzy or 
 Synonym search. In these cases also the highlights in the search 
 results are far from the actual search term but still they are getting 
 correctly highlighted.

 Maybe I am getting it completely wrong but it looks like there is 
 something wrong with my implementation.

 Thanks  Regards,

 Soumya.


 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: 27 March 2013 06:07 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr Phonetic Search Highlight issue in search results

 How would you expect it to highlight successfully? The term is fakt, 
 there's nothing built in (and, indeed couldn't be) to un-phoneticize 
 it into fact and apply that to the Content field. The whole point of 
 phonetic processing is to do a lossy translation from the word into 
 some variant, losing precision all the way.

 So this behavior is unsurprising...

 Best
 Erick




 On Tue, Mar 26, 2013 at 7:28 AM, Soumyanayan Kar 
 soumyanayan@rebaca.com
 wrote:

 When we are issuing a query with Phonetic Search, it is returning the 
 correct documents but not returning the highlights. When we use 
 Stemming or Synonym searches we are getting the proper highlights.



 For example, when we execute a phonetic query for the term
 fakt(ContentSearchPhonetic:fakt) in the Solr Admin interface, it 
 returns two documents containing the term fact(phonetic token 
 equivalent), but the list of highlights is empty as shown in the 
 response below.



 response

 lst name=responseHeader

 int name=status0/int

 int name=QTime16/int

 lst name=params

   str name=qContentSearchPhonetic:fakt/str

   str name=wtxml/str

 /lst

   /lst

 result name=response numFound=2 start=0

 doc

   long name=DocId1/long

   str name=DocTitleDoc 1/str

   str name=ContentAnyway, this game was excellent and was 
 well worth the time.  The graphics are truly amazing and the sound 
 track was pretty pleasant also. The  preacher was in  fact a 
 thief./str

   long name=_version_1430480998833848320/long

 /doc

 doc

   long name=DocId2/long

   str name=DocTitleDoc 2/str

   str name=Contentstunning. The  preacher was in  fact an 
 excellent thief who  had stolen the original manuscript of Hamlet 
 from an exhibit on the  Riviera, where  he also  acquired his 
 remarkable and tan./str

   long name=_version_1430480998841188352/long

 /doc

   /result

   lst name=highlighting

 lst name=1/

 lst name=2/

   /lst

 /response



 Relevant section of Solr schema:



 field name=DocId type=long indexed=true stored=true
 required=true/

 field name=DocTitle type=string indexed=false stored=true
 required=true/

 field name=Content type=text_general indexed=false
 stored=true
 required=true/



 field name=ContentSearch type=text_general indexed=true
 stored=false multiValued=true/

 field name=ContentSearchStemming type=text_stem indexed=true
 stored=false multiValued=true/

 field name=ContentSearchPhonetic type=text_phonetic
 indexed=true
 stored=false multiValued=true/

 field name=ContentSearchSynonym type=text_synonym indexed=true
 stored=false multiValued=true/



 uniqueKeyDocId/uniqueKey

 copyField source=Content dest=ContentSearch/

 copyField source=Content dest=ContentSearchStemming/

 copyField source=Content dest=ContentSearchPhonetic/

 copyField source=Content dest=ContentSearchSynonym/



 fieldType name=text_stem class=solr.TextField 

   analyzer

RE: Solr Phonetic Search Highlight issue in search results

2013-03-31 Thread Soumyanayan Kar
Hi Erick,

Thanks for the reply. But help me understand this: If Solr is able to
isolate the two documents which contain the term fact being the phonetic
equivalent of the search term fakt, then why will it be unable to
highlight the terms based on the same logic it uses to search the documents.

Also, it is correctly highlighting the results in other searches which are
also approximate searches and not exact ones for eg. Fuzzy or Synonym
search. In these cases also the highlights in the search results are far
from the actual search term but still they are getting correctly
highlighted.

Maybe I am getting it completely wrong but it looks like there is something
wrong with my implementation.

Thanks  Regards,

Soumya.


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 27 March 2013 06:07 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Phonetic Search Highlight issue in search results

How would you expect it to highlight successfully? The term is fakt,
there's nothing built in (and, indeed couldn't be) to un-phoneticize it into
fact and apply that to the Content field. The whole point of phonetic
processing is to do a lossy translation from the word into some variant,
losing precision all the way.

So this behavior is unsurprising...

Best
Erick




On Tue, Mar 26, 2013 at 7:28 AM, Soumyanayan Kar soumyanayan@rebaca.com
 wrote:

 When we are issuing a query with Phonetic Search, it is returning the 
 correct documents but not returning the highlights. When we use 
 Stemming or Synonym searches we are getting the proper highlights.



 For example, when we execute a phonetic query for the term
 fakt(ContentSearchPhonetic:fakt) in the Solr Admin interface, it 
 returns two documents containing the term fact(phonetic token 
 equivalent), but the list of highlights is empty as shown in the 
 response below.



 response

 lst name=responseHeader

 int name=status0/int

 int name=QTime16/int

 lst name=params

   str name=qContentSearchPhonetic:fakt/str

   str name=wtxml/str

 /lst

   /lst

 result name=response numFound=2 start=0

 doc

   long name=DocId1/long

   str name=DocTitleDoc 1/str

   str name=ContentAnyway, this game was excellent and was 
 well worth the time.  The graphics are truly amazing and the sound 
 track was pretty pleasant also. The  preacher was in  fact a 
 thief./str

   long name=_version_1430480998833848320/long

 /doc

 doc

   long name=DocId2/long

   str name=DocTitleDoc 2/str

   str name=Contentstunning. The  preacher was in  fact an 
 excellent thief who  had stolen the original manuscript of Hamlet  
 from an exhibit on the  Riviera, where  he also  acquired his 
 remarkable and tan./str

   long name=_version_1430480998841188352/long

 /doc

   /result

   lst name=highlighting

 lst name=1/

 lst name=2/

   /lst

 /response



 Relevant section of Solr schema:



 field name=DocId type=long indexed=true stored=true
 required=true/

 field name=DocTitle type=string indexed=false stored=true
 required=true/

 field name=Content type=text_general indexed=false
stored=true
 required=true/



 field name=ContentSearch type=text_general indexed=true
 stored=false multiValued=true/

 field name=ContentSearchStemming type=text_stem indexed=true
 stored=false multiValued=true/

 field name=ContentSearchPhonetic type=text_phonetic
indexed=true
 stored=false multiValued=true/

 field name=ContentSearchSynonym type=text_synonym indexed=true
 stored=false multiValued=true/



 uniqueKeyDocId/uniqueKey

 copyField source=Content dest=ContentSearch/

 copyField source=Content dest=ContentSearchStemming/

 copyField source=Content dest=ContentSearchPhonetic/

 copyField source=Content dest=ContentSearchSynonym/



 fieldType name=text_stem class=solr.TextField 

   analyzer

  tokenizer class=solr.WhitespaceTokenizerFactory/

  filter class=solr.SnowballPorterFilterFactory/

   /analyzer

 /fieldType



 fieldType name=text_phonetic class=solr.TextField 

   analyzer

  tokenizer class=solr.WhitespaceTokenizerFactory/

  filter class=solr.PhoneticFilterFactory
 encoder=DoubleMetaphone inject=false/

   /analyzer

 /fieldType



 fieldType name=text_synonym class=solr.TextField 

 analyzer

   tokenizer class=solr.WhitespaceTokenizerFactory/

   filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/

 /analyzer

 /fieldType



 Relevant section of Solr config:



 requestHandler name=/select class=solr.SearchHandler

 !-- default values for query parameters can be specified, these

  will be overridden by parameters in the request

   --

  lst name=defaults

str name

Solr Phonetic Search Highlight issue in search results

2013-03-26 Thread Soumyanayan Kar
When we are issuing a query with Phonetic Search, it is returning the
correct documents but not returning the highlights. When we use Stemming or
Synonym searches we are getting the proper highlights.

 

For example, when we execute a phonetic query for the term
fakt(ContentSearchPhonetic:fakt) in the Solr Admin interface, it returns two
documents containing the term fact(phonetic token equivalent), but the
list of highlights is empty as shown in the response below.

 

response

lst name=responseHeader

int name=status0/int

int name=QTime16/int

lst name=params

  str name=qContentSearchPhonetic:fakt/str

  str name=wtxml/str

/lst

  /lst

result name=response numFound=2 start=0

doc

  long name=DocId1/long

  str name=DocTitleDoc 1/str

  str name=ContentAnyway, this game was excellent and was well
worth the time.  The graphics are truly amazing and the sound track was
pretty pleasant also. The  preacher was in  fact a thief./str

  long name=_version_1430480998833848320/long

/doc

doc

  long name=DocId2/long

  str name=DocTitleDoc 2/str

  str name=Contentstunning. The  preacher was in  fact an
excellent thief who  had stolen the original manuscript of Hamlet  from an
exhibit on the  Riviera, where  he also  acquired his remarkable and
tan./str

  long name=_version_1430480998841188352/long

/doc

  /result

  lst name=highlighting

lst name=1/

lst name=2/

  /lst

/response

 

Relevant section of Solr schema:

 

field name=DocId type=long indexed=true stored=true
required=true/

field name=DocTitle type=string indexed=false stored=true
required=true/

field name=Content type=text_general indexed=false stored=true
required=true/



field name=ContentSearch type=text_general indexed=true
stored=false multiValued=true/

field name=ContentSearchStemming type=text_stem indexed=true
stored=false multiValued=true/

field name=ContentSearchPhonetic type=text_phonetic indexed=true
stored=false multiValued=true/

field name=ContentSearchSynonym type=text_synonym indexed=true
stored=false multiValued=true/



uniqueKeyDocId/uniqueKey

copyField source=Content dest=ContentSearch/

copyField source=Content dest=ContentSearchStemming/

copyField source=Content dest=ContentSearchPhonetic/

copyField source=Content dest=ContentSearchSynonym/



fieldType name=text_stem class=solr.TextField 

  analyzer

 tokenizer class=solr.WhitespaceTokenizerFactory/

 filter class=solr.SnowballPorterFilterFactory/

  /analyzer  

/fieldType



fieldType name=text_phonetic class=solr.TextField 

  analyzer

 tokenizer class=solr.WhitespaceTokenizerFactory/

 filter class=solr.PhoneticFilterFactory
encoder=DoubleMetaphone inject=false/

  /analyzer  

/fieldType



fieldType name=text_synonym class=solr.TextField 

analyzer

  tokenizer class=solr.WhitespaceTokenizerFactory/

  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/

/analyzer 

/fieldType

 

Relevant section of Solr config:

 

requestHandler name=/select class=solr.SearchHandler

!-- default values for query parameters can be specified, these

 will be overridden by parameters in the request

  --

 lst name=defaults

   str name=echoParamsexplicit/str

   int name=rows100/int

   str name=dfContentSearch/str

 bool name=hltrue/bool

str name=hl.flContent/str

str name=f.Content.hl.fragsize150/str

  str name=f.Content.hl.snippets40/str

 /lst

/requestHandler

searchComponent class=solr.HighlightComponent name=highlight

highlighting

!-- Configure the standard fragmenter --

!-- This could most likely be commented out in the default case --

fragmenter name=gap 

default=true

class=solr.highlight.GapFragmenter

  lst name=defaults

int name=hl.fragsize100/int

  /lst

/fragmenter



!-- A regular-expression-based fragmenter 

 (for sentence extraction) 

  --

fragmenter name=regex 

class=solr.highlight.RegexFragmenter

  lst name=defaults

!-- slightly smaller fragsizes work better because of slop --

int name=hl.fragsize70/int

!-- allow 50% slop on fragment sizes --

float name=hl.regex.slop0.5/float

!-- a basic sentence pattern --

str name=hl.regex.pattern[-\w ,/\n\quot;apos;]{20,200}/str

  /lst

/fragmenter

 

Has anyone experienced this kind of behaviour before? Need some direction
for troubleshooting.

 

Soumya.

 

 



Advanced Search Option in Solr corresponding to DtSearch options

2013-02-06 Thread Soumyanayan Kar
Hi,

 

We are replacing the search and indexing module in an application from
DtSearch to Solr using solrnet as the .net Solr client library.

 

We are relatively new to Solr/Lucene and would need some help/direction to
understand the more advanced search options in Solr.

 

The current application supports the following search options using
DtSearch:

 

1)Word(s) or phrase

2)Exact words or phrases

3)Not these words or phrases

4)One or more of words(A OR B OR C)

5)Proximity of word with n words of another word

6)Numeric range - From - To

7)Option

. Stemming(search* finds searching or searches)

. Synonym(search finds seek or look)

. Fuzzy within n letters(p%arts finds paris)

. Phonic homonyms(#Smith also finds Smithe and Smythe)

 

As an example the search query that gets generated to be posted to DtSearch
for the below use case:

1.   Search Phrase:  generic collection

2.   Exact Phrase: linq

3.   Not these words: sql

4.   One or more of these words:  ICollection or ArrayList or
Hashtable

5.   Proximity:   csharp within
4 words of language

6.   Options:

a.  Stemming

b.  Synonym

c.   Fuzzy within 2 letters

d.  Phonic homonyms

 

Search Query: generic* collection* generic collection #generic #collection
g%%eneric c%%ollection linq  -sql ICollection OR ArrayList OR Hashtable
csharp w/4 language

 

We have been able to do simple searches(singular term search in a file
content) with highlights with Solr. Now we need to replace these options
with Solr/Lucene.

 

Can anybody provide some directions on what/where should we be looking.

 

Thanks  Regards,

 

Soumya.

 

 



Issue with mutiple records in full text search

2013-01-29 Thread Soumyanayan Kar
Hi,

 

We are trying to use solr for a text based search solution in a web
application. The documents that are getting indexed are essentially text
based files like *.txt, *.pdf, etc. We are using the Tika extraction plugin
to extract the text content from the files and storing it using a
text_general type field in the solr schema file.  Relevant part of the
schema file:

 

field name=CaseId type=long indexed=true stored=true
required=true/

field name=CaseTitle type=string indexed=false
stored=true required=true/

field name=CaseNumber type=string indexed=false
stored=true required=true/

field name=MediaType type=int indexed=true
stored=true required=true/

field name=MediaId type=string indexed=true
stored=true required=true/

field name=CaptionName type=string indexed=false
stored=true required=true/

field name=MediaPath type=string indexed=false
stored=true required=true/

field name=MimeType type=string indexed=false
stored=true required=false/

field name=DocumentNumber type=string indexed=false
stored=true required=false/

field name=DeponentFullName type=string indexed=false
stored=true required=false/

field name=DepositionDate type=date indexed=false
stored=true required=false/

field name=DocCreatedDate type=date indexed=false
stored=true required=false/

field name=DocModifiedDate type=date indexed=false
stored=true required=false/

field name=Content type=text_general indexed=false
stored=true required=true/

field name=WorkgroupIdList type=text_general
indexed=true stored=true required=true multiValued=true/



 field name=ContentSearch type=text_general
indexed=true stored=false multiValued=true/

field name=_version_ type=long indexed=true
stored=true/

 

uniqueKeyMediaId/uniqueKey

copyField source=Content dest=ContentSearch/

 

We are using a .net based solution and using the solrnet client to
communicate with Solr. 

 

The content field is supposed to store the text content of the file and the
ContentSearch field will be used for executing the search.

While the documents are getting indexed properly, while executing search we
are getting only the first occurrence of the search term returned for each
document.

For example, if we have a.txt and b.pdf which are indexed, and the search
term case exists in both the documents multiple times(a.txt - 7 hits,
b.pdf - 10 hits), when executing a search for case against both the
documents, we are getting two records returned which are the first
occurrences of the search term in the respective docs, while this should
return 17 hits.

 

Used Luke to test the index records but cannot find anything apparently
wrong. 

Is this something to do with the type(text_general) of the search field or
the way we are loading the entire content of the file into one index
document?

 

Soumya.

 

 

Thanks  Regards,

 

Soumya.

 

 



Issue with mutiple records in full text search

2013-01-29 Thread Soumyanayan Kar
Hi,

 

We are trying to use solr for a text based search solution in a web
application. The documents that are getting indexed are essentially text
based files like *.txt, *.pdf, etc. We are using the Tika extraction plugin
to extract the text content from the files and storing it using a
text_general type field in the solr schema file.  Relevant part of the
schema file:

 

field name=CaseId type=long indexed=true stored=true
required=true/

field name=CaseTitle type=string indexed=false
stored=true required=true/

field name=CaseNumber type=string indexed=false
stored=true required=true/

field name=MediaType type=int indexed=true
stored=true required=true/

field name=MediaId type=string indexed=true
stored=true required=true/

field name=CaptionName type=string indexed=false
stored=true required=true/

field name=MediaPath type=string indexed=false
stored=true required=true/

field name=MimeType type=string indexed=false
stored=true required=false/

field name=DocumentNumber type=string indexed=false
stored=true required=false/

field name=DeponentFullName type=string indexed=false
stored=true required=false/

field name=DepositionDate type=date indexed=false
stored=true required=false/

field name=DocCreatedDate type=date indexed=false
stored=true required=false/

field name=DocModifiedDate type=date indexed=false
stored=true required=false/

field name=Content type=text_general indexed=false
stored=true required=true/

field name=WorkgroupIdList type=text_general
indexed=true stored=true required=true multiValued=true/



 field name=ContentSearch type=text_general
indexed=true stored=false multiValued=true/

field name=_version_ type=long indexed=true
stored=true/

 

uniqueKeyMediaId/uniqueKey

copyField source=Content dest=ContentSearch/

 

We are using a .net based solution and using the solrnet client to
communicate with Solr. 

 

The content field is supposed to store the text content of the file and the
ContentSearch field will be used for executing the search.

While the documents are getting indexed properly, while executing search we
are getting only the first occurrence of the search term returned for each
document.

For example, if we have a.txt and b.pdf which are indexed, and the search
term case exists in both the documents multiple times(a.txt - 7 hits,
b.pdf - 10 hits), when executing a search for case against both the
documents, we are getting two records returned which are the first
occurrences of the search term in the respective docs, while this should
return 17 hits.

 

Used Luke to test the index records but cannot find anything apparently
wrong. 

Is this something to do with the type(text_general) of the search field or
the way we are loading the entire content of the file into one index
document?

 

Soumya.

 

 



RE: Issue with mutiple records in full text search

2013-01-29 Thread Soumyanayan Kar
Thanks Jack for the explanation.

But lets say if my requirement needs me to return all occurrences of the
search term along with the text snippet around them for each document under
the search scope, how do we go about achieving that with Solr?

Thanks  Regards,

Soumya.



-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: 29 January 2013 08:00 PM
To: solr-user@lucene.apache.org
Subject: Re: Issue with mutiple records in full text search

The number of hits of a term in a Solr document impacts the score, but
still only counts as one hit in the numFound count. Solr doesn't track
hits for individual term occurrences, except that you could check the
term frequency of a specific term in a specific document if you wanted,
using a function query - tf(field,term) - which can also be included in the
fl field list.

To be clear - Solr has no concept of records, just documents and fields.

-- Jack Krupansky

-Original Message-
From: Soumyanayan Kar
Sent: Tuesday, January 29, 2013 9:01 AM
To: solr-user@lucene.apache.org
Subject: Issue with mutiple records in full text search

Hi,



We are trying to use solr for a text based search solution in a web
application. The documents that are getting indexed are essentially text
based files like *.txt, *.pdf, etc. We are using the Tika extraction plugin
to extract the text content from the files and storing it using a
text_general type field in the solr schema file.  Relevant part of the
schema file:



field name=CaseId type=long indexed=true stored=true
required=true/

field name=CaseTitle type=string indexed=false
stored=true required=true/

field name=CaseNumber type=string indexed=false
stored=true required=true/

field name=MediaType type=int indexed=true
stored=true required=true/

field name=MediaId type=string indexed=true
stored=true required=true/

field name=CaptionName type=string indexed=false
stored=true required=true/

field name=MediaPath type=string indexed=false
stored=true required=true/

field name=MimeType type=string indexed=false
stored=true required=false/

field name=DocumentNumber type=string indexed=false
stored=true required=false/

field name=DeponentFullName type=string indexed=false
stored=true required=false/

field name=DepositionDate type=date indexed=false
stored=true required=false/

field name=DocCreatedDate type=date indexed=false
stored=true required=false/

field name=DocModifiedDate type=date indexed=false
stored=true required=false/

field name=Content type=text_general indexed=false
stored=true required=true/

field name=WorkgroupIdList type=text_general
indexed=true stored=true required=true multiValued=true/



 field name=ContentSearch type=text_general
indexed=true stored=false multiValued=true/

field name=_version_ type=long indexed=true
stored=true/



uniqueKeyMediaId/uniqueKey

copyField source=Content dest=ContentSearch/



We are using a .net based solution and using the solrnet client to
communicate with Solr.



The content field is supposed to store the text content of the file and the
ContentSearch field will be used for executing the search.

While the documents are getting indexed properly, while executing search we
are getting only the first occurrence of the search term returned for each
document.

For example, if we have a.txt and b.pdf which are indexed, and the search
term case exists in both the documents multiple times(a.txt - 7 hits,
b.pdf - 10 hits), when executing a search for case against both the
documents, we are getting two records returned which are the first
occurrences of the search term in the respective docs, while this should
return 17 hits.



Used Luke to test the index records but cannot find anything apparently
wrong.

Is this something to do with the type(text_general) of the search field or
the way we are loading the entire content of the file into one index
document?



Soumya.





Thanks  Regards,



Soumya.