Re: Solr Deleting Docs after Indexing
It was indeed the duplicate Id's. Somehow I thought I had it unique all the way. Thanks, Kaushik On Mon, Sep 11, 2017 at 3:21 PM, Susheel Kumar <susheel2...@gmail.com> wrote: > Does all 4 document's have same docID (Unqiue key)? > > On Mon, Sep 11, 2017 at 2:44 PM, Kaushik <kaushika...@gmail.com> wrote: > > > I am using Solr 5.3 and have a custom Solr J application to write to > Solr. > > When I index using this application, I expect to see 4 documents indexed. > > But for some strange reason, 3 documents get deleted and there is always > > only 1 document in the index. I say that because the final tally on the > > Solr Admin console is > > Num Docs: 1 > > Max Doc: 4 > > Deleted Docs: 3 > > > > > > How and where in Solr/logs can I find why the documents are being > deleted? > > > > Thanks, > > Kaushik > > >
Solr Deleting Docs after Indexing
I am using Solr 5.3 and have a custom Solr J application to write to Solr. When I index using this application, I expect to see 4 documents indexed. But for some strange reason, 3 documents get deleted and there is always only 1 document in the index. I say that because the final tally on the Solr Admin console is Num Docs: 1 Max Doc: 4 Deleted Docs: 3 How and where in Solr/logs can I find why the documents are being deleted? Thanks, Kaushik
Re: Number of occurrences in Solr Documents
Thanks to Susheel and Shawn. Unfortunately the Solr version we have is Solr 5.3 and it does not include the totaltermfrequency feature. Is there any downside of using TermVectorFrequency ; like peformance issues? On Thu, Jun 29, 2017 at 11:49 AM, Susheel Kumar <susheel2...@gmail.com> wrote: > That's even better. Thanks, Shawn. > > On Thu, Jun 29, 2017 at 11:45 AM, Shawn Heisey <apa...@elyograg.org> > wrote: > > > On 6/29/2017 8:40 AM, Kaushik wrote: > > > We are trying to get the most frequently used words in a collection. > > > My understanding is that using facet.field=content_txt. An e.g. of > > > content_txt value is "The fox jumped over another fox". In such a > > > scenario, I am expecting the facet to return with "fox" and with a > > > count value of 2. However, we end up getting "fox" with a value of 1. > > > It appears we are getting total number of documents that match the > > > query as opposed to the total number of times the word ocurred. How > > > can the latter be achieved? > > > > Facets count the number of documents, not the number of terms. > > > > You might be after the terms component. > > > > https://lucene.apache.org/solr/guide/6_6/the-terms-component.html > > > > This generally works across the entire index, while facets can operate > > on documents that match a query. > > > > Thanks, > > Shawn > > > > >
Number of occurrences in Solr Documents
Hello, We are trying to get the most frequently used words in a collection. My understanding is that using facet.field=content_txt. An e.g. of content_txt value is "The fox jumped over another fox". In such a scenario, I am expecting the facet to return with "fox" and with a count value of 2. However, we end up getting "fox" with a value of 1. It appears we are getting total number of documents that match the query as opposed to the total number of times the word ocurred. How can the latter be achieved? Thanks, AK
How does using cacheKey and lookup behave?
I use the cacheKey, cacheLookup, SortedMapBackedCache in the Data Import Handler of Solr 5.x to join two or more entities. Does this give me an equivalent of Sql's inner join? If so, how can I get something similar to left join? Thank you, Kaushik
Is there Solr limitation on size for document retrieval?
Hello, Is there a limit on the size of a document that can be indexed and rendered by Solr? We use Solr 5.3.1 and while we are able to index a document of 40 mb size withouot any issue, we are unable to retrieve the indexed SolrDocument. Is there any configuration that we can use to spit out the entire document? Also, the only reason why we need the whole document is because of the highlighting feature. It would be great if we can just get a snippet of the text, instead of the entire content field for highlighting. Thanks, Kaushik
Adding multiple entities to single core
Hi Admin,I am a new-bee to SOLR.I want to establish multiple entities within a single core so that each entity refers data from two different tables and hence indexing the data.Please help me out Attached are my schema.xml and data-config.xml files Looking forward for a positive response --Namanspan style=font-family: Trebuchet MS, sans-serif; dataConfig dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost/db_ipc user=root password= batchSize=1 /   document name=listing     entity name=listings_data pk=property_id query=SELECT * FROM member_listing  field column=l_id template=l_${member_listing.id} template=listing / field column=id name=id template=listing / field column=property_id name=property_id template=listing / field column=member_id name=member_id template=listing / field column=property_type_id name=property_type_id template=listing / field column=property_for name=property_for template=listing / field column=l_location_id name=location_id template=listing / field column=l_location name=location template=listing / field column=l_city name=city template=listing / field column=l_city_other name=city_other template=listing / field column=price name=price template=listing / field column=area name=area template=listing / field column=area_unit name=area_unit template=listing / field column=area_in_sqfeet name=area_in_sqfeet template=listing / field column=is_negotiable name=is_negotiable template=listing / field column=deposit_amount name=deposit_amount template=listing / field column=bedrooms name=bedrooms template=listing / field column=reposted_date name=reposted_date template=listing / field column=contact_name name=contact_name template=listing / field column=contact_phone name=contact_phone template=listing / field column=contact_mobile name=contact_mobile template=listing / field column=contact_email name=contact_email template=listing / field column=property_address name=property_address template=listing / field column=project_society name=project_society template=listing / field column=furnished name=furnished template=listing / field column=age_of_construction name=age_of_construction template=listing / /entity  entity name=user_data pk=member_id query=SELECT * FROM member field column=m_id template=l_${member.member_id} template=member / field column=member_id name=m_member_id template=member / field column=username name=m_username template=member / field column=password name=m_password template=member / field column=fullname name=m_fullname template=member / field column=email name=m_email template=member / field column=address name=m_address template=member / field column=city name=m_city template=member / field column=locality_id name=m_locality_id template=member / field column=mobile name=m_mobile template=member / field column=member_type name=m_member_typetemplate=member / /entity   /document /dataConfig /span ?xml version=1.0 encoding=UTF-8? !-- Solr managed schema - automatically generated - DO NOT EDIT -- schema name=example-data-driven-schema version=1.5 uniqueKeyid/uniqueKey fieldType name=ancestor_path class=solr.TextField analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.PathHierarchyTokenizerFactory delimiter=// /analyzer /fieldType fieldType name=binary class=solr.BinaryField/ fieldType name=boolean class=solr.BoolField sortMissingLast=true/ fieldType name=booleans class=solr.BoolField multiValued=true sortMissingLast=true/ fieldType name=currency class=solr.CurrencyField precisionStep=8 currencyConfig=currency.xml defaultCurrency=USD/ fieldType name=date class=solr.TrieDateField precisionStep=0 positionIncrementGap=0/ fieldType name=dates class=solr.TrieDateField precisionStep=0 multiValued=true positionIncrementGap=0/ fieldType name=descendent_path class=solr.TextField analyzer type=index tokenizer class=solr.PathHierarchyTokenizerFactory delimiter=// /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ /analyzer /fieldType fieldType name=double class=solr.TrieDoubleField precisionStep=0 positionIncrementGap=0/ fieldType name=doubles class=solr.TrieDoubleField precisionStep=0 multiValued=true positionIncrementGap=0/ fieldType name=float class=solr.TrieFloatField precisionStep=0 positionIncrementGap=0/ fieldType name=floats class=solr.TrieFloatField precisionStep=0 multiValued=true positionIncrementGap=0/ fieldType name=ignored class=solr.StrField multiValued=true indexed=false stored=false/ fieldType name=int class=solr.TrieIntField precisionStep=0 positionIncrementGap=0/ fieldType name=ints class=solr.TrieIntField precisionStep=0 multiValued=true positionIncrementGap=0/ fieldType name=location class=solr.LatLonType subFieldSuffix=_coordinate/ fieldType
Re: Injecting synonymns into Solr
I am facing the same problem; currently I am resorting to a custom program to create this file. Hopefully there is a better solution out there. Thanks, Kaushik On Thu, Apr 30, 2015 at 3:58 AM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi, Does anyone knows any faster method of populating the synonyms.txt file instead of manually typing in the words into the file, which there could be thousands of synonyms around? Regards, Edwin
Re: Mutli term synonyms
Hi Roman, Following is my use case: *Schema.xml*... field name=name type=text_autophrase indexed=true stored=true/ fieldType name=text_autophrase class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=com.lucidworks.analysis.AutoPhrasingTokenFilterFactory phrases=autophrases.txt includeTokens=false replaceWhitespaceWith=X / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / /analyzer /fieldType *SolrConfig.xml...* name=/autophrase class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dfname/str str name=defTypeautophrasingParser/str /lst /requestHandler queryParser name=autophrasingParser class=com.lucidworks.analysis.AutoPhrasingQParserPlugin str name=phrasesautophrases.txt/str str name=replaceWhitespaceWithX/str /queryParser *Synonyms.txt* PEG-20 SORBITAN LAURATE,POLYOXYETHYLENE 20 SORBITAN MONOLAURATE,TWEEN 20,POLYSORBATE 20 [USAN],POLYSORBATE 20 [INCI],POLYSORBATE 20 [II],POLYSORBATE 20 [HSDB],TWEEN-20,PEG-20 SORBITAN,PEG-20 SORBITAN [VANDF],POLYSORBATE-20,POLYSORBATE 20,SORETHYTAN MONOLAURATE,T-MAZ 20,POLYOXYETHYLENE (20) SORBITAN MONOLAURATE,SORBITAN MONODODECANOATE,POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE,POLYOXYETHYLENE SORBITAN MONOLAURATE,POLYSORBATE 20 [MART.],SORBIMACROGOL LAURATE 300,POLYSORBATE 20 [FHFI],FEMA NO. 2915,POLYSORBATE 20 [FCC],POLYSORBATE 20 [WHO-DD],POLYSORBATE 20 [VANDF] *Autophrase.txt...* Has all the above phrases in one column *Indexed document* doc field name=id31/field field name=namePolysorbate 20/field /doc So when I query SOLR /autphrase for tween 20 or FEMA NO. 2915, I expect to see the record containig Polysorbate 20. i.e. http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=true should have retrieved it; but it doesnt. What could I be doing wrong? On Wed, Apr 29, 2015 at 2:10 AM, Roman Chyla roman.ch...@gmail.com wrote: I'm not sure I understand - the autophrasing filter will allow the parser to see all the tokens, so that they can be parsed (and multi-token synonyms) identified. So if you are using the same analyzer at query and index time, they should be able to see the same stuff. are you using multi-token synonyms, or just entries that look like multi synonym? (in the first case, the tokens are separated by null byte) - in the second case, they are just strings even with whitespaces, your synonym file must contain exactly the same entries as your analyzer sees them (and in the same order; or you have to use the same analyzer to load the synonym files) can you post the relevant part of your schema.xml? note: I can confirm that multi-token synonym expansion can be made to work, even in complex cases - we do it - but likely, if you need multi-token synonyms, you will also need a smarter query parser. sometimes your users will use query strings that contain overlapping synonym entries, to handle that, you will have to know how to generate all possible 'reads', example synonym: foo bar, foobar hey foo, heyfoo user input: hey foo bar possible readings: ((hey foo) +bar) OR (hey +(foo bar)) i'm simplifying it here, the fun starts when you are seeing a phrase query :) On Tue, Apr 28, 2015 at 10:31 AM, Kaushik kaushika...@gmail.com wrote: Hi there, I tried the solution provided in https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/ .The mentioned solution works when the indexed data does not have alpha numerics or special characters. But in my case the synonyms are something like the below. T-MAZ 20 POLYOXYETHYLENE (20) SORBITAN MONOLAURATE SORBITAN MONODODECANOATE POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE POLYOXYETHYLENE SORBITAN MONOLAURATE POLYSORBATE 20 [MART.] SORBIMACROGOL LAURATE 300 POLYSORBATE 20 [FHFI] FEMA NO. 2915 They have alpha numerics, special characters, spaces, etc. Is there a way to implment synonyms even in such case? Thanks, Kaushik On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] daniel.da...@nih.gov wrote: Handling MESH descriptor preferred terms and such is similar
Re: Mutli term synonyms
Hi Roman, When I used the debugQuery using http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=truedebugQuery=true I see the following in the response. The autophrase plugin seems to be doing its part. Just not the synonym expansion. When you say use phrase queries, what do you mean? Please clarify. response: { numFound: 0, start: 0, docs: [] }, debug: { rawquerystring: tween 20, querystring: tween 20, parsedquery: name:tweenx20, parsedquery_toString: name:tweenx20, explain: {}, Thank you, Kaushik On Wed, Apr 29, 2015 at 4:00 PM, Roman Chyla roman.ch...@gmail.com wrote: Pls post output of the request with debugQuery=true Do you see the synonyms being expanded? Probably not. You can go to the administer iface, in the analyzer section play with the input until you see the synonyms. Use phrase queries too. That will be helpful to elliminate autophrase filter On Apr 29, 2015 6:18 AM, Kaushik kaushika...@gmail.com wrote: Hi Roman, Following is my use case: *Schema.xml*... field name=name type=text_autophrase indexed=true stored=true/ fieldType name=text_autophrase class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=com.lucidworks.analysis.AutoPhrasingTokenFilterFactory phrases=autophrases.txt includeTokens=false replaceWhitespaceWith=X / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / /analyzer /fieldType *SolrConfig.xml...* name=/autophrase class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dfname/str str name=defTypeautophrasingParser/str /lst /requestHandler queryParser name=autophrasingParser class=com.lucidworks.analysis.AutoPhrasingQParserPlugin str name=phrasesautophrases.txt/str str name=replaceWhitespaceWithX/str /queryParser *Synonyms.txt* PEG-20 SORBITAN LAURATE,POLYOXYETHYLENE 20 SORBITAN MONOLAURATE,TWEEN 20,POLYSORBATE 20 [USAN],POLYSORBATE 20 [INCI],POLYSORBATE 20 [II],POLYSORBATE 20 [HSDB],TWEEN-20,PEG-20 SORBITAN,PEG-20 SORBITAN [VANDF],POLYSORBATE-20,POLYSORBATE 20,SORETHYTAN MONOLAURATE,T-MAZ 20,POLYOXYETHYLENE (20) SORBITAN MONOLAURATE,SORBITAN MONODODECANOATE,POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE,POLYOXYETHYLENE SORBITAN MONOLAURATE,POLYSORBATE 20 [MART.],SORBIMACROGOL LAURATE 300,POLYSORBATE 20 [FHFI],FEMA NO. 2915,POLYSORBATE 20 [FCC],POLYSORBATE 20 [WHO-DD],POLYSORBATE 20 [VANDF] *Autophrase.txt...* Has all the above phrases in one column *Indexed document* doc field name=id31/field field name=namePolysorbate 20/field /doc So when I query SOLR /autphrase for tween 20 or FEMA NO. 2915, I expect to see the record containig Polysorbate 20. i.e. http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=true should have retrieved it; but it doesnt. What could I be doing wrong? On Wed, Apr 29, 2015 at 2:10 AM, Roman Chyla roman.ch...@gmail.com wrote: I'm not sure I understand - the autophrasing filter will allow the parser to see all the tokens, so that they can be parsed (and multi-token synonyms) identified. So if you are using the same analyzer at query and index time, they should be able to see the same stuff. are you using multi-token synonyms, or just entries that look like multi synonym? (in the first case, the tokens are separated by null byte) - in the second case, they are just strings even with whitespaces, your synonym file must contain exactly the same entries as your analyzer sees them (and in the same order; or you have to use the same analyzer to load the synonym files) can you post the relevant part of your schema.xml? note: I can confirm that multi-token synonym expansion can be made to work, even in complex cases - we do it - but likely, if you need multi-token synonyms, you will also need a smarter query parser. sometimes your users will use query strings that contain overlapping synonym entries, to handle that, you will have to know how to generate all possible
Re: analyzer, indexAnalyzer and queryAnalyzer
Hi Doug, Nice explanation of the query parsers. If you get a chance, can you please take a quick look at the issue I am facing with multi term synonyms as well? http://lucene.472066.n3.nabble.com/Mutli-term-synonyms-tt4200960.html#none is the problem I am facing. I am now able to perform multi term searches on most phrases, barring the one's which have special characters used in SOLR. ie. [], etc. Your help is much appreciated. Thanks, Kaushik On Wed, Apr 29, 2015 at 9:24 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: So Solr has the idea of a query parser. The query parser is a convenient way of passing a search string to Solr and having Solr parse it into underlying Lucene queries: You can see a list of query parsers here http://wiki.apache.org/solr/QueryParser What this means is that the query parser does work to pull terms into individual clauses *before* analysis is run. It's a parsing layer that sits outside the analysis chain. This creates problems like the sea biscuit problem, whereby we declare sea biscuit as a query time synonym of seabiscuit. As you may know synonyms are checked during analysis. However, if the query parser splits up sea from biscuit before running analysis, the query time analyzer will fail. The string sea is brought by itself to the query time analyzer and of course won't match sea biscuit. Same with the string biscuit in isolation. If the full string sea biscuit was brought to the analyzer, it would see [sea] next to [biscuit] and declare it a synonym of seabiscuit. Thanks to the query parser, the analyzer has lost the association between the terms, and both terms aren't brought together to the analyzer. My colleague John Berryman wrote a pretty good blog post on this http://opensourceconnections.com/blog/2013/10/27/why-is-multi-term-synonyms-so-hard-in-solr/ There's several solutions out there that attempt to address this problem. One from Ted Sullivan at Lucidworks https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/ Another popular one is the hon-lucene-synonyms plugin: http://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/search/FieldQParserPlugin.html Yet another work-around is to use the field query parser: http://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/search/FieldQParserPlugin.html I also tend to write my own query parsers, so on the one hand its annoying that query parsers have the problems above, on the flipside Solr makes it very easy to implement whatever parsing you think is appropriatte with a small bit of Java/Lucene knowledge. Hopefully that explanation wasn't too deep, but its an important thing to know about Solr. Are you asking out of curiosity, or do you have a specific problem? Thanks -Doug On Wed, Apr 29, 2015 at 6:32 PM, Steven White swhite4...@gmail.com wrote: Hi Doug, I don't understand what you mean by the following: For example, if a user searches for q=hot dogsdefType=edismaxqf=title body the *query parser* *not* the *analyzer* first turns the query into: If I have indexAnalyzer and queryAnalyzer in a fieldType that are 100% identical, the example you provided, does it stand? If so, why? Or do you mean something totally different by query parser? Thanks Steve On Wed, Apr 29, 2015 at 4:18 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: * 1) If the content of indexAnalyzer and queryAnalyzer are exactly the same,that's the same as if I have an analyzer only, right?* 1) Yes * 2) Under the hood, all three are the same thing when it comes to what kind* *of data and configuration attributes can take, right?* 2) Yes. Both take in text and output a token stream. *What I'm trying to figure out is this: beside being able to configure a* *fieldType to have different analyzer setting at index and query time, thereis nothing else that's unique about each.* The only thing to look out for in Solr land is the query parser. Most Solr query parsers treat whitespace as meaningful. For example, if a user searches for q=hot dogsdefType=edismaxqf=title body the *query parser* *not* the *analyzer* first turns the query into: (title:hot title:dog) | (body:hot body:dog) each word which *then *gets analyzed. This is because the query parser tries to be smart and turn hot dog into hot OR dog, or more specifically making them two must clauses. This trips quite a few folks up, you can use the field query parser which uses the field as a phrase query. Hope that helps -- *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections, LLC | 240.476.9983 | http://www.opensourceconnections.com Author: Taming Search http://manning.com/turnbull from Manning Publications This e-mail and all contents, including attachments, is considered
Re: Mutli term synonyms
Hi Roman, Tween 20 also did not retrieve me results. So I replaced the whitespaces in the synonyms.txt with 'x' and now when I search, I get the results back. One problem however still exits. i.e. when I search for POLYSORBATE 20[MART.] which is a synonym for POLYSORBATE 20, I get error as below, msg: org.apache.solr.search.SyntaxError: Cannot parse 'polysORbate 20[mart.] ': Encountered \ \]\ \] \\ at line 1, column 20.\r\nWas expecting one of:\r\n\TO\ ...\r\nRANGE_QUOTED ...\r\nRANGE_GOOP ...\r\n, code: 400 If I am able to solve this, I think I am pretty close to the solution. Any thoughts there? I appreciate your help on this matter. Thank you, Kaushik On Wed, Apr 29, 2015 at 5:48 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Kaushik, I meant to compare tween 20 against tween 20. Your autophrase filter replaces whitespace with x, but your synonym filter expects whitespaces. Try that. Roman On Apr 29, 2015 2:27 PM, Kaushik kaushika...@gmail.com wrote: Hi Roman, When I used the debugQuery using http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=truedebugQuery=true I see the following in the response. The autophrase plugin seems to be doing its part. Just not the synonym expansion. When you say use phrase queries, what do you mean? Please clarify. response: { numFound: 0, start: 0, docs: [] }, debug: { rawquerystring: tween 20, querystring: tween 20, parsedquery: name:tweenx20, parsedquery_toString: name:tweenx20, explain: {}, Thank you, Kaushik On Wed, Apr 29, 2015 at 4:00 PM, Roman Chyla roman.ch...@gmail.com wrote: Pls post output of the request with debugQuery=true Do you see the synonyms being expanded? Probably not. You can go to the administer iface, in the analyzer section play with the input until you see the synonyms. Use phrase queries too. That will be helpful to elliminate autophrase filter On Apr 29, 2015 6:18 AM, Kaushik kaushika...@gmail.com wrote: Hi Roman, Following is my use case: *Schema.xml*... field name=name type=text_autophrase indexed=true stored=true/ fieldType name=text_autophrase class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=com.lucidworks.analysis.AutoPhrasingTokenFilterFactory phrases=autophrases.txt includeTokens=false replaceWhitespaceWith=X / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / /analyzer /fieldType *SolrConfig.xml...* name=/autophrase class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dfname/str str name=defTypeautophrasingParser/str /lst /requestHandler queryParser name=autophrasingParser class=com.lucidworks.analysis.AutoPhrasingQParserPlugin str name=phrasesautophrases.txt/str str name=replaceWhitespaceWithX/str /queryParser *Synonyms.txt* PEG-20 SORBITAN LAURATE,POLYOXYETHYLENE 20 SORBITAN MONOLAURATE,TWEEN 20,POLYSORBATE 20 [USAN],POLYSORBATE 20 [INCI],POLYSORBATE 20 [II],POLYSORBATE 20 [HSDB],TWEEN-20,PEG-20 SORBITAN,PEG-20 SORBITAN [VANDF],POLYSORBATE-20,POLYSORBATE 20,SORETHYTAN MONOLAURATE,T-MAZ 20,POLYOXYETHYLENE (20) SORBITAN MONOLAURATE,SORBITAN MONODODECANOATE,POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE,POLYOXYETHYLENE SORBITAN MONOLAURATE,POLYSORBATE 20 [MART.],SORBIMACROGOL LAURATE 300,POLYSORBATE 20 [FHFI],FEMA NO. 2915,POLYSORBATE 20 [FCC],POLYSORBATE 20 [WHO-DD],POLYSORBATE 20 [VANDF] *Autophrase.txt...* Has all the above phrases in one column *Indexed document* doc field name=id31/field field name=namePolysorbate 20/field /doc So when I query SOLR /autphrase for tween 20 or FEMA NO. 2915, I expect to see the record containig Polysorbate 20. i.e. http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=true should have retrieved
Re: Mutli term synonyms
Hi there, I tried the solution provided in https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/ .The mentioned solution works when the indexed data does not have alpha numerics or special characters. But in my case the synonyms are something like the below. T-MAZ 20 POLYOXYETHYLENE (20) SORBITAN MONOLAURATE SORBITAN MONODODECANOATE POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE POLYOXYETHYLENE SORBITAN MONOLAURATE POLYSORBATE 20 [MART.] SORBIMACROGOL LAURATE 300 POLYSORBATE 20 [FHFI] FEMA NO. 2915 They have alpha numerics, special characters, spaces, etc. Is there a way to implment synonyms even in such case? Thanks, Kaushik On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] daniel.da...@nih.gov wrote: Handling MESH descriptor preferred terms and such is similar. I encountered this during evaluation of Solr for a project here at NLM. We decided to use Solr for different projects instead. I considered the following approaches: - use a custom tokenizer at index time that indexed all of the multiple term alternatives. - index the data, and then have an enrichment process that queries on each source synonym, and generates an update to add the target synonyms. Follow this with an optimize. - During the indexing process, but before sending the data to Solr, process the data to tokenize and add synonyms to another field. Both the custom tokenizer and enrichment process share the feature that they use Solr's own tokenizer rather than duplicate it. The enrichment process seems to me only workable in environments where you can re-index all data periodically, so no continuous stream of data to index that needs to be handled relatively quickly once it is generated.The last method of pre-processing the data seems the least desirable to me from a blue-sky perspective, but is probably the easiest to implement and the most independent of Solr. Hope this helps, Dan Davis, Systems/Applications Architect (Contractor), Office of Computer and Communications Systems, National Library of Medicine, NIH -Original Message- From: Kaushik [mailto:kaushika...@gmail.com] Sent: Monday, April 20, 2015 10:47 AM To: solr-user@lucene.apache.org Subject: Mutli term synonyms Hello, Reading up on synonyms it looks like there is no real solution for multi term synonyms. Is that right? I have a use case where I need to map one multi term phrase to another. i.e. Tween 20 needs to be translated to Polysorbate 40. Any thoughts as to how this can be achieved? Thanks, Kaushik
Order of Copy Field and Analyzer
Hello, What is the order in which these occur? - Copy field - Analyzer The other way of asking the above question I guess is, if I copy an _txt field to _t field, does the analyzer of _t get the orignial text sent to _txt field or the analyzed tokens from it? Thanks, Kaushik
Correct usage for Synonyms.txt
Is my understanding of synonyms.txt configuration correct 1. When the user can search from a list of synonyms and the searchable document can have any synonym the configuration should be like below. Fuji, Gala, Braeburn, Crisp = Fuji, Gala, Braeburn, Crisp 2. When the user can search from a list of synonyms and the searchable document can only have a preferred term (for e.g. Apple) Apple, Fuji, Gala, Braeburn, Crisp OR Fuji, Gala, Braeburn, Crisp = Apple Is there any other format that I am missing? Thank you, Kaushik
Mutli term synonyms
Hello, Reading up on synonyms it looks like there is no real solution for multi term synonyms. Is that right? I have a use case where I need to map one multi term phrase to another. i.e. Tween 20 needs to be translated to Polysorbate 40. Any thoughts as to how this can be achieved? Thanks, Kaushik
Re: generate uuid/ id for table which do not have any primary key
Have you tried select concatenated fields as id, name, age ? On Thu, Apr 16, 2015 at 3:34 PM, Vishal Swaroop vishal@gmail.com wrote: Just wondering if there is a way to generate uuid/ id in data-config without using combination of fields in query... data-config.xml ?xml version=1.0 encoding=UTF-8 ? dataConfig dataSource batchSize=2000 name=test type=JdbcDataSource driver=oracle.jdbc.OracleDriver url=jdbc:oracle:thin:@ldap: user=myUser password=pwd/ document entity name=test_entity docRoot=true dataSource=test query=select name, age from test_user /entity /document /dataConfig On Thu, Apr 16, 2015 at 3:18 PM, Vishal Swaroop vishal@gmail.com wrote: Thanks Kaushik Erick.. Though I can populate uuid by using combination of fields but need to change the type to string else it throws Invalid UUID String field name=uuid type=string indexed=true stored=true required=true multiValued=false/ a) I will have ~80 millions records and wondering if performance might be issue b) So, during update I can still use combination of fields i.e. uuid ? On Thu, Apr 16, 2015 at 2:44 PM, Erick Erickson erickerick...@gmail.com wrote: This seems relevant: http://stackoverflow.com/questions/16914324/solr-4-missing-required-field-uuid Best, Erick On Thu, Apr 16, 2015 at 11:38 AM, Kaushik kaushika...@gmail.com wrote: You seem to have defined the field, but not populating it in the query. Use a combination of fields to come up with a unique id that can be assigned to uuid. Does that make sense? Kaushik On Thu, Apr 16, 2015 at 2:25 PM, Vishal Swaroop vishal@gmail.com wrote: How to generate uuid/ id (maybe in data-config.xml...) for table which do not have any primary key. Scenario : Using DIH I need to import data from database but table does not have any primary key I do have uuid defined in schema.xml and is field name=uuid type=uuid indexed=true stored=true required=true multiValued=false/ uniqueKeyuuid/uniqueKey data-config.xml ?xml version=1.0 encoding=UTF-8 ? dataConfig dataSource batchSize=2000 name=test type=JdbcDataSource driver=oracle.jdbc.OracleDriver url=jdbc:oracle:thin:@ldap: user=myUser password=pwd/ document entity name=test_entity docRoot=true dataSource=test query=select name, age from test_user /entity /document /dataConfig Error : Document is missing mandatory uniqueKey field: uuid
Re: generate uuid/ id for table which do not have any primary key
You seem to have defined the field, but not populating it in the query. Use a combination of fields to come up with a unique id that can be assigned to uuid. Does that make sense? Kaushik On Thu, Apr 16, 2015 at 2:25 PM, Vishal Swaroop vishal@gmail.com wrote: How to generate uuid/ id (maybe in data-config.xml...) for table which do not have any primary key. Scenario : Using DIH I need to import data from database but table does not have any primary key I do have uuid defined in schema.xml and is field name=uuid type=uuid indexed=true stored=true required=true multiValued=false/ uniqueKeyuuid/uniqueKey data-config.xml ?xml version=1.0 encoding=UTF-8 ? dataConfig dataSource batchSize=2000 name=test type=JdbcDataSource driver=oracle.jdbc.OracleDriver url=jdbc:oracle:thin:@ldap: user=myUser password=pwd/ document entity name=test_entity docRoot=true dataSource=test query=select name, age from test_user /entity /document /dataConfig Error : Document is missing mandatory uniqueKey field: uuid
Problem with SOLR Collection creation
Hello, We have deployed a solr.war file to a weblogic server. The web.xml has been modified to have the path to the SOLR home as follows: env-entryenv-entry-namesolr/home/env-entry-nameenv-entry-typejava.lang.String/env-entry-typeenv-entry-valueD:\SOLR\4.7.0\RegulatoryReview/env-entry-value/env-entry The deployment of the Solr comes up fine. In the D:\SOLR\4.7.0\RegulatoryReview directory we have RR folder under which the conf directory with the required config files are present (solrconfig.xml, schema.xml, etc). But when I try to add the collection to SOLR through the admin console, I get the following error. Thursday, August 28, 2014 10:06:37 AM ERROR SolrCore org.apache.solr.common.SolrException: Error CREATEing SolrCore 'RegulatoryReview': Unable to create core: RegulatoryReview Caused by: class org.apache.solr.search.LRUCache org.apache.solr.common.SolrException: Error CREATEing SolrCore 'RR': Unable to create core: RRCaused by: class org.apache.solr.search.LRUCache at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:546) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:152) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:733) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:268) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:218) at weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:57) at weblogic.servlet.internal.WebAppServletContext$ServletInvocationAction.wrapRun(WebAppServletContext.java:3730) at weblogic.servlet.internal.WebAppServletContext$ServletInvocationAction.run(WebAppServletContext.java:3696) at weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:321) at weblogic.security.service.SecurityManager.runAs(SecurityManager.java:120) at weblogic.servlet.internal.WebAppServletContext.securedExecute(WebAppServletContext.java:2273) at weblogic.servlet.internal.WebAppServletContext.execute(WebAppServletContext.java:2179) at weblogic.servlet.internal.ServletRequestImpl.run(ServletRequestImpl.java:1490) at weblogic.work.ExecuteThread.execute(ExecuteThread.java:256) at weblogic.work.ExecuteThread.run(ExecuteThread.java:221) Caused by: org.apache.solr.common.SolrException: Unable to create core: RR at org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:989) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:606) at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:509) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:152) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:732) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:268) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217) at weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:56) ... 9 more Caused by: org.apache.solr.common.SolrException: Could not load config file D:\SOLR\4.7.0\RegulatoryReview\RR\solrconfig.xml at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:530) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:597) at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:509) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:152) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:733) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:268) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:218) at weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:57) ... 9 more Caused by: java.lang.ClassCastException: class org.apache.solr.search.LRUCache at java.lang.Class.asSubclass(Class.java:3027) at
Re: Problem with SOLR Collection creation
The issue I was facing was that there were additonal librarires on the classpath that were conflicting and not required. Removed those and the problem dissapeared. Thank you, Kaushik On Thu, Aug 28, 2014 at 11:50 AM, Shawn Heisey s...@elyograg.org wrote: On 8/28/2014 8:28 AM, Kaushik wrote: Hello, We have deployed a solr.war file to a weblogic server. The web.xml has been modified to have the path to the SOLR home as follows: env-entryenv-entry-namesolr/home/env-entry-nameenv-entry-typejava.lang.String/env-entry-typeenv-entry-valueD:\SOLR\4.7.0\RegulatoryReview/env-entry-value/env-entry The deployment of the Solr comes up fine. In the D:\SOLR\4.7.0\RegulatoryReview directory we have RR folder under which the conf directory with the required config files are present (solrconfig.xml, schema.xml, etc). But when I try to add the collection to SOLR through the admin console, I get the following error. Thursday, August 28, 2014 10:06:37 AM ERROR SolrCore org.apache.solr.common.SolrException: Error CREATEing SolrCore 'RegulatoryReview': Unable to create core: RegulatoryReview Caused by: class org.apache.solr.search.LRUCache It would seem there's a problem with the cache config in your solrconfig.xml, or that there's some kind of problem with the Solr jars contained within the war. No testing is done with weblogic, so it's always possible it's a class conflict with weblogic itself, but I would bet on a config problem first. The issue I believe is that it is trying to find D:\SOLR\4.7.0\RegulatoryReview\RR\solrconfig.xml by ignoring the conf directory in which it should be finding it. What am I doing wrong? This is SOLR-5814, a bug in the log messages, not the program logic. I thought it had been fixed by 4.8, but the issue is still unresolved. https://issues.apache.org/jira/browse/SOLR-5814 Thanks, Shawn
How to delete documents
From a database table, we have figured out a way to do the full load and the delta loads. However, there are scenarios where some of the DB rows get deleted. How can we have such documents deleted from SOLR indices? Thanks, Kaushik
Re: Faceting on multivalued field
Are you implying to change the DB query of the nested entity which fetches the comments (query is in my post) or something can be done during the index like using Transformers etc. ? Thanks, Kaushik On Mon, Apr 4, 2011 at 8:07 AM, Erick Erickson erickerick...@gmail.comwrote: Why not count them on the way in and just store that number along with the original e-mail? Best Erick On Sun, Apr 3, 2011 at 10:10 PM, Kaushik Chakraborty kaych...@gmail.com wrote: Ok. My expectation was since comment_post_id is a MultiValued field hence it would appear multiple times (i.e. for each comment). And hence when I would facet with that field it would also give me the count of those many documents where comment_post_id appears. My requirement is getting total for every document i.e. finding number of comments per post in the whole corpus. To explain it more clearly, I'm getting a result xml something like this str name=post_id46/str str name=post_textHello World/str str name=person_id20/str arr name=comment_id str9/str str10/str /arr arr name=comment_person_id str19/str str2/str /arr arr name=comment_post_id str46/str str46/str /arr arr name=comment_text strHello - from World/str strHi/str /arr lst name=facet_fields lst name=comment_post_id *int name=461/int* I need the count to be 2 as the post 46 has 2 comments. What other way can I approach? Thanks, Kaushik On Mon, Apr 4, 2011 at 4:29 AM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, I think you're misunderstanding faceting. It's counting the number of documents that have a particular value. So if you're faceting on comment_post_id, there is one and only one document with that value (assuming that the comment_post_ids are unique). Which is what's being reported This will be quite expensive on a large corpus, BTW. Is your task to show the totals for *every* document in your corpus or just the ones in a display page? Because if the latter, your app could just count up the number of elements in the XML returned for the multiValued comments field. If that's not relevant, could you explain a bit more why you need this count? Best Erick On Sun, Apr 3, 2011 at 2:31 PM, Kaushik Chakraborty kaych...@gmail.com wrote: Hi, My index contains a root entity Post and a child entity Comments. Each post can have multiple comments. data-config.xml: document entity name=posts transformer=TemplateTransformer dataSource=jdbc query= field column=post_id / field column=post_text/ field column=person_id/ entity name=comments dataSource=jdbc query=select * from comments where post_id = ${posts.post_id} field column=comment_id / field column=comment_text / field column=comment_person_id / field column=comment_post_id / /entity /entity /document The schema has all columns of comment entity as MultiValued fields and all fields are indexed stored. My requirement is to count the number of comments for each post. Approach I'm taking is to query on *:* and faceting the result on comment_post_id so that it gives the count of comment occurred for that post. But I'm getting incorrect result e.g. if a post has 2 comments, the multivalued fields are populated alright but the facet count is coming as 1 (for that post_id). What else do I need to do? Thanks, Kaushik
Faceting on multivalued field
Hi, My index contains a root entity Post and a child entity Comments. Each post can have multiple comments. data-config.xml: document entity name=posts transformer=TemplateTransformer dataSource=jdbc query= field column=post_id / field column=post_text/ field column=person_id/ entity name=comments dataSource=jdbc query=select * from comments where post_id = ${posts.post_id} field column=comment_id / field column=comment_text / field column=comment_person_id / field column=comment_post_id / /entity /entity /document The schema has all columns of comment entity as MultiValued fields and all fields are indexed stored. My requirement is to count the number of comments for each post. Approach I'm taking is to query on *:* and faceting the result on comment_post_id so that it gives the count of comment occurred for that post. But I'm getting incorrect result e.g. if a post has 2 comments, the multivalued fields are populated alright but the facet count is coming as 1 (for that post_id). What else do I need to do? Thanks, Kaushik
Re: SOLR DIH importing MySQL text column as a BLOB
The query's there in the data-config.xml. And the query's fetching as expected from the database. Thanks, Kaushik On Wed, Mar 16, 2011 at 9:21 PM, Gora Mohanty g...@mimirtech.com wrote: On Wed, Mar 16, 2011 at 2:29 PM, Stefan Matheis matheis.ste...@googlemail.com wrote: Kaushik, i just remembered an ML-Post few weeks ago .. same problem while importing geo-data ( http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2254395.html ) - the solution was: CAST( CONCAT( lat, ',', lng ) AS CHAR ) at that time i search a little bit for the reason and afaik there was a bug in mysql/jdbc which produces that binary output under certain conditions [...] As Stefan mentions, there might be a way to solve this. Could you show us the query in DIH that you are using when you get this BLOB, i.e., the SELECT statement that goes to the database? It might also be instructive for you to try that same SELECT directly in a mysql interface. Regards, Gora
SOLR DIH importing MySQL text column as a BLOB
I've a column for posts in MySQL of type `text`, I've tried corresponding `field-type` for it in Solr `schema.xml` e.g. `string, text, text-ws`. But whenever I'm importing it using the DIH, it's getting imported as a BLOB object. I checked, this thing is happening only for columns of type `text` and not for `varchar`(they are getting indexed as string). Hence, the posts field is not becoming searchable. I found about this issue, after repeated search failures, when I did a `*:*` query search on Solr. A sample response: result name=response numFound=223 start=0 maxScore=1.0 doc float name=score1.0/float str name=solr_post_bio[B@10a33ce2/str date name=solr_post_created_at2011-02-21T07:02:55Z/date str name=solr_post_emailtest.acco...@gmail.com/str str name=solr_post_first_nameTest/str str name=solr_post_last_nameAccount/str str name=solr_post_message[B@2c93c4f1/str str name=solr_post_status_message_id1/str /doc The `data-config.xml` : document entity name=posts dataSource=jdbc query=select p.person_id as solr_post_person_id, pr.first_name as solr_post_first_name, pr.last_name as solr_post_last_name, u.email as solr_post_email, p.message as solr_post_message, p.id as solr_post_status_message_id, p.created_at as solr_post_created_at, pr.bio as solr_post_bio from posts p,users u,profiles pr where p.person_id = u.id and p.person_id = pr.person_id and p.type='StatusMessage' field column=solr_post_person_id / field column=solr_post_first_name/ field column=solr_post_last_name / field column=solr_post_email / field column=solr_post_message / field column=solr_post_status_message_id / field column=solr_post_created_at / field column=solr_post_bio/ /entity /document The `schema.xml` : fields field name=solr_post_status_message_id type=string indexed=true stored=true required=true / field name=solr_post_message type=text_ws indexed=true stored=true required=true / field name=solr_post_bio type=text indexed=false stored=true / field name=solr_post_first_name type=string indexed=false stored=true / field name=solr_post_last_name type=string indexed=false stored=true / field name=solr_post_email type=string indexed=false stored=true / field name=solr_post_created_at type=date indexed=false stored=true / /fields uniqueKeysolr_post_status_message_id/uniqueKey defaultSearchFieldsolr_post_message/defaultSearchField Thanks, Kaushik