Re: Use a different folder for schema.xml
You can include one xml file into another, something like 1. ?xml version='1.0' encoding='utf-8'? 2. !DOCTYPE document [ !ENTITY resourcedb SYSTEM 3. 'file:/some/absolute/path/a.xml' ] 4. resource 5. childofbresourcedb;childofb 6. /resource - Ravish On Wed, Aug 22, 2012 at 8:56 AM, Alexander Cougarman acoug...@bwc.orgwrote: Thanks, Lance. Please forgive my ignorance, but what do you mean by soft links/XML include feature? Can you provide an example? Thanks again. Sincerely, Alex -Original Message- From: Lance Norskog [mailto:goks...@gmail.com] Sent: 22 August 2012 9:55 AM To: solr-user@lucene.apache.org Subject: Re: Use a different folder for schema.xml It is possible to store the entire conf/ directory somewhere. To store only the schema.xml file, try soft links or the XML include feature: conf/schema.xml includes from somewhere else. On Tue, Aug 21, 2012 at 11:31 PM, Alexander Cougarman acoug...@bwc.org wrote: Hi. For our Solr instance, we need to put the schema.xml file in a different location than where it resides now. Is this possible? Thanks. Sincerely, Alex -- Lance Norskog goks...@gmail.com
Re: Solr - case-insensitive search do not work
filter class=solr.LowerCaseFilterFactory/ is already present in your field type definition (its twice now) Are you adding quotes around your query by any chance? Ravish On Wed, Aug 22, 2012 at 11:31 AM, meghana meghana.rav...@amultek.comwrote: I want to apply case-insensitive search for field *myfield* in solr. I googled a bit for that , and i found that , i need to apply *LowerCaseFilterFactory *to Field Type and field should be of solr.TextFeild. I applied that in my *schema.xml* and re-index the data, then also my search seems to be case-sensitive. Below is search that i perform. * http://localhost:8080/solr/select?q=myfield:cloud universityhl=onhl.snippets=99hl.fl=myfield* Below is definition for field type fieldType name=text_en_splitting class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_en.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_en.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType and below is my field definition field name=myfield type=text_en_splitting indexed=true stored=true / Not sure , what is wrong with this. Please help me to resolve this. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-case-insensitive-search-do-not-work-tp4002605.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr - case-insensitive search do not work
OK. Try without quotes like myfield:cloud+university and see if it has any effect. Also, try both queries with debugging turned on and post the output of the same ( http://wiki.apache.org/solr/CommonQueryParameters#Debugging ) It must be some field configuration issue or that double quotes are causing some analyzers to not work on your query. Hope this helps. Ravish On Wed, Aug 22, 2012 at 12:11 PM, meghana meghana.rav...@amultek.comwrote: @Ravish Bhagdev , Yes I am adding double quotes around my search , as shown in my post. Like, myfield:cloud university -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-case-insensitive-search-do-not-work-tp4002605p4002610.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr - case-insensitive search do not work
Also, try comparing your field configuration to Solrs default text field and see if you can spot any differences. Ravish On Wed, Aug 22, 2012 at 1:09 PM, Ravish Bhagdev ravish.bhag...@gmail.comwrote: OK. Try without quotes like myfield:cloud+university and see if it has any effect. Also, try both queries with debugging turned on and post the output of the same ( http://wiki.apache.org/solr/CommonQueryParameters#Debugging ) It must be some field configuration issue or that double quotes are causing some analyzers to not work on your query. Hope this helps. Ravish On Wed, Aug 22, 2012 at 12:11 PM, meghana meghana.rav...@amultek.comwrote: @Ravish Bhagdev , Yes I am adding double quotes around my search , as shown in my post. Like, myfield:cloud university -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-case-insensitive-search-do-not-work-tp4002605p4002610.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr - case-insensitive search do not work
Did you see my message about debugging parameters? Try that and see what's happening behind the scenes. I can confirm that by default the queries are NOT case sensitive. Ravish On Wed, Aug 22, 2012 at 2:45 PM, meghana meghana.rav...@amultek.com wrote: Hi Ravish , the defination for text_en_splitting in solr default schema and of mine are same.. still its not working... any idea? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-case-insensitive-search-do-not-work-tp4002605p4002645.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Score threshold 'reasonably', independent of results returned
Commercial solutions often have %age that is meant to signify the quality of match. Solr has relative score and you cannot tell by just looking at this value if a result is relevant enough to be in first page or not. Score depends on what else is in the index so not easy to normalize in the way you suggest. Ravish On Wed, Aug 22, 2012 at 4:03 PM, Mou mouna...@gmail.com wrote: Hi, I think that this totally depends on your requirements and thus applicable for a user scenario. Score does not have any absolute meaning, it is always relative to the query. If you want to watch some particular queries and want to show results with score above previously set threshold, you can use this. If I always have that x% threshold in place , there may be many queries which would not return anything and I certainly do not want that. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Score-threshold-reasonably-independent-of-results-returned-tp4002312p4002673.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: No Effect of omitNorms and omitTermFreqAndPositions when using MLT handler?
Ahh, this is because I have to override DefaultSimilarity to turn off tf/idf scoring? But this will apply to all the fields and general search on text fields as well? Is there a way to apply custom similarity to specific field types or fields only? Is there no way of turning TF/IDF off without this? Thanks, Ravish On Mon, May 21, 2012 at 10:24 AM, Ravish Bhagdev ravish.bhag...@gmail.comwrote: Hi All, I was wondering if omitNorms will have any effect on MLT handler at all? I'm using schema version 1.2 with Solr 1.4 and have defined couple of fields, which I want to use for MLT lookup and don't want factors like field length or TF/IDF to affect the scores. The definitions are as below: fieldType name=lowercase class=solr.TextField positionIncrementGap=100 omitNorms=true omitTermFreqAndPositions=true analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType fieldType name=text_nonorms class=solr.TextField positionIncrementGap=100 omitNorms=true omitTermFreqAndPositions=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 / filter class=solr.LowerCaseFilterFactory / filter class=solr.EnglishPorterFilterFactory protected=protwords.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 / filter class=solr.LowerCaseFilterFactory / filter class=solr.EnglishPorterFilterFactory protected=protwords.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory / /analyzer /fieldType !-- and the fields that use the above field types are -- field name=PROFILE_TAGS type=lowercase indexed=true stored=true multiValued=true termVectors=true/ field name=PROFILE_TAGS_TXT type=text_nonorms indexed=true stored=true multiValued=true termVectors=true/ In My solrconfig.xml I have defined following for my MLT request handler: requestHandler name=/mlt class=solr.MoreLikeThisHandler lst name=defaults str name=mlt.flPROFILE_TAGS,PROFILE_TAGS_TXT/str str name=mlt.qfPROFILE_TAGS^10.0 PROFILE_TAGS_TXT^2.0/str int name=mlt.mindf1/int int name=mlt.mintf1/int str name=flid,score/str str name=mlt.flPROFILE_TAGS,PROFILE_TAGS_TXT/str /lst /requestHandler However, when I run my query as follows: http://localhost:9090/solr/mlt?fl=*,scorestart=0q=id:4417454.matchRecordqt=/mltfq=targetDB:ConnectMeDBrows=1000debugQuery=on the debug scoring info shows following: str name=5042172.matchRecord 0.17156276 = (MATCH) product of: 1.4296896 = (MATCH) sum of: 0.24737607 = (MATCH) weight(PROFILE_TAGS_TXT:system^5.0 in 1472), product of: 0.06376338 = queryWeight(PROFILE_TAGS_TXT:system^5.0), product of: 5.0 = boost 3.8795946 = idf(docFreq=538, maxDocs=9598) 0.0032871156 = queryNorm 3.8795946 = (MATCH) fieldWeight(PROFILE_TAGS_TXT:system in 1472), product of: 1.0 = tf(termFreq(PROFILE_TAGS_TXT:system)=1) 3.8795946 = idf(docFreq=538, maxDocs=9598) 1.0 = fieldNorm(field=PROFILE_TAGS_TXT, doc=1472) 0.65193653 = (MATCH) weight(PROFILE_TAGS_TXT:adapt^5.0 in 1472), product of: 0.10351306 = queryWeight(PROFILE_TAGS_TXT:adapt^5.0), product of: 5.0 = boost 6.298109 = idf(docFreq=47, maxDocs=9598) 0.0032871156 = queryNorm 6.298109 = (MATCH) fieldWeight(PROFILE_TAGS_TXT:adapt in 1472), product of: 1.0 = tf(termFreq(PROFILE_TAGS_TXT:adapt)=1) 6.298109 = idf(docFreq=47, maxDocs=9598) 1.0 = fieldNorm(field=PROFILE_TAGS_TXT, doc=1472) 0.530377 = (MATCH) weight(PROFILE_TAGS_TXT:optic^5.0 in 1472), product of: 0.093365155 = queryWeight(PROFILE_TAGS_TXT:optic^5.0), product of: 5.0 = boost 5.6806736 = idf(docFreq=88, maxDocs=9598) 0.0032871156 = queryNorm 5.6806736 = (MATCH) fieldWeight(PROFILE_TAGS_TXT:optic in 1472), product of: 1.0 = tf(termFreq(PROFILE_TAGS_TXT:optic)=1) 5.6806736 = idf(docFreq=88, maxDocs=9598) 1.0 = fieldNorm(field=PROFILE_TAGS_TXT, doc=1472) 0.12 = coord(3/25) /str Which seems to suggest that the TF/IDF is being performed on these fields! Also, does it make any difference if I specify omitNorms in field definition vs specifying in fieldType definition? I will appreciate any help with this. Thanks, Ravish
Re: No Effect of omitNorms and omitTermFreqAndPositions when using MLT handler?
I found this: https://issues.apache.org/jira/browse/LUCENE-2236 So, it seems this feature is not supported in Solr 1.4 at all. Is there any possible work around? If not, I'll have to consider splitting my schema into two which will be quite a big change :( - Ravish On Mon, May 21, 2012 at 11:03 AM, Ravish Bhagdev ravish.bhag...@gmail.comwrote: Ahh, this is because I have to override DefaultSimilarity to turn off tf/idf scoring? But this will apply to all the fields and general search on text fields as well? Is there a way to apply custom similarity to specific field types or fields only? Is there no way of turning TF/IDF off without this? Thanks, Ravish On Mon, May 21, 2012 at 10:24 AM, Ravish Bhagdev ravish.bhag...@gmail.com wrote: Hi All, I was wondering if omitNorms will have any effect on MLT handler at all? I'm using schema version 1.2 with Solr 1.4 and have defined couple of fields, which I want to use for MLT lookup and don't want factors like field length or TF/IDF to affect the scores. The definitions are as below: fieldType name=lowercase class=solr.TextField positionIncrementGap=100 omitNorms=true omitTermFreqAndPositions=true analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType fieldType name=text_nonorms class=solr.TextField positionIncrementGap=100 omitNorms=true omitTermFreqAndPositions=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 / filter class=solr.LowerCaseFilterFactory / filter class=solr.EnglishPorterFilterFactory protected=protwords.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 / filter class=solr.LowerCaseFilterFactory / filter class=solr.EnglishPorterFilterFactory protected=protwords.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory / /analyzer /fieldType !-- and the fields that use the above field types are -- field name=PROFILE_TAGS type=lowercase indexed=true stored=true multiValued=true termVectors=true/ field name=PROFILE_TAGS_TXT type=text_nonorms indexed=true stored=true multiValued=true termVectors=true/ In My solrconfig.xml I have defined following for my MLT request handler: requestHandler name=/mlt class=solr.MoreLikeThisHandler lst name=defaults str name=mlt.flPROFILE_TAGS,PROFILE_TAGS_TXT/str str name=mlt.qfPROFILE_TAGS^10.0 PROFILE_TAGS_TXT^2.0/str int name=mlt.mindf1/int int name=mlt.mintf1/int str name=flid,score/str str name=mlt.flPROFILE_TAGS,PROFILE_TAGS_TXT/str /lst /requestHandler However, when I run my query as follows: http://localhost:9090/solr/mlt?fl=*,scorestart=0q=id:4417454.matchRecordqt=/mltfq=targetDB:ConnectMeDBrows=1000debugQuery=on the debug scoring info shows following: str name=5042172.matchRecord 0.17156276 = (MATCH) product of: 1.4296896 = (MATCH) sum of: 0.24737607 = (MATCH) weight(PROFILE_TAGS_TXT:system^5.0 in 1472), product of: 0.06376338 = queryWeight(PROFILE_TAGS_TXT:system^5.0), product of: 5.0 = boost 3.8795946 = idf(docFreq=538, maxDocs=9598) 0.0032871156 = queryNorm 3.8795946 = (MATCH) fieldWeight(PROFILE_TAGS_TXT:system in 1472), product of: 1.0 = tf(termFreq(PROFILE_TAGS_TXT:system)=1) 3.8795946 = idf(docFreq=538, maxDocs=9598) 1.0 = fieldNorm(field=PROFILE_TAGS_TXT, doc=1472) 0.65193653 = (MATCH) weight(PROFILE_TAGS_TXT:adapt^5.0 in 1472), product of: 0.10351306 = queryWeight(PROFILE_TAGS_TXT:adapt^5.0), product of: 5.0 = boost 6.298109 = idf(docFreq=47, maxDocs=9598) 0.0032871156 = queryNorm 6.298109 = (MATCH) fieldWeight(PROFILE_TAGS_TXT:adapt in 1472), product of: 1.0 = tf(termFreq(PROFILE_TAGS_TXT:adapt)=1) 6.298109 = idf(docFreq=47, maxDocs=9598) 1.0 = fieldNorm(field=PROFILE_TAGS_TXT, doc=1472) 0.530377 = (MATCH) weight(PROFILE_TAGS_TXT:optic^5.0 in 1472), product of: 0.093365155 = queryWeight(PROFILE_TAGS_TXT:optic^5.0), product of: 5.0 = boost 5.6806736 = idf(docFreq=88, maxDocs=9598) 0.0032871156 = queryNorm 5.6806736 = (MATCH) fieldWeight(PROFILE_TAGS_TXT:optic in 1472), product of: 1.0 = tf(termFreq(PROFILE_TAGS_TXT:optic)=1) 5.6806736 = idf(docFreq=88, maxDocs=9598) 1.0
Re: A tool for frequent re-indexing...
Thanks. This is useful to know as well. I was actually after SolrEntityProcessorhttp://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor which I failed to notice until pointed out by previous reply because I'm using 1.4 still. Cheers, Ravish On Fri, Apr 6, 2012 at 11:01 AM, Valeriy Felberg valeri.felb...@gmail.comwrote: I've implemented something like described in https://issues.apache.org/jira/browse/SOLR-3246. The idea is to add an update request processor at the end of the update chain in the core you want to copy. The processor converts the SolrInputDocument to XML (there is some utility method for doing this) and dumps the XML into a file which can be fed into Solr again with curl. If you have many documents you will probably want to distribute the XML files into different directories using some common prefix in the id field. On Fri, Apr 6, 2012 at 5:18 AM, Ahmet Arslan iori...@yahoo.com wrote: I am considering writing a small tool that would read from one solr core and write to another as a means of quick re-indexing of data. I have a large-ish set (hundreds of thousands) of documents that I've already parsed with Tika and I keep changing bits and pieces in schema and config to try new things often. Instead of having to go through the process of re-indexing from docs (and some DBs), I thought it may be much more faster to just read from one core and write into new core with new schema, analysers and/or settings. I was wondering if anyone else has done anything similar already? It would be handy if I can use this sort of thing to spin off another core write to it and then swap the two cores discarding the older one. You might find these relevant : https://issues.apache.org/jira/browse/SOLR-3246 http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor
Re: pagerank??
You might want to look into Nutch and its LinkRank instead of Solr for this. For obtaining such information, you need a crawler to crawl through the links. Not what Solr is meant for. Rav On Wed, Apr 4, 2012 at 8:46 AM, Bing Li lbl...@gmail.com wrote: According to my knowledge, Solr cannot support this. In my case, I get data by keyword-matching from Solr and then rank the data by PageRank after that. Thanks, Bing On Wed, Apr 4, 2012 at 6:37 AM, Manuel Antonio Novoa Proenza mano...@estudiantes.uci.cu wrote: Hello, I have in my Solr index , many indexed documents. Let me know any way or efficient function to calculate the page rank of websites indexed. s 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: Incremantally updating a VERY LARGE field - Is this possibe ?
Updating a single field is not possible in solr. The whole record has to be rewritten. 300 MB is still not that big a file. Have you tried doing the indexing (if its only a one time thing) by giving it ~2 GB or xmx? A single file with that size is strange! May I ask what is it? Rav On Tue, Apr 3, 2012 at 7:32 PM, vybe3142 vybe3...@gmail.com wrote: Some days ago, I posted about an issue with SOLR running out of memory when attempting to index large text files (say 300 MB ). Details at http://lucene.472066.n3.nabble.com/Solr-Tika-crashing-when-attempting-to-index-large-files-td3846939.html Two things I need to point out: 1. I don't need Tika for content extraction as the files are already in plain text format. 2. The heap space error was caused by a futile Tika/SOLR attempt at creating the corresponding huge XML document in memory I've decided to develop a custom handler that 1. reads the file text directly 2. attempts to create a SOLR document and directly add the text data to the corresponding field. One approach I've taken is to read manageable chunks of text data sequentially from the file and process. We've used this approach sucessfully with Lucene in the past and I'm attempting to make it work with SOLR too. I got most of the work done yesterday, but need a bit of guidance w.r.t. point 2. How can I achieve updating the same field multiple times. Looking at the SOLR source, processor.addField() merely a. adds to the in-memory field map and b. attempts to write EVERYTHING to the index later on. In my situation, (a) eventually causes a heap space error. Here's part of the handler code. Thanks much Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Incremantally-updating-a-VERY-LARGE-field-Is-this-possibe-tp3881945p3881945.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Incremantally updating a VERY LARGE field - Is this possibe ?
Yes, I think there are good reasons why it works like that. Focus of search system is to be efficient on query side at cost of being not that efficient on storage. You must however also note that by default a field's length is limited to 1 words in solrconf.xml which you may also need to modify. But I guess if its going out of memory you might have already done this? Ravish On Wed, Apr 4, 2012 at 1:34 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: There is https://issues.apache.org/jira/browse/LUCENE-3837 but I suppose it's too far from completion. On Wed, Apr 4, 2012 at 2:48 PM, Ravish Bhagdev ravish.bhag...@gmail.com wrote: Updating a single field is not possible in solr. The whole record has to be rewritten. 300 MB is still not that big a file. Have you tried doing the indexing (if its only a one time thing) by giving it ~2 GB or xmx? A single file with that size is strange! May I ask what is it? Rav On Tue, Apr 3, 2012 at 7:32 PM, vybe3142 vybe3...@gmail.com wrote: Some days ago, I posted about an issue with SOLR running out of memory when attempting to index large text files (say 300 MB ). Details at http://lucene.472066.n3.nabble.com/Solr-Tika-crashing-when-attempting-to-index-large-files-td3846939.html Two things I need to point out: 1. I don't need Tika for content extraction as the files are already in plain text format. 2. The heap space error was caused by a futile Tika/SOLR attempt at creating the corresponding huge XML document in memory I've decided to develop a custom handler that 1. reads the file text directly 2. attempts to create a SOLR document and directly add the text data to the corresponding field. One approach I've taken is to read manageable chunks of text data sequentially from the file and process. We've used this approach sucessfully with Lucene in the past and I'm attempting to make it work with SOLR too. I got most of the work done yesterday, but need a bit of guidance w.r.t. point 2. How can I achieve updating the same field multiple times. Looking at the SOLR source, processor.addField() merely a. adds to the in-memory field map and b. attempts to write EVERYTHING to the index later on. In my situation, (a) eventually causes a heap space error. Here's part of the handler code. Thanks much Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Incremantally-updating-a-VERY-LARGE-field-Is-this-possibe-tp3881945p3881945.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev ge...@yandex.ru http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Search for library returns 0 results, but search for marion library returns many results
Yes, can you check if results you get with marion library match on marion or library? By default solr uses OR between words (specified in solrconfig.xml). You can also easily check this by enabling highlighting. Ravish On Wed, Apr 4, 2012 at 4:11 PM, Joshua Sumali jsum...@kobo.com wrote: Did you try to append debugQuery=on to get more information? -Original Message- From: Sean Adams-Hiett [mailto:s...@advantage-companies.com] Sent: Wednesday, April 04, 2012 10:43 AM To: solr-user@lucene.apache.org Subject: Search for library returns 0 results, but search for marion library returns many results This is cross posted on Drupal.org: http://drupal.org/node/1515046 Summary: I have a fairly clean install of Drupal 7 with Apachesolr-1.0-beta18. I have created a content type called document with a number of fields. I am working with 30k+ records, most of which are related to Marion, IA in some way. A search for library (without the quotes) returns no results, while a search for marion library returns thousands of results. That doesn't make any sense to me at all. Details: ul liDrupal 7 (latest stable version)/li liApachesolr-1.0-beta18/li liCustom content type with many fields/li liLAMP stack running on Centos Linode/li liPHP 5.2.x/li /ul I also checked this through the solr admin interface, running the same searches with similar results, so I can't rule out the possibility that something is configured wrong... but since I am using the solrconfig.xml and schema.xml files provided with the modules, it is also a possibility that the issue lies here as well. I have watched the logs and during the searches that produce no results but should, there is no output in the log besides the regular code[INFO]/code about the query. I am stumped and I am past a deadline with this project, so any help would be greatly appreciated. -- Sean Adams-Hiett Director of Development The Advantage Companies s...@advantage-companies.com www.advantage-companies.com
Re: Tags and Folksonomies
Hi Hoss, I am not sure why you suggest Payload for ranking documents with more frequent tags above those with fewer tags. Wont the term frequency part of relevancy score ensure this by default? If you make tags a 'lowercase' field (with full value tokenisation), the frequency of tags in multivalued field should improve score for doc A in below scenario? Payloads, I thought would be more useful when you want some tags in a record to be weighted more than others? Or have I missed some point maybe. Thanks, Rav On Tue, Apr 3, 2012 at 1:02 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : Suppose I have content which has title and description. Users can tag content : and search content based on tag, title and description. Tag has more : weightage. : : Any inputs on how indexing and retrieval will work given there is content : and tags using Solr? Has anyone implemented search based on collaborative : tagging? simple stuff would be to have your 3 fields, and search them with a weighted boosting -- giving more importance to the tag field. where things get more complicated is when you want docA to score higher for hte query boat then docB because 100 users have taged docA with boat, but only 5 users have taged docB boat The canonical way to deal with this would be using payloads to boost the weight of a term -- the DelimitedPayloadTokenFilterFactory can help with this at index time, but off the top of my head i don't think any of the existing Solr QParsers will build the neccessary PayloadTermQuery, so you might have to roll your own -- there are afew Jira issues with patches that you might be able to re-use or get inspired from... https://issues.apache.org/jira/browse/SOLR-1485 -Hoss
Re: Apache solr not indexing complete pdf file using tikka
I'd also suggest trying extracting text using tika-app (shipped with tika distribution as executable jar) on the PDF(s) in question to see if problem is with extraction or with indexing. Rav On Mon, Apr 2, 2012 at 1:55 PM, Erick Erickson erickerick...@gmail.comwrote: You can index 2B tokens, so upping maxFieldLength should have fixed your problem at least as far as Solr is concerned. How many tokens get indexed? I'm not as familiar with Tika, but there may be some kind of parameter there (although I don't remember this coming up before)... Did you restart Solr after making the change to solrconfig.xml? If you're seeing 10,000 tokens or so, that's the default for maxFieldLength I'd recommend stopping Solr, rm -rf solr home/data/index and restarting Solr just to be sure you're not seeing leftover junk, you'll have to re-index your docs after changing the maxLength param. Best Erick On Mon, Apr 2, 2012 at 7:19 AM, Manoj Saini manoj.sa...@stigasoft.com wrote: Hello Guys, I am using apache solr 3.3.0 with Tikka 1.0. I have pdf files which I am pushing into solr for conent searching. Apache solr is indexing pdf files and I can see them in apache solr admin interface for search. But the issue is apache solr is not indexing whole file content. It is indexing upto only limited size. Am I missing something, some configuration, or this is the behavior of apache solr? I have tried to update solrconfig.xml. I have updated ramBufferSizeMB, maxFieldLength. Thanks Manoj Saini Thanks, Best Regards, Manoj Saini | Sr. Software Engineer | Stigasoft m: +91 98 1034 1281 | e: mailto:nseh...@stigasoft.com manoj.sa...@stigasoft.com | w: http://www.stigasoft.com www.stigasoft.com
Re: Tags and Folksonomies
OK, yes that's true. Although I'd expect term vectors to just increment term count when a tag is re-applied (if you have term vectors enabled), increasing a boost stored as a payload with each tag, each time an existing tag is re-tagged maybe a more sensible approach if this is the case. You'll still have to rewrite the whole record for this though as its not possible to 'update' a specific field value in Solr for efficiency reasons. Rav On Tue, Apr 3, 2012 at 4:50 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : I am not sure why you suggest Payload for ranking documents with more : frequent tags above those with fewer tags. Wont the term frequency part of : relevancy score ensure this by default? If you make tags a 'lowercase' Sorry, yes ... absolutely - if you use omitNormws=false on the tags field, and add these two docs... { id: doc1; tags: [house, house, house, boat] } { id: doc2; tags: [house, boat, car, vegas] } ...then doc1 will score higher on a query for tags:house. my suggestion to use payloads was because sending the same value many many times (ie: if 100,000 users apply the tag house you would need to index that doc with the word house repeated 100,000 times) can be prohibitive. -Hoss
Re: ExtractingRequestHandler
(Bit off-topic but...) I understand the fact that Solr isn't meant to 'store' everything, but because highlighting matches requires a field to be stored I would expect most people having to end-up storing full document content in their indexes? Can't think there is any good workaround for this... Rav On Sun, Apr 1, 2012 at 6:15 PM, Erick Erickson erickerick...@gmail.comwrote: Yes, you can. but Generally, storing the raw input in Solr is not the best approach. The problem here is that pretty soon you get a huge index that contains *everything*. Solr was not intended to be a data store. Besides, you then need to store the binary form of the file. Solr only deals with text, not markup. Most people index the text in Solr, and enough information so the application knows where to go to fetch the original document when the user drills down (e.g. file path, database PK, etc). Would that work for your situation? Best Erick On Sat, Mar 31, 2012 at 3:55 PM, spr...@gmx.eu wrote: Hi, I want to index various filetypes in solr, this can easily done with ExtractingRequestHandler. But I also need the extracted content back. I know ext.extract.only but then nothing gets indexed, right? Can I index the document AND get the content back as with ext.extract.only? In a single request? Thank you
Re: Position Solr results
Hi, I don't believe Solr has anything built in that will do this for you. You will likely have to just get the IDs and lookup at what position the ID you are referring to occurs (using Java or other programming language/scripts). Rav On Sun, Apr 1, 2012 at 5:54 PM, Manuel Antonio Novoa Proenza mano...@estudiantes.uci.cu wrote: hi Marcelo In that sense I think the score does not help. The score is a number that I determined at that position results generated are a given site. For example : I perform the following query : q = university Solr generates several results among which is that of a certain website. Does solr some mechanism to let me know that posción is this result? I reiterate that my English is very bad so I use a translator , anyway then send you what I mean in Spanish. thank you very much Manuel hola Marcelo En ese sentido creo que el score no me sirve. El score es un numero que no me determina en que posición de los resultados generados se encuentra un determinado sitio. Por ejemplo: Yo realizo la siguiente consulta: q= universidad Solr genera varios resultados entre los que se encuentra el de un determinado sitio web. ¿Cuenta solr con algún mecanismo que me permita saber en que posción se encuentra este resultado? Te reitero que mi inglés es muy malo por eso uso un traductor, de todas formas a continuación te envío lo que quiero decir en español. Muchas gracias Manuel Saludos... Manuel Antonio Novoa Proenza Universidad de las Ciencias Informáticas Email: mano...@estudiantes.uci.cu - Mensaje original - De: Marcelo Carvalho Fernandes mcf2...@gmail.com Para: solr-user@lucene.apache.org Enviados: Domingo, 1 de Abril 2012 5:14:50 Asunto: Re: Position Solr results Try using the score field in the search results. --- Marcelo Carvalho Fernandes On Friday, March 30, 2012, Manuel Antonio Novoa Proenza mano...@estudiantes.uci.cu wrote: Hi I'm not good with English, and for this reason I had to resort to a translator. I have the following question ... How I can get the position in which there is a certain website in solr results generated for a given search criteria ? regards ManP 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci -- Marcelo Carvalho Fernandes +55 21 8272-7970 +55 21 2205-2786 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Highlighting matched interesting terms in MoreLikeThisHandler...
Hi All, I wonder if anyone else has had a requirement similar to this: I'm using MLT handler to return matching documents, matched on a specific field which works perfectly. But I want to be able to show which interesting terms matched for a given result set. If there was a way of listing these terms or having something like snippet highlighting, I would have been able to do this. But it seems this is not supported at all as far as I know? I came upon following very old thread from 2009 when looking for solution: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200905.mbox/%3cf3ce8ddb0905010807l14f08470mf7dc961d872f7...@mail.gmail.com%3E I wonder if there has been any resolution on this. Has this been considered as a new feature request yet? Anyone else had a similar requirement that they could find a workaround for? I believe Autonomy supports this kind of matching and what matched functionality so it must be popular requirement... Thanks, Ravish
Fwd: Using MLT Handler to find similar documents but also filter similar documents by a keyword.
I will appreciate any comments or help on this. Thanks. Rav -- Forwarded message -- From: Ravish Bhagdev ravish.bhag...@gmail.com Date: Fri, Mar 2, 2012 at 12:12 AM Subject: Using MLT Handler to find similar documents but also filter similar documents by a keyword. To: solr-user@lucene.apache.org Hi, Apologies if this has been answered before, I tried searching for it and didn't find anything answering this exactly. I want to find similar documents using MLT Handler using some specified fields but I want to filter down the returned matches with some keywords as well. I looked at the example provided at http://wiki.apache.org/solr/MoreLikeThisHandler : /solr/mlt?q=id:SP2514Nmlt.fl=manu,catmlt.mindf=1mlt.mintf=1* fq=inStock:true*mlt.interestingTerms=details which is specifying a filter query using fq to filter (something). I understand that the first document returned as a result of query (q=id:SP2514N) is used for performing the matching and fq actually affects this result rather than the matched documents returned by MLT. Am I right or wrong? That is the fq in above example going to filter the MLT match results by the fq query or will it just affect the initial query to get the first document to match by? If former, that is what I want to do, but is fq the way to do it? Can I use this fq on any kind of text/string field? I hope my question is making sense, it is a bit hard to explain so I am sorry if not! Thanks, Ravish
Using MLT Handler to find similar documents but also filter similar documents by a keyword.
Hi, Apologies if this has been answered before, I tried searching for it and didn't find anything answering this exactly. I want to find similar documents using MLT Handler using some specified fields but I want to filter down the returned matches with some keywords as well. I looked at the example provided at http://wiki.apache.org/solr/MoreLikeThisHandler : /solr/mlt?q=id:SP2514Nmlt.fl=manu,catmlt.mindf=1mlt.mintf=1* fq=inStock:true*mlt.interestingTerms=details which is specifying a filter query using fq to filter (something). I understand that the first document returned as a result of query (q=id:SP2514N) is used for performing the matching and fq actually affects this result rather than the matched documents returned by MLT. Am I right or wrong? That is the fq in above example going to filter the MLT match results by the fq query or will it just affect the initial query to get the first document to match by? If former, that is what I want to do, but is fq the way to do it? Can I use this fq on any kind of text/string field? I hope my question is making sense, it is a bit hard to explain so I am sorry if not! Thanks, Ravish
Re: highlight issue
Also, not entirely sure wild-cards are supported in text based fields, only on strings. Although things may have changed in recent versions of Solr, I am not sure. R On Thu, Dec 1, 2011 at 3:55 PM, Koji Sekiguchi k...@r.email.ne.jp wrote: Suppose my search query is *Rak*.In my database i have *Rakesh Chaturvedi * name. I am getting *emRak/ememRak/emesh Chaturvedi* as the response. Same the case with the following names. Search Dhar -- highlight emDhar/ememDhar/em**mesh Darshan Search Suda-- highlight emSuda/ememSuda/em**rshan Faakir Can someone help me? I am using the following filters for index and query. fieldType name=text_autofill class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.**KeywordTokenizerFactory/ filter class=solr.**LowerCaseFilterFactory/ filter class=solr.**WordDelimiterFilterFactory generateWordParts=1 preserveOriginal=1/ filter class=solr.**EdgeNGramFilterFactory minGramSize=1 maxGramSize=50 side=front/ /analyzer analyzer type=query tokenizer class=solr.**StandardTokenizerFactory/ filter class=solr.**LowerCaseFilterFactory/ filter class=solr.**WordDelimiterFilterFactory generateWordParts=1 preserveOriginal=1/ /analyzer /fieldType I don't think Highlighter can support n-gram field. Can you try to comment out EdgeNGramFilterFactory and re-index then highlight? koji -- Check out Query Log Visualizer for Apache Solr http://www.rondhuit-demo.com/**loganalyzer/loganalyzer.htmlhttp://www.rondhuit-demo.com/loganalyzer/loganalyzer.html http://www.rondhuit.com/en/
Re: Solr messing up the UK GBP (pound) symbol in response, even though Java environment variabe has file encoding is set to UTF 8....
Thanks Chris. Yes, changing connector settings not just in solr but also in all webapps that were sending queries into it solved the problem! Appreciate the help. R On Tue, Sep 13, 2011 at 6:11 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : Any idea why solr is unable to return the pound sign as-is? : : I tried typing in £ 1 million in Solr admin GUI and got following response. ... : str name=q£ 1 million/str ... : Here is my Java Properties I got also from admin interface: ... : catalina.home = : /home/rbhagdev/SCCRepos/SCC_Platform/search/solr/target/ Looks like you are using tomcat, so I suspect you are getting bit by this... https://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config If that's not the problem, please try running the example/exampledocs/test_utf8.sh script against your Solr instance (you'll need to change the URL variable to match your host:port) -Hoss
Solr messing up the UK GBP (pound) symbol in response, even though Java environment variabe has file encoding is set to UTF 8....
Any idea why solr is unable to return the pound sign as-is? I tried typing in £ 1 million in Solr admin GUI and got following response. response lst name=responseHeader int name=status0/int int name=QTime5/int lst name=params str name=indenton/str str name=start0/str str name=q£ 1 million/str str name=rows10/str str name=version2.2/str /lst /lst result name=response numFound=0 start=0/ /response Here is my Java Properties I got also from admin interface: java.runtime.name = Java(TM) SE Runtime Environment sun.boot.library.path = /usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/amd64 java.vm.version = 20.1-b02 solr.data.dir = target/solr_data shared.loader = java.vm.vendor = Sun Microsystems Inc. java.vendor.url = http://java.sun.com/ path.separator = :java.vm.name = Java HotSpot(TM) 64-Bit Server VM tomcat.util.buf.StringCache.byte.enabled = true file.encoding.pkg = sun.io user.country = GB sun.java.launcher = SUN_STANDARD sun.os.patch.level = unknownjava.vm.specification.name = Java Virtual Machine Specification user.dir = /home/rbhagdev/SCCRepos/SCC_Platform/search/solr java.runtime.version = 1.6.0_26-b03 java.awt.graphicsenv = sun.awt.X11GraphicsEnvironment java.endorsed.dirs = /usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/endorsed os.arch = amd64 java.io.tmpdir = /tmp line.separator = java.vm.specification.vendor = Sun Microsystems Inc. java.naming.factory.url.pkgs = org.apache.namingos.name = Linux classworlds.conf = /usr/share/maven2/bin/m2.conf sun.jnu.encoding = UTF-8 java.library.path = /usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/amd64/server:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/amd64:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/../lib/amd64:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/libjava.specification.name = Java Platform API Specification java.class.version = 50.0 sun.management.compiler = HotSpot 64-Bit Tiered Compilers os.version = 2.6.38-11-generic user.home = /home/rbhagdev user.timezone = Europe/London catalina.useNaming = true java.awt.printerjob = sun.print.PSPrinterJob java.specification.version = 1.6 file.encoding = UTF-8 solr.solr.home = src/test/resources/solr_home catalina.home = /home/rbhagdev/SCCRepos/SCC_Platform/search/solr/target/tomcatuser.name = rbhagdev java.class.path = /usr/share/maven2/boot/classworlds.jar java.naming.factory.initial = org.apache.naming.java.javaURLContextFactory package.definition = sun.,java.,org.apache.catalina.,org.apache.coyote.,org.apache.tomcat.,org.apache.jasper. java.vm.specification.version = 1.0 sun.arch.data.model = 64 java.home = /usr/lib/jvm/java-6-sun-1.6.0.26/jre sun.java.command = org.codehaus.classworlds.Launcher tomcat:run-war java.specification.vendor = Sun Microsystems Inc. user.language = enjava.vm.info = mixed mode java.version = 1.6.0_26 java.ext.dirs = /usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/ext:/usr/java/packages/lib/ext securerandom.source = file:/dev/./urandom sun.boot.class.path = /usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/resources.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/rt.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/jsse.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/jce.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/charsets.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/modules/jdk.boot.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/classes java.vendor = Sun Microsystems Inc. server.loader = maven.home = /usr/share/maven2 catalina.base = /home/rbhagdev/SCCRepos/SCC_Platform/search/solr/target/tomcat file.separator = / java.vendor.url.bug = http://java.sun.com/cgi-bin/bugreport.cgi common.loader = ${catalina.home}/lib,${catalina.home}/lib/*.jar sun.cpu.endian = little sun.io.unicode.encoding = UnicodeLittle package.access = sun.,org.apache.catalina.,org.apache.coyote.,org.apache.tomcat.,org.apache.jasper.,sun.beans. sun.desktop = gnome sun.cpu.isalist = Thanks, Ravish
Getting sum of all terms count in dataset instead of document count using TermsComponent....(and TermsComponent vs Facets)
Hi Guys, I need a bit of help. I want to produce frequency analysis of all tokens inside my solr Index from a specific (content) field. When I use TermsComponent or FacetCounts, what I get is how many records or documents each term appears in (which again confuses me as to what the difference is, is it facets are restricted to terms in result set and termscomponent is not restricted by the query?). Is there yet a way to get total terms count (not per document but across the whole index)? I have tried searching in archieves and across web but closest match I found is this: http://search-lucene.com/m/of5Fn1PUOHU/ It is suggested in this post that I can post the mentioned lines of code into TermsComponent.java and it should work. However, the code seems to have changed since and when I try this, the Class TermDocs is not even recognized. I was wondering if there is any other way by using Lucene or Solr to do this. I will be very grateful for any reply. If it helps, below is the code I am running right now which gives me document count and not Terms count. String queryString = document:*; SolrQuery solrQuery = new SolrQuery(); solrQuery.setQuery(queryString); solrQuery.setQueryType(/terms); solrQuery.setTerms(true); solrQuery.setTermsLimit(20); solrQuery.setParam(terms.fl, document); solrQuery.setTermsSortString(count); QueryResponse solrResp = conf._solr.executeQuery(solrQuery, 0, 10); TermsResponse termsResp = solrResp.getTermsResponse(); ListTermsResponse.Term terms = termsResp.getTerms(document); Ignore the conf object and _solr variable thats just my internal singleton object. Thanks, Ravish Bhagdev
Re: Getting sum of all terms count in dataset instead of document count using TermsComponent....(and TermsComponent vs Facets)
Yes, you are right. Ignore the query (document:*), it wont matter if i have it for termscomponent i guess. I've compiled current source from head, but also tried on 1.4.1. Any idea how to go about finding a solution to this? Thanks, Ravish On Sun, Feb 27, 2011 at 1:56 PM, Ahmet Arslan iori...@yahoo.com wrote: I want to produce frequency analysis of all tokens inside my solr Index from a specific (content) field. When I use TermsComponent or FacetCounts, what I get is how many records or documents each term appears in (which again confuses me as to what the difference is, is it facets are restricted to terms in result set and termscomponent is not restricted by the query?). Is there yet a way to get total terms count (not per document but across the whole index)? Terms Component does not respect q= parameter. In other words, it is not restricted by the query. I have tried searching in archieves and across web but closest match I found is this: http://search-lucene.com/m/of5Fn1PUOHU/ It is suggested in this post that I can post the mentioned lines of code into TermsComponent.java and it should work. However, the code seems to have changed since and when I try this, the Class TermDocs is not even recognized. What version of solr are you using?
very quick question that will help me greatly... OR query syntax when using fields for solr dataset....
Hi Guys, I've been trying various combinations but unable to perform a OR query for a specific field in my solr schema. I have a string field called myfield and I want to return all documents that have this field which either matches abc or xyz So all records that have myfield=abc and all records that have myfield=xyz should be returned (union) What should my query be? I have tried (myfield=abc OR myfield=xyz) which works, but only returns all the documents that contain xyz in that field, which I find quite weird. I have tried running this as fq query as well but same result! It is such a simple thing but I can't find right syntax after going through a lot of documentation and searching. Will appreciate any quick reply or examples, thanks very much. Ravish
Re: very quick question that will help me greatly... OR query syntax when using fields for solr dataset....
Hi Jan, Thanks for reply. I have tried the first variation in your example (and again after reading your reply). It returns no results! Note: it is not a multivalued field, I think when you use example 1 below, it looks for both xyz and abc in same field for same document, what i'm trying to get are all records that match either of the two. I hope I am making sense. Thanks, Ravish On Tue, Feb 15, 2011 at 1:47 PM, Jan Høydahl jan@cominvent.com wrote: http://wiki.apache.org/solr/SolrQuerySyntax Examples: q=myfield:(xyz OR abc) q={!lucene q.op=OR df=myfield}xyz abc q=xyz OR abcdefType=edismaxqf=myfield PS: If using type=string, you will not match individual words inside the field, only an exact case sensitive match of whole field. Use some variant of text if this is not what you want. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 15. feb. 2011, at 14.39, Ravish Bhagdev wrote: Hi Guys, I've been trying various combinations but unable to perform a OR query for a specific field in my solr schema. I have a string field called myfield and I want to return all documents that have this field which either matches abc or xyz So all records that have myfield=abc and all records that have myfield=xyz should be returned (union) What should my query be? I have tried (myfield=abc OR myfield=xyz) which works, but only returns all the documents that contain xyz in that field, which I find quite weird. I have tried running this as fq query as well but same result! It is such a simple thing but I can't find right syntax after going through a lot of documentation and searching. Will appreciate any quick reply or examples, thanks very much. Ravish
Re: very quick question that will help me greatly... OR query syntax when using fields for solr dataset....
Arghhh.. I think its the regexp parser messing things up (just looked at the debugQuery ouput and its parsing incorrectly some / kind of letters I had. I think I can clean up the data off these characters or maybe there is a way to escape them... Ravish On Tue, Feb 15, 2011 at 1:54 PM, Ravish Bhagdev ravish.bhag...@gmail.comwrote: Hi Jan, Thanks for reply. I have tried the first variation in your example (and again after reading your reply). It returns no results! Note: it is not a multivalued field, I think when you use example 1 below, it looks for both xyz and abc in same field for same document, what i'm trying to get are all records that match either of the two. I hope I am making sense. Thanks, Ravish On Tue, Feb 15, 2011 at 1:47 PM, Jan Høydahl jan@cominvent.comwrote: http://wiki.apache.org/solr/SolrQuerySyntax Examples: q=myfield:(xyz OR abc) q={!lucene q.op=OR df=myfield}xyz abc q=xyz OR abcdefType=edismaxqf=myfield PS: If using type=string, you will not match individual words inside the field, only an exact case sensitive match of whole field. Use some variant of text if this is not what you want. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 15. feb. 2011, at 14.39, Ravish Bhagdev wrote: Hi Guys, I've been trying various combinations but unable to perform a OR query for a specific field in my solr schema. I have a string field called myfield and I want to return all documents that have this field which either matches abc or xyz So all records that have myfield=abc and all records that have myfield=xyz should be returned (union) What should my query be? I have tried (myfield=abc OR myfield=xyz) which works, but only returns all the documents that contain xyz in that field, which I find quite weird. I have tried running this as fq query as well but same result! It is such a simple thing but I can't find right syntax after going through a lot of documentation and searching. Will appreciate any quick reply or examples, thanks very much. Ravish
Re: very quick question that will help me greatly... OR query syntax when using fields for solr dataset....
Hi Erick, I've managed to fix the problem, it was to do with not encoding certain characters. Escaped with \ and it all works fine now :) . Sorry I was just being insane, look at debugQuery output helped. I know about the string field, this is kind of a uuid field that I am storing, so it it desired that it always be exact match, so I am being careful about why I chose that type. I am going to start looking at all that is available as Analyzer soon, something that does string distance match would be cool Ravish On Tue, Feb 15, 2011 at 2:30 PM, Erick Erickson erickerick...@gmail.comwrote: You might look at the analysis page from the admin console for the field in question, it'll show you what various parts of the analysis chain do. But I agree with Jan, having your field as a string type is a red flag. This field is NOT analyzed, parsed, or filtered. For instance, if a doc has a value for the field of: [My life], only [My life] will match. Not [my], not [life], not even [my life] (ignore all brackets, but quotes are often confused with phrases). It may well be that this is the exact behavior you want, but this is often a point of confusion. Best Erick On Tue, Feb 15, 2011 at 9:00 AM, Ravish Bhagdev ravish.bhag...@gmail.com wrote: Arghhh.. I think its the regexp parser messing things up (just looked at the debugQuery ouput and its parsing incorrectly some / kind of letters I had. I think I can clean up the data off these characters or maybe there is a way to escape them... Ravish On Tue, Feb 15, 2011 at 1:54 PM, Ravish Bhagdev ravish.bhag...@gmail.comwrote: Hi Jan, Thanks for reply. I have tried the first variation in your example (and again after reading your reply). It returns no results! Note: it is not a multivalued field, I think when you use example 1 below, it looks for both xyz and abc in same field for same document, what i'm trying to get are all records that match either of the two. I hope I am making sense. Thanks, Ravish On Tue, Feb 15, 2011 at 1:47 PM, Jan Høydahl jan@cominvent.com wrote: http://wiki.apache.org/solr/SolrQuerySyntax Examples: q=myfield:(xyz OR abc) q={!lucene q.op=OR df=myfield}xyz abc q=xyz OR abcdefType=edismaxqf=myfield PS: If using type=string, you will not match individual words inside the field, only an exact case sensitive match of whole field. Use some variant of text if this is not what you want. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 15. feb. 2011, at 14.39, Ravish Bhagdev wrote: Hi Guys, I've been trying various combinations but unable to perform a OR query for a specific field in my solr schema. I have a string field called myfield and I want to return all documents that have this field which either matches abc or xyz So all records that have myfield=abc and all records that have myfield=xyz should be returned (union) What should my query be? I have tried (myfield=abc OR myfield=xyz) which works, but only returns all the documents that contain xyz in that field, which I find quite weird. I have tried running this as fq query as well but same result! It is such a simple thing but I can't find right syntax after going through a lot of documentation and searching. Will appreciate any quick reply or examples, thanks very much. Ravish
Re: Are there any restrictions on what kind of how many fields you can use in Pivot Query? I get ClassCastException when I use some of my string fields, and don't when I use some other sting fields
Looks like its a bug? Is it not? Ravish On Tue, Feb 15, 2011 at 4:03 PM, Ravish Bhagdev ravish.bhag...@gmail.comwrote: When include some of the fields in my search query: SEVERE: java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lorg.apache.solr.common.util.ConcurrentLRUCache$CacheEntry; at org.apache.solr.common.util.ConcurrentLRUCache$PQueue.myInsertWithOverflow(ConcurrentLRUCache.java:377) at org.apache.solr.common.util.ConcurrentLRUCache.markAndSweep(ConcurrentLRUCache.java:329) at org.apache.solr.common.util.ConcurrentLRUCache.put(ConcurrentLRUCache.java:144) at org.apache.solr.search.FastLRUCache.put(FastLRUCache.java:131) at org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:904) at org.apache.solr.handler.component.PivotFacetHelper.doPivots(PivotFacetHelper.java:121) at org.apache.solr.handler.component.PivotFacetHelper.doPivots(PivotFacetHelper.java:126) at org.apache.solr.handler.component.PivotFacetHelper.process(PivotFacetHelper.java:85) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:84) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:231) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1298) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:340) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:662) Works with some fields not with others... What could be the problem? It is hard to know with just that exception as it refers to solr's internal files...any indicators will help me debug. Thanks, Ravish
Why solr relies on solr.solr.home???
Hi, This may be a naive question but do we really need to have solr.solr.home variable for solr installation? It is a bit annoying modifying tomcat settings in automated install. If I create a packaged application, how do I ensure a normal user would be able to install it without having to modify tomcat batch or shell files (or service settings in case of msi installer)? If not possible what will be the easiest way to automate the process (cross platform)? Also, is it possible to run solr without needing to host it in a http container? Why do we need webapp to index or query?? Ravi
Re: Incremental indexing of database
Can't you write triggers for your database/tables you want to index? That way you can keep track of all kinds of changes and updates and not just addition of a new record. Ravish On Tue, Jul 22, 2008 at 8:15 PM, anshuljohri [EMAIL PROTECTED] wrote: Hi, In my project i have to index whole database which contains text data only. So if i follow incremental indexing approch than my problem is that how will I pick delta data from database. Is there any utility in solr to keep track the last indexed record. Or is there any other approch to solve this problem. Thanks, Anshul Johri -- View this message in context: http://www.nabble.com/Incremental-indexing-of-database-tp18596613p18596613.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is it possible to add synonyms run time?
Yes, I'm fairly new as well. So do you mean adding words to the query effectively doing an or between synonymous terms? That sounds simple way of doing it, if this works, what makes indexing with synonyms useful? Ravish On Jan 25, 2008 2:42 PM, Jon Lehto [EMAIL PROTECTED] wrote: Hi Ravish, You may want to think about the synonym dictionary as being a tool on the side, rather than each indexed document having a copy of the synonyms. At indexing time, one might normalize synonyms to a single value, and at query time do the same to get the match. Alternately, use the synonym dictionary at run-time to expand a user's query terms, like a thesaurus. That said, I'm new to the tool, and not clear on how synonyms are implemented. Jon = From: Ravish Bhagdev [EMAIL PROTECTED] Date: 2008/01/25 Fri AM 08:24:33 CST To: solr-user@lucene.apache.org Subject: Is it possible to add synonyms run time? As I understood from available documentation, synonyms need to be defined before starting the indexing process. Is it possible to add synonyms at run time such that all index fields of all documents get updated? Does it work for newly added documents atleast? Also, how to make each user of application define his own set of synonyms that others should be oblivious to (others get normal results without synon considered) Thanks, Ravish
Is it possible to add synonyms run time?
As I understood from available documentation, synonyms need to be defined before starting the indexing process. Is it possible to add synonyms at run time such that all index fields of all documents get updated? Does it work for newly added documents atleast? Also, how to make each user of application define his own set of synonyms that others should be oblivious to (others get normal results without synon considered) Thanks, Ravish
Re: Is it possible to add synonyms run time?
I see, thanks a lot for this, makes things clear now. So just to make sure I understand this bit, by injecting synonyms at query time you mean basically adding terms implicitly to keywords behind the scenes before passing it to solr? Or is there are more conventional method or interface that is being suggested? Thanks for all the help! Ravish On Jan 25, 2008 3:59 PM, Erick Erickson [EMAIL PROTECTED] wrote: To me, it's really a question of where the work should be done given your problem space. Injecting synonyms at index time allows the queries to be simpler/faster. Injecting the synonyms at query time gets complex but is more flexible. As always, it's a time/space tradeoff. If you're willing to pay the space penalty for increased query speed, inject at index time. Otherwise you can inject at query time. And the query-time injection performance hit may not be trivial. Consider, for instance, span queries. Do you want to pay the price at query time for, say a BooleanQuery that is composed of 5 SpanQueries where each term in each SpanQuery consists of several OR clauses because of synonym injection? Perhaps you do and perhaps you don't. It all depends upon what your data looks like and what your performance criteria are. And you can do other tricks. Consider rather than indexing all the terms, only index the canonical term. That is, consider hit and the synonyms strike, popular, punch. you could index hit for any of the 4 terms, then do the same substitution for your query. Which would make your index smaller *and* your queries faster. But you're right. Injecting synonyms at index time really requires a fixed synonym list that doesn't vary by user. So if you want synonym lists on a per-user basis, you're probably going to have to inject synonyms at query time. Best Erick On Jan 25, 2008 9:46 AM, Ravish Bhagdev [EMAIL PROTECTED] wrote: Yes, I'm fairly new as well. So do you mean adding words to the query effectively doing an or between synonymous terms? That sounds simple way of doing it, if this works, what makes indexing with synonyms useful? Ravish On Jan 25, 2008 2:42 PM, Jon Lehto [EMAIL PROTECTED] wrote: Hi Ravish, You may want to think about the synonym dictionary as being a tool on the side, rather than each indexed document having a copy of the synonyms. At indexing time, one might normalize synonyms to a single value, and at query time do the same to get the match. Alternately, use the synonym dictionary at run-time to expand a user's query terms, like a thesaurus. That said, I'm new to the tool, and not clear on how synonyms are implemented. Jon = From: Ravish Bhagdev [EMAIL PROTECTED] Date: 2008/01/25 Fri AM 08:24:33 CST To: solr-user@lucene.apache.org Subject: Is it possible to add synonyms run time? As I understood from available documentation, synonyms need to be defined before starting the indexing process. Is it possible to add synonyms at run time such that all index fields of all documents get updated? Does it work for newly added documents atleast? Also, how to make each user of application define his own set of synonyms that others should be oblivious to (others get normal results without synon considered) Thanks, Ravish
Re: SOLR X FAST
Stability and better Support (at great cost obviously) On Dec 11, 2007 10:20 PM, William Silva [EMAIL PROTECTED] wrote: Hi, Why use FAST and not use SOLR ? For example. What will FAST offer that will justify the investment ? I would like a matrix comparing both. Thanks, William. On Dec 11, 2007 8:15 PM, Matthew Runo [EMAIL PROTECTED] wrote: I think it all depends, what do you want out of Solr or FAST? Thanks! Matthew Runo Software Developer 702.943.7833 On Dec 11, 2007, at 2:09 PM, William Silva wrote: Hi, How is the best way to compare SOLR and FAST Search ? Thanks, William.
Re: SOLR X FAST
Could you please elaborate on what you mean by ingestion pipeline and horizontal scalability? I apologize if this is a stupid question everyone else on the forum is familiar with. Thanks, Ravi On Dec 12, 2007 1:09 AM, Nuno Leitao [EMAIL PROTECTED] wrote: Depends, if you are looking for a small sized index (gigabytes rather than dozens or hundreds of gigabytes or terabytes) with relatively simple requirements (a few facets, simple tokenization, English only linguistics, etc.) Solr is likely to be appropriate for most cases. FAST however gives you great horizontal scalability, out of the box linguistics for many languages (including CJK), contextual and scope searching, a web, file and database crawler, programmable ingestion pipeline, etc. Regards. --Nuno On 11 Dec 2007, at 22:09, William Silva wrote: Hi, How is the best way to compare SOLR and FAST Search ? Thanks, William.
Re: SOLR 1.2 - Updates sent containing fields that are not on the Schema fail silently
Yup, I do remember that happening to me before. Is this intentionally so? Ravish On Nov 28, 2007 1:41 PM, Daniel Alheiros [EMAIL PROTECTED] wrote: Hi I experienced a very unpleasant problem recently, when my search indexing adaptor was changed to add some new fields. The problem is my schema didn't follow those changes (new fields added), and after that SOLR was silently ignoring all documents I sent. Neither SOLR Java client or SOLR server returned me an error code or log message. In the server side, nothing was logged and the client received a standard success return. Why didn't my documents got indexed and this new fields were just ignored? That is what I think it was supposed to do. Please let me know your thoughts. Regards, Daniel http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
Re: index size
Hi All, I'm facing similar problem. I want to index entire document as a field. But I also want to be able to retrieve snippets (like Google/Nutch return in results page below the links). To achieve this I have to keep the document field to stored right? When I do this my index becomes huge 10 GB index, cause I have 10K docs but each is very lengthy HTML. Is there any better solution? Why is index created by nutch so small in comparison (about 27 mb approx) but it still returns snippets! Ravish On 10/9/07, Kevin Lewandowski [EMAIL PROTECTED] wrote: Late reply on this but I just wanted to say thanks for the suggestions. I went through my whole schema and was storing things that didn't need to be stored and indexing a lot of things that didn't need to be indexed. Just completed a full reindex and it's a much more reasonable size now. Kevin On 8/20/07, Mike Klaas [EMAIL PROTECTED] wrote: On 17-Aug-07, at 2:03 PM, Kevin Lewandowski wrote: Are there any tips on reducing the index size or what factors most impact index size? My index has 2.7 million documents and is 200 gigabytes and growing. Most documents are around 2-3kb and there are about 30 indexed fields. An ls -sh will tell you roughly where the the space is being occupied. There is something strange going on: 2.5kB * 2.7m is only 6GB, and I have trouble imagining where the 30-fold index size expansion is coming from. -Mike
Fwd: solr, snippets and stored field in nutch...
Hey guys, Checkout this thread I opened on nutch mailing list. Looks like Solr can benefit from reusing Nutch's segment based storage strategy for efficiency in returning snippets, summaries etc without using Lucene stored fields? Was this considered before? Ravish -- Forwarded message -- From: Dennis Kubes [EMAIL PROTECTED] Date: Oct 11, 2007 11:27 PM Subject: Re: snippets and stored field in nutch... To: [EMAIL PROTECTED] The reason it is stored in the segments instead of index to allow summarizers to be run on the content of hits to produce the summaries that appear in the search results. Summarizers are pluggable and the actual content used to produce the summary can change. And summaries can be changed without re-fetching or re-indexing. If a summary were stored in the index, re-indexing would have to occur to make changes. Also the way the search process works, Nutch returns hits (basically document ids). These hits are then sorted and deduped and the best x number (usually 10) returned. For only these 10 best hits, hit details (fields in the index) and summaries are retrieved. So there is something to be said about the amount of data being pushed over the network. Dennis Kubes Ravish Bhagdev wrote: Ah, I see, didn't know that, Thanks! Interesting that nutch stores it in a different structure (segments) and doesn't reuse Lucene strategy of storing within index. Any particular reason why? Is there any other use of Segments data structure except to return snippets? Cheers, Ravish On 10/11/07, John H. Lee [EMAIL PROTECTED] wrote: Hi Ravish. You are correct that Nutch does not store document content in the Lucene index. The content *is* stored in the Nutch segment, which is where snippets come from. Hope this helps. -J On Oct 11, 2007, at 12:08 PM, Ravish Bhagdev wrote: Hey All, Am I right in believing that in Lucene/Nutch, to be able to return content or snippet to a search query, the field to be returned has to be stored? AFAIK, by default, Nutch dose not store the document field, am I right? If so, how does it manage to return snippets? Wouldn't the index be quite huge if nutch were storing document field by default? I will appreciate any help/comments as I'm bit lost with this. Ravi
Re: unable to figure out nutch type highlighting in solr....
Thanks all for help. Just to make sure I understand correctly, am I right in summarizing this way than?: No significance of using HTML: Unlike nutch Solr doesn't parse HTML, so it ignores the anchors, titles etc and is not good for page rank -esq indexing. HTMLAnalyser (by with you probably mean HTMLStripWhitespaceTokenizer?) : Main purpose is to allow users to index html code, it will strip the html tags and index the contents, but if used for getting snippets in results the em tags may be in wrong locations To avoid using HTMLAnalyser, strip out the tags yourself and only send text to Solr for indexing using one of the normal analysers. Highlighting should be accurate in this case. (Query esp. Adrian): If you are indexing XHTML, do you replace tags with entities before giving it to solr, if so, when you get back snippets do you get tags or entities or do you convert again to tags for presentation? What's the best way out? It would help me a lot if you briefly explain your configuration. Do let me know if my assumptions are wrong! Cheers, Ravish On 10/5/07, Chris Hostetter [EMAIL PROTECTED] wrote: : In general, I don't recommend indexing HTML content straight to Solr. None of : the Solr contributors do this so the use case hasn't received a lot of love. I second that comment ... the HTML Striping code was never intended to be an HTML Parser it was designed to be a workarround for dealing with dirty data where people had unwanted HTML tags in what should be plain text. indexing as is with some analyzers would result in words like script, strong, and class matching lots of docs where the words never relaly appear in the text. if you have wellformed HTML documents, use an HTML parser to extract the real content. -Hoss
Re: unable to figure out nutch type highlighting in solr....
Thanks Adrian, I'm very new to Solr myself so struggling a bit in initial stages... One last one, when you send HTML to solr, do you too replace special chars and tags with named entities? I did this and HTMLStripper doesn't seem to recognise them the tags :-S While if I try and input HTML as is indexer throws exceptions (as having tags within XML tags is obviously not valid. How to do this part? Ravish On 10/5/07, Adrian Sutton [EMAIL PROTECTED] wrote: On 05/10/2007, at 4:07 PM, Ravish Bhagdev wrote: (Query esp. Adrian): If you are indexing XHTML, do you replace tags with entities before giving it to solr, if so, when you get back snippets do you get tags or entities or do you convert again to tags for presentation? What's the best way out? It would help me a lot if you briefly explain your configuration. We happen to develop a HTML editor so we know 100% for certain that the XHTML is valid XML. Given that we just throw the raw XHTML at Solr which uses the HTMLStripWhitespaceTokenizer. However, at this stage we haven't configured highlighting at all, so our index is used for search and retrieving a document ID. At some point I'd like to add highlighting and it sounds like the best way to do so would be to index the document text instead of the HTML. Beyond that, we also use Solr as an optimization for extracting information such as what content was most recently changed, which pages link to others etc. On the page linking, we actually identify what pages are linked to prior to indexing and store them as a separate field - Solr itself has no understanding of the linking. Oh and I should note, I'm very new to Solr so I'm probably not doing things the best way, but I'm getting great results anyway. Regards, Adrian Sutton http://www.symphonious.net
Re: unable to figure out nutch type highlighting in solr....
Thanks all for very valuable contributions, I understand these aspects of Solr much better now but... But a different use-case might be for the highlighting to encompass the markup rather than just the text, e.g. span class=highlightedtopic type=locationParis/topic/span which would have to be accomplished some other way. Yes, exactly. And I think nutch handles this somehow as I remember using it for indexing HTML and then returning snippets with accurate highlighting placed within html snippets. Is there a potential for code reuse from nutch? Maybe this is topic for solr developer list? Or has it been already considered? Bests, Ravish
Re: Indexing HTML
Hi Erik, All, I escaped HTML text into entities before sending to Solr and indexing went fine. The problem now is that when I get back a snippet with highlighted text for the html field, its not well formed as the highliting dosen't somtimes include the entire tag if present. For e.g.: lst name=0008369D − arr name=document − str ound-color: #FF; text-align: left; text-indent: 0px; emline-heigh/emt: normal ; margin-top: 0px; margin-ri /str /arr /lst lst name=0008369B − arr name=document − str /TRgt;br / lt;TR align=quot;leftemquot; va/emlign=quot;middlequot; style=quot; height: 28.80px;q /str /arr /lst /lst Because of this I cannot present the resulting html in a webpage. Is it possible to strip out all HTML tags completely in result set? Would you recommend sending stripped out text to solr instead? But doesn't Solr use HTML features while searching (anchors/titles etc). Why is there no documentation about indexing HTML specifically using solr. How does nutch do it? does it strip out html in the snippets it returns? Any help will be appreciated. Thanks, Ravi On 8/27/07, Erik Hatcher [EMAIL PROTECTED] wrote: On Aug 27, 2007, at 10:00 AM, Michael Kimsal wrote: What's odd about this is that the error seems to indicate that I did. Actually the error message looks like you escaped too much. You should _not_ escape field, only the contents of it. Erik The full text (minus the stack trace) was org.xmlpull.v1.XmlPullParserException: parser must be on START_TAG or TEXT to read text (position: START_TAG seen ...lt;field name=linegt;lt;a href=foobargt;... @4:37) Or is that just a byproduct of how SOLR reports the errors back - always escaping them? Thanks guys - I'll have another crack at this tonight. On 8/27/07, Erik Hatcher [EMAIL PROTECTED] wrote: Michael, I think the issue is that you're not escaping the field values. Send something like this to Solr instead: field name=linelt;a href=foobargt;lt;bgt;lt;igt;linktextlt;/igt;lt;/bgt;lt;/ agt;/field Erik On Aug 27, 2007, at 9:29 AM, Michael Kimsal wrote: Hello I'm trying to index individual lines of an HTML file, and I'm hitting this error: TEXT must be immediately followed by END_TAG and not START_TAG I've got something that looks like add doc field name=id4/field field name=linea href=foobarbilinktext/i/b/a/ field /doc /add Actually, that sample code above, as its own data file POSTed to SOLR, throws parser must be on START_TAG or TEXT to read text (position: START_TAG seen ...lt;field name=linegt;lt;a href=foobargt;... @4:37 as an error. Any clues as to how I can do this? I'd like to keep the original copy of each line intact in the index. Thanks! -- Michael Kimsal http://webdevradio.com -- Michael Kimsal http://webdevradio.com
unable to figure out nutch type highlighting in solr....
I have tried very hard to follow documentation and forums that try to answer questions about how to return snippets with highlights for relevant searched term using Solr (as nutch does with such ease). I will be really grateful if someone can guide me with basics, i have made sure that the field to be highlighted is stored in index etc. Still I can't figure out why it doesn't return the snippet and instead returns the whole document. I have tried all different highlight parameters with variations, but no idea what's happening. Can I test highlighting using given application using full search interface option? How, it just returns xml with full document between field tag at the moment. Please find attached my conf files as well ?xml version=1.0 encoding=UTF-8 ? !-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the License); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -- config !-- Set this to 'false' if you want solr to continue working after it has encountered an severe configuration error. In a production environment, you may want solr to keep working even if one handler is mis-configured. You may also set this to false using by setting the system property: -Dsolr.abortOnConfigurationError=false -- abortOnConfigurationError${solr.abortOnConfigurationError:true}/abortOnConfigurationError !-- Used to specify an alternate directory to hold all index data other than the default ./data under the Solr home. If replication is in use, this should match the replication configuration. -- !-- dataDir./solr/data/dataDir -- indexDefaults !-- Values here affect all index writers and act as a default unless overridden. -- useCompoundFilefalse/useCompoundFile mergeFactor5/mergeFactor maxBufferedDocs100/maxBufferedDocs maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout /indexDefaults mainIndex !-- options specific to the main on-disk lucene index -- useCompoundFilefalse/useCompoundFile mergeFactor5/mergeFactor maxBufferedDocs1000/maxBufferedDocs maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength !-- If true, unlock any held write or commit locks on startup. This defeats the locking mechanism that allows multiple processes to safely access a lucene index, and should be used with care. -- unlockOnStartupfalse/unlockOnStartup /mainIndex !-- the default high-performance update handler -- updateHandler class=solr.DirectUpdateHandler2 !-- A prefix of solr. for class names is an alias that causes solr to search appropriate packages, including org.apache.solr.(search|update|request|core|analysis) -- !-- autocommit pending docs if certain criteria are met autoCommit maxDocs1/maxDocs maxTime1000/maxTime /autoCommit -- autoCommit maxDocs1000/maxDocs maxTime1000/maxTime /autoCommit !-- The RunExecutableListener executes an external command. exe - the name of the executable to run dir - dir to use as the current working directory. default=. wait - the calling thread waits until the executable returns. default=true args - the arguments to pass to the program. default=nothing env - environment variables to set. default=nothing -- !-- A postCommit event is fired after every commit or optimize command listener event=postCommit class=solr.RunExecutableListener str name=exesnapshooter/str str name=dirsolr/bin/str bool name=waittrue/bool arr name=args strarg1/str strarg2/str /arr arr name=env strMYVAR=val1/str /arr /listener -- !-- A postOptimize event is fired only after every optimize command, useful in conjunction with index distribution to only distribute optimized indicies listener event=postOptimize class=solr.RunExecutableListener str name=exesnapshooter/str str name=dirsolr/bin/str bool name=waittrue/bool /listener -- /updateHandler query !-- Maximum number of clauses in a boolean query... can affect range or prefix queries that expand
Processing solr response....
Hi, Apologies if this has been asked before but I couldn't find anything when I searched... I have been looking ant SolJava examples. I've been using Nutch/Lucene before which returns results from query nicely in a class with url, title and snippet (summary). While Solr seems to return XML with score and other details with just the url field. Is there a way to avoid having to deal with XML after each query? I want to avoid parsing it will be much better if I could get results directly into a Java data structure like a List or Map etc through the API. Also can anyone point me to some example or documentation which clarifies XML returned by Solr and also how to get variations of this including specifying what exactly i would see in xml like which particular fields etc. Hope i'm making sense Thanks, Ravi
Indexing longer documents using Solr...memory issue after index grows to about 800 MB...
Hi, The problem: - I have about 11K html documents to index. - I'm trying to index these documents (along with 3 more small string fields) so that when I search within the doc field (field with the html file content), I can get results with snippets or highlights as I get when using nutch. - While going through Wiki I noticed that if I need to do highlighting in a particular field, I have to make sure it is indexed and stored. But when I try to do the above, after indexing about 3K files which creates index of about 800MB (which is fine as files are quite lengthy) it keeps giving out of heap space errors. Things I've tried without much help: - Increase memory of tomcat - Play around with settings like autoCommit (documents and time) - Reducing mergefactor to 5 - Reducing maxBufferedDocs to 100 My question is also, if its required to store fields in index to be able to do highlighting/returning field content, how does nutch/lucene do it without that (because index for same documents created using nutch is much much smaller) But also when trying to query partially added documents, when I set field highlight on (and a particular field) it doesn't seem to have any effect. As you can see I'm very confused how to proceed. I hope I'm being clear though :-S Thanks, Ravi
Indexing HTML content... (Embed HTML into XML?)
Hello, Sorry for stupid question. I'm trying to index html file as one of the fields in Solr, I've setup appropriate analyzer in schema but I'm not sure how to add html content to Solr. Encapsulating HTML content within field tag is obviously not valid. How do I add html content? Hope the query is clear Thanks, Ravi