Re: charfilter doesn't do anything
yes but that filter html and not the specific tag i want. On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote: Hmmm, have you looked at: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Not quite the body, perhaps, but might it help? On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote: ok i have html pages with html.!--body--content i want!--/body--./html. i want to extract (index, store) only that between the body-comments. i thought regexTransformer would be the best because xpath doesn't work in tika and i cant nest a xpathEntetyProcessor to use xpath. what i have also found out is that the htmlparser from tika cuts my body-comments out and tries to make well formed html, which i would like to switch off. On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote: On 9/6/2013 7:09 AM, Andreas Owen wrote: i've managed to get it working if i use the regexTransformer and string is on the same line in my tika entity. but when the string is multilined it isn't working even though i tried ?s to set the flag dotall. entity name=tika processor=TikaEntityProcessor url=${rec.url} dataSource=dataUrl onError=skip htmlMapper=identity format=html transformer=RegexTransformer field column=text_html regex=lt;bodygt;(.+)lt;/bodygt; replaceWith=QQQ sourceColName=text / /entity then i tried it like this and i get a stackoverflow field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt; replaceWith=QQQ sourceColName=text / in javascript this works but maybe because i only used a small string. Sounds like we've got an XY problem here. http://people.apache.org/~hossman/#xyproblem How about you tell us *exactly* what you'd actually like to have happen and then we can find a solution for you? It sounds a little bit like you're interested in stripping all the HTML tags out. Perhaps the HTMLStripCharFilter? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Something that I already said: By using the KeywordTokenizer, you won't be able to search for individual words on your HTML input. The entire input string is treated as a single token, and therefore ONLY exact entire-field matches (or certain wildcard matches) will be possible. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory Note that no matter what you do to your data with the analysis chain, Solr will always return the text that was originally indexed in search results. If you need to affect what gets stored as well, perhaps you need an Update Processor. Thanks, Shawn
Re: subindex
Hi Erick it makes sense. Thank you for this. peyman On Sep 5, 2013, at 4:11 PM, Erick Erickson erickerick...@gmail.com wrote: Nope. You can do this if you've stored _all_ the fields (with the exception of _version_ and the destinations of copyField directives). But there's no way I know of to do what you want if you haven't. If you have, you'd be essentially spinning through all your docs and re-indexing just the fields you cared about. But if you still have access to your original docs this would be slower/more complicated than just re-indexing from scratch. Best Erick On Wed, Sep 4, 2013 at 1:51 PM, Peyman Faratin pey...@robustlinks.comwrote: Hi Is there a way to build a new (smaller) index from an existing (larger) index where the smaller index contains a subset of the fields of the larger index? thank you
Re: charfilter doesn't do anything
I tried this and it seems to work when added to the standard Solr example in 4.4: field name=body type=text_html_body indexed=true stored=true / fieldType name=text_html_body class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 / tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType That char filter retains only text between body and /body. Is that what you wanted? Indexing this data: curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d ' [{id:doc-1,body:abc bodyA test./body def}]' And querying with these commands: curl http://localhost:8983/solr/select/?q=*:*indent=truewt=json; Shows all data curl http://localhost:8983/solr/select/?q=body:testindent=truewt=json; shows the body text curl http://localhost:8983/solr/select/?q=body:abcindent=truewt=json; shows nothing (outside of body) curl http://localhost:8983/solr/select/?q=body:defindent=truewt=json; shows nothing (outside of body) curl http://localhost:8983/solr/select/?q=body:bodyindent=truewt=json; Shows nothing, HTML tag stripped In your original query, you didn't show us what your default field, df parameter, was. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Sunday, September 08, 2013 5:21 AM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything yes but that filter html and not the specific tag i want. On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote: Hmmm, have you looked at: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Not quite the body, perhaps, but might it help? On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote: ok i have html pages with html.!--body--content i want!--/body--./html. i want to extract (index, store) only that between the body-comments. i thought regexTransformer would be the best because xpath doesn't work in tika and i cant nest a xpathEntetyProcessor to use xpath. what i have also found out is that the htmlparser from tika cuts my body-comments out and tries to make well formed html, which i would like to switch off. On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote: On 9/6/2013 7:09 AM, Andreas Owen wrote: i've managed to get it working if i use the regexTransformer and string is on the same line in my tika entity. but when the string is multilined it isn't working even though i tried ?s to set the flag dotall. entity name=tika processor=TikaEntityProcessor url=${rec.url} dataSource=dataUrl onError=skip htmlMapper=identity format=html transformer=RegexTransformer field column=text_html regex=lt;bodygt;(.+)lt;/bodygt; replaceWith=QQQ sourceColName=text / /entity then i tried it like this and i get a stackoverflow field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt; replaceWith=QQQ sourceColName=text / in javascript this works but maybe because i only used a small string. Sounds like we've got an XY problem here. http://people.apache.org/~hossman/#xyproblem How about you tell us *exactly* what you'd actually like to have happen and then we can find a solution for you? It sounds a little bit like you're interested in stripping all the HTML tags out. Perhaps the HTMLStripCharFilter? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Something that I already said: By using the KeywordTokenizer, you won't be able to search for individual words on your HTML input. The entire input string is treated as a single token, and therefore ONLY exact entire-field matches (or certain wildcard matches) will be possible. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory Note that no matter what you do to your data with the analysis chain, Solr will always return the text that was originally indexed in search results. If you need to affect what gets stored as well, perhaps you need an Update Processor. Thanks, Shawn
Expunge deleting using excessive transient disk space
Hi, In order to delete part of my index I run a delete by query that intends to erase 15% of the docs. I added this params to the solrconfig.xml mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnce2/int int name=maxMergeAtOnceExplicit2/int double name=maxMergedSegmentMB5000.0/double double name=reclaimDeletesWeight10.0/double double name=segmentsPerTier15.0/double /mergePolicy The extra params were added in order to promote merge of old segments but with restriction on the transient disk that can be used (as I have only 15GB per shard). This procedure failed on a no space left on device exception, although proper calculations show that these params should cause no usage excess of the transient free disk space I have. Looking on the infostream I can see that the first merges do succeed but older segments are kept in reference thus cannot be deleted until all the merging are done. Is there anyway of overcoming this?
Profiling Solr Lucene for query
Hello all Looking on the 10% slowest queries, I get very bad performances (~60 sec per query). These queries have lots of conditions on my main field (more than a hundred), including phrase queries and rows=1000. I do return only id's though. I can quite firmly say that this bad performance is due to slow storage issue (that are beyond my control for now). Despite this I want to improve my performances. As tought in school, I started profiling these queries and the data of ~1 minute profile is located here: http://picpaste.com/pics/IMG_20130908_132441-ZyrfXeTY.1378637843.jpg Main observation: most of the time I do wait for readVInt, who's stacktrace (2 out of 2 thread dumps) is: catalina-exec-3870 - Thread t@6615 java.lang.Thread.State: RUNNABLE at org.apadhe.lucene.store.DataInput.readVInt(DataInput.java:108) at org.apaChe.lucene.codeosAockTreeIermsReade$FieldReader$SegmentTermsEnumFrame.loadBlock(BlockTreeTermsReader.java: 2357) at ora.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BlockTreeTermsReader.java:1745) at org.apadhe.lucene.index.TermContext.build(TermContext.java:95) at org.apache.lucene.search.PhraseQuery$PhraseWeight.init(PhraseQuery.java:221) at org.apache.lucene.search.PhraseQuery.createWeight(PhraseQuery.java:326) at org.apache.lucene.search.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at oro.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:675) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297) So I do actually wait for IO as expected, but I might be too many time page faulting while looking for the TermBlocks (tim file), ie locating the term. As I reindex now, would it be useful lowering down the termInterval (default to 128)? As the FST (tip files) are that small (few 10-100 MB) so there are no memory contentions, could I lower down this param to 8 for example? The benefit from lowering down the term interval would be to obligate the FST to get on memory (JVM - thanks to the NRTCachingDirectory) as I do not control the term dictionary file (OS caching, loads an average of 6% of it). General configs: solr 4.3 36 shards, each has few million docs These 36 servers (each server has 2 replicas) are running virtual, 16GB memory each (4GB for JVM, 12GB remain for the OS caching), consuming 260GB of disk mounted for the index files.
Solr suggest - How to define solr suggest as case insensitive
My suggest (spellchecker) is returning case sensitive answers. (I use it to autocomplete - dog and Dog return different phrases)\ my suggest is defined as follows - in solrconfig - searchComponent class=solr.SpellCheckComponent name=suggest lst name=spellchecker str name=namesuggest/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str str name=fieldsuggest/str !-- the indexed field to derive suggestions from -- float name=threshold0.005/float str name=buildOnCommittrue/str !--str name=sourceLocationamerican-english/str-- /lst /searchComponent requestHandler class=org.apache.solr.handler.component.SearchHandler name=/suggest lst name=defaults str name=spellchecktrue/str str name=spellcheck.dictionarysuggest/str str name=spellcheck.onlyMorePopulartrue/str str name=spellcheck.count5/str str name=spellcheck.collatetrue/str /lst arr name=components strsuggest/str /arr /requestHandler in schema field name=suggest type=phrase_suggest indexed=true stored=true required=false multiValued=true/ and copyField source=Name dest=suggest/ and fieldtype name=phrase_suggest class=solr.TextField analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.PatternReplaceFilterFactory pattern=([^\p{L}\p{M}\p{N}\p{Cs}]*[\p{L}\p{M}\p{N}\p{Cs}\_]+:)|([^\p{L}\p{M}\p{N}\p{Cs}])+ replacement= replace=all/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.TrimFilterFactory/ /analyzer /fieldtype
Re: unknown _stream_source_info while indexing rich doc in solr
Error got resolved,thanks a lot Sir.I have been trying since days to resolve it. On Fri, Sep 6, 2013 at 11:36 PM, Chris Hostetter-3 [via Lucene] ml-node+s472066n4088604...@n3.nabble.com wrote: : it shows type as undefined for dynamic field ignored_* , and I am using That means the running solr instance does not know anything about a dynamic field named ignored_* -- it doesn't exist. : but on the admin page it shows schema : the page showing hte schema file just tells you what's on disk -- it has no way of knowing if you modified that file after starting up solr. ... Wait a minute ... i see your problem now... ... : /fields : dynamicField name=ignored_* type=ignored indexed=false stored=true : multiValued=true/ ...your dynamicField/ declaration needs to be inside your fields block. -Hoss -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/unknown-stream-source-info-while-indexing-rich-doc-in-solr-tp4088136p4088604.html To unsubscribe from unknown _stream_source_info while indexing rich doc in solr, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4088136code=bnV0YW5zaGluZGUxOTkyQGdtYWlsLmNvbXw0MDg4MTM2fC0xMzEzOTU5Mzcx . NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://lucene.472066.n3.nabble.com/unknown-stream-source-info-while-indexing-rich-doc-in-solr-tp4088136p4088765.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing pdf files - question.
Error got resolved,solution was dynamic field / must be within fields tag. On Sun, Sep 8, 2013 at 3:31 AM, Furkan KAMACI furkankam...@gmail.comwrote: Could you show us logs you get when you start your web container? 2013/9/4 Nutan Shinde nutanshinde1...@gmail.com My solrconfig.xml is: requestHandler name=/update/extract class=solr.extraction.ExtractingRequestHandler lst name=defaults str name=fmap.contentdesc/str !-to map this field of my table which is defined as shown below in schem.xml-- str name=lowernamestrue/str str name=uprefixattr_/str str name=captureAttrtrue/str /lst /requestHandler lib dir=../../extract regex=.*\.jar / Schema.xml: fields field name=doc_id type=integer indexed=true stored=true multiValued=false/ field name=name type=text indexed=true stored=true multiValued=false/ field name=path type=text indexed=true stored=true multiValued=false/ field name=desc type=text_split indexed=true stored=true multiValued=false/ /fields types fieldType name=string class=solr.StrField / fieldType name=integer class=solr.IntField / fieldType name=text class=solr.TextField / fieldType name=text class=solr.TextField / /types dynamicField name=*_i type=integer indexed=true stored=true/ uniqueKeydoc_id/uniqueKey I have created extract directory and copied all required .jar and solr-cell jar files into this extract directory and given its path in lib tag in solrconfig.xml When I try out this: curl http://localhost:8080/solr/update/extract?literal.doc_id=1commit=true; -F myfile=@solr-word.pdf mailto:myfile=@solr-word.pdf in Windows 7. I get /solr/update/extract is not available and sometimes I get access denied error. I tried resolving through net,but in vain.as all the solutions are related to linux os,im working on Windows. Please help me and provide solutions related o Windows os. I referred Apache_solr_4_Cookbook. Thanks a lot.
Re: Expunge deleting using excessive transient disk space
Right, but you should have at least as much free space as your total index size, and I don't see the total index size (but I'm just glancing). I'm not entirely sure you can precisely calculate the maximum free space you have relative to the amount needed for merging, some of the people who wrote that code can probably tell you more. I'd _really_ try to get more disk space. The amount of engineer time spent trying to tune this is way more expensive than a disk... Best, Erick On Sun, Sep 8, 2013 at 11:51 AM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hi, In order to delete part of my index I run a delete by query that intends to erase 15% of the docs. I added this params to the solrconfig.xml mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnce2/int int name=maxMergeAtOnceExplicit2/int double name=maxMergedSegmentMB5000.0/double double name=reclaimDeletesWeight10.0/double double name=segmentsPerTier15.0/double /mergePolicy The extra params were added in order to promote merge of old segments but with restriction on the transient disk that can be used (as I have only 15GB per shard). This procedure failed on a no space left on device exception, although proper calculations show that these params should cause no usage excess of the transient free disk space I have. Looking on the infostream I can see that the first merges do succeed but older segments are kept in reference thus cannot be deleted until all the merging are done. Is there anyway of overcoming this?
Dynamic Field
Hi all, I am using solr dynamic field. i am storing data in the following format:- idbatch_*job_* So for a doc, data is storing like:- -- id batch_21 job_21 job_22 batch_22 ... -- 1 120 01 121 ... -- Using luke request handler i found that currently there are more than 5k fields and 300 docs. And fields are always increasing because of dynamic field. So i am worried about solr performance or any unknown issues which can come to solr. If somebody had experienced please tell me. Please tell the correct solution to handle these issues. are there any alternatives of dynamic fields. Can we store information like below ? - idjobs batch - 21 {21:0,22:1}{21:120,22:121} - -- View this message in context: http://lucene.472066.n3.nabble.com/Dynamic-Field-tp4088775.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Some highlighted snippets aren't being returned
Hi again Everyone, I didn't get any replies to this, so I thought I'd re-send in case anyone missed it and has any thoughts. Thanks, Eric On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon elo2...@columbia.edu wrote: Hi Everyone, I'm facing an issue in which my solr query is returning highlighted snippets for some, but not all results. For reference, I'm searching through an index that contains web crawls of human-rights-related websites. I'm running solr as a webapp under Tomcat and I've included the query's solr params from the Tomcat log: ... webapp=/solr-4.2 path=/select params={facet=truesort=score+descgroup.limit=10spellcheck.q=Unanganf.mimetype_code.facet.limit=7hl.simple.pre=codeq.alt=*:*f.organization_type__facet.facet.limit=6f.language__facet.facet.limit=6hl=truef.date_of_capture_.facet.limit=6group.field=original_urlhl.simple.post=/codefacet.field=domainfacet.field=date_of_capture_facet.field=mimetype_codefacet.field=geographic_focus__facetfacet.field=organization_based_in__facetfacet.field=organization_type__facetfacet.field=language__facetfacet.field=creator_name__facethl.fragsize=600f.creator_name__facet.facet.limit=6facet.mincount=1qf=text^1hl.fl=contentshl.fl=titlehl.fl=original_urlwt=rubyf.geographic_focus__facet.facet.limit=6defType=edismaxrows=10f.domain.facet.limit=6q=Unanganf.organization_based_in__facet.facet.limit=6q.op=ANDgroup=truehl.usePhraseHighlighter=true} hits=8 status=0 QTime=108 ... For the query above (which can be simplified to say: find all documents that contain the word unangan and return facets, highlights, etc.), I get five search results. Only three of these are returning highlighted snippets. Here's the highlighting portion of the solr response (note: printed in ruby notation because I'm receiving this response in a Rails app): highlighting= {20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf= {}, 20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf= {}, 20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf= {}, 20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf= {contents= [...actual snippet is returned here...]}, 20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf= {contents= [...actual snippet is returned here...]}, 20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2-uu-no-39-tahun-1999= {contents= [...actual snippet is returned here...]}, 20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no-39-tahun-1999?tmpl=componentformat=raw= {contents= [...actual snippet is returned here...]}, 20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.pdf= {}} I have eight (as opposed to five) results above because I'm also doing a grouped query, grouping by a field called original_url, and this leads to five grouped results. I've confirmed that my highlight-lacking results DO contain the word unangan, as expected, and this term is appearing in a text field that's indexed and stored, and being searched for all text searches. For example, one of the search results is for a crawl of this document: http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.pdf And if you view that document on the web, you'll see that it does contain unangan. Has anyone seen this before? And does anyone have any good suggestions for troubleshooting/fixing the problem? Thanks! - Eric
Re: Some highlighted snippets aren't being returned
Zip up all your configs Bill Bell Sent from mobile On Sep 8, 2013, at 3:00 PM, Eric O'Hanlon elo2...@columbia.edu wrote: Hi again Everyone, I didn't get any replies to this, so I thought I'd re-send in case anyone missed it and has any thoughts. Thanks, Eric On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon elo2...@columbia.edu wrote: Hi Everyone, I'm facing an issue in which my solr query is returning highlighted snippets for some, but not all results. For reference, I'm searching through an index that contains web crawls of human-rights-related websites. I'm running solr as a webapp under Tomcat and I've included the query's solr params from the Tomcat log: ... webapp=/solr-4.2 path=/select params={facet=truesort=score+descgroup.limit=10spellcheck.q=Unanganf.mimetype_code.facet.limit=7hl.simple.pre=codeq.alt=*:*f.organization_type__facet.facet.limit=6f.language__facet.facet.limit=6hl=truef.date_of_capture_.facet.limit=6group.field=original_urlhl.simple.post=/codefacet.field=domainfacet.field=date_of_capture_facet.field=mimetype_codefacet.field=geographic_focus__facetfacet.field=organization_based_in__facetfacet.field=organization_type__facetfacet.field=language__facetfacet.field=creator_name__facethl.fragsize=600f.creator_name__facet.facet.limit=6facet.mincount=1qf=text^1hl.fl=contentshl.fl=titlehl.fl=original_urlwt=rubyf.geographic_focus__facet.facet.limit=6defType=edismaxrows=10f.domain.facet.limit=6q=Unanganf.organization_based_in__facet.facet.limit=6q.op=ANDgroup=truehl.usePhraseHighlighter=true} hits=8 status=0 QTime=108 ... For the query above (which can be simplified to say: find all documents that contain the word unangan and return facets, highlights, etc.), I get five search results. Only three of these are returning highlighted snippets. Here's the highlighting portion of the solr response (note: printed in ruby notation because I'm receiving this response in a Rails app): highlighting= {20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf= {}, 20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf= {}, 20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf= {}, 20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf= {contents= [...actual snippet is returned here...]}, 20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf= {contents= [...actual snippet is returned here...]}, 20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2-uu-no-39-tahun-1999= {contents= [...actual snippet is returned here...]}, 20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no-39-tahun-1999?tmpl=componentformat=raw= {contents= [...actual snippet is returned here...]}, 20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.pdf= {}} I have eight (as opposed to five) results above because I'm also doing a grouped query, grouping by a field called original_url, and this leads to five grouped results. I've confirmed that my highlight-lacking results DO contain the word unangan, as expected, and this term is appearing in a text field that's indexed and stored, and being searched for all text searches. For example, one of the search results is for a crawl of this document: http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.pdf And if you view that document on the web, you'll see that it does contain unangan. Has anyone seen this before? And does anyone have any good suggestions for troubleshooting/fixing the problem? Thanks! - Eric
Re: Dynamic Field
2. Flatten your data. 3. Use dynamic and multivalued fields only in moderation. 1. First, tell us how your application intends to use and query your data. That will be a guide to how your data should be stored. -- Jack Krupansky -Original Message- From: anurag.jain Sent: Sunday, September 08, 2013 3:49 PM To: solr-user@lucene.apache.org Subject: Dynamic Field Hi all, I am using solr dynamic field. i am storing data in the following format:- idbatch_*job_* So for a doc, data is storing like:- -- id batch_21 job_21 job_22 batch_22 ... -- 1 120 01 121 ... -- Using luke request handler i found that currently there are more than 5k fields and 300 docs. And fields are always increasing because of dynamic field. So i am worried about solr performance or any unknown issues which can come to solr. If somebody had experienced please tell me. Please tell the correct solution to handle these issues. are there any alternatives of dynamic fields. Can we store information like below ? - idjobs batch - 21 {21:0,22:1}{21:120,22:121} - -- View this message in context: http://lucene.472066.n3.nabble.com/Dynamic-Field-tp4088775.html Sent from the Solr - User mailing list archive at Nabble.com.
SOLR index Recovery availability
Hi Team,Need your suggestions/views on the approach I have in place for SOLR availability and recovery. I am running *SOLR 3.5* and have around *30k* document's indexed in my SOLR core. I have configured SOLR to hold *5k * documents in each segment at a time.I periodically commit optimize my SOLR index. I have delta indexing in place to index new documents in SOLR, /very rarely / I face index corruption issue, to fix this issue I have *checkindex -fix* job in place as well.However sometime this job can delete the corrupt segment! (meaning loss of 5K documents, till I full Re-index SOLR.) _*I have few follow up questions on this case.*_ 1. How can I avoid loss of 5K documents (checkindex -fix), shall I reduce number of documents per segments count? is there an alternate solution? 2. If I start taking periodic backup (snapshot) of entire index, shall I just replace my data/index folder from the backup folder in case corruption is found? Is this a good implementation? 3. Any other good solution, suggestion to have maximum index availability all the time? Thanks in advance for giving your time. Atul -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-index-Recovery-availability-tp4088782.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR index Recovery availability
This sounds very complicated for only 30K documents. Put them all on one server, give it enough memory so that the index can all be in file buffers. If there is a disaster, reindex everything. That should only take a few minutes. And don't optimize. wunder On Sep 8, 2013, at 3:01 PM, atuldj.jadhav wrote: Hi Team,Need your suggestions/views on the approach I have in place for SOLR availability and recovery. I am running *SOLR 3.5* and have around *30k* document's indexed in my SOLR core. I have configured SOLR to hold *5k * documents in each segment at a time.I periodically commit optimize my SOLR index. I have delta indexing in place to index new documents in SOLR, /very rarely / I face index corruption issue, to fix this issue I have *checkindex -fix* job in place as well.However sometime this job can delete the corrupt segment! (meaning loss of 5K documents, till I full Re-index SOLR.) _*I have few follow up questions on this case.*_ 1. How can I avoid loss of 5K documents (checkindex -fix), shall I reduce number of documents per segments count? is there an alternate solution? 2. If I start taking periodic backup (snapshot) of entire index, shall I just replace my data/index folder from the backup folder in case corruption is found? Is this a good implementation? 3. Any other good solution, suggestion to have maximum index availability all the time? Thanks in advance for giving your time. Atul -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-index-Recovery-availability-tp4088782.html Sent from the Solr - User mailing list archive at Nabble.com. -- Walter Underwood wun...@wunderwood.org
multiple update processor chains.
is it possible to have multiple run by default? i've tried adding multiple update.chains for the UpdateRequestHandler but it didn't seem to work. wondering if its even possible. Thanks msj
Data import
Hi, It's possible to disable document update when running data import, full-import command? Thanks smime.p7s Description: S/MIME cryptographic signature
RE: Some highlighted snippets aren't being returned
Eric, Your example document is quite long. Are you setting hl.maxAnalyzedChars? If you don't, the highlighter you appear to be using will not look past the first 51,200 characters of the document for snippet candidates. http://wiki.apache.org/solr/HighlightingParameters#hl.maxAnalyzedChars -- Bryan -Original Message- From: Eric O'Hanlon [mailto:elo2...@columbia.edu] Sent: Sunday, September 08, 2013 2:01 PM To: solr-user@lucene.apache.org Subject: Re: Some highlighted snippets aren't being returned Hi again Everyone, I didn't get any replies to this, so I thought I'd re-send in case anyone missed it and has any thoughts. Thanks, Eric On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon elo2...@columbia.edu wrote: Hi Everyone, I'm facing an issue in which my solr query is returning highlighted snippets for some, but not all results. For reference, I'm searching through an index that contains web crawls of human-rights-related websites. I'm running solr as a webapp under Tomcat and I've included the query's solr params from the Tomcat log: ... webapp=/solr-4.2 path=/select params={facet=truesort=score+descgroup.limit=10spellcheck.q=Unanganf.m imetype_code.facet.limit=7hl.simple.pre=codeq.alt=*:*f.organization_t ype__facet.facet.limit=6f.language__facet.facet.limit=6hl=truef.date_of _capture_.facet.limit=6group.field=original_urlhl.simple.post=/code facet.field=domainfacet.field=date_of_capture_facet.field=mimetype _codefacet.field=geographic_focus__facetfacet.field=organization_based_i n__facetfacet.field=organization_type__facetfacet.field=language__facet facet.field=creator_name__facethl.fragsize=600f.creator_name__facet.face t.limit=6facet.mincount=1qf=text^1hl.fl=contentshl.fl=titlehl.fl=orig inal_urlwt=rubyf.geographic_focus__facet.facet.limit=6defType=edismaxr ows=10f.domain.facet.limit=6q=Unanganf.organization_based_in__facet.fac et.limit=6q.op=ANDgroup=truehl.usePhraseHighlighter=true} hits=8 status=0 QTime=108 ... For the query above (which can be simplified to say: find all documents that contain the word unangan and return facets, highlights, etc.), I get five search results. Only three of these are returning highlighted snippets. Here's the highlighting portion of the solr response (note: printed in ruby notation because I'm receiving this response in a Rails app): highlighting= {20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun% 202002%20tentang%20Perlindungan%20Anak.pdf= {}, 20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2 02002%20tentang%20Perlindungan%20Anak.pdf= {}, 20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2 02002%20tentang%20Perlindungan%20Anak.pdf= {}, 20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf= {contents= [...actual snippet is returned here...]}, 20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf= {contents= [...actual snippet is returned here...]}, 20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2- uu-no-39-tahun-1999= {contents= [...actual snippet is returned here...]}, 20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no- 39-tahun-1999?tmpl=componentformat=raw= {contents= [...actual snippet is returned here...]}, 20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_U timut_heritage.pdf= {}} I have eight (as opposed to five) results above because I'm also doing a grouped query, grouping by a field called original_url, and this leads to five grouped results. I've confirmed that my highlight-lacking results DO contain the word unangan, as expected, and this term is appearing in a text field that's indexed and stored, and being searched for all text searches. For example, one of the search results is for a crawl of this document: http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.p df And if you view that document on the web, you'll see that it does contain unangan. Has anyone seen this before? And does anyone have any good suggestions for troubleshooting/fixing the problem? Thanks! - Eric
Re: Tweaking boosts for more search results variety
Sorry for the delayed response. Limitations in this scenario where we have 5 million indexed documents from about only 1000 sites. If results are grouped by site we will not be able to show more than a couple of pages for lot of search keywords. Ex: Search for Solr has 1000 matches but only from 20 sites. In these 20 sites 10 sites are of sitetype A - boost 5 7 sites are of sitetype B - boost 2 3 sites are of sitetype C - boost 1 Limitation 1: If these are grouped by site only 20 results would be displayed in 2 pages (10 per page). We still want to display all the results. For a better user experience Ideally we would like to have 10 results in page 1 from 10 distinct sites of sitetype A (which has higher boost already) or In a real world scenario from 7-8 distinct sites. In our case we see like 7 matches on a page from a single site. Limitation 2: Inverse Document frequency (IDF) would have helped here but, in that case our preferential boost for sitetypes is ignored and some results from sitetype C would come on top due to IDF boost. What we want to achieve is any way to control variety of sites displayed in search results with preferential boost still in place. Thanks in advance On Sun, Sep 8, 2013 at 6:36 AM, Furkan KAMACI furkankam...@gmail.comwrote: What do you mean with *these limitations *Do you want to make multiple grouping at same time? 2013/9/6 Sai Gadde gadde@gmail.com Thank you Jack for the suggestion. We can try group by site. But considering that number of sites are only about 1000 against the index size of 5 million, One can expect most of the hits would be hidden and for certain specific keywords only a handful of actual results could be displayed if results are grouped by site. we already group on a signature field to identify duplicate content in these 5 million+ docs. But here the number of duplicates are only about 3-5% maximum. Is there any workaround for these limitations with grouping? Thanks Shyam On Thu, Sep 5, 2013 at 9:16 PM, Jack Krupansky j...@basetechnology.com wrote: The grouping (field collapsing) feature somewhat addresses this - group by a site field and then if more than one or a few top pages are from the same site they get grouped or collapsed so that you can see more sites in a few results. See: http://wiki.apache.org/solr/**FieldCollapsing http://wiki.apache.org/solr/FieldCollapsing https://cwiki.apache.org/**confluence/display/solr/**Result+Grouping https://cwiki.apache.org/confluence/display/solr/Result+Grouping -- Jack Krupansky -Original Message- From: Sai Gadde Sent: Thursday, September 05, 2013 2:27 AM To: solr-user@lucene.apache.org Subject: Tweaking boosts for more search results variety Our index is aggregated content from various sites on the web. We want good user experience by showing multiple sites in the search results. In our setup we are seeing most of the results from same site on the top. Here is some information regarding queries and schema site - String field. We have about 1000 sites in index sitetype - String field. we have 3 site types omitNorms=true for both the fields Doc count varies largely based on site and sitetype by a factor of 10 - 1000 times Total index size is about 5 million docs. Solr Version: 4.0 In our queries we have a fixed and preferential boost for certain sites. sitetype has different and fixed boosts for 3 possible values. We turned off Inverse Document Frequency (IDF) for these boosts to work properly. Other text fields are boosted based on search keywords only. With this setup we often see a bunch of hits from a single site followed by next etc., Is there any solution to see results from variety of sites and still keep the preferential boosts in place?
Re: Data import
What do you specifically mean by the disable document update? Do you mean in-place update? Or do you mean you want to run the import but not actually populate Solr collection with processed documents? It might help to explain the business level goal you are trying to achieve. Or, specific error that you are perhaps seeing and trying to avoid. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, Sep 9, 2013 at 6:42 AM, Luís Portela Afonso meligalet...@gmail.comwrote: Hi, It's possible to disable document update when running data import, full-import command? Thanks
Re: Solr4.4 or zookeeper 3.4.5 do not support too many collections? more than 600?
Thank you Erick. It's very useful to me. I have already started to merge logs of collections to 15 collections. but there's another question. If I merge 1000 collections to 1 collection, to the new collection it will have about 20G data and about 30M records. In 1 solr server, I will create 15 such big collections. So I don't know if solr can support such big data in 1 collection(20G data with 30M records) or in 1 solr server(15*20G data with 15*30M records)? Or do I need buy new servers to install solr and do shrding to support that? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr4-4-or-zookeeper-3-4-5-do-not-support-too-many-collections-more-than-600-tp4088689p4088802.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: multiple update processor chains.
Only one chain per handler. But then you can define any sequence inside the chain, so why do you care about multiple chains? Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, Sep 9, 2013 at 5:43 AM, mike st. john mstj...@gmail.com wrote: is it possible to have multiple run by default? i've tried adding multiple update.chains for the UpdateRequestHandler but it didn't seem to work. wondering if its even possible. Thanks msj
Re: Loading a SpellCheck dynamically
Hi Thanks for the response. Per your instructions, I have set up additional request handlers for handling language-specific /selects: !-- generic query -- requestHandler name=/select class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dftext/str str name=spellchecktrue/str str name=spellcheck.collatetrue/str str name=spellcheck.onlyMorePopularfalse/str str name=spellcheck.extendedResultsfalse/str str name=spellcheck.count3/str /lst arr name=last-components strspellcheck/str /arr /requestHandler !-- English-specific query -- requestHandler name=/select_en class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dftext/str str name=spellchecktrue/str str name=spellcheck.collatetrue/str str name=spellcheck.onlyMorePopularfalse/str str name=spellcheck.extendedResultsfalse/str str name=spellcheck.count3/str /lst arr name=last-components strspellcheck_en/str /arr /requestHandler While it may require additional setup I think it works quite elegantly and allows me to do more language-targeted queries in addition to spell suggest. Thanks again. Cheers Hayden On 06/09/13 16:35, Shalin Shekhar Mangar wrote: My guess is that you have a single request handler defined with all your language specific spell check components. This is why you see spellcheck values from all spellcheckers. If the above is true, then I don't think there is a way to choose one specific spellchecker component. The alternative is to define multiple request handlers with one-to-one mapping with the spell check components. Then you can send a request to one particular request handler and the corresponding spell check component will return its response. On Thu, Sep 5, 2013 at 11:29 PM, Mr Havercamp mrhaverc...@gmail.com wrote: I currently have multiple spellchecks configured in my solrconfig.xml to handle a variety of different spell suggestions in different languages. In the snippet below, I have a catch-all spellcheck as well as an English only one for more accurate matching (I.e. my schema.xml is set up to capture english only fields to an english-specific textSpell_en field and then I also capture to a generic textSpell field): ---solrconfig.xml--- searchComponent name=spellcheck_en class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetextSpell_en/str lst name=spellchecker str name=namedefault/str str name=fieldspell_en/str str name=spellcheckIndexDir./spellchecker_en/str str name=buildOnOptimizetrue/str /lst /searchComponent searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetextSpell/str lst name=spellchecker str name=namedefault/str str name=fieldspell/str str name=spellcheckIndexDir./spellchecker/str str name=buildOnOptimizetrue/str /lst /searchComponent My question is; when I query my Solr index, am I able to load, say, just spellcheck values from the spellcheck_en spellchecker rather than from both? This would be useful if I were to start implementing additional language spellchecks; E.g. spellcheck_ja, spellcheck_fr, etc. Thanks for any insights. Cheers Hayden
Searching solr on school name during year
Hi, Currently I have a student search which allows me to search for documents in a school. I am looking at including year search into the existing schema which would enable users to search for students in a school during an year. I have a proposed change in the schema to add the year component to facilitate this search. Existing schema: (No year information currently) field name=id type=string indexed=true stored=true required=true multiValued=false / field name=name type=text_general indexed=true stored=true / field name=schoolName type=text_general indexed=true stored=true multiValued=true/ Current sample data: name:Borris Mayers schoolName:Canterbury University New schema: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=name type=text_general indexed=true stored=true / field name=schoolName type=text_general indexed=true stored=true multiValued=true/ field name=schoolNameWithTermOriginal type=string indexed=false stored=true multiValued=true/ Sample data: name:Borris Mayers schoolName:Canterbury University, start_2001, year_2001, year_2002, year_2003, year_2004, year_2005, end_2005 schoolNameWithTermOriginal:Canterbury University||2001-2005 Please suggest if its a correct approach or there is a better way to do the same. I am using Solr 4.3. Thanks, Rohit Kumar