Re: Some highlighted snippets aren't being returned
Hi Eric, As Bryan suggests, you should look at appropriately setting up the fragSize maxAnalyzedChars for long documents. One issue I find with your search request is that in trying to highlight across three separate fields, you have added each of them as a separate request param: hl.fl=contentshl.fl=titlehl.fl=original_url The way to do it would be (http://wiki.apache.org/solr/HighlightingParameters#hl.fl) to pass them as values to one comma (or space) separated field: hl.fl=contents,title,original_url Regards, Aloke On 9/9/13, Bryan Loofbourrow bloofbour...@knowledgemosaic.com wrote: Eric, Your example document is quite long. Are you setting hl.maxAnalyzedChars? If you don't, the highlighter you appear to be using will not look past the first 51,200 characters of the document for snippet candidates. http://wiki.apache.org/solr/HighlightingParameters#hl.maxAnalyzedChars -- Bryan -Original Message- From: Eric O'Hanlon [mailto:elo2...@columbia.edu] Sent: Sunday, September 08, 2013 2:01 PM To: solr-user@lucene.apache.org Subject: Re: Some highlighted snippets aren't being returned Hi again Everyone, I didn't get any replies to this, so I thought I'd re-send in case anyone missed it and has any thoughts. Thanks, Eric On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon elo2...@columbia.edu wrote: Hi Everyone, I'm facing an issue in which my solr query is returning highlighted snippets for some, but not all results. For reference, I'm searching through an index that contains web crawls of human-rights-related websites. I'm running solr as a webapp under Tomcat and I've included the query's solr params from the Tomcat log: ... webapp=/solr-4.2 path=/select params={facet=truesort=score+descgroup.limit=10spellcheck.q=Unanganf.m imetype_code.facet.limit=7hl.simple.pre=codeq.alt=*:*f.organization_t ype__facet.facet.limit=6f.language__facet.facet.limit=6hl=truef.date_of _capture_.facet.limit=6group.field=original_urlhl.simple.post=/code facet.field=domainfacet.field=date_of_capture_facet.field=mimetype _codefacet.field=geographic_focus__facetfacet.field=organization_based_i n__facetfacet.field=organization_type__facetfacet.field=language__facet facet.field=creator_name__facethl.fragsize=600f.creator_name__facet.face t.limit=6facet.mincount=1qf=text^1hl.fl=contentshl.fl=titlehl.fl=orig inal_urlwt=rubyf.geographic_focus__facet.facet.limit=6defType=edismaxr ows=10f.domain.facet.limit=6q=Unanganf.organization_based_in__facet.fac et.limit=6q.op=ANDgroup=truehl.usePhraseHighlighter=true} hits=8 status=0 QTime=108 ... For the query above (which can be simplified to say: find all documents that contain the word unangan and return facets, highlights, etc.), I get five search results. Only three of these are returning highlighted snippets. Here's the highlighting portion of the solr response (note: printed in ruby notation because I'm receiving this response in a Rails app): highlighting= {20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun% 202002%20tentang%20Perlindungan%20Anak.pdf= {}, 20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2 02002%20tentang%20Perlindungan%20Anak.pdf= {}, 20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2 02002%20tentang%20Perlindungan%20Anak.pdf= {}, 20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf= {contents= [...actual snippet is returned here...]}, 20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf= {contents= [...actual snippet is returned here...]}, 20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2- uu-no-39-tahun-1999= {contents= [...actual snippet is returned here...]}, 20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no- 39-tahun-1999?tmpl=componentformat=raw= {contents= [...actual snippet is returned here...]}, 20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_U timut_heritage.pdf= {}} I have eight (as opposed to five) results above because I'm also doing a grouped query, grouping by a field called original_url, and this leads to five grouped results. I've confirmed that my highlight-lacking results DO contain the word unangan, as expected, and this term is appearing in a text field that's indexed and stored, and being searched for all text searches. For example, one of the search results is for a crawl of this document: http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.p df And if you view that document on the web, you'll see that it does contain unangan. Has anyone seen this before? And does anyone have any good suggestions for troubleshooting/fixing the problem? Thanks! - Eric
Re: multiple update processor chains.
Alexandre, it was setup with multiple processors and working fine. I just noticed in the docs, it mentioned you could have multiple chains, it seemed to make sense to have the ability to chain the defined processors in order without the need to merge them into a single update processor definition. thanks msj On Mon, Sep 9, 2013 at 12:28 AM, Alexandre Rafalovitch arafa...@gmail.comwrote: Only one chain per handler. But then you can define any sequence inside the chain, so why do you care about multiple chains? Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, Sep 9, 2013 at 5:43 AM, mike st. john mstj...@gmail.com wrote: is it possible to have multiple run by default? i've tried adding multiple update.chains for the UpdateRequestHandler but it didn't seem to work. wondering if its even possible. Thanks msj
Re: multiple update processor chains.
Which section in the docs specifically? I thought it was multiple chains per config file, but you had to choose your specific chain for individual processors. I might be wrong though. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, Sep 9, 2013 at 1:51 PM, mike st. john mstj...@gmail.com wrote: Alexandre, it was setup with multiple processors and working fine. I just noticed in the docs, it mentioned you could have multiple chains, it seemed to make sense to have the ability to chain the defined processors in order without the need to merge them into a single update processor definition. thanks msj On Mon, Sep 9, 2013 at 12:28 AM, Alexandre Rafalovitch arafa...@gmail.comwrote: Only one chain per handler. But then you can define any sequence inside the chain, so why do you care about multiple chains? Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, Sep 9, 2013 at 5:43 AM, mike st. john mstj...@gmail.com wrote: is it possible to have multiple run by default? i've tried adding multiple update.chains for the UpdateRequestHandler but it didn't seem to work. wondering if its even possible. Thanks msj
Ideal Server Environment
Hi guyz, I am trying to setup a LIVE environment for my project that uses Apache Solr along with PHP/MySQL... The indexing is of heavy data (about many GBs).. Please can someone recommend the best server for this? Thanks a lot. -- Regards, Raheel Hasan
Re: Ideal Server Environment
Also, I wonder if Solr will require High processor? High Memory or High Storage? 1) For Indexing 2) For querying On Mon, Sep 9, 2013 at 12:36 PM, Raheel Hasan raheelhasan@gmail.comwrote: Hi guyz, I am trying to setup a LIVE environment for my project that uses Apache Solr along with PHP/MySQL... The indexing is of heavy data (about many GBs).. Please can someone recommend the best server for this? Thanks a lot. -- Regards, Raheel Hasan -- Regards, Raheel Hasan
Re: multiple update processor chains.
Your correct, its not specifically for the update.chain. my mistake. thanks msj On Mon, Sep 9, 2013 at 3:34 AM, Alexandre Rafalovitch arafa...@gmail.comwrote: Which section in the docs specifically? I thought it was multiple chains per config file, but you had to choose your specific chain for individual processors. I might be wrong though. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, Sep 9, 2013 at 1:51 PM, mike st. john mstj...@gmail.com wrote: Alexandre, it was setup with multiple processors and working fine. I just noticed in the docs, it mentioned you could have multiple chains, it seemed to make sense to have the ability to chain the defined processors in order without the need to merge them into a single update processor definition. thanks msj On Mon, Sep 9, 2013 at 12:28 AM, Alexandre Rafalovitch arafa...@gmail.comwrote: Only one chain per handler. But then you can define any sequence inside the chain, so why do you care about multiple chains? Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, Sep 9, 2013 at 5:43 AM, mike st. john mstj...@gmail.com wrote: is it possible to have multiple run by default? i've tried adding multiple update.chains for the UpdateRequestHandler but it didn't seem to work. wondering if its even possible. Thanks msj
Re: Solr4.4 or zookeeper 3.4.5 do not support too many collections? more than 600?
If you want have more collections you need to configure in zookeeper and solr the -Djute.maxbuffer variable to override the default limitation. In zookeeper you can configure it in zookeeper-env.sh file. On Solr pass the variable like the others. Note: In both cases the value configured need to be the same or bad things can happen. -- Yago Riveiro Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Monday, September 9, 2013 at 5:01 AM, diyun2008 wrote: Thank you Erick. It's very useful to me. I have already started to merge logs of collections to 15 collections. but there's another question. If I merge 1000 collections to 1 collection, to the new collection it will have about 20G data and about 30M records. In 1 solr server, I will create 15 such big collections. So I don't know if solr can support such big data in 1 collection(20G data with 30M records) or in 1 solr server(15*20G data with 15*30M records)? Or do I need buy new servers to install solr and do shrding to support that? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr4-4-or-zookeeper-3-4-5-do-not-support-too-many-collections-more-than-600-tp4088689p4088802.html Sent from the Solr - User mailing list archive at Nabble.com (http://Nabble.com).
Re: Dynamic Field
Hi: As you posted, a possibility could be, to define the fields jobs and batch as multivalued and use the partial updatehttp://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/to add new values to those fields. Hope it helps. On Sun, Sep 8, 2013 at 9:49 PM, anurag.jain anurag.k...@gmail.com wrote: Hi all, I am using solr dynamic field. i am storing data in the following format:- idbatch_*job_* So for a doc, data is storing like:- -- id batch_21 job_21 job_22 batch_22 ... -- 1 120 01 121 ... -- Using luke request handler i found that currently there are more than 5k fields and 300 docs. And fields are always increasing because of dynamic field. So i am worried about solr performance or any unknown issues which can come to solr. If somebody had experienced please tell me. Please tell the correct solution to handle these issues. are there any alternatives of dynamic fields. Can we store information like below ? - idjobs batch - 21 {21:0,22:1}{21:120,22:121} - -- View this message in context: http://lucene.472066.n3.nabble.com/Dynamic-Field-tp4088775.html Sent from the Solr - User mailing list archive at Nabble.com.
help regarding custom query which returns custom output
hi all I have requirement like I have implemented fulltext search and autosuggestion and spellcorrection functionality in solr but they all are running on different cores so I have to call 3 different request handlers for getting the results which is adding the unnecessary delay so I wanted to know is there any solution that I call just one request URL and get all these three results and json feedback from solr. thanx regards rohan
Re: Ideal Server Environment
On Mon, 2013-09-09 at 09:39 +0200, Raheel Hasan wrote: Also, I wonder if Solr will require High processor? High Memory or High Storage? 1) For Indexing * Processor * Bulk read/write. 2) For querying * Processor only if you have complex queries * Fast random I/O reads, which can be accomplished either by having enough RAM to cache most or all of your index or by using SSDs. Your question is much too generic to go into specific hardware. Read https://wiki.apache.org/lucene-java/ImproveIndexingSpeed https://wiki.apache.org/lucene-java/ImproveSearchingSpeed https://wiki.apache.org/solr/SolrPerformanceProblems then build a test instance, measure and scale from there. - Toke Eskildsen
Re: Ideal Server Environment
ok thanks for the reply Also, could you tell me if CentOS or Ubuntu will be better? On Mon, Sep 9, 2013 at 3:17 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote: On Mon, 2013-09-09 at 09:39 +0200, Raheel Hasan wrote: Also, I wonder if Solr will require High processor? High Memory or High Storage? 1) For Indexing * Processor * Bulk read/write. 2) For querying * Processor only if you have complex queries * Fast random I/O reads, which can be accomplished either by having enough RAM to cache most or all of your index or by using SSDs. Your question is much too generic to go into specific hardware. Read https://wiki.apache.org/lucene-java/ImproveIndexingSpeed https://wiki.apache.org/lucene-java/ImproveSearchingSpeed https://wiki.apache.org/solr/SolrPerformanceProblems then build a test instance, measure and scale from there. - Toke Eskildsen -- Regards, Raheel Hasan
Re: Profiling Solr Lucene for query
are you querying your shards via a frontend solr? We have noticed, that querying becomes much faster if results merging can be avoided. Dmitry On Sun, Sep 8, 2013 at 6:56 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello all Looking on the 10% slowest queries, I get very bad performances (~60 sec per query). These queries have lots of conditions on my main field (more than a hundred), including phrase queries and rows=1000. I do return only id's though. I can quite firmly say that this bad performance is due to slow storage issue (that are beyond my control for now). Despite this I want to improve my performances. As tought in school, I started profiling these queries and the data of ~1 minute profile is located here: http://picpaste.com/pics/IMG_20130908_132441-ZyrfXeTY.1378637843.jpg Main observation: most of the time I do wait for readVInt, who's stacktrace (2 out of 2 thread dumps) is: catalina-exec-3870 - Thread t@6615 java.lang.Thread.State: RUNNABLE at org.apadhe.lucene.store.DataInput.readVInt(DataInput.java:108) at org.apaChe.lucene.codeosAockTreeIermsReade$FieldReader$SegmentTermsEnumFrame.loadBlock(BlockTreeTermsReader.java: 2357) at ora.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BlockTreeTermsReader.java:1745) at org.apadhe.lucene.index.TermContext.build(TermContext.java:95) at org.apache.lucene.search.PhraseQuery$PhraseWeight.init(PhraseQuery.java:221) at org.apache.lucene.search.PhraseQuery.createWeight(PhraseQuery.java:326) at org.apache.lucene.search.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at oro.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:675) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297) So I do actually wait for IO as expected, but I might be too many time page faulting while looking for the TermBlocks (tim file), ie locating the term. As I reindex now, would it be useful lowering down the termInterval (default to 128)? As the FST (tip files) are that small (few 10-100 MB) so there are no memory contentions, could I lower down this param to 8 for example? The benefit from lowering down the term interval would be to obligate the FST to get on memory (JVM - thanks to the NRTCachingDirectory) as I do not control the term dictionary file (OS caching, loads an average of 6% of it). General configs: solr 4.3 36 shards, each has few million docs These 36 servers (each server has 2 replicas) are running virtual, 16GB memory each (4GB for JVM, 12GB remain for the OS caching), consuming 260GB of disk mounted for the index files.
Facet Sort with non ASCII Characters
Dear solr users Is there a plan to add support for alphabetical facet sorting with non ASCII Characters ? Best regards Sandro Sandro Zbinden Software Engineer
Facet sort descending
Dear solr users Is there a plan to add a descending sort order for facet queries ? Best regards Sandro Sandro Zbinden Software Engineer
Re: More on topic of Meta-search/Federated Search with Solr
Hi Dan, You might want to take a look at pazpar2 [1], an open-source, federated search engine with first-class support for SOLR (with addition to standard information retrieval protocols like Z39.50/SRU). [1] http://www.indexdata.com/pazpar2 On Thu, Sep 5, 2013 at 9:55 PM, Paul Libbrecht p...@hoplahup.net wrote: Hello list, A student of a friend of mine made his masters on that topic, especially about federated ranking. I have copied his text here: http://direct.hoplahup.net/tmp/FederatedRanking-Koblischke-2009.pdf Feel free to contact me to contact Robert Koblischke for questions. Paul On 28 août 2013, at 20:35, Dan Davis wrote: On Mon, Aug 26, 2013 at 9:06 PM, Amit Jha shanuu@gmail.com wrote: Would you like to create something like http://knimbus.com I work at the National Library of Medicine. We are moving our library catalog to a newer platform, and we will probably include articles. The article's content and meta-data are available from a number of web-scale discovery services such as PRIMO, Summon, EBSCO's EDS, EBSCO's traditional API. Most libraries use open source solutions to avoid the cost of purchasing an expensive enterprise search platform. We are big; we already have a closed-source enterprise search engine (and our own home grown Entrez search used for PubMed).Since we can already do Federated Search with the above, I am evaluating the effort of adding such to Apache Solr. Because NLM data is used in the open relevancy project, we actually have the relevancy decisions to decide whether we have done a good job of it. I obviously think it would be Fun to add Federated Search to Apache Solr. *Standard disclosure *- my opinion's do not represent the opinions of NIH or NLM.Fun is no reason to spend tax-payer money.Enhancing Apache Solr would reduce the risk of putting all our eggs in one basket. and there may be some other relevant benefits. We do use Apache Solr here for more than one other project... so keep up the good work even if my working group decides to go with the closed-source solution. -- Cheers, Jakub
Re: Ideal Server Environment
On Mon, 2013-09-09 at 12:42 +0200, Raheel Hasan wrote: Also, could you tell me if CentOS or Ubuntu will be better? You are asking for short answers to complex questions. There is nothing inherent in Solr that favours one Linux installation over another. CentOS is aimed at the enterprise, so I _guess_ that it will be preferable if you have a sysadmin to handle the underlying system for you. If you are to manage it yourself, I would recommend Ubuntu as it is aimed at end-users. - Toke Eskildsen
Re: Solr Cell Question
Thanks Erick, This is how I was doing it but when I saw the Solr Cell stuff I figured I'd give it a go. What I ended up doing is the following ModifiableSolrParams params = indexer.index(artifact); params.add(fmap.content, my_custom_field); params.add(extractFormat, text); ContentStreamUpdateRequest up = new ContentStreamUpdateRequest( /update/extract); up.setParams(params); FileStream f = new FileStream(new File()); up.addContentStream(f); On Fri, Sep 6, 2013 at 9:54 AM, Erick Erickson erickerick...@gmail.comwrote: It's always frustrating when someone replies with Why not do it a completely different way?. But I will anyway :). There's no requirement at all that you send things to Solr to make Solr Cel (aka Tika) do it's tricks. Since you're already in SolrJ anyway, why not just parse on the client? This has the advantage of allowing you to offload the Tika processing from Solr which can be quite expensive. You can use the same Tika jars that come with Solr or download whatever version from the Tika project you want. That way, you can exercise much better control over what's done. Here's a skeletal program with indexing from a DB mixed in, but it shouldn't be hard at all to pull the DB parts out. http://searchhub.org/dev/2012/02/14/indexing-with-solrj/ FWIW, Erick On Thu, Sep 5, 2013 at 5:28 PM, Jamie Johnson jej2...@gmail.com wrote: Is it possible to configure solr cell to only extract and store the body of a document when indexing? I'm currently doing the following which I thought would work ModifiableSolrParams params = new ModifiableSolrParams(); params.set(defaultField, content); params.set(xpath, /xhtml:html/xhtml:body/descendant::node()); ContentStreamUpdateRequest up = new ContentStreamUpdateRequest( /update/extract); up.setParams(params); FileStream f = new FileStream(new File(..)); up.addContentStream(f); up.setAction(ACTION.COMMIT, true, true); solrServer.request(up); But the result of content is as follows arr name=content_mvtxt str/ strnull/str strISO-8859-1/str strtext/plain; charset=ISO-8859-1/str strJust a little test/str /arr What I had hoped for was just arr name=content_mvtxt strJust a little test/str /arr
SOLR 4 stopwords and token positions
Hi Everyone, I'm migrating from SOLR 3.x to 4.x and I'm required to keep the results as close as possible as before. So I'm running some tests and found some differences. My query is: *title_search_pt:(geladeira/refrigerador)* And the parsed query becomes: *MultiPhraseQuery(title_search_pt:(refriger geladeir) (refriger geladeir))* * * This is identical in both instances (3.x and 4.x) so that's not the problem. My document is: *balcão refrigerado e geladeira frigorifica* * * Which, after analysis, becomes: *balca refriger geladeir frigorif* * * That is also identical in both versions, *except for the token positions*. Notice how 'e' disappears, because of being a stopword. In SOLR 3.x the positions are: 1, 2, *3*, 4 In SOLR 4.x the positions are: 1, 2, *4*, 5 Could that be the problem? I've posted a question before here: phrase queries on punctuationhttp://stackoverflow.com/questions/15314460/solr-generates-phrase-queries-on-punctuation which I believe that, with the issue with token positions, is causing the discrepancies. I couldn't found any documentation/changelog about token positions with stopwords, hell, I can barely google SOLR-4 specific things. Can this be solved? I whish i could fix the original StackOverflow answer (prevent phrase query generation with punctuation), but I could live with fixing the token position thing at least (remember that if things work as before, then I am able to upgrade to 4.x). Thank you in advance PS: just in case I'm adding the schema (version=1.5) part: fieldtype name=text_pt class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index charFilter class=solr.PatternReplaceCharFilterFactory pattern=- replacement=IIIHYPHENIII/tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.PatternReplaceFilterFactory pattern=IIIHYPHENIII replacement=-/filter class=solr.ASCIIFoldingFilterFactory /filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 preserveOriginal=1 catenateWords=1 catenateNumbers=1 catenateAll=0/filter class=solr.LowerCaseFilterFactory/filter class=solr.StopFilterFactory ignoreCase=false words=portugueseStopWords.txt/ filter class=solr.BrazilianStemFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzeranalyzer type=query charFilter class=solr.PatternReplaceCharFilterFactory pattern=- replacement=IIIHYPHENIII/tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.PatternReplaceFilterFactory pattern=IIIHYPHENIII replacement=-/filter class=solr.ASCIIFoldingFilterFactory / filter class=solr.SynonymFilterFactory ignoreCase=true synonyms=portugueseSynonyms.txt expand=true/filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 preserveOriginal=1 catenateNumbers=0 catenateAll=0 protected=protwords.txt/ filter class=solr.LowerCaseFilterFactory/filter class=solr.StopFilterFactory ignoreCase=false words=portugueseStopWords.txt/ filter class=solr.BrazilianStemFilterFactory/filter class=solr.RemoveDuplicatesTokenFilterFactory//analyzer /fieldtype
Re: How to Manage RAM Usage at Heavy Indexing
Is there anything says something about that bug? 2013/8/28 Dan Davis dansm...@gmail.com This could be an operating systems problem rather than a Solr problem. CentOS 6.4 (linux kernel 2.6.32) may have some issues with page flushing and I would read-up up on that. The VM parameters can be tuned in /etc/sysctl.conf On Sun, Aug 25, 2013 at 4:23 PM, Furkan KAMACI furkankam...@gmail.com wrote: Hi Erick; I wanted to get a quick answer that's why I asked my question as that way. Error is as follows: INFO - 2013-08-21 22:01:30.978; org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp=/solr path=/update params={wt=javabinversion=2} {add=[com.deviantart.reachmeh ere:http/gallery/, com.deviantart.reachstereo:http/, com.deviantart.reachstereo:http/art/SE-mods-313298903, com.deviantart.reachtheclouds:http/, com.deviantart.reachthegoddess:http/, co m.deviantart.reachthegoddess:http/art/retouched-160219962, com.deviantart.reachthegoddess:http/badges/, com.deviantart.reachthegoddess:http/favourites/, com.deviantart.reachthetop:http/ art/Blue-Jean-Baby-82204657 (1444006227844530177), com.deviantart.reachurdreams:http/, ... (163 adds)]} 0 38790 ERROR - 2013-08-21 22:01:30.979; org.apache.solr.common.SolrException; java.lang.RuntimeException: [was class org.eclipse.jetty.io.EofException] early EOF at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:393) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:245) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1812) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:365) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:948) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:722) Caused by: org.eclipse.jetty.io.EofException: early EOF at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:65) at java.io.InputStream.read(InputStream.java:101) at
Re: Solr4.4 or zookeeper 3.4.5 do not support too many collections? more than 600?
Thank you Yago. That seems some strange. Do you know some official document detail this? I really need more evidence to do dicision.I mean I need to compare the two method and find out which have more advantages in terms of performance and cost. And I will change my parameter to do more testing. I have 15K collections at least . If you have more experiences, I'm very appreciated to get more advices from you. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr4-4-or-zookeeper-3-4-5-do-not-support-too-many-collections-more-than-600-tp4088689p4088873.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Connection Established but waiting for response for a long time.
Set name=ThreadPool New class=org.eclipse.jetty.util.thread.QueuedThreadPool Set name=minThreads10/Set Set name=maxThreads1/Set Set name=detailedDumpfalse/Set /New /Set Call name=addConnector Arg New class=org.eclipse.jetty.server.bio.SocketConnector Set name=hostSystemProperty name=jetty.host //Set Set name=portSystemProperty name=jetty.port default=8983//Set Set name=maxIdleTime5000/Set Set name=requestHeaderSize65536/Set Set name=lowResourceMaxIdleTime1500/Set Set name=statsOnfalse/Set /New /Arg /Call Everything is default expect for Set name=maxIdleTime5000/Set Set name=requestHeaderSize65536/Set Thanks, Qun -- View this message in context: http://lucene.472066.n3.nabble.com/Connection-Established-but-waiting-for-response-for-a-long-time-tp4088587p4088874.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr4.4 or zookeeper 3.4.5 do not support too many collections? more than 600?
I just found this option -Djute.maxbuffer in zookeeper admin document. But it's a Unsafe Options. I can't really know what it mean. Maybe that will bring some unstable problems? Does someone have some real practical experiences when using this parameter? I will have at least 15K collections. Or I will have to merge them to small numbers. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr4-4-or-zookeeper-3-4-5-do-not-support-too-many-collections-more-than-600-tp4088689p4088878.html Sent from the Solr - User mailing list archive at Nabble.com.
Stemming and protwords configuration
Hi, We have a Solr server using stemming: filter class=solr.SnowballPorterFilterFactory language=French protected=protwords.txt / I would like to query the French words frais and fraise separately. I put the word fraise in protwords.txt file. - When I query the word fraise, no document indexed with the word frais are found. - When I query the word frais, I've got documents indexed with the word fraise. Is there a way to do not match fraises documents in the second situation ? I hope this is clear. Thanks for your reply. Christophe _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified. Thank you.
Re: collections api setting dataDir
On 9/7/2013 2:25 PM, mike st. john wrote: yes the collections api ignored it,what i ended up doing, was just building out some fairness in regards to creating the cores and calling coreadmin to create the cores, seemed to work ok. Only issue i'm having now, and i'm still investigating is subsequent queries are returning different counts. Every time I have seen distributed queries return different counts on different runs, it is because documents with the same value in the UniqueKey field exist in more than one shard. If you are letting SolrCloud route your documents automatically, this shouldn't happen ... but if you are using distrib=false or a router that doesn't do it automatically, then it could. The Collections API doesn't do the dataDir parameter. I suspect this is because you could pass an absolute path in, which would break things because every core would be trying to use the same dataDir. If you want a directory other than ${instanceDir}/data for dataDir, then you will need to create each core individually rather than use the Collections API. Java does have the capability to determine whether a path is relative or absolute, but it is safer to just ignore that parameter, especially given the fact that a single cloud is usually on many servers, and there's no reason those servers can't be running wildly different operating systems. Half your cloud could be on a Linux/UNIX OS and half of it could be on Windows. I personally find it better to let the Collections API do its thing and use the default. Thanks, Shawn
Re: collections api setting dataDir
hi, i've sorted it all out. basically a few replicas had failed and the counts on the replicas were less than the leader., i basically killed the index on those replicas and let them recover. Thanks for the help. msj On Mon, Sep 9, 2013 at 11:08 AM, Shawn Heisey s...@elyograg.org wrote: On 9/7/2013 2:25 PM, mike st. john wrote: yes the collections api ignored it,what i ended up doing, was just building out some fairness in regards to creating the cores and calling coreadmin to create the cores, seemed to work ok. Only issue i'm having now, and i'm still investigating is subsequent queries are returning different counts. Every time I have seen distributed queries return different counts on different runs, it is because documents with the same value in the UniqueKey field exist in more than one shard. If you are letting SolrCloud route your documents automatically, this shouldn't happen ... but if you are using distrib=false or a router that doesn't do it automatically, then it could. The Collections API doesn't do the dataDir parameter. I suspect this is because you could pass an absolute path in, which would break things because every core would be trying to use the same dataDir. If you want a directory other than ${instanceDir}/data for dataDir, then you will need to create each core individually rather than use the Collections API. Java does have the capability to determine whether a path is relative or absolute, but it is safer to just ignore that parameter, especially given the fact that a single cloud is usually on many servers, and there's no reason those servers can't be running wildly different operating systems. Half your cloud could be on a Linux/UNIX OS and half of it could be on Windows. I personally find it better to let the Collections API do its thing and use the default. Thanks, Shawn
Re: Data import
When i run dataimport/?command=full-importclean=false, solr add new documents with the information. But if the same information already exists with the same uniquekey, it replaces the existing document with a new one. It does not update the document, it creates a new one. It's that possible? I'm indexing rss feeds. I run the rss example that exists in the solr examples, and i does that. On Sep 9, 2013, at 4:10 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: What do you specifically mean by the disable document update? Do you mean in-place update? Or do you mean you want to run the import but not actually populate Solr collection with processed documents? It might help to explain the business level goal you are trying to achieve. Or, specific error that you are perhaps seeing and trying to avoid. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, Sep 9, 2013 at 6:42 AM, Luís Portela Afonso meligalet...@gmail.comwrote: Hi, It's possible to disable document update when running data import, full-import command? Thanks smime.p7s Description: S/MIME cryptographic signature
Re: How to Manage RAM Usage at Heavy Indexing
On 9/9/2013 10:35 AM, P Williams wrote: Is it odd that my index is ~16GB but top shows 30GB in virtual memory? Would the extra be for the field and filter caches I've increased in size? This should probably be a new thread, but it might have some applicability here, so I'm replying. I have noticed some inconsistencies in memory reporting on Linux with regard to Solr. Here's a screenshot of top on one of my production systems, sorted by memory: https://www.dropbox.com/s/ylxm0qlcegithzc/prod-top-sort-mem.png The virtual memory size for the top process is right in line with my index size, plus a few gig for the java heap. Something to note as you ponder these numbers: My java heap is only 6GB. Java has allocated the entire 6GB. The other two java processes are homegrown Solr-related applications. What's odd is the resident and shared memory sizes. I have pretty much convinced myself that the shared memory size is misreported. If you add up the numbers for cached and free, you get a total of 53659264 ... about 11GB shy of the 64GB total memory. if the reported resident memory for the Solr java process (17GB) were accurate, this would exceed total physical memory by several gigabytes, and there would be swap in use, but as you can see, there is no swap in use. Recently I overheard a conversation between Lucene committers in a lucene IRC channel that seemed to be discussing this phenomenon. There is apparently some issue with certain mmap modes that result in the operating system shared memory number going up even though no actual memory is being consumed. Thanks, Shawn
Re: Searching solr on school name during year
You could either add two separate fields, one for start year and another for end year. And then facilitate range queries to include all docs. eg. Name - Boris start year - 2001 end year - 2005 Or you could just have one field and put in multivalued years a student has attended the school. name -boris year 2001 2002 2003 2004 2005 I think the second approach would complete your objective -- View this message in context: http://lucene.472066.n3.nabble.com/Searching-solr-on-school-name-during-year-tp4088817p4088910.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Data import
Any form of indexing would always replace a document and never update it. If you dont want replacements dont use a unique key in your schema and sort on time/date etc. But i still dont get one thing, if i have two indexes that i try to merge and both the indexes have some documents with same unique ids, they dont overwrite each other. Instead what i have is two documents with same unique id. Why does this happen? Anyone any clues? -- View this message in context: http://lucene.472066.n3.nabble.com/Data-import-tp4088789p4088921.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr suggest - How to define solr suggest as case insensitive
This is probably because your dictionary is made up of all lower case tokens, but when you query the spell-checker similar analysis doesnt happen. Ideal case would be when you query the spellchecker you send lower case queries -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-suggest-How-to-define-solr-suggest-as-case-insensitive-tp4088764p4088918.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr suggestion -
Don't do any analysis on the field you are using for suggestion. What is happening here is that query time and indexing time the tokens are being broken on white space. So effectively, at is being taken as one token and l is being taken as another token for which you get two different suggestions. -- View this message in context: http://lucene.472066.n3.nabble.com/solr-suggestion-tp4087841p4088919.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to Manage RAM Usage at Heavy Indexing
Hi, I've been seeing the same thing on CentOS with high physical memory use with low JVM-Memory use. I came to the conclusion that this was expected behaviour. Using top I noticed that my solr user's java process has Virtual memory allocated of about twice the size of the index, actual is within the limits I set when jetty starts. I infer from this that 98% of Physical Memory is being used to cache the index. Walter, Erick and others are constantly reminding people on list to have RAM the size of the index available -- I think 98% physical memory use is exactly why. Here is an excerpt from Uwe Schindler's well written piecehttp://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.htmlwhich explains in greater detail: *Basically mmap does the same like handling the Lucene index as a swap file. The mmap() syscall tells the O/S kernel to virtually map our whole index files into the previously described virtual address space, and make them look like RAM available to our Lucene process. We can then access our index file on disk just like it would be a large byte[] array (in Java this is encapsulated by a ByteBuffer interface to make it safe for use by Java code). If we access this virtual address space from the Lucene code we don’t need to do any syscalls, the processor’s MMU and TLB handles all the mapping for us. If the data is only on disk, the MMU will cause an interrupt and the O/S kernel will load the data into file system cache. If it is already in cache, MMU/TLB map it directly to the physical memory in file system cache. It is now just a native memory access, nothing more! We don’t have to take care of paging in/out of buffers, all this is managed by the O/S kernel. Furthermore, we have no concurrency issue, the only overhead over a standard byte[] array is some wrapping caused by Java’s ByteBuffer interface (it is still slower than a real byte[] array, but that is the only way to use mmap from Java and is much faster than all other directory implementations shipped with Lucene). We also waste no physical memory, as we operate directly on the O/S cache, avoiding all Java GC issues described before.* * * Is it odd that my index is ~16GB but top shows 30GB in virtual memory? Would the extra be for the field and filter caches I've increased in size? I went through a few Java tuning steps relating to OutOfMemoryErrors when using DataImportHandler with Solr. The first thing is that when using the FileEntityProcessor for each file in the file system to be indexed an entry is made and stored in heap before any indexing actually occurs. When I started pointing this at very large directories I started running out of heap. One work-around is to divide the job up into smaller batches, but I was able to allocate more memory so that everything fit. The next thing is that with more memory allocated the limiting factor was too many open files. After allowing the solr user to open more files I was able to get past this as well. There was a sweet spot where indexing with just enough memory was slow enough that I didn't experience the too many open files error but why go slow? Now I'm able to index ~4M documents (newspaper articles and fulltext monographs) in about 7 hours. I hope someone will correct me if I'm wrong about anything I've said here and especially if there is a better way to do things. Best of luck, Tricia On Wed, Aug 28, 2013 at 12:12 PM, Dan Davis dansm...@gmail.com wrote: This could be an operating systems problem rather than a Solr problem. CentOS 6.4 (linux kernel 2.6.32) may have some issues with page flushing and I would read-up up on that. The VM parameters can be tuned in /etc/sysctl.conf On Sun, Aug 25, 2013 at 4:23 PM, Furkan KAMACI furkankam...@gmail.com wrote: Hi Erick; I wanted to get a quick answer that's why I asked my question as that way. Error is as follows: INFO - 2013-08-21 22:01:30.978; org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp=/solr path=/update params={wt=javabinversion=2} {add=[com.deviantart.reachmeh ere:http/gallery/, com.deviantart.reachstereo:http/, com.deviantart.reachstereo:http/art/SE-mods-313298903, com.deviantart.reachtheclouds:http/, com.deviantart.reachthegoddess:http/, co m.deviantart.reachthegoddess:http/art/retouched-160219962, com.deviantart.reachthegoddess:http/badges/, com.deviantart.reachthegoddess:http/favourites/, com.deviantart.reachthetop:http/ art/Blue-Jean-Baby-82204657 (1444006227844530177), com.deviantart.reachurdreams:http/, ... (163 adds)]} 0 38790 ERROR - 2013-08-21 22:01:30.979; org.apache.solr.common.SolrException; java.lang.RuntimeException: [was class org.eclipse.jetty.io.EofException] early EOF at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at
Re: Profiling Solr Lucene for query
Hi Dmitry, I have solr 4.3 and every query is distributed and merged back for ranking purpose. What do you mean by frontend solr? On Mon, Sep 9, 2013 at 2:12 PM, Dmitry Kan solrexp...@gmail.com wrote: are you querying your shards via a frontend solr? We have noticed, that querying becomes much faster if results merging can be avoided. Dmitry On Sun, Sep 8, 2013 at 6:56 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello all Looking on the 10% slowest queries, I get very bad performances (~60 sec per query). These queries have lots of conditions on my main field (more than a hundred), including phrase queries and rows=1000. I do return only id's though. I can quite firmly say that this bad performance is due to slow storage issue (that are beyond my control for now). Despite this I want to improve my performances. As tought in school, I started profiling these queries and the data of ~1 minute profile is located here: http://picpaste.com/pics/IMG_20130908_132441-ZyrfXeTY.1378637843.jpg Main observation: most of the time I do wait for readVInt, who's stacktrace (2 out of 2 thread dumps) is: catalina-exec-3870 - Thread t@6615 java.lang.Thread.State: RUNNABLE at org.apadhe.lucene.store.DataInput.readVInt(DataInput.java:108) at org.apaChe.lucene.codeosAockTreeIermsReade$FieldReader$SegmentTermsEnumFrame.loadBlock(BlockTreeTermsReader.java: 2357) at ora.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BlockTreeTermsReader.java:1745) at org.apadhe.lucene.index.TermContext.build(TermContext.java:95) at org.apache.lucene.search.PhraseQuery$PhraseWeight.init(PhraseQuery.java:221) at org.apache.lucene.search.PhraseQuery.createWeight(PhraseQuery.java:326) at org.apache.lucene.search.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at oro.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:675) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297) So I do actually wait for IO as expected, but I might be too many time page faulting while looking for the TermBlocks (tim file), ie locating the term. As I reindex now, would it be useful lowering down the termInterval (default to 128)? As the FST (tip files) are that small (few 10-100 MB) so there are no memory contentions, could I lower down this param to 8 for example? The benefit from lowering down the term interval would be to obligate the FST to get on memory (JVM - thanks to the NRTCachingDirectory) as I do not control the term dictionary file (OS caching, loads an average of 6% of it). General configs: solr 4.3 36 shards, each has few million docs These 36 servers (each server has 2 replicas) are running virtual, 16GB memory each (4GB for JVM, 12GB remain for the OS caching), consuming 260GB of disk mounted for the index files.
Re: Expunge deleting using excessive transient disk space
I can only agree for the 50% free space recommendation. Unfortunately I do not have this for the current time, I'm standing on a 10% free disk (out of 300GB for each server). I'm aware it is very low. Does this seem reasonable adapting the current merge policy (or writing a new one) that would free up the transient disk space every merge instead of waiting for all of them to achieve? Where can I get such a answer (people who wrote the code)? Thanks On Sun, Sep 8, 2013 at 9:30 PM, Erick Erickson erickerick...@gmail.comwrote: Right, but you should have at least as much free space as your total index size, and I don't see the total index size (but I'm just glancing). I'm not entirely sure you can precisely calculate the maximum free space you have relative to the amount needed for merging, some of the people who wrote that code can probably tell you more. I'd _really_ try to get more disk space. The amount of engineer time spent trying to tune this is way more expensive than a disk... Best, Erick On Sun, Sep 8, 2013 at 11:51 AM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hi, In order to delete part of my index I run a delete by query that intends to erase 15% of the docs. I added this params to the solrconfig.xml mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnce2/int int name=maxMergeAtOnceExplicit2/int double name=maxMergedSegmentMB5000.0/double double name=reclaimDeletesWeight10.0/double double name=segmentsPerTier15.0/double /mergePolicy The extra params were added in order to promote merge of old segments but with restriction on the transient disk that can be used (as I have only 15GB per shard). This procedure failed on a no space left on device exception, although proper calculations show that these params should cause no usage excess of the transient free disk space I have. Looking on the infostream I can see that the first merges do succeed but older segments are kept in reference thus cannot be deleted until all the merging are done. Is there anyway of overcoming this?
Re: Facet Sort with non ASCII Characters
On Mon, Sep 9, 2013 at 7:16 AM, Sandro Zbinden zbin...@imagic.ch wrote: Is there a plan to add support for alphabetical facet sorting with non ASCII Characters ? The entire unicode range should already work. Can you give an example of what you would like to see? -Yonik http://lucidworks.com
Re: Expunge deleting using excessive transient disk space
10% free space is guaranteed to cause problems. That is a faulty installation. Explain to ops that Solr needs double the minimum index size. This is required for normal operation. That isn't extra, it is required for merges. Solr makes copies instead of doing record locking. The merge design is essential for speed. If they don't provide that, it will break, and it will be their fault. If they don't want to provide that, they need a different search engine. Adapting the merge policy to work with only 10% free space is not reasonable. When one segment is bigger than 10% (and it will be), merging that segment will fail. wunder On Sep 9, 2013, at 12:24 PM, Manuel Le Normand wrote: I can only agree for the 50% free space recommendation. Unfortunately I do not have this for the current time, I'm standing on a 10% free disk (out of 300GB for each server). I'm aware it is very low. Does this seem reasonable adapting the current merge policy (or writing a new one) that would free up the transient disk space every merge instead of waiting for all of them to achieve? Where can I get such a answer (people who wrote the code)? Thanks On Sun, Sep 8, 2013 at 9:30 PM, Erick Erickson erickerick...@gmail.comwrote: Right, but you should have at least as much free space as your total index size, and I don't see the total index size (but I'm just glancing). I'm not entirely sure you can precisely calculate the maximum free space you have relative to the amount needed for merging, some of the people who wrote that code can probably tell you more. I'd _really_ try to get more disk space. The amount of engineer time spent trying to tune this is way more expensive than a disk... Best, Erick On Sun, Sep 8, 2013 at 11:51 AM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hi, In order to delete part of my index I run a delete by query that intends to erase 15% of the docs. I added this params to the solrconfig.xml mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnce2/int int name=maxMergeAtOnceExplicit2/int double name=maxMergedSegmentMB5000.0/double double name=reclaimDeletesWeight10.0/double double name=segmentsPerTier15.0/double /mergePolicy The extra params were added in order to promote merge of old segments but with restriction on the transient disk that can be used (as I have only 15GB per shard). This procedure failed on a no space left on device exception, although proper calculations show that these params should cause no usage excess of the transient free disk space I have. Looking on the infostream I can see that the first merges do succeed but older segments are kept in reference thus cannot be deleted until all the merging are done. Is there anyway of overcoming this? -- Walter Underwood wun...@wunderwood.org
Re: Profiling Solr Lucene for query
Hello Manuel, 1 minute sampling brings too few data. Lowering termindex should help, however I don't know how FST really behaves on in. It definitely helped at 3.x; Would you mind if I ask which OS you have and which Directory implementation is used actually? On Sun, Sep 8, 2013 at 7:56 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello all Looking on the 10% slowest queries, I get very bad performances (~60 sec per query). These queries have lots of conditions on my main field (more than a hundred), including phrase queries and rows=1000. I do return only id's though. I can quite firmly say that this bad performance is due to slow storage issue (that are beyond my control for now). Despite this I want to improve my performances. As tought in school, I started profiling these queries and the data of ~1 minute profile is located here: http://picpaste.com/pics/IMG_20130908_132441-ZyrfXeTY.1378637843.jpg Main observation: most of the time I do wait for readVInt, who's stacktrace (2 out of 2 thread dumps) is: catalina-exec-3870 - Thread t@6615 java.lang.Thread.State: RUNNABLE at org.apadhe.lucene.store.DataInput.readVInt(DataInput.java:108) at org.apaChe.lucene.codeosAockTreeIermsReade$FieldReader$SegmentTermsEnumFrame.loadBlock(BlockTreeTermsReader.java: 2357) at ora.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BlockTreeTermsReader.java:1745) at org.apadhe.lucene.index.TermContext.build(TermContext.java:95) at org.apache.lucene.search.PhraseQuery$PhraseWeight.init(PhraseQuery.java:221) at org.apache.lucene.search.PhraseQuery.createWeight(PhraseQuery.java:326) at org.apache.lucene.search.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at oro.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:675) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297) So I do actually wait for IO as expected, but I might be too many time page faulting while looking for the TermBlocks (tim file), ie locating the term. As I reindex now, would it be useful lowering down the termInterval (default to 128)? As the FST (tip files) are that small (few 10-100 MB) so there are no memory contentions, could I lower down this param to 8 for example? The benefit from lowering down the term interval would be to obligate the FST to get on memory (JVM - thanks to the NRTCachingDirectory) as I do not control the term dictionary file (OS caching, loads an average of 6% of it). General configs: solr 4.3 36 shards, each has few million docs These 36 servers (each server has 2 replicas) are running virtual, 16GB memory each (4GB for JVM, 12GB remain for the OS caching), consuming 260GB of disk mounted for the index files. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: charfilter doesn't do anything
Did you in fact try my suggested example? If not, please do so. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Monday, September 09, 2013 4:42 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything i index html pages with a lot of lines and not just a string with the body-tag. it doesn't work with proper html files, even though i took all the new lines out. html-file: htmlnav-contentbody nur das will ich sehen/bodyfooter-content/html solr update debug output: text_html: [html\r\n\r\nmeta name=\Content-Encoding\ content=\ISO-8859-1\\r\nmeta name=\Content-Type\ content=\text/html; charset=ISO-8859-1\\r\ntitle/title\r\n\r\nbodynav-content nur das will ich sehenfooter-content/body/html] On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote: I tried this and it seems to work when added to the standard Solr example in 4.4: field name=body type=text_html_body indexed=true stored=true / fieldType name=text_html_body class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 / tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType That char filter retains only text between body and /body. Is that what you wanted? Indexing this data: curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d ' [{id:doc-1,body:abc bodyA test./body def}]' And querying with these commands: curl http://localhost:8983/solr/select/?q=*:*indent=truewt=json; Shows all data curl http://localhost:8983/solr/select/?q=body:testindent=truewt=json; shows the body text curl http://localhost:8983/solr/select/?q=body:abcindent=truewt=json; shows nothing (outside of body) curl http://localhost:8983/solr/select/?q=body:defindent=truewt=json; shows nothing (outside of body) curl http://localhost:8983/solr/select/?q=body:bodyindent=truewt=json; Shows nothing, HTML tag stripped In your original query, you didn't show us what your default field, df parameter, was. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Sunday, September 08, 2013 5:21 AM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything yes but that filter html and not the specific tag i want. On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote: Hmmm, have you looked at: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Not quite the body, perhaps, but might it help? On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote: ok i have html pages with html.!--body--content i want!--/body--./html. i want to extract (index, store) only that between the body-comments. i thought regexTransformer would be the best because xpath doesn't work in tika and i cant nest a xpathEntetyProcessor to use xpath. what i have also found out is that the htmlparser from tika cuts my body-comments out and tries to make well formed html, which i would like to switch off. On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote: On 9/6/2013 7:09 AM, Andreas Owen wrote: i've managed to get it working if i use the regexTransformer and string is on the same line in my tika entity. but when the string is multilined it isn't working even though i tried ?s to set the flag dotall. entity name=tika processor=TikaEntityProcessor url=${rec.url} dataSource=dataUrl onError=skip htmlMapper=identity format=html transformer=RegexTransformer field column=text_html regex=lt;bodygt;(.+)lt;/bodygt; replaceWith=QQQ sourceColName=text / /entity then i tried it like this and i get a stackoverflow field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt; replaceWith=QQQ sourceColName=text / in javascript this works but maybe because i only used a small string. Sounds like we've got an XY problem here. http://people.apache.org/~hossman/#xyproblem How about you tell us *exactly* what you'd actually like to have happen and then we can find a solution for you? It sounds a little bit like you're interested in stripping all the HTML tags out. Perhaps the HTMLStripCharFilter? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Something that I already said: By using the KeywordTokenizer, you won't be able to search for individual words on your HTML input. The entire input string is treated as a single token, and therefore ONLY exact entire-field matches (or certain wildcard matches) will be possible. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory Note that no matter what you do to your data with the analysis chain, Solr will always return the text that was originally indexed in search results. If you need to affect what gets stored as well, perhaps you need an Update Processor. Thanks, Shawn
eDismax Phrase Field Boosts on Single Terms
I am curious how the dismay parser handles single term queries and phrase boosts. For example, if I had a query q=bars with the following dismax parameters: qf=categories and pf=categories^100 I would expect that the parser would match on the QF parameter but then also match again on the PF parameter and apply the boost. I am not seeing this. Should I be? The reason I was trying to avoid applying both a QF and PF boost is because i do want to boost on values like: Bars and Restaurants and a PF boost makes the most sense here rather than boosting on any documents that contain Bar or Restaurant Thanks in advance. Jeff
Re: charfilter doesn't do anything
i index html pages with a lot of lines and not just a string with the body-tag. it doesn't work with proper html files, even though i took all the new lines out. html-file: htmlnav-contentbody nur das will ich sehen/bodyfooter-content/html solr update debug output: text_html: [html\r\n\r\nmeta name=\Content-Encoding\ content=\ISO-8859-1\\r\nmeta name=\Content-Type\ content=\text/html; charset=ISO-8859-1\\r\ntitle/title\r\n\r\nbodynav-content nur das will ich sehenfooter-content/body/html] On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote: I tried this and it seems to work when added to the standard Solr example in 4.4: field name=body type=text_html_body indexed=true stored=true / fieldType name=text_html_body class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 / tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType That char filter retains only text between body and /body. Is that what you wanted? Indexing this data: curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d ' [{id:doc-1,body:abc bodyA test./body def}]' And querying with these commands: curl http://localhost:8983/solr/select/?q=*:*indent=truewt=json; Shows all data curl http://localhost:8983/solr/select/?q=body:testindent=truewt=json; shows the body text curl http://localhost:8983/solr/select/?q=body:abcindent=truewt=json; shows nothing (outside of body) curl http://localhost:8983/solr/select/?q=body:defindent=truewt=json; shows nothing (outside of body) curl http://localhost:8983/solr/select/?q=body:bodyindent=truewt=json; Shows nothing, HTML tag stripped In your original query, you didn't show us what your default field, df parameter, was. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Sunday, September 08, 2013 5:21 AM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything yes but that filter html and not the specific tag i want. On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote: Hmmm, have you looked at: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Not quite the body, perhaps, but might it help? On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote: ok i have html pages with html.!--body--content i want!--/body--./html. i want to extract (index, store) only that between the body-comments. i thought regexTransformer would be the best because xpath doesn't work in tika and i cant nest a xpathEntetyProcessor to use xpath. what i have also found out is that the htmlparser from tika cuts my body-comments out and tries to make well formed html, which i would like to switch off. On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote: On 9/6/2013 7:09 AM, Andreas Owen wrote: i've managed to get it working if i use the regexTransformer and string is on the same line in my tika entity. but when the string is multilined it isn't working even though i tried ?s to set the flag dotall. entity name=tika processor=TikaEntityProcessor url=${rec.url} dataSource=dataUrl onError=skip htmlMapper=identity format=html transformer=RegexTransformer field column=text_html regex=lt;bodygt;(.+)lt;/bodygt; replaceWith=QQQ sourceColName=text / /entity then i tried it like this and i get a stackoverflow field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt; replaceWith=QQQ sourceColName=text / in javascript this works but maybe because i only used a small string. Sounds like we've got an XY problem here. http://people.apache.org/~hossman/#xyproblem How about you tell us *exactly* what you'd actually like to have happen and then we can find a solution for you? It sounds a little bit like you're interested in stripping all the HTML tags out. Perhaps the HTMLStripCharFilter? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Something that I already said: By using the KeywordTokenizer, you won't be able to search for individual words on your HTML input. The entire input string is treated as a single token, and therefore ONLY exact entire-field matches (or certain wildcard matches) will be possible. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory Note that no matter what you do to your data with the analysis chain, Solr will always return the text that was originally indexed in search results. If you need to affect what gets stored as well, perhaps you need an Update Processor. Thanks, Shawn
Re: Profiling Solr Lucene for query
Hi Manuel, The frontend solr instance is the one that does not have its own index and is doing merging of the results. Is this the case? If yes, are all 36 shards always queried? Dmitry On Mon, Sep 9, 2013 at 10:11 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hi Dmitry, I have solr 4.3 and every query is distributed and merged back for ranking purpose. What do you mean by frontend solr? On Mon, Sep 9, 2013 at 2:12 PM, Dmitry Kan solrexp...@gmail.com wrote: are you querying your shards via a frontend solr? We have noticed, that querying becomes much faster if results merging can be avoided. Dmitry On Sun, Sep 8, 2013 at 6:56 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello all Looking on the 10% slowest queries, I get very bad performances (~60 sec per query). These queries have lots of conditions on my main field (more than a hundred), including phrase queries and rows=1000. I do return only id's though. I can quite firmly say that this bad performance is due to slow storage issue (that are beyond my control for now). Despite this I want to improve my performances. As tought in school, I started profiling these queries and the data of ~1 minute profile is located here: http://picpaste.com/pics/IMG_20130908_132441-ZyrfXeTY.1378637843.jpg Main observation: most of the time I do wait for readVInt, who's stacktrace (2 out of 2 thread dumps) is: catalina-exec-3870 - Thread t@6615 java.lang.Thread.State: RUNNABLE at org.apadhe.lucene.store.DataInput.readVInt(DataInput.java:108) at org.apaChe.lucene.codeosAockTreeIermsReade$FieldReader$SegmentTermsEnumFrame.loadBlock(BlockTreeTermsReader.java: 2357) at ora.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BlockTreeTermsReader.java:1745) at org.apadhe.lucene.index.TermContext.build(TermContext.java:95) at org.apache.lucene.search.PhraseQuery$PhraseWeight.init(PhraseQuery.java:221) at org.apache.lucene.search.PhraseQuery.createWeight(PhraseQuery.java:326) at org.apache.lucene.search.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at oro.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:675) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297) So I do actually wait for IO as expected, but I might be too many time page faulting while looking for the TermBlocks (tim file), ie locating the term. As I reindex now, would it be useful lowering down the termInterval (default to 128)? As the FST (tip files) are that small (few 10-100 MB) so there are no memory contentions, could I lower down this param to 8 for example? The benefit from lowering down the term interval would be to obligate the FST to get on memory (JVM - thanks to the NRTCachingDirectory) as I do not control the term dictionary file (OS caching, loads an average of 6% of it). General configs: solr 4.3 36 shards, each has few million docs These 36 servers (each server has 2 replicas) are running virtual, 16GB memory each (4GB for JVM, 12GB remain for the OS caching), consuming 260GB of disk mounted for the index files.
Re: charfilter doesn't do anything
i tried but that isn't working either, it want a data-stream, i'll have to check how to post json instead of xml On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote: Did you at least try the pattern I gave you? The point of the curl was the data, not how you send the data. You can just use the standard Solr simple post tool. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Monday, September 09, 2013 6:40 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything i've downloaded curl and tried it in the comman prompt and power shell on my win 2008r2 server, thats why i used my dataimporter with a single line html file and copy/pastet the lines into schema.xml On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote: Did you in fact try my suggested example? If not, please do so. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Monday, September 09, 2013 4:42 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything i index html pages with a lot of lines and not just a string with the body-tag. it doesn't work with proper html files, even though i took all the new lines out. html-file: htmlnav-contentbody nur das will ich sehen/bodyfooter-content/html solr update debug output: text_html: [html\r\n\r\nmeta name=\Content-Encoding\ content=\ISO-8859-1\\r\nmeta name=\Content-Type\ content=\text/html; charset=ISO-8859-1\\r\ntitle/title\r\n\r\nbodynav-content nur das will ich sehenfooter-content/body/html] On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote: I tried this and it seems to work when added to the standard Solr example in 4.4: field name=body type=text_html_body indexed=true stored=true / fieldType name=text_html_body class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 / tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType That char filter retains only text between body and /body. Is that what you wanted? Indexing this data: curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d ' [{id:doc-1,body:abc bodyA test./body def}]' And querying with these commands: curl http://localhost:8983/solr/select/?q=*:*indent=truewt=json; Shows all data curl http://localhost:8983/solr/select/?q=body:testindent=truewt=json; shows the body text curl http://localhost:8983/solr/select/?q=body:abcindent=truewt=json; shows nothing (outside of body) curl http://localhost:8983/solr/select/?q=body:defindent=truewt=json; shows nothing (outside of body) curl http://localhost:8983/solr/select/?q=body:bodyindent=truewt=json; Shows nothing, HTML tag stripped In your original query, you didn't show us what your default field, df parameter, was. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Sunday, September 08, 2013 5:21 AM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything yes but that filter html and not the specific tag i want. On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote: Hmmm, have you looked at: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Not quite the body, perhaps, but might it help? On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote: ok i have html pages with html.!--body--content i want!--/body--./html. i want to extract (index, store) only that between the body-comments. i thought regexTransformer would be the best because xpath doesn't work in tika and i cant nest a xpathEntetyProcessor to use xpath. what i have also found out is that the htmlparser from tika cuts my body-comments out and tries to make well formed html, which i would like to switch off. On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote: On 9/6/2013 7:09 AM, Andreas Owen wrote: i've managed to get it working if i use the regexTransformer and string is on the same line in my tika entity. but when the string is multilined it isn't working even though i tried ?s to set the flag dotall. entity name=tika processor=TikaEntityProcessor url=${rec.url} dataSource=dataUrl onError=skip htmlMapper=identity format=html transformer=RegexTransformer field column=text_html regex=lt;bodygt;(.+)lt;/bodygt; replaceWith=QQQ sourceColName=text / /entity then i tried it like this and i get a stackoverflow field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt; replaceWith=QQQ sourceColName=text / in javascript this works but maybe because i only used a small string. Sounds like we've got an XY problem here. http://people.apache.org/~hossman/#xyproblem How about you tell us *exactly* what you'd actually like to have happen and then we can find a
Re: unknown _stream_source_info while indexing rich doc in solr
: Subject: Re: unknown _stream_source_info while indexing rich doc in solr : : Error got resolved,thanks a lot Sir.I have been trying since days to : resolve it. Usersn't shouldn't have to worry about problems like this ... i'll try to make this less error prone... https://issues.apache.org/jira/browse/SOLR-5228 -Hoss
Re: charfilter doesn't do anything
i've downloaded curl and tried it in the comman prompt and power shell on my win 2008r2 server, thats why i used my dataimporter with a single line html file and copy/pastet the lines into schema.xml On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote: Did you in fact try my suggested example? If not, please do so. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Monday, September 09, 2013 4:42 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything i index html pages with a lot of lines and not just a string with the body-tag. it doesn't work with proper html files, even though i took all the new lines out. html-file: htmlnav-contentbody nur das will ich sehen/bodyfooter-content/html solr update debug output: text_html: [html\r\n\r\nmeta name=\Content-Encoding\ content=\ISO-8859-1\\r\nmeta name=\Content-Type\ content=\text/html; charset=ISO-8859-1\\r\ntitle/title\r\n\r\nbodynav-content nur das will ich sehenfooter-content/body/html] On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote: I tried this and it seems to work when added to the standard Solr example in 4.4: field name=body type=text_html_body indexed=true stored=true / fieldType name=text_html_body class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 / tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType That char filter retains only text between body and /body. Is that what you wanted? Indexing this data: curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d ' [{id:doc-1,body:abc bodyA test./body def}]' And querying with these commands: curl http://localhost:8983/solr/select/?q=*:*indent=truewt=json; Shows all data curl http://localhost:8983/solr/select/?q=body:testindent=truewt=json; shows the body text curl http://localhost:8983/solr/select/?q=body:abcindent=truewt=json; shows nothing (outside of body) curl http://localhost:8983/solr/select/?q=body:defindent=truewt=json; shows nothing (outside of body) curl http://localhost:8983/solr/select/?q=body:bodyindent=truewt=json; Shows nothing, HTML tag stripped In your original query, you didn't show us what your default field, df parameter, was. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Sunday, September 08, 2013 5:21 AM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything yes but that filter html and not the specific tag i want. On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote: Hmmm, have you looked at: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Not quite the body, perhaps, but might it help? On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote: ok i have html pages with html.!--body--content i want!--/body--./html. i want to extract (index, store) only that between the body-comments. i thought regexTransformer would be the best because xpath doesn't work in tika and i cant nest a xpathEntetyProcessor to use xpath. what i have also found out is that the htmlparser from tika cuts my body-comments out and tries to make well formed html, which i would like to switch off. On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote: On 9/6/2013 7:09 AM, Andreas Owen wrote: i've managed to get it working if i use the regexTransformer and string is on the same line in my tika entity. but when the string is multilined it isn't working even though i tried ?s to set the flag dotall. entity name=tika processor=TikaEntityProcessor url=${rec.url} dataSource=dataUrl onError=skip htmlMapper=identity format=html transformer=RegexTransformer field column=text_html regex=lt;bodygt;(.+)lt;/bodygt; replaceWith=QQQ sourceColName=text / /entity then i tried it like this and i get a stackoverflow field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt; replaceWith=QQQ sourceColName=text / in javascript this works but maybe because i only used a small string. Sounds like we've got an XY problem here. http://people.apache.org/~hossman/#xyproblem How about you tell us *exactly* what you'd actually like to have happen and then we can find a solution for you? It sounds a little bit like you're interested in stripping all the HTML tags out. Perhaps the HTMLStripCharFilter? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Something that I already said: By using the KeywordTokenizer, you won't be able to search for individual words on your HTML input. The entire input string is treated as a single token, and therefore ONLY exact entire-field matches (or certain wildcard matches) will be possible.
Re: Facet sort descending
: Is there a plan to add a descending sort order for facet queries ? : Best regards Sandro I don't understand your question. if you specify multiple facet.query params, then the constraint counts are returned in the order they were initially specified -- there is no need for server side sorting, because they all come back (as opposed to facet.field where the number of constraints can be unbounded and you may request just the top X using facet.limit) If you are asking about facet.field and using facet.sort to specify the order of the constraints for each field, then no -- i don't believe anyone is currently working on adding options for descending sort. I don't think it would be hard to add if someone wanted to ... I just don't know that there has ever been enough demand for anyone to look into it. -Hoss
Re: Solr suggest - How to define solr suggest as case insensitive
: This is probably because your dictionary is made up of all lower case tokens, : but when you query the spell-checker similar analysis doesnt happen. Ideal : case would be when you query the spellchecker you send lower case queries You can init the SpellCheckComponent with a queryAnalyzerFieldType option that will control what analysis happens. ie... !-- This field type's analyzer is used by the QueryConverter to tokenize the value for q parameter -- str name=queryAnalyzerFieldTypephrase_suggest/str ...it would be nice if this defaulted to using the fieldType of hte field you configure on the Suggester, but not all Impls are based on the index (you might be using an external dict file) so it has to be explicitly configured, and defaults to using a simple WhitespaceAnalyzer. -Hoss
Re: charfilter doesn't do anything
Did you at least try the pattern I gave you? The point of the curl was the data, not how you send the data. You can just use the standard Solr simple post tool. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Monday, September 09, 2013 6:40 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything i've downloaded curl and tried it in the comman prompt and power shell on my win 2008r2 server, thats why i used my dataimporter with a single line html file and copy/pastet the lines into schema.xml On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote: Did you in fact try my suggested example? If not, please do so. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Monday, September 09, 2013 4:42 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything i index html pages with a lot of lines and not just a string with the body-tag. it doesn't work with proper html files, even though i took all the new lines out. html-file: htmlnav-contentbody nur das will ich sehen/bodyfooter-content/html solr update debug output: text_html: [html\r\n\r\nmeta name=\Content-Encoding\ content=\ISO-8859-1\\r\nmeta name=\Content-Type\ content=\text/html; charset=ISO-8859-1\\r\ntitle/title\r\n\r\nbodynav-content nur das will ich sehenfooter-content/body/html] On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote: I tried this and it seems to work when added to the standard Solr example in 4.4: field name=body type=text_html_body indexed=true stored=true / fieldType name=text_html_body class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 / tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType That char filter retains only text between body and /body. Is that what you wanted? Indexing this data: curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d ' [{id:doc-1,body:abc bodyA test./body def}]' And querying with these commands: curl http://localhost:8983/solr/select/?q=*:*indent=truewt=json; Shows all data curl http://localhost:8983/solr/select/?q=body:testindent=truewt=json; shows the body text curl http://localhost:8983/solr/select/?q=body:abcindent=truewt=json; shows nothing (outside of body) curl http://localhost:8983/solr/select/?q=body:defindent=truewt=json; shows nothing (outside of body) curl http://localhost:8983/solr/select/?q=body:bodyindent=truewt=json; Shows nothing, HTML tag stripped In your original query, you didn't show us what your default field, df parameter, was. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Sunday, September 08, 2013 5:21 AM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything yes but that filter html and not the specific tag i want. On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote: Hmmm, have you looked at: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Not quite the body, perhaps, but might it help? On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote: ok i have html pages with html.!--body--content i want!--/body--./html. i want to extract (index, store) only that between the body-comments. i thought regexTransformer would be the best because xpath doesn't work in tika and i cant nest a xpathEntetyProcessor to use xpath. what i have also found out is that the htmlparser from tika cuts my body-comments out and tries to make well formed html, which i would like to switch off. On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote: On 9/6/2013 7:09 AM, Andreas Owen wrote: i've managed to get it working if i use the regexTransformer and string is on the same line in my tika entity. but when the string is multilined it isn't working even though i tried ?s to set the flag dotall. entity name=tika processor=TikaEntityProcessor url=${rec.url} dataSource=dataUrl onError=skip htmlMapper=identity format=html transformer=RegexTransformer field column=text_html regex=lt;bodygt;(.+)lt;/bodygt; replaceWith=QQQ sourceColName=text / /entity then i tried it like this and i get a stackoverflow field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt; replaceWith=QQQ sourceColName=text / in javascript this works but maybe because i only used a small string. Sounds like we've got an XY problem here. http://people.apache.org/~hossman/#xyproblem How about you tell us *exactly* what you'd actually like to have happen and then we can find a solution for you? It sounds a little bit like you're interested in stripping all the HTML tags out. Perhaps the HTMLStripCharFilter? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Something that I already said: By using the
Re: Solr4.4 or zookeeper 3.4.5 do not support too many collections? more than 600?
If you have 15K collections I guess that you are doing custom sharding and not using collection sharding. My first approach was the same as you are doing. In fact, I have the same lote of cores issue. I use the Djute.maxbuffer without any issue. In last versions, Solr implements a way to do sharding using a prefix in your ID, therefore I replace my lot of cores with a collection with shards. Now with the splitshard feature you can split the shards that reach a condiserable size. Downside, I don't know if the splitshard feature honors the compositeId defined on collection's creation. Recommendation, if you don't want that the lot of cores issue bites you in some kind of wierd issue or anomalous behavior try to reduce the cores as possible and splits shards as necessary when performance can hurt your environment. -- Yago Riveiro Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Monday, September 9, 2013 at 3:09 PM, diyun2008 wrote: I just found this option -Djute.maxbuffer in zookeeper admin document. But it's a Unsafe Options. I can't really know what it mean. Maybe that will bring some unstable problems? Does someone have some real practical experiences when using this parameter? I will have at least 15K collections. Or I will have to merge them to small numbers. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr4-4-or-zookeeper-3-4-5-do-not-support-too-many-collections-more-than-600-tp4088689p4088878.html Sent from the Solr - User mailing list archive at Nabble.com (http://Nabble.com).
Re: charfilter doesn't do anything
Use XML then. Although you will need to escape the XML special characters as I did in the pattern. The point is simply: Quickly and simply try to find the simple test scenario that illustrates the problem. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Monday, September 09, 2013 7:05 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything i tried but that isn't working either, it want a data-stream, i'll have to check how to post json instead of xml On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote: Did you at least try the pattern I gave you? The point of the curl was the data, not how you send the data. You can just use the standard Solr simple post tool. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Monday, September 09, 2013 6:40 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything i've downloaded curl and tried it in the comman prompt and power shell on my win 2008r2 server, thats why i used my dataimporter with a single line html file and copy/pastet the lines into schema.xml On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote: Did you in fact try my suggested example? If not, please do so. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Monday, September 09, 2013 4:42 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything i index html pages with a lot of lines and not just a string with the body-tag. it doesn't work with proper html files, even though i took all the new lines out. html-file: htmlnav-contentbody nur das will ich sehen/bodyfooter-content/html solr update debug output: text_html: [html\r\n\r\nmeta name=\Content-Encoding\ content=\ISO-8859-1\\r\nmeta name=\Content-Type\ content=\text/html; charset=ISO-8859-1\\r\ntitle/title\r\n\r\nbodynav-content nur das will ich sehenfooter-content/body/html] On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote: I tried this and it seems to work when added to the standard Solr example in 4.4: field name=body type=text_html_body indexed=true stored=true / fieldType name=text_html_body class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 / tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType That char filter retains only text between body and /body. Is that what you wanted? Indexing this data: curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d ' [{id:doc-1,body:abc bodyA test./body def}]' And querying with these commands: curl http://localhost:8983/solr/select/?q=*:*indent=truewt=json; Shows all data curl http://localhost:8983/solr/select/?q=body:testindent=truewt=json; shows the body text curl http://localhost:8983/solr/select/?q=body:abcindent=truewt=json; shows nothing (outside of body) curl http://localhost:8983/solr/select/?q=body:defindent=truewt=json; shows nothing (outside of body) curl http://localhost:8983/solr/select/?q=body:bodyindent=truewt=json; Shows nothing, HTML tag stripped In your original query, you didn't show us what your default field, df parameter, was. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Sunday, September 08, 2013 5:21 AM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything yes but that filter html and not the specific tag i want. On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote: Hmmm, have you looked at: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Not quite the body, perhaps, but might it help? On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen a...@conx.ch wrote: ok i have html pages with html.!--body--content i want!--/body--./html. i want to extract (index, store) only that between the body-comments. i thought regexTransformer would be the best because xpath doesn't work in tika and i cant nest a xpathEntetyProcessor to use xpath. what i have also found out is that the htmlparser from tika cuts my body-comments out and tries to make well formed html, which i would like to switch off. On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote: On 9/6/2013 7:09 AM, Andreas Owen wrote: i've managed to get it working if i use the regexTransformer and string is on the same line in my tika entity. but when the string is multilined it isn't working even though i tried ?s to set the flag dotall. entity name=tika processor=TikaEntityProcessor url=${rec.url} dataSource=dataUrl onError=skip htmlMapper=identity format=html transformer=RegexTransformer field column=text_html regex=lt;bodygt;(.+)lt;/bodygt; replaceWith=QQQ sourceColName=text / /entity then i tried it like this and i get a stackoverflow field column=text_html regex=lt;bodygt;((.|\n|\r)+)lt;/bodygt;
Re: Expunge deleting using excessive transient disk space
: Looking on the infostream I can see that the first merges do succeed but : older segments are kept in reference thus cannot be deleted until all the : merging are done. I suspect what you are seeing is that filehandles for the older segemnts are kept open (and thus, the bytes on disk for those old segments are not free'd p for use by new segments) because hte existing IndexReaders still need to use them until the completion of the merge process, and new IndexReader/IndexSearcher opening/warming takes place, *and* all execiting requests that used the old IndexSearcher completed. -Hoss
Re: Data import
: Any form of indexing would always replace a document and never update it. At a very low level this is true, but Solr does support Atomic Updates (aka Partial Updates) that can be used to allow a lcient to only specify the values of an existing document they want to chagne and Solr will handle everything on the server side. : But i still dont get one thing, if i have two indexes that i try to merge : and both the indexes have some documents with same unique ids, they dont : overwrite each other. Instead what i have is two documents with same unique : id. Why does this happen? Anyone any clues? This seems like a completley unrelated question -- please start a new thread and provide full details of your situation and question in ordre for people to try to assist you... https://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. -Hoss
Re: Data import
: When i run dataimport/?command=full-importclean=false, solr add new : documents with the information. But if the same information already : exists with the same uniquekey, it replaces the existing document with a : new one. : It does not update the document, it creates a new one. It's that possible? I'm not certain that i'm understanding your question. It is possible using Atomic Updates, but you have to be explicit about what/how you wnat Solr to use the new information (ie: when to replace, when to add to a multivaluded field, when to increment a numeric field, etc...) https://wiki.apache.org/solr/Atomic_Updates I don't think DIH has any straight forward syntax for letting you configure this easily, but as long as you put a map in each field (ie: via ScriptTransformer perhaps) containing a single modifier = value pair you want applied to that field, it should work. : I'm indexing rss feeds. I run the rss example that exists in the solr : examples, and i does that. Can you please be more specific about what you would like to see happen, we can better understand what your actual goal is? It's really not clear if using Atomic Updates is the easiest way to achieve what you're after, or if I'm just completley missunderstanding your question... https://wiki.apache.org/solr/UsingMailingLists -Hoss
Re: Data import
So I'm indexing RSS feeds. I'm running the data import full-import command with a cron job. It runs every 15 minutes and indexes a lot of RSS feeds from many sources. With cron job, I do a http request using curl, to the address http://localhost:port/solr/core/dataimport/?command=full-importclean=false When it runs, if the rss source has a feed that is already indexed on solr, it updates the existing source. So if the source has the same information of the destiny, it updates the information on the destiny. I want to prevent that. Is that explicit? I may try to provide some examples. Thanks On Tuesday, September 10, 2013, Chris Hostetter wrote: : When i run dataimport/?command=full-importclean=false, solr add new : documents with the information. But if the same information already : exists with the same uniquekey, it replaces the existing document with a : new one. : It does not update the document, it creates a new one. It's that possible? I'm not certain that i'm understanding your question. It is possible using Atomic Updates, but you have to be explicit about what/how you wnat Solr to use the new information (ie: when to replace, when to add to a multivaluded field, when to increment a numeric field, etc...) https://wiki.apache.org/solr/Atomic_Updates I don't think DIH has any straight forward syntax for letting you configure this easily, but as long as you put a map in each field (ie: via ScriptTransformer perhaps) containing a single modifier = value pair you want applied to that field, it should work. : I'm indexing rss feeds. I run the rss example that exists in the solr : examples, and i does that. Can you please be more specific about what you would like to see happen, we can better understand what your actual goal is? It's really not clear if using Atomic Updates is the easiest way to achieve what you're after, or if I'm just completley missunderstanding your question... https://wiki.apache.org/solr/UsingMailingLists -Hoss -- Sent from Gmail Mobile
Re: Data import
: With cron job, I do a http request using curl, to the address : http://localhost:port/solr/core/dataimport/?command=full-importclean=false : : When it runs, if the rss source has a feed that is already indexed on solr, : it updates the existing source. : So if the source has the same information of the destiny, it updates the : information on the destiny. : : I want to prevent that. Is that explicit? I may try to provide some : examples. Yes, specific examples would be helpful -- it's not really clear what it is that you want to prevent. Please note the URL i mentioned before and use it as a guideline for how much detail we need to understand what it is you are asking... : Can you please be more specific about what you would like to see happen, : we can better understand what your actual goal is? It's really not clear : https://wiki.apache.org/solr/UsingMailingLists -Hoss
Re: Data import
But with atomic updates i need to send the information, right? I want that solr automatic indexes it. And he is doing that. Can you look at the solr example in the source? There is an example on example-DIH folder. Imagine that you run the URL to import the data every 15 minutes. If the same information is already indexed, solr will update it, and by update I mean delete and index again. I just want that solr simple discards the information if this already exists with indexed. On Tuesday, September 10, 2013, Chris Hostetter wrote: : With cron job, I do a http request using curl, to the address : http://localhost:port /solr/core/dataimport/?command=full-importclean=false : : When it runs, if the rss source has a feed that is already indexed on solr, : it updates the existing source. : So if the source has the same information of the destiny, it updates the : information on the destiny. : : I want to prevent that. Is that explicit? I may try to provide some : examples. Yes, specific examples would be helpful -- it's not really clear what it is that you want to prevent. Please note the URL i mentioned before and use it as a guideline for how much detail we need to understand what it is you are asking... : Can you please be more specific about what you would like to see happen, : we can better understand what your actual goal is? It's really not clear : https://wiki.apache.org/solr/UsingMailingLists -Hoss -- Sent from Gmail Mobile
Re: Data import
Sounds like you want a custom UpdateRequestProcessor chain that checks if the document already exists with given primary key and does not even bother passing it on to the next processor in the chain. This would make sense as an optimization or as a first step in a complex update chain that perhaps uses a lot of external resources to pre-process the content (e.g. named entities extraction). I don't think such URP exist at the moment? But it should be simple to write one assuming URPs can do lookups by primary IDs and have go/no-go decisions on individual documents. Anybody knows the details of this? Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Tue, Sep 10, 2013 at 7:53 AM, Luis Portela Afonso meligalet...@gmail.com wrote: But with atomic updates i need to send the information, right? I want that solr automatic indexes it. And he is doing that. Can you look at the solr example in the source? There is an example on example-DIH folder. Imagine that you run the URL to import the data every 15 minutes. If the same information is already indexed, solr will update it, and by update I mean delete and index again. I just want that solr simple discards the information if this already exists with indexed. On Tuesday, September 10, 2013, Chris Hostetter wrote: : With cron job, I do a http request using curl, to the address : http://localhost:port /solr/core/dataimport/?command=full-importclean=false : : When it runs, if the rss source has a feed that is already indexed on solr, : it updates the existing source. : So if the source has the same information of the destiny, it updates the : information on the destiny. : : I want to prevent that. Is that explicit? I may try to provide some : examples. Yes, specific examples would be helpful -- it's not really clear what it is that you want to prevent. Please note the URL i mentioned before and use it as a guideline for how much detail we need to understand what it is you are asking... : Can you please be more specific about what you would like to see happen, : we can better understand what your actual goal is? It's really not clear : https://wiki.apache.org/solr/UsingMailingLists -Hoss -- Sent from Gmail Mobile
find all two word phrases that appear in more than one document
Dear Solr Ninjas, We would like to run a query that returns two word phrases that appear in more than one document. So for e.g. take the string Solr Ninja. Since it appears in more than one document in our Solr instance, the query should return that. The query should find all such phrases from all the documents in our Solr instance, by querying for two adjacent word combination (forming a phrase) in the documents that are in the Solr. These two adjacent word combinations should come from the documents in the Solr index. Any ideas on how to write this query? Thanks.
Re: Solr4.4 or zookeeper 3.4.5 do not support too many collections? more than 600?
Thank you very much for your advice. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr4-4-or-zookeeper-3-4-5-do-not-support-too-many-collections-more-than-600-tp4088689p4089009.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: find all two word phrases that appear in more than one document
The phases are usually called n-grams or shingles. You can probably use ShingleFilterFactory to create your shingles (possibly with outputUnigrams=false) and then use TermsComponent ( http://wiki.apache.org/solr/TermsComponent) to list the results. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Tue, Sep 10, 2013 at 8:22 AM, Ali, Saqib docbook@gmail.com wrote: Dear Solr Ninjas, We would like to run a query that returns two word phrases that appear in more than one document. So for e.g. take the string Solr Ninja. Since it appears in more than one document in our Solr instance, the query should return that. The query should find all such phrases from all the documents in our Solr instance, by querying for two adjacent word combination (forming a phrase) in the documents that are in the Solr. These two adjacent word combinations should come from the documents in the Solr index. Any ideas on how to write this query? Thanks.
Re: Restrict Parsing duplicate file in Solr
Thanks for the response. My requirement is make sure I detect file if its already indexed , neglect instead of replacing the existing one. -- View this message in context: http://lucene.472066.n3.nabble.com/Restrict-Parsing-duplicate-file-in-Solr-tp4088471p4089023.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: find all two word phrases that appear in more than one document
Thanks Alexandre. I looked at the wiki page for the TermsComponent. But I am not sure if I follow. Do you have an example or some better document? Thanks! :) On Mon, Sep 9, 2013 at 8:17 PM, Alexandre Rafalovitch arafa...@gmail.comwrote: The phases are usually called n-grams or shingles. You can probably use ShingleFilterFactory to create your shingles (possibly with outputUnigrams=false) and then use TermsComponent ( http://wiki.apache.org/solr/TermsComponent) to list the results. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Tue, Sep 10, 2013 at 8:22 AM, Ali, Saqib docbook@gmail.com wrote: Dear Solr Ninjas, We would like to run a query that returns two word phrases that appear in more than one document. So for e.g. take the string Solr Ninja. Since it appears in more than one document in our Solr instance, the query should return that. The query should find all such phrases from all the documents in our Solr instance, by querying for two adjacent word combination (forming a phrase) in the documents that are in the Solr. These two adjacent word combinations should come from the documents in the Solr index. Any ideas on how to write this query? Thanks.
Does configuration change requires Zookeeper restart?
Hi, I have solrcloud with two collections. I have indexed 100Million docs to the first collection. I need some changes to the solr configuration files. Im going to index the new data tot he second collection. What are the steps that i should follow? Should i restart the zookeeper? Pls suggest Thanks, Prasi