AW: correlation between score and term frequency
Yes, that was the meaning of my question! Can you answer it? -Ursprüngliche Nachricht- Von: Joseph Doehr [mailto:[EMAIL PROTECTED] Gesendet: Montag, 1. Oktober 2007 20:00 An: solr-user@lucene.apache.org Betreff: Re: correlation between score and term frequency Hi Alex, do you mean, you like to know if both results have the same relevance through the whole content which is indexed and if both results are direct comparable? [EMAIL PROTECTED] schrieb: I have a question about the correlation between the score value and the term frequency. Let's assume that we have one index about one set of documents. In addition to that, let's assume that there is only one term in a query. If we now search for the term car and get a certain score value X, and if we then search for the term football and get the same score value X. Is it now sure that both values X are the same? Could you explain, what correlation between the score value and the term frequency exists in my scenario?
unable to figure out nutch type highlighting in solr....
I have tried very hard to follow documentation and forums that try to answer questions about how to return snippets with highlights for relevant searched term using Solr (as nutch does with such ease). I will be really grateful if someone can guide me with basics, i have made sure that the field to be highlighted is stored in index etc. Still I can't figure out why it doesn't return the snippet and instead returns the whole document. I have tried all different highlight parameters with variations, but no idea what's happening. Can I test highlighting using given application using full search interface option? How, it just returns xml with full document between field tag at the moment. Please find attached my conf files as well ?xml version=1.0 encoding=UTF-8 ? !-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the License); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -- config !-- Set this to 'false' if you want solr to continue working after it has encountered an severe configuration error. In a production environment, you may want solr to keep working even if one handler is mis-configured. You may also set this to false using by setting the system property: -Dsolr.abortOnConfigurationError=false -- abortOnConfigurationError${solr.abortOnConfigurationError:true}/abortOnConfigurationError !-- Used to specify an alternate directory to hold all index data other than the default ./data under the Solr home. If replication is in use, this should match the replication configuration. -- !-- dataDir./solr/data/dataDir -- indexDefaults !-- Values here affect all index writers and act as a default unless overridden. -- useCompoundFilefalse/useCompoundFile mergeFactor5/mergeFactor maxBufferedDocs100/maxBufferedDocs maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout /indexDefaults mainIndex !-- options specific to the main on-disk lucene index -- useCompoundFilefalse/useCompoundFile mergeFactor5/mergeFactor maxBufferedDocs1000/maxBufferedDocs maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength !-- If true, unlock any held write or commit locks on startup. This defeats the locking mechanism that allows multiple processes to safely access a lucene index, and should be used with care. -- unlockOnStartupfalse/unlockOnStartup /mainIndex !-- the default high-performance update handler -- updateHandler class=solr.DirectUpdateHandler2 !-- A prefix of solr. for class names is an alias that causes solr to search appropriate packages, including org.apache.solr.(search|update|request|core|analysis) -- !-- autocommit pending docs if certain criteria are met autoCommit maxDocs1/maxDocs maxTime1000/maxTime /autoCommit -- autoCommit maxDocs1000/maxDocs maxTime1000/maxTime /autoCommit !-- The RunExecutableListener executes an external command. exe - the name of the executable to run dir - dir to use as the current working directory. default=. wait - the calling thread waits until the executable returns. default=true args - the arguments to pass to the program. default=nothing env - environment variables to set. default=nothing -- !-- A postCommit event is fired after every commit or optimize command listener event=postCommit class=solr.RunExecutableListener str name=exesnapshooter/str str name=dirsolr/bin/str bool name=waittrue/bool arr name=args strarg1/str strarg2/str /arr arr name=env strMYVAR=val1/str /arr /listener -- !-- A postOptimize event is fired only after every optimize command, useful in conjunction with index distribution to only distribute optimized indicies listener event=postOptimize class=solr.RunExecutableListener str name=exesnapshooter/str str name=dirsolr/bin/str bool name=waittrue/bool /listener -- /updateHandler query !-- Maximum number of clauses in a boolean query... can affect range or prefix queries that expand
Re: Searching combined English-Japanese index
Yonik Seeley schrieb: On 10/1/07, Maximilian Hütter [EMAIL PROTECTED] wrote: Yonik Seeley schrieb: On 10/1/07, Maximilian Hütter [EMAIL PROTECTED] wrote: When I search using an English term, I get results but the Japanese is not encoded correctly in the response. (although it is UTF-8 encoded) One quick thing to try is the python writer (wt=python) to see the actual unicode values of what you are getting back (since the python writer automatically escapes non-ascii). That can help rule out incorrect charset handling by clients. -Yonik Thanks for the tip, it turns out that the unicode values are wrong... I mean the browser displays correctly what is send. But I don't know how solr gets these values. OK, so they never got into the index correctly. The most likely explanation is that the charset wasn't set correctly when the update message was sent to Solr. -Yonik Are you sure, they are wrong in the index? When I use the Lucene Index Monitor (http://limo.sourceforge.net/) to look at the document in the index the Japanese is displayed correctly. I am using Jetty 6.0.1 by the way. Best regards, Max -- Maximilian Hütter blue elephant systems GmbH Wollgrasweg 49 D-70599 Stuttgart Tel: (+49) 0711 - 45 10 17 578 Fax: (+49) 0711 - 45 10 17 573 e-mail : [EMAIL PROTECTED] Sitz : Stuttgart, Amtsgericht Stuttgart, HRB 24106 Geschäftsführer: Joachim Hörnle, Thomas Gentsch, Holger Dietrich
Re: Index multiple languages with multiple analyzers with the same field
Same Here. But I can't see how to fit into this UNLESS you are going to create an analyzer to handle a language parameter and based on it would be able to apply a set of filters (and sometimes you want a different - but compatible - set of filters in indexing/query time). It would work, but doing so we lose the advantage of having Solr config were we can change and experiment alternative analyzers/tokenizers/filters compositions... What I've done is I created one specific text field per language and created a dismax request handler per language (using language name or ISO name) and it is very flexible and appropriate for each language. I've also created for management simplicity a dismax handler that allows me to query all documents no matter in which language it is. May be useful for you too. Regards, Daniel Alheiros On 29/9/07 03:29, Lance Norskog [EMAIL PROTECTED] wrote: Other people custom-create a separate dynamic field for each language they want to support. The spellchecker in Solr 1.2 wants just one field to use as its word source, so this fits. We have a more complex version of this problem: we have content with both English and other languages. Searching is one problem; we also want to have spelling correction dictionaries for each language. We have many world languages which need very different handling and semantics, like CJK processing. We will have to use the multiple-field trick; I don't think we can shoehorn our complexity into this technique. It is a valiant effort, though. It's possible we could separate out the different-language words in the document, put them each in separate words_en_text, word_sp_text, etc. and make the default search field out of copyField source=*_text dest=defaultText/ Hmm. Lance -Original Message- From: Thom Nelson [mailto:[EMAIL PROTECTED] Sent: Friday, September 28, 2007 12:07 PM To: solr-user@lucene.apache.org; [EMAIL PROTECTED] Subject: Re: Index multiple languages with multiple analyzers with the same field I had the same problem, but never found a good solution. The best solution is to have a more dynamic way of determining which analyzer to return, such as having some kind of conditional expression evalution in the fieldType/analyzer element, where either the document or the query request could be used as the comparison object. fieldtype type=textMultiLingual class=solr.TextField analyzer type=query expression=request.lang == 'EN' tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldtype Analyzers could still be cached by adding the expression to the cache key. Unfortunately I have switched jobs, so I don't have the time or motivation to do this, but it should be a very useful addition. - Thom Wu, Daniel wrote: Hi, I know this probably has been asked before, but I was not able to find it in the mailing list. So forgive me if I repeated the same question. We are trying to build a search application to support multiple languages. Users can potentially query with any language. First thought come to us is to index the text of all languages in the same field using language specific analyzer. As all the data are indexed in the same field, it would just find results with the language that matches the user query. Looking at the Solr schema, it seems each field can have one and only analyzer. Is it possible to have multiple analyzers for the same field? Or is there any other approaches that can achieve the same thing? Daniel http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
Re: Problem with html code inside xml
Thanks I use this solution: put ![CDATA[ Here my hml code ]] in the xml to be indexed and it works, nothing to change in the xsl. In the schema I use this fieldType fieldType name=html class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType -- Now question: I created a field to index only the text for this html code. I created a field type: fieldType name=htmlTxt class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.HTMLStripWhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Everything works (the div tags, p tags are removed) but some strongnnn/strong or br/ tags are style in the text after indexing. If you've got any idea to solve this problem it we'll be great. Thanks S. Christin - Le 25 sept. 07 à 13:14, Thorsten Scherler a écrit : On Tue, 2007-09-25 at 12:06 +0100, Jérôme Etévé wrote: If I understand, you want to keep the raw html code in solr like that (in your posting xml file): field name=storyFullText html/html /field I think you should encode your content to protect these xml entities: - lt; - gt; - quot; - amp; If you use perl, have a look at HTML::Entities. AFAIR you cannot use tags, they always are getting transformed to entities. The solution is to have a xsl transformation after the response that transforms the entities back to tags. Have a look at the thread http://marc.info/?t=11677583791r=1w=2 and especially at http://marc.info/?l=solr-userm=116782664828926w=2 HTH salu2 On 9/25/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hello, I've got some problem with html code who is embedded in xml file: Sample source . content stories div class=storyTitle Les débats /div div class=storyIntroductionText Le premier tour des élections fédérales se déroulera le 21 octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez- vous, dont plusieurs grands débats à l'enseigne de Forums. /div div class=paragraph div class=paragraphTitle/ div class=paragraphText my para textehere br/ br/ Vous trouverez sur cette page toutes les dates et les heures de ces différents rendez-vous ainsi que le nom et les partis des débatteurs. De plus, vous pourrez également écouter ou réécouter l'ensemble de ces émissions. /div /div - When a make a query on solr I've got something like that in the source code of the xml result: td xmlns=http://www.w3.org/1999/xhtml; span class=markuplt;/span span class=start-tagdiv/span span class=attribute-nameclass/span span class=markup=/span span class=attribute-valueparagraph/span span class=markupgt;/spandiv class=expander-content div class=indentspan class=markuplt;/span span class=start-tagdiv/span span class=attribute-nameclass/span span class=markup=/span span class=attribute-valueparagraphTitle/span span class=markup/gt;/span/divtabletr td class=expander−div class=spacer/ /tdtdspan class=markuplt;/span ... It is not exactly what I want. I want to keep the html tags, that all without formatting. So the br tags and a tags are well formed in xml and json result, but the div tags are not kept. - In the schema.xml I've got this for the html content fieldType name=html class=solr.TextField / field name=storyFullText type=html indexed=true stored=true multiValued=true/ - Any help would be appreciate. Thanks in advance. S. Christin -- Thorsten Scherler thorsten.at.apache.org Open Source Java consulting, training and solutions
Re: Letter-number transitions - can this be turned off
Thanks for your helpful suggestions. I have considered other analyzers but WDF has great strengths. I will experiment with maintaining transitions and then consider modifying the code. F. Knudson Mike Klaas wrote: On 30-Sep-07, at 12:47 PM, F Knudson wrote: Is there a flag to disable the letter-number transition in the solr.WordDelimiterFilterFactory? We are indexing category codes, thesaurus codes for which this letter number transition makes no sense. It is bloating the indexing (which is already large). Have you considered using a different analyzer? If you want to continue using WDF, you could make a quick change around since 320: if (splitOnCaseChange == 0 (lastType ALPHA) != 0 (type ALPHA) != 0) { // ALPHA-ALPHA: always ignore if case isn't considered. } else if ((lastType UPPER)!=0 (type LOWER)!=0) { // UPPER-LOWER: Don't split } else { ... by adding a clause that catches ALPHA - NUMERIC (and vice versa) and ignores it. Another approach that I am using locally is to maintain the transitions, but force tokens to be a minimum size (so r2d2 doesn't tokenize to four tokens but arrrdeee does). There is a patch here: http://issues.apache.org/jira/browse/SOLR-293 If you vote for it, I promise to get it in for 1.3 g -Mike -- View this message in context: http://www.nabble.com/Letter-number-transitions---can-this-be-turned-off-tf4544769.html#a13003019 Sent from the Solr - User mailing list archive at Nabble.com.
RE: Searching combined English-Japanese index
Python does not do Unicode strings natively, you have to do them explicitly. It is possible that your python receiver is not doing the right thing with the incoming strings. Also, Jetty has problems with UTF-8; the Wiki has more on this. Lance -Original Message- From: Maximilian Hütter [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 02, 2007 1:35 AM To: solr-user@lucene.apache.org Subject: Re: Searching combined English-Japanese index Yonik Seeley schrieb: On 10/1/07, Maximilian Hütter [EMAIL PROTECTED] wrote: Yonik Seeley schrieb: On 10/1/07, Maximilian Hütter [EMAIL PROTECTED] wrote: When I search using an English term, I get results but the Japanese is not encoded correctly in the response. (although it is UTF-8 encoded) One quick thing to try is the python writer (wt=python) to see the actual unicode values of what you are getting back (since the python writer automatically escapes non-ascii). That can help rule out incorrect charset handling by clients. -Yonik Thanks for the tip, it turns out that the unicode values are wrong... I mean the browser displays correctly what is send. But I don't know how solr gets these values. OK, so they never got into the index correctly. The most likely explanation is that the charset wasn't set correctly when the update message was sent to Solr. -Yonik Are you sure, they are wrong in the index? When I use the Lucene Index Monitor (http://limo.sourceforge.net/) to look at the document in the index the Japanese is displayed correctly. I am using Jetty 6.0.1 by the way. Best regards, Max -- Maximilian Hütter blue elephant systems GmbH Wollgrasweg 49 D-70599 Stuttgart Tel: (+49) 0711 - 45 10 17 578 Fax: (+49) 0711 - 45 10 17 573 e-mail : [EMAIL PROTECTED] Sitz : Stuttgart, Amtsgericht Stuttgart, HRB 24106 Geschäftsführer: Joachim Hörnle, Thomas Gentsch, Holger Dietrich
Re: Searching combined English-Japanese index
On 10/2/07, Maximilian Hütter [EMAIL PROTECTED] wrote: Are you sure, they are wrong in the index? It's not an issue with Jetty output encoding since the python writer takes the string and converts it to ascii before that. Since Solr does no charset encoding itself on output, that must mean that it's in the index incorrectly. When I use the Lucene Index Monitor (http://limo.sourceforge.net/) to look at the document in the index the Japanese is displayed correctly. I've never really used limo, but it's possible it's incorrectly interpreting what's in the index (and by luck doing the reverse transformation that got the data in there incorrectly). Try indexing a document with a unicode character specified via an entity, to remove the issues of input char encodings. For example if a Japanese char has a unicode value of \u1234, then in the XML doc, use #x1234 -Yonik
schema for response
Hi, there, Given that there's some questions on the updated XML schema for the response in Solr 1.2. Can someone points me to the XML schema? Is it documented somewhere? I'm particularly interested in the different status code we would have in the response for either update or select. -- Regards, -Hui
Re: schema for response
Yu-Hui Jin wrote: Hi, there, Given that there's some questions on the updated XML schema for the response in Solr 1.2. Can someone points me to the XML schema? Is it documented somewhere? I'm particularly interested in the different status code we would have in the response for either update or select. In 1.2, /update and /select can share the same response format if you set: requestDispatcher handleSelect=true in solrconfig.xml All status codes in 1.2 should map to standard HTTP status cods -- 200 is ok, 400 bad request, 500 - some server error etc... ryan
Re: dataset parameters suitable for lucene application
Hi There, Would you mind if I pasted your data onto the wiki page at http://wiki.apache.org/solr/SolrPerformanceData I think it would be helpful to get some more numbers on that page, so people can help decide if Solr is the right application for them. Thanks, Chris Harris, new Solr user On 9/26/07, Xuesong Luo [EMAIL PROTECTED] wrote: My experience so far: 200k number of indexes were created in 90 mins(including db time), index size is 200m, query a key word on all string fields(30) takes 0.3-1 sec, query a key word on one field takes tens of mill seconds. -Original Message- From: Charlie Jackson [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 26, 2007 8:53 AM To: solr-user@lucene.apache.org Subject: RE: dataset parameters suitable for lucene application My experiences so far with this level of data have been good. Number of records: Maxed out at 8.8 million Database size: friggin huge (100+ GB) Index size: ~24 GB 1) It took me about a day to index 8 million docs using a non-optimized program I wrote. It's non-optimized in the sense that it's not multi-threaded. It batched together groups of about 5,000 docs at a time to be indexed. 2) Search times for a basic search are almost always sub-second. If we toss in some faceting, it takes a little longer, but I've hardly ever seen it go above 1-2 seconds even with the most advanced queries. Hope that helps. Charlie -Original Message- From: Law, John [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 26, 2007 9:28 AM To: solr-user@lucene.apache.org Subject: dataset parameters suitable for lucene application I am new to the list and new to lucene and solr. I am considering Lucene for a potential new application and need to know how well it scales. Following are the parameters of the dataset. Number of records: 7+ million Database size: 13.3 GB Index Size: 10.9 GB My questions are simply: 1) Approximately how long would it take Lucene to index these documents? 2) What would the approximate retrieval time be (i.e. search response time)? Can someone provide me with some informed guidance in this regard? Thanks in advance, John __ John Law Director, Platform Management ProQuest 789 Eisenhower Parkway Ann Arbor, MI 48106 734-997-4877 [EMAIL PROTECTED] www.proquest.com www.csa.com ProQuest... Start here.
Solr live at Netflix
Here at Netflix, we switched over our site search to Solr two weeks ago. We've seen zero problems with the server. We average 1.2 million queries/day on a 250K item index. We're running four Solr servers with simple round-robin HTTP load-sharing. This is all on 1.1. I've been too busy tuning to upgrade. Thanks everyone, this is a great piece of software. wunder -- Walter Underwood Search Guy, Netflix
Re: Re: Problem with html code inside xml
Hi ! I'm facing a similar problem. Some HTML docs are correctly indexed and others are simply rejected even I encoded all problematic HTML tags as Thorsten suggested. In the following example, my_doc.xml is a valid XML file, compliant with my Solr's schema fields : $ java -jar post.jar ./my_doc.xml SimplePostTool: version 1.2 SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, other encodings are not currently supported SimplePostTool: POSTing files to http://localhost:8983/solr/update.. SimplePostTool: POSTing file solrdoc SimplePostTool: FATAL: Connection error (is Solr running at http://localhost:8983/solr/update ?): java.io.IOException: Server returned HTTP response code: 500 for URL: http://localhost:8983/solr/update Is there any way to let Solr to be more verbose than that ? Do I need to go into the Java code to understand what happen? I'm looking for a simple solution. Thanks in advance cheers Y. Message d'origine De: [EMAIL PROTECTED] Sujet: Re: Problem with html code inside xml Date: Tue, 2 Oct 2007 16:15:26 +0200 A: solr-user@lucene.apache.org Thanks I use this solution: put ![CDATA[ Here my hml code ]] in the xml to be indexed and it works, nothing to change in the xsl. In the schema I use this fieldType fieldType name=html class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType -- Now question: I created a field to index only the text for this html code. I created a field type: fieldType name=htmlTxt class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.HTMLStripWhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Everything works (the div tags, p tags are removed) but some strongnnn/strong or br/ tags are style in the text after indexing. If you've got any idea to solve this problem it we'll be great. Thanks S. Christin - Le 25 sept. 07 à 13:14, Thorsten Scherler a écrit : On Tue, 2007-09-25 at 12:06 +0100, Jérôme Etévé wrote: If I understand, you want to keep the raw html code in solr like that (in your posting xml file): field name=storyFullText html/html /field I think you should encode your content to protect these xml entities: - lt; - gt; - quot; - amp; If you use perl, have a look at HTML::Entities. AFAIR you cannot use tags, they always are getting transformed to entities. The solution is to have a xsl transformation after the response that transforms the entities back to tags. Have a look at the thread http://marc.info/?t=11677583791r=1w=2 and especially at http://marc.info/?l=solr-userm=116782664828926w=2 HTH salu2 On 9/25/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hello, I've got some problem with html code who is embedded in xml file: Sample source . content stories div class=storyTitle Les débats /div div class=storyIntroductionText Le premier tour des élections fédérales se déroulera le 21 octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez- vous, dont plusieurs grands débats à l'enseigne de Forums. /div div class=paragraph div class=paragraphTitle/ div class=paragraphText my para textehere br/ br/ Vous trouverez sur cette page toutes les dates et les heures de ces différents rendez-vous ainsi que le nom et les partis des débatteurs. De plus, vous pourrez également écouter ou réécouter l'ensemble de ces émissions. /div /div - When a make a query on solr I've got something like that in the source code of the xml result: td xmlns=http://www.w3.org/1999/xhtml; span
Re: Problem with html code inside xml
: I created a field type: : : fieldType name=htmlTxt class=solr.TextField positionIncrementGap=100 ... : Everything works (the div tags, p tags are removed) but some : strongnnn/strong or br/ tags are style in the text after indexing. i cut/paste that fieldtype into the example schema.xml, and experimented with the analysis tool (http://localhost:8983/solr/admin/analysis.jsp) and both of those examples were correctly striped. do you have a more specific example of something that doesn't work? Hmm... it seems like maybe the problem is examples like this... blahblahstringnnn/strong ...if the tag is direclty adjacent to other text, it may not get striped off ... i'm not sure if that's specific to the HtmlWhitespaceTokenizer. -Hoss
Re: Re: Problem with html code inside xml
: SimplePostTool: FATAL: Connection error (is Solr running at http://localhost:8983/solr/update ?): java.io.IOException: Server returned HTTP response code: 500 for URL: http://localhost:8983/solr/update : : Is there any way to let Solr to be more verbose than that ? Solr outputs all errors using whatever default error page format your servlet container uses, it also logs all errors tothe servlet containers loging system. this specific error indicates that post.jar could not connect to Solr at all (hence the FATAL: Connection error and the hint that perhaps Solr isn't actually runing at the URL ypost.jar is trying to contact.) If you are using the example Jetty setup that comes with Solr, and you send a document that triggers a Solr error, post.jar will output something like this (in this specific error, the problem is that the document being posted is total giberesh, an not XML at all)... SimplePostTool: FATAL: Solr returned an error: ParseError_at_rowcol11_Message_only_whitespace_content_allowed_before_start_tag_and_not___javaxxmlstreamXMLStreamException_ParseError_at_rowcol11_Message_only_whitespace_content_allowed_before_start_tag_and_not___at_combeaxmlstreamMXParserparsePrologMXParserjava2044__at_combeaxmlstreamMXParsernextImplMXParserjava1947__at_combeaxmlstreamMXParsernextMXParserjava1333__at_orgapachesolrhandlerXmlUpdateRequestHandlerprocessUpdateXmlUpdateRequestHandlerjava148__at_orgapachesolrhandlerXmlUpdateRequestHandlerhandleRequestBodyXmlUpdateRequestHandlerjava123__at_orgapachesolrhandlerRequestHandlerBasehandleRequestRequestHandlerBasejava78__at_orgapachesolrcoreSolrCoreexecuteSolrCorejava807__at_orgapachesolrservletSolrDispatchFilterexecuteSolrDispatchFilterjava206__at_orgapachesolrservletSolrDispatchFilterdoFilterSolrDispatchFilterjava174__at_orgmortbayjettyservletServletHandler$CachedChaindoFilterServletHandlerjava1089__at_orgmortbayjettyservletServletHandlerhandleServletHandlerjava365__at_orgmortbayjettysecuritySecurityHandlerhandleSecurityHandlerjava216__at_orgmortbayjettyservletSessionHandlerhandleSessionHandlerjava181__at_orgmortbayjettyhandlerContextHandlerhandleContextHandlerjava712__at_orgmortbayjettywebappWebAppContexthandleWebAppContextjava405__at_orgmortbayjettyhandlerContextHandlerCollectionhandleContextHandlerCollectionjava211__at_orgmortbayjettyhandlerHandlerCollectionhandleHandlerCollectionjava114__at_orgmortbayjettyhandlerHandlerWrapperhandleHandlerWrapperjava139__at_orgmortbayjettyServerhandleServerjava285__at_orgmortbayjettyHttpConnectionhandleRequestHttpConnectionjava502__at_orgmortbayjettyHttpConnection$RequestHandlercontentHttpConnectionjava835__at_orgmortbayjettyHttpParserparseNextHttpParserjava641__at_orgmortbayjettyHttpParserparseAvailableHttpParserjava202__at_orgmortbayjettyHttpCo -Hoss
Re: Solr live at Netflix
: Here at Netflix, we switched over our site search to Solr two weeks ago. That's great Walter ... could I persuade you to add a few notes about this to... http://wiki.apache.org/solr/PublicServers http://wiki.apache.org/solr/SolrPerformanceData -Hoss
Re: Solr live at Netflix
I think Chris Harris is doing that. I'll check it and touch it up afterwards. Avoid race conditions. --wunder On 10/2/07 4:26 PM, Chris Hostetter [EMAIL PROTECTED] wrote: : Here at Netflix, we switched over our site search to Solr two weeks ago. That's great Walter ... could I persuade you to add a few notes about this to... http://wiki.apache.org/solr/PublicServers http://wiki.apache.org/solr/SolrPerformanceData -Hoss
Re: Solr live at Netflix
Nice! And there seem to be some improvements. For example, Gamers and Gamera no longer stem to the same word :-) Tom On 10/2/07, Walter Underwood [EMAIL PROTECTED] wrote: Here at Netflix, we switched over our site search to Solr two weeks ago. We've seen zero problems with the server. We average 1.2 million queries/day on a 250K item index. We're running four Solr servers with simple round-robin HTTP load-sharing. This is all on 1.1. I've been too busy tuning to upgrade. Thanks everyone, this is a great piece of software. wunder -- Walter Underwood Search Guy, Netflix
question about bi-gram analysis on query
Hey guys, I'm trying to index a field in Chinese using the CJKTokenizer, and I'm finding that my searches on the index are not working at all. The index is created properly (looking with Luke), and when I search against it with Luke the data comes back as I would expect. Also, when I use the analysis page of solr admin, the result is what I would expect. On an actual search though, nothing is found. Here are the relevant snippets from my confs: fieldtype name=text_zh class=solr.TextField analyzer tokenizer class=org.apache.solr.analysis.ja.CJKTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.TrimFilterFactory / /analyzer /fieldtype ... field name=text type=text_zh indexed=true stored=false multiValued=true/ So if I send in 美聯社 it correctly creates 2 tokens 美聯 聯社 And if I do a search in Luke and the solr analysis page for美聯, I get a hit. But on the actual search, I don't. Also, I've noticed that the parsed query on luke is: text:美聯 聯社 and in solr it is: text:美聯 聯社 I noticed there is an extra space in the solr parsed query. I don't know if that makes a difference. I'm really at a loss. Does anyone know why I don’t get search hits back? Thanks, Dave Keene
Re: Solr live at Netflix
On Tue, 02 Oct 2007 15:26:33 -0700 Walter Underwood [EMAIL PROTECTED] wrote: Here at Netflix, we switched over our site search to Solr two weeks ago. We've seen zero problems with the server. We average 1.2 million queries/day on a 250K item index. We're running four Solr servers with simple round-robin HTTP load-sharing. Hi Walter, would you mind sharing hardware specs, OS, index size, VM settings, OS specific tunings ? unless that will be added to the wiki... :) thanks in advance, B _ {Beto|Norberto|Numard} Meijome Have the courage to take your own thoughts seriously, for they will shape you. Albert Einstein I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: schema for response
Got it. Thanks, Ryan. -Hui On 10/2/07, Ryan McKinley [EMAIL PROTECTED] wrote: Yu-Hui Jin wrote: Hi, there, Given that there's some questions on the updated XML schema for the response in Solr 1.2. Can someone points me to the XML schema? Is it documented somewhere? I'm particularly interested in the different status code we would have in the response for either update or select. In 1.2, /update and /select can share the same response format if you set: requestDispatcher handleSelect=true in solrconfig.xml All status codes in 1.2 should map to standard HTTP status cods -- 200 is ok, 400 bad request, 500 - some server error etc... ryan -- Regards, -Hui
Re: searching remote indexes
Well, we do not have a Solr server and all the calls to index and search documents is done via Embedded Solr. What is the approach then? On 9/28/07, Mike Klaas [EMAIL PROTECTED] wrote: Solr's main interface is http, so you can connect to that remotely. Query each machine and combine the results using you own business logic. Alternatively, you can try out the query distribution code being developed in http://issues.apache.org/jira/browse/SOLR-303 -Mike On 28-Sep-07, at 1:59 AM, Venkatraman S wrote: resending due to lack of response : [We are using embedded solr 1.2 ] I need a mechanism by which i can search over 3 remote indexes? Can i use the Lucene remote apis to access the index created via Embedded solr? -Venkat On 9/4/07, Venkatraman S [EMAIL PROTECTED] wrote: Hi, [I am new to Solr]. How do i search remote indexes using Solr? I am not able to find suitable documentation on this - can you guys guide me? Regards, Venkat -- -- --
Re: searching remote indexes
Using embedded solr, there is no (built in) way to access remote indexes. If you want to access remote indexes you need to run a server. Solr 1.3 (trunk) includes a java client you may want to look at: http://wiki.apache.org/solr/Solrj If you poke around, this also includes simple ways to run solr with embedded jetty - letting you run a light weight server. ryan Venkatraman S wrote: Well, we do not have a Solr server and all the calls to index and search documents is done via Embedded Solr. What is the approach then? On 9/28/07, Mike Klaas [EMAIL PROTECTED] wrote: Solr's main interface is http, so you can connect to that remotely. Query each machine and combine the results using you own business logic. Alternatively, you can try out the query distribution code being developed in http://issues.apache.org/jira/browse/SOLR-303 -Mike On 28-Sep-07, at 1:59 AM, Venkatraman S wrote: resending due to lack of response : [We are using embedded solr 1.2 ] I need a mechanism by which i can search over 3 remote indexes? Can i use the Lucene remote apis to access the index created via Embedded solr? -Venkat On 9/4/07, Venkatraman S [EMAIL PROTECTED] wrote: Hi, [I am new to Solr]. How do i search remote indexes using Solr? I am not able to find suitable documentation on this - can you guys guide me? Regards, Venkat -- -- --