Re: Checking Optimal Values for BM25
Hi Furkan, in order to change the BM25 parameter values k1 and b, the following XML snippet needs to be added in your schema.xml configuration file: 1.3 0.7 It is even possible to specify the SimilarityFactory on individual index fields. See [1] for more details. Best Sascha [1] https://wiki.apache.org/solr/SchemaXml#Similarity Am 15.12.2016 um 14:58 schrieb Furkan KAMACI: Hi, Sole's default similarity is BM25 anymore. Its parameters are defined as k1=1.2, b=0.75 as default. However is there any way that to check the effect of using different coefficients to calculate BM25 to find the optimal values? Kind Regards, Furkan KAMACI
Re: field length within BM25 score calculation in Solr 6.3
Hi, bumping my question after 10 days. Any clarification is appreciated. Best Sascha Hi folks, my Solr index consists of one document with a single valued field "title" of type "text_general". The title field was index with the content: 1 2 3 4 5 6 7 8 9. The field type text_general uses a StandardTokenizer which should result in 9 tokens. The corresponding length of field title in the given document is 9. The field type is defined as follows: I’ve checked that none of the nine tokens (1, 2, …, 9) is a stop word. As expected, the query title:1 returns the given document. The BM25 score of the document for the given query is 0.272. But why does Solr 6.3 states that the length of field title is 10.24? 0.27233246 = weight(title_alt:1 in 0) [SchemaSimilarity], result of: 0.27233246 = score(doc=0,freq=1.0 = termFreq=1.0), product of: 0.2876821 = idf(docFreq=1, docCount=1) 0.94664377 = tfNorm, computed from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.75 = parameter b 9.0 = avgFieldLength 10.24 = fieldLength In contrast, the value of avgFieldLength is correct. The same observation can be made if the index consists of two simple documents: doc1: title = 1 2 3 4 doc2: title = 1 2 3 4 5 6 7 8 The BM25 score calculation of doc2 is explained as: 0.14143422 = weight(title_alt:1 in 1) [SchemaSimilarity], result of: 0.14143422 = score(doc=1,freq=1.0 = termFreq=1.0), product of: 0.18232156 = idf(docFreq=2, docCount=2) 0.7757405 = tfNorm, computed from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.75 = parameter b 6.0 = avgFieldLength 10.24 = fieldLength The value of fieldLength does not match 8. Is there same "magic“ applied to the value of field length that goes beyond the standard BM25 score formula? If so, what is the idea behind this modification. If not, is this a Lucene / Solr bug? Best regards, Sascha -- Sascha Szott :: KOBV/ZIB :: +49 30 84185-457
field length within BM25 score calculation in Solr 6.3
Hi folks, my Solr index consists of one document with a single valued field "title" of type "text_general". The title field was index with the content: 1 2 3 4 5 6 7 8 9. The field type text_general uses a StandardTokenizer which should result in 9 tokens. The corresponding length of field title in the given document is 9. The field type is defined as follows: I’ve checked that none of the nine tokens (1, 2, …, 9) is a stop word. As expected, the query title:1 returns the given document. The BM25 score of the document for the given query is 0.272. But why does Solr 6.3 states that the length of field title is 10.24? 0.27233246 = weight(title_alt:1 in 0) [SchemaSimilarity], result of: 0.27233246 = score(doc=0,freq=1.0 = termFreq=1.0), product of: 0.2876821 = idf(docFreq=1, docCount=1) 0.94664377 = tfNorm, computed from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.75 = parameter b 9.0 = avgFieldLength 10.24 = fieldLength In contrast, the value of avgFieldLength is correct. The same observation can be made if the index consists of two simple documents: doc1: title = 1 2 3 4 doc2: title = 1 2 3 4 5 6 7 8 The BM25 score calculation of doc2 is explained as: 0.14143422 = weight(title_alt:1 in 1) [SchemaSimilarity], result of: 0.14143422 = score(doc=1,freq=1.0 = termFreq=1.0), product of: 0.18232156 = idf(docFreq=2, docCount=2) 0.7757405 = tfNorm, computed from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.75 = parameter b 6.0 = avgFieldLength 10.24 = fieldLength The value of fieldLength does not match 8. Is there same "magic“ applied to the value of field length that goes beyond the standard BM25 score formula? If so, what is the idea behind this modification. If not, is this a Lucene / Solr bug? Best regards, Sascha
Re: Problem of facet on 170M documents
Hi Ming, which Solr version are you using? In case you use one of the latest versions (4.5 or above) try the new parameter facet.threads with a reasonable value (4 to 8 gave me a massive performance speedup when working with large facets, i.e. nTerms 10^7). -Sascha Mingfeng Yang wrote: I have an index with 170M documents, and two of the fields for each doc is source and url. And I want to know the top 500 most frequent urls from Video source. So I did a facet with fq=source:Videofacet=truefacet.field=urlfacet.limit=500, and the matching documents are about 9 millions. The solr cluster is hosted on two ec2 instances each with 4 cpu, and 32G memory. 16G is allocated tfor java heap. 4 master shards on one machine, and 4 replica on another machine. Connected together via zookeeper. Whenever I did the query above, the response is just taking too long and the client will get timed out. Sometimes, when the end user is impatient, so he/she may wait for a few second for the results, and then kill the connection, and then issue the same query again and again. Then the server will have to deal with multiple such heavy queries simultaneously and being so busy that we got no server hosting shard error, probably due to lost communication between solr node and zookeeper. Is there any way to deal with such problem? Thanks, Ming
intersection of filter queries with raw query parser
Hi folks, is it possible to use the raw query parser with a disjunctive filter query? Say, I have a field 'foo' and two values 'v1' and 'v2' (the field values are free text and can contain any character). What I want is to retrieve all documents satisying fq=foo:(v1 OR v2). In case only one field (v1) is given, the query fq={!raw f=foo}v1 works as expected. But how can I formulate the filter query (with the raw query parser) in case two values are provided. The same question was posted on Stackoverflow (http://stackoverflow.com/questions/5637675/solr-query-with-raw-data-and-union-multiple-facet-values) two years ago. But there was only the advice to give up using the raw query parser which is not what I want to do. Thanks in advance, Sascha
Re: Does SolrCloud support distributed IDFs?
Hi Mark, Mark Miller wrote: Still waiting on that issue. I think Andrzej should just update it to trunk and commit - it's option and defaults to off. Go vote :) Sounds like the problem is already solved and the remaining work consists of code integration? Can somebody estimate how much work that would be? -Sascha
Does SolrCloud support distributed IDFs?
Hi folks, a known limitation of the old distributed search feature is the lack of distributed/global IDFs (#SOLR-1632). Does SolrCloud bring some improvements in this direction? Best regards, Sascha
Re: Prefix query is not analysed?
Hi, wildcard and fuzzy queries are not analyzed. -Sascha Alok Bhandari alokomprakashbhand...@gmail.com schrieb: Hello , I am pushing Chuck Follett'.?.? in solr and when I query for this field with query string field:Follett'.* I am getting 0 results. field type declared is fieldType name=text_email class=solr.TextField stored=true indexed=true positionIncrementGap=100 analyzer tokenizer class=solr.UAX29URLEmailTokenizerFactory maxTokenLength=255/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType and parser we are using is EdisMax . Is it the case that for prefix query the text analysis is not done I am getting 0 results or there is something fundamentally wrong with my data/schema . -- View this message in context: http://lucene.472066.n3.nabble.com/Prefix-query-is-not-analysed-tp3992435.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Prefix query is not analysed?
Hi, I suppose you are using Solr 3.6. Then take a look at http://www.lucidimagination.com/blog/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/ -Sascha Alok Bhandari alokomprakashbhand...@gmail.com schrieb: Thanks for reply. If I check the debug query through solr-admin I can see that the lower case filter is applied and rawquerystring:em_to_name:Follett'.*, querystring:em_to_name:Follett'.*, parsedquery:+em_to_name:follett'.*, parsedquery_toString:+em_to_name:follett'.*, explain:{}, QParser:ExtendedDismaxQParser, I can see this query. So is it the case that only tokenization is not done for the wildcard queries but other filters specified are applied? -- View this message in context: http://lucene.472066.n3.nabble.com/Prefix-query-is-not-analysed-tp3992435p3992450.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: indexing documents in Apache Solr using php-curl library
Hi, perhaps it's better to use a PHP Solr client library. I used https://code.google.com/p/solr-php-client/ in a project of mine and it worked just fine. -Sascha Asif wrote: I am indexing the file using php curl library. I am stuck here with the code echo Stored in: . upload/ . $_FILES[file][name]; $result=move_uploaded_file($_FILES[file][tmp_name],upload/ . $_FILES[file][name]); if ($result == 1) echo pUpload done ./p; $options = getopt(f:); $infile = $options['f']; $url = http://localhost:8983/solr/update/;; $filename = upload/ . $_FILES[file][name]; $handle = fopen($filename, rb); $contents = fread($handle, filesize($filename)); fclose($handle); echo $url; $post_string = file_get_contents(upload/ . $_FILES[file][name]); echo $contents; $header = array(Content-type:text/xml; charset=utf-8); $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_HTTPHEADER, $header); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_POST, 1); curl_setopt($ch, CURLOPT_POSTFIELDS, $post_string); curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1); curl_setopt($ch, CURLINFO_HEADER_OUT, 1); $data = curl_exec($ch); if (curl_errno($ch)) { print curl_error: . curl_error($ch); } else { curl_close($ch); print curl exited okay\n; echo Data returned...\n; echo \n; echo $data; echo \n; } Nothing is showing as a result. Moreover there is nothing shown in the event log of Apache Solr. please help me with the code
Re: how to retrieve a doc from its docID ?
Hi, did you include the fl parameter in the Solr query URL? If that's the case make sure that the field name 'text' is mentioned there. You should also make sure that the field definition (in schema.xml) for 'text' says stored=true, otherwise the field will not be returned. -Sascha Giovanni Gherdovich g.gherdov...@gmail.com schrieb: Hi all, when querying my solr instance, the answers I get are the document IDs of my docs. Here is how one of my docs looks like: -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- add doc field name=texthello solar!/field field name=id123/field /doc /add -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- here is the response if I query for solar : -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- response lst name=responseHeader/lst result name=response numFound=1 start=0 maxScore=1.0 docfloat name=score1.0/float str name=id123/str/doc /result /response -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- which is, solr gives me the doc ID. How to retrieve the doc's field text given its id ? cheers, Giovanni
Re: querying thru solritas gives me zero results
Hi, Solritas uses the dismax query parser. The dismax config parameter 'qf' specifies the index fields to be searched in. Make sure that 'name' is your default search field. -Sascha Giovanni Gherdovich g.gherdov...@gmail.com schrieb: Hi all, this morning I was very proud of myself since I managed to set up solritas ( http://wiki.apache.org/solr/VelocityResponseWriter ) for the solr instance on my server (ubuntu natty). This joy lasted only half a minute, since the only query that gets more than zero results with solritas is the catchall *:* for example: http://my.server.com:8080/solr/select/?q=foobar has thousands of results, http://my.server.com:8080/solr/itas?q=foobar has none Here the standard and velocity request handlers from my solrconfig.xml; -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8 requestHandler name=standard class=solr.SearchHandler default=true lst name=defaults str name=echoParamsexplicit/str /lst /requestHandler -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8 queryResponseWriter name=velocity class=org.apache.solr.request.VelocityResponseWriter/ requestHandler name=/itas class=solr.SearchHandler lst name=defaults str name=wtvelocity/str str name=v.templatebrowse/str str name=titleSolr cookbook example/str str name=defTypedismax/str str name=q.alt*:*/str str name=rows10/str str name=fl*,score/str str name=qfname/str /lst /requestHandler -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8 -- -- 8 any hint on how I can debug that? cheers, Giovanni
Re: Searching for digits with strings
Hi, as far as I know Solr does not provide such a feature. If you cannot make any assumptions on the numbers, choose an appropriate library that is able to transform between numerical and non-numerical representations and populate the search field with both versions at index-time. -Sascha Alireza Salimi alireza.sal...@gmail.com schrieb: Hi, Well that's the only solution I got so far and it would work for most of the cases, but l thought there might be some better solutions. Thanks On Wed, Jun 27, 2012 at 5:49 PM, Upayavira u...@odoko.co.uk wrote: How many numbers? 0-9? Or every number under the sun? You could achieve a limited number by using synonyms, 0 is a synonym for nought and zero, etc. Upayavira On Wed, Jun 27, 2012, at 05:22 PM, Alireza Salimi wrote: Hi, I was wondering if there's a built in solution in Solr so that you can search for documents with digits by their string representations. i.e. search for 'two' would match fields which have '2' token and vice versa. Thanks
Re: getting started
Hi Mari, it depends ... * How many records are stored in your MySQL databases? * How often will updates occur? * How many db records / index documents are changed per update? I would suggest to start with a single Solr core first. Thereby, you can concentrate on the basics and do not need to deal with more advanced things like sharding. In case you encounter performance issues later on, you can switch to a multi-core setup. -Sascha Mari Masuda wrote: Hello, I am new to Solr and am in the beginning planning stage of a large project and could use some advice so as not to make a huge design blunder that I will regret down the road. Currently I have about 10 MySQL databases that store information about different archival collections. For example, we have data and metadata about a political poster collection, a television program, documents and photographs of and about a famous author, etc. My job is to work with the staff archivists to come up with a standard metadata template so the 10 databases can be consolidated into one. Currently the info in these databases is accessed through 10 different sets of PHP pages that were written a long time ago for PHP 4. My plan is to write a new Java application that will handle both public display of the info as well as an administrative interface so that staff members can add or edit the records. I have decided to use Solr as the search mechanism for this project. Because the info in each of our 10 collections is slightly different (e.g., a record about a poster does not contain duration information, but a record about a TV show does) I was thinking it would be good to separate each collection's index into a separate Solr core so that commits coming from one collection do not bog down the other unrelated collections. One reservation I have is that eventually we would like to be able to type in Iraq and find records across all of the collections at once instead of having to search each collection separately. Although I don't know anything about it at this stage, I did Google sharding after reading someone's recent post on this list and it sounds like that may be a potential answer to my question. Does anyone have any advice on how I should initially set up Solr for my situation? I am slowly making my way through the wiki and RTFMing, but I wanted to see what the experts have to say because at this point I don't really know where to start. Thank you very much, Mari
Re: Solr coding
Hi, depending on your needs, take a look at Apache ManifoldCF. It adds document-level security on top of Solr. -Sascha On 23.03.2011 14:20, satya swaroop wrote: Hi All, As for my project Requirement i need to keep privacy for search of files so that i need to modify the code of solr, for example if there are 5 users and each user indexes some files as user1 - java1, c1,sap1 user2 - java2, c2,sap2 user3 - java3, c3,sap3 user4 - java4, c4,sap4 user5 - java5, c5,sap5 and if a user2 searches for the keyword java then it should be display only the file java2 and not other files so inorder to keep this filtering inside solr itself may i know where to modify the code... i will access a database to check the user indexed files and then filter the result... i didnt have any cores.. i indexed all files in a single index... Regards, satya
Re: Search failing for matched text in large field
Hi Paul, did you increase the value of the maxFieldLength parameter in your solrconfig.xml? -Sascha On 23.03.2011 17:05, Paul wrote: I'm using solr 1.4.1. I have a document that has a pretty big field. If I search for a phrase that occurs near the start of that field, it works fine. If I search for a phrase that appears even a little ways into the field, it doesn't find it. Is there some limit to how far into a field solr will search? Here's the way I'm doing the search. All I'm changing is the text I'm searching on to make it succeed or fail: http://localhost:8983/solr/my_core/select/?q=%22search+phrase%22hl=onhl.fl=text Or, if it is not related to how large the document is, what else could it possibly be related to? Could there be some character in that field that is stopping the search?
Re: Search failing for matched text in large field
On 23.03.2011 18:52, Paul wrote: I increased maxFieldLength and reindexed a small number of documents. That worked -- I got the correct results. In 3 minutes! Did you mark the field in question as stored = false? -Sascha I assume that if I reindex all my documents that all searches will become even slower. Is there any way to get all the results in a way that is quick enough that my user won't get bored waiting? Is there some optimization of this coming in solr 3.0? On Wed, Mar 23, 2011 at 12:15 PM, Sascha Szottsz...@zib.de wrote: Hi Paul, did you increase the value of the maxFieldLength parameter in your solrconfig.xml? -Sascha On 23.03.2011 17:05, Paul wrote: I'm using solr 1.4.1. I have a document that has a pretty big field. If I search for a phrase that occurs near the start of that field, it works fine. If I search for a phrase that appears even a little ways into the field, it doesn't find it. Is there some limit to how far into a field solr will search? Here's the way I'm doing the search. All I'm changing is the text I'm searching on to make it succeed or fail: http://localhost:8983/solr/my_core/select/?q=%22search+phrase%22hl=onhl.fl=text Or, if it is not related to how large the document is, what else could it possibly be related to? Could there be some character in that field that is stopping the search?
Re: Index MS office
Hi, have a look at Solr's ExtractingRequestHandler: http://wiki.apache.org/solr/ExtractingRequestHandler -Sascha On 02.02.2011 16:49, Thumuluri, Sai wrote: Good Morning, I am planning to get started on indexing MS office using ApacheSolr - can someone please direct me where I should start? Thanks, Sai Thumuluri
Re: Malformed XML with exotic characters
Hi folks, I've made the same observation when working with Solr's ExtractingRequestHandler on the command line (no browser interaction). When issuing the following curl command curl 'http://mysolrhost/solr/update/extract?extractOnly=trueextractFormat=textwt=xmlresource.name=foo.pdf' --data-binary @foo.pdf -H 'Content-type:text/xml; charset=utf-8' foo.xml Solr's XML response writer returns malformed xml, e.g., xmllint gives me: foo.xml:21: parser error : Char 0xD835 out of allowed range foo.xml:21: parser error : PCDATA invalid Char value 55349 I'm not totally sure, if this is an Tika/PDFBox issue. However, I would expect in every case that the XML output produced by Solr is well-formed even if the libraries used under the hood return garbage. -Sascha p.s. I can provide the pdf file in question, if anybody would like to see it in action. On 01.02.2011 16:43, Markus Jelsma wrote: There is an issue with the XML response writer. It cannot cope with some very exotic characters or possibly the right-to-left writing systems. The issue can be reproduced by indexing the content of the home page of wikipedia as it contains a lot of exotic matter. The problem does not affect the JSON response writer. The problem is, i am unsure whether this is a bug in Solr or that perhaps Firefox itself trips over. Here's the output of the JSONResponeWriter for a query returning the home page: { responseHeader:{ status:0, QTime:1, params:{ fl:url,content, indent:true, wt:json, q:*:*, rows:1}}, response:{numFound:6744,start:0,docs:[ { url:http://www.wikipedia.org/;, content:Wikipedia English The Free Encyclopedia 3 543 000+ articles 日 本語 フリー百科事典 730 000+ 記事 Deutsch Die freie Enzyklopädie 1 181 000+ Artikel Español La enciclopedia libre 710 000+ artículos Français L’encyclopédie libre 1 061 000+ articles Русский Свободная энциклопедия 654 000+ статей Italiano L’enciclopedia libera 768 000+ voci Português A enciclopédia livre 669 000+ artigos Polski Wolna encyklopedia 769 000+ haseł Nederlands De vrije encyclopedie 668 000+ artikelen Search • Suchen • Rechercher • Szukaj • Ricerca • 検索 • Buscar • Busca • Zoeken • Поиск • Sök • 搜尋 • Cerca • Søk • Haku • Пошук • Hledání • Keresés • Căutare • 찾기 • Tìm kiếm • Ara • Cari • Søg • بحث • Serĉu • Претрага • Paieška • Hľadať • Suk • جستجو • חיפוש • Търсене • Poišči • Cari • Bilnga العربية Български Català Česky Dansk Deutsch English Español Esperanto فارسی Français 한국어 Bahasa Indonesia Italiano עברית Lietuvių Magyar Bahasa Melayu Nederlands 日本語 Norsk (bokmål) Polski Português Română Русский Slovenčina Slovenščina Српски / Srpski Suomi Svenska Türkçe Українська Tiếng Việt Volapük Winaray 中文 100 000+ العربية • Български • Català • Česky • Dansk • Deutsch • English • Español • Esperanto • فارسی • Français • 한국어 • Bahasa Indonesia • Italiano • עברית • Lietuvių • Magyar • Bahasa Melayu • Nederlands • 日本語 • Norsk (bokmål) • Polski • Português • Русский • Română • Slovenčina • Slovenščina • Српски / Srpski • Suomi • Svenska • Türkçe • Українська • Tiếng Việt • Volapük • Winaray • 中文 10 000+ Afrikaans • Aragonés • Armãneashce • Asturianu • Kreyòl Ayisyen • Azərbaycan / آذربايجان ديلی • বাংলা • Беларуская ( Акадэмічная • Тарашкевiца ) • বিষ্ণুপ্রিযা় মণিপুরী • Bosanski • Brezhoneg • Чăваш • Cymraeg • Eesti • Ελληνικά • Euskara • Frysk • Gaeilge • Galego • ગુજરાતી • Հայերեն • हिन्दी • Hrvatski • Ido • Íslenska • Basa Jawa • ಕನ್ನಡ • ქართული • Kurdî / كوردی • Latina • Latviešu • Lëtzebuergesch • Lumbaart • Македонски • മലയാളം • मराठी • नेपाल भाषा • नेपाली • Norsk (nynorsk) • Nnapulitano • Occitan • Piemontèis • Plattdüütsch • Ripoarisch • Runa Simi • شاہ مکھی پنجابی • Shqip • Sicilianu • Simple English • Sinugboanon • Srpskohrvatski / Српскохрватски • Basa Sunda • Kiswahili • Tagalog • தமிழ் • తెలుగు • ไทย • اردو • Walon • Yorùbá • 粵語 • Žemaitėška 1 000+ Bahsa Acèh • Alemannisch • አማርኛ • Arpitan • ܐܬܘܪܝܐ • Avañe’ẽ • Aymar Aru • Bân-lâm-gú • Bahasa Banjar • Basa Banyumasan • Башҡорт • भोजपुरी • Bikol Central • Boarisch • བོད་ཡིག • Chavacano de Zamboanga • Corsu • Deitsch • ދިވެހި • Diné Bizaad • Eald Englisc • Emigliàn–Rumagnòl • Эрзянь • Estremeñu • Fiji Hindi • Føroyskt • Furlan • Gaelg • Gàidhlig • 贛語 • گیلکی • Hak- kâ-fa / 客家話 • Хальмг • ʻŌlelo Hawaiʻi • Hornjoserbsce • Ilokano • Interlingua • Interlingue • Ирон Æвзаг • Kapampangan • Kaszëbsczi • Kernewek • ភាសាខ្មែរ • Kinyarwanda • Коми • Кыргызча • Ladino / לאדינו • Ligure • Limburgs • Lingála • lojban • Malagasy • Malti • 文言 • Māori • مصرى • مازِرونی / Mäzeruni • Монгол • မြန်မာဘာသာ • Nāhuatlahtōlli • Nedersaksisch • Nouormand • Novial • Нохчийн • Олык Марий • O‘zbek • पाऴि • Pangasinán • ਪੰਜਾਬੀ / پنجابی • Papiamentu • پښتو • Picard •
Re: Malformed XML with exotic characters
Hi Markus, in my case the JSON response writer returns valid JSON. The same holds for the PHP response writer. -Sascha On 01.02.2011 18:44, Markus Jelsma wrote: You can exclude the input's involvement by checking if other response writers do work. For me, the JSONResponseWriter works perfectly with the same returned data in some AJAX environment. On Tuesday 01 February 2011 18:29:06 Sascha Szott wrote: Hi folks, I've made the same observation when working with Solr's ExtractingRequestHandler on the command line (no browser interaction). When issuing the following curl command curl 'http://mysolrhost/solr/update/extract?extractOnly=trueextractFormat=text; wt=xmlresource.name=foo.pdf' --data-binary @foo.pdf -H 'Content-type:text/xml; charset=utf-8' foo.xml Solr's XML response writer returns malformed xml, e.g., xmllint gives me: foo.xml:21: parser error : Char 0xD835 out of allowed range foo.xml:21: parser error : PCDATA invalid Char value 55349 I'm not totally sure, if this is an Tika/PDFBox issue. However, I would expect in every case that the XML output produced by Solr is well-formed even if the libraries used under the hood return garbage. -Sascha p.s. I can provide the pdf file in question, if anybody would like to see it in action. On 01.02.2011 16:43, Markus Jelsma wrote: There is an issue with the XML response writer. It cannot cope with some very exotic characters or possibly the right-to-left writing systems. The issue can be reproduced by indexing the content of the home page of wikipedia as it contains a lot of exotic matter. The problem does not affect the JSON response writer. The problem is, i am unsure whether this is a bug in Solr or that perhaps Firefox itself trips over. Here's the output of the JSONResponeWriter for a query returning the home page: { responseHeader:{ status:0, QTime:1, params:{ fl:url,content, indent:true, wt:json, q:*:*, rows:1}}, response:{numFound:6744,start:0,docs:[ { url:http://www.wikipedia.org/;, content:Wikipedia English The Free Encyclopedia 3 543 000+ articles 日 本語 フリー百科事典 730 000+ 記事 Deutsch Die freie Enzyklopädie 1 181 000+ Artikel Español La enciclopedia libre 710 000+ artículos Français L’encyclopédie libre 1 061 000+ articles Русский Свободная энциклопедия 654 000+ статей Italiano L’enciclopedia libera 768 000+ voci Português A enciclopédia livre 669 000+ artigos Polski Wolna encyklopedia 769 000+ haseł Nederlands De vrije encyclopedie 668 000+ artikelen Search • Suchen • Rechercher • Szukaj • Ricerca • 検索 • Buscar • Busca • Zoeken • Поиск • Sök • 搜尋 • Cerca • Søk • Haku • Пошук • Hledání • Keresés • Căutare • 찾기 • Tìm kiếm • Ara • Cari • Søg • بحث • Serĉu • Претрага • Paieška • Hľadať • Suk • جستجو • חיפוש • Търсене • Poišči • Cari • Bilnga العربية Български Català Česky Dansk Deutsch English Español Esperanto فارسی Français 한국어 Bahasa Indonesia Italiano עברית Lietuvių Magyar Bahasa Melayu Nederlands 日本語 Norsk (bokmål) Polski Português Română Русский Slovenčina Slovenščina Српски / Srpski Suomi Svenska Türkçe Українська Tiếng Việt Volapük Winaray 中文 100 000+ العربية • Български • Català • Česky • Dansk • Deutsch • English • Español • Esperanto • فارسی • Français • 한국어 • Bahasa Indonesia • Italiano • עברית • Lietuvių • Magyar • Bahasa Melayu • Nederlands • 日本語 • Norsk (bokmål) • Polski • Português • Русский • Română • Slovenčina • Slovenščina • Српски / Srpski • Suomi • Svenska • Türkçe • Українська • Tiếng Việt • Volapük • Winaray • 中文 10 000+ Afrikaans • Aragonés • Armãneashce • Asturianu • Kreyòl Ayisyen • Azərbaycan / آذربايجان ديلی • বাংলা • Беларуская ( Акадэмічная • Тарашкевiца ) • বিষ্ণুপ্রিযা় মণিপুরী • Bosanski • Brezhoneg • Чăваш • Cymraeg • Eesti • Ελληνικά • Euskara • Frysk • Gaeilge • Galego • ગુજરાતી • Հայերեն • हिन्दी • Hrvatski • Ido • Íslenska • Basa Jawa • ಕನ್ನಡ • ქართული • Kurdî / كوردی • Latina • Latviešu • Lëtzebuergesch • Lumbaart • Македонски • മലയാളം • मराठी • नेपाल भाषा • नेपाली • Norsk (nynorsk) • Nnapulitano • Occitan • Piemontèis • Plattdüütsch • Ripoarisch • Runa Simi • شاہ مکھی پنجابی • Shqip • Sicilianu • Simple English • Sinugboanon • Srpskohrvatski / Српскохрватски • Basa Sunda • Kiswahili • Tagalog • தமிழ் • తెలుగు • ไทย • اردو • Walon • Yorùbá • 粵語 • Žemaitėška 1 000+ Bahsa Acèh • Alemannisch • አማርኛ • Arpitan • ܐܬܘܪܝܐ • Avañe’ẽ • Aymar Aru • Bân-lâm-gú • Bahasa Banjar • Basa Banyumasan • Башҡорт • भोजपुरी • Bikol Central • Boarisch • བོད་ཡིག • Chavacano de Zamboanga • Corsu • Deitsch • ދިވެހި • Diné Bizaad • Eald Englisc • Emigliàn–Rumagnòl • Эрзянь • Estremeñu • Fiji Hindi • Føroyskt • Furlan • Gaelg • Gàidhlig • 贛語 • گیلکی • Hak- kâ-fa / 客家話 • Хальмг • ʻŌlelo Hawaiʻi • Hornjoserbsce • Ilokano
missing type check when working with pint field type
Hi folks, I've noticed an unexpected behavior while working with the various built-in integer field types (int, tint, pint). It seems as the first two ones are subject to type checking, while the latter one is not. I'll give you an example based on the example schema that is shipped out with Solr. When trying to index the document doc field name=id1/field field name=foo_iinvalid_value/field field name=foo_ti1/field field name=foo_pi1/field /doc Solr responds with a NumberFormatException (the same holds when setting the value of foo_ti to invalid_value): java.lang.NumberFormatException: For input string: invalid_value Surprisingly, an attempt to index the document doc field name=id1/field field name=foo_i1/field field name=foo_ti1/field field name=foo_piinvalid_value/field /doc is successful. In the end, sorting on foo_pi leads to an exception, e.g., http://localhost:8983/solr/select?q=*:*sort=foo_pi desc raises an HTTP 500 error: java.lang.StringIndexOutOfBoundsException: String index out of range: 0 at java.lang.String.charAt(String.java:686) at org.apache.lucene.search.FieldCache$7.parseInt(FieldCache.java:234) at org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCacheImpl.java:457) at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:224) at org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:430) at org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCacheImpl.java:447) at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:224) at org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:430) at org.apache.lucene.search.FieldComparator$IntComparator.setNextReader(FieldComparator.java:332) at org.apache.lucene.search.TopFieldCollector$OneComparatorNonScoringCollector.setNextReader(TopFieldCollector.java:94) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:249) at org.apache.lucene.search.Searcher.search(Searcher.java:171) at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:988) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:884) at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) [...] Is this a bug or did I missed something? -Sascha
Re: missing type check when working with pint field type
Hi Erick, I see the point. But what is pint (plong, pfloat, pdouble) actually intended for (sorting is not possible, no type checking is performed)? Seems to me as it is something very similar to the string type (both store and index the value verbatim). -Sascha On 18.01.2011 14:38, Erick Erickson wrote: I suspect you missed this comment in the schema file: *** Plain numeric field types that store and index the text value verbatim (and hence don't support range queries, since the lexicographic ordering isn't equal to the numeric ordering) *** So what's happening is that the field is being indexed as a text type and, I suspect, begin tokenized. The error you're getting is when trying to sort against a tokenized field which is undefined. At least that's my story and I'm sticking to it Best Erick On Tue, Jan 18, 2011 at 8:10 AM, Sascha Szottsz...@zib.de wrote: Hi folks, I've noticed an unexpected behavior while working with the various built-in integer field types (int, tint, pint). It seems as the first two ones are subject to type checking, while the latter one is not. I'll give you an example based on the example schema that is shipped out with Solr. When trying to index the document doc field name=id1/field field name=foo_iinvalid_value/field field name=foo_ti1/field field name=foo_pi1/field /doc Solr responds with a NumberFormatException (the same holds when setting the value of foo_ti to invalid_value): java.lang.NumberFormatException: For input string: invalid_value Surprisingly, an attempt to index the document doc field name=id1/field field name=foo_i1/field field name=foo_ti1/field field name=foo_piinvalid_value/field /doc is successful. In the end, sorting on foo_pi leads to an exception, e.g., http://localhost:8983/solr/select?q=*:*sort=foo_pi desc raises an HTTP 500 error: java.lang.StringIndexOutOfBoundsException: String index out of range: 0 at java.lang.String.charAt(String.java:686) at org.apache.lucene.search.FieldCache$7.parseInt(FieldCache.java:234) at org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCacheImpl.java:457) at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:224) at org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:430) at org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCacheImpl.java:447) at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:224) at org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:430) at org.apache.lucene.search.FieldComparator$IntComparator.setNextReader(FieldComparator.java:332) at org.apache.lucene.search.TopFieldCollector$OneComparatorNonScoringCollector.setNextReader(TopFieldCollector.java:94) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:249) at org.apache.lucene.search.Searcher.search(Searcher.java:171) at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:988) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:884) at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) [...] Is this a bug or did I missed something? -Sascha -- Sascha Szott :: KOBV/ZIB :: sz...@zib.de :: +49 30 84185-457
Re: post search using solrj
Hi Don, you could give the HTTP method to be used as a second argument to the QueryRequest constructor: [http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/request/QueryRequest.html#QueryRequest(org.apache.solr.common.params.SolrParams,%20org.apache.solr.client.solrj.SolrRequest.METHOD)] -Sascha Don Hill wrote: Hi. I am using solrj and it has been working fine. I now have a requirement to add more parameters. So many that I get a max URI exceeded error. Is there anyway using SolrQuery todo a http post so I don't have these issues? don
DataImportHandler in Solr 1.4.1: exception handling in FileListEntityProcessor
Hi folks, why does FileListEntityProcessor ignores onError=continue and abort indexing if a directory or a file does not exist? I'm using both XPathEntityProcessor and FileListEntityProcessor with onError set to continue. In case a directory or file is not present an Exception is thrown and indexing is stopped immediately. Below you can find a stack trace that is generated in case the directory /home/doe/foo does not exist: SEVERE: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' value: /home/doe/foo/bar.xml is not a directory Processing Document # 3 at org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:122) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:71) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370) How should I configure both processors so that missing directories and files are ignored and the indexing process does not stop immediately? Best, Sascha
Re: DataImportHandler in Solr 1.4.1: exception handling in FileListEntityProcessor
Sorry, there was a mistake in the stack trace. The correct one is: SEVERE: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' value: /home/doe/foo is not a directory Processing Document # 3 at org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:122) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:71) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370) -Sascha On 11.08.2010 15:18, Sascha Szott wrote: Hi folks, why does FileListEntityProcessor ignores onError=continue and abort indexing if a directory or a file does not exist? I'm using both XPathEntityProcessor and FileListEntityProcessor with onError set to continue. In case a directory or file is not present an Exception is thrown and indexing is stopped immediately. Below you can find a stack trace that is generated in case the directory /home/doe/foo does not exist: SEVERE: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' value: /home/doe/foo/bar.xml is not a directory Processing Document # 3 at org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:122) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:71) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370) How should I configure both processors so that missing directories and files are ignored and the indexing process does not stop immediately? Best, Sascha
Re: problem with formulating a negative query
Hi, Chris Hostetter wrote: AND, OR, and NOT are just syntactic-sugar for modifying the MUST, MUST_NOT, and SHOULD. The default op of OR only affects the first clause of your query (R) because it doesn't have any modifiers -- Thanks for pointing that out! -Sascha the second clause has that NOT modifier so your query is effectivley... topic:R -topic:[* TO *] ...which by definition can't match anything. -Hoss
Re: problem with formulating a negative query
Hi Erick, thanks for your explanations. But why are all docs being *removed* from the set of all docs that contain R in their topic field? This would correspond to a boolean AND and would stand in conflict with the clause q.op=OR. This seems a bit strange to me. Furthermore, Smiley Pugh stated in their Solr 1.4 book on pg. 102 that adding the a subexpression containing the negative query (-[* TO *]) and the match-all-docs clause (*:*) is only a workaround. Why is this workaround necessary at all? Best, Sascha Erick Erickson wrote: This may help: http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Boolean%20operators But the clause you specified translates roughly as find all the documents that contain R, then remove any of them that match * TO *. * TO * contains all the documents with R, so everything you just matched is removed from your results. HTH Erick On Tue, Jun 29, 2010 at 12:40 PM, Sascha Szottsz...@zib.de wrote: Hi Ahmet, it works, thanks a lot! To be true I have no idea what's the problem with defType=luceneq.op=ORdf=topicq=R NOT [* TO *] -Sascha Ahmet Arslan wrote: I have a (multi-valued) field topic in my index which does not need to exist in every document. Now, I'm struggling with formulating a query that returns all documents that either have no topic field at all *or* whose topic field value is R. Does this work? defType=luceneq.op=ORq=topic:R (+*:* -topic:[* TO *])
Re: Is there a way to delete multiple documents using wildcard?
Hi, you can delete all docs that match a certain query: deletequeryuid:6-HOST*/query/delete -Sascha bbarani wrote: Hi, I am trying to delete a group of documents using wildcard. Something like update?commit=true%20-H%20Content-Type:%20text/xml%20--data-binary%20'deletedocfield%20name=uid6-HOST*/field/doc/delete' I want to delete all documents which contains the uid starting with 6-HOST but this query doesnt seem to work.. Am I doing anything wrong?? Thanks, BB
Re: Is there a way to delete multiple documents using wildcard?
Hi, does /select?q=uid:6-HOST* return any documents? -Sascha bbarani wrote: Hi, Thanks a lot for your reply.. I tried the below query update?commit=true%20-H%20Content-Type:%20text/xml%20--data-binary%20'deletequeryuid:6-HOST*/query/delete' But even now none of the documents are getting deleted.. Am I forming the URL wrong? Thanks, BB
Re: Is there a way to delete multiple documents using wildcard?
Hi, take a look inside Solr's log file. Are there any error messages with respect to the update request? Furthermore, you could try the following two commands instead: curl http://host:port/solr/update; --form-string stream.body=deletequeryuid:6-HOST*/query/delete curl http://host:port/solr/update; --form-string stream.body=commit/ -Sascha bbarani wrote: Yeah, I am getting the results when I use /select handler. I tried the below query.. /select?q=uid:6-HOST* Gotresult name=response numFound=52920 start=0 Thanks BB
problem with formulating a negative query
Hi folks, I have a (multi-valued) field topic in my index which does not need to exist in every document. Now, I'm struggling with formulating a query that returns all documents that either have no topic field at all *or* whose topic field value is R. Unfortunately, the query /select?q={!lucene q.op=OR df=topic}(R NOT [* TO *]) does not return any docs even though there are documents in my index that fulfil the specified condition as you can deduce from the queries listed below: /select?q=topic:R returns 0 docs /select?q=-topic:[* TO *] returns 0 docs Appending the query with debugQuery=true returns: str name=rawquerystring{!lucene q.op=OR df=topic}(R NOT [* TO *])/str str name=querystring{!lucene q.op=OR df=topic}(R NOT [* TO *])/str str name=parsedquerytopic:R -topic:[* TO *]/str str name=parsedquery_toStringtopic:R -topic:[* TO *]/str Does anybody have a clue of what is wrong here? Thanks in advance, Sascha
Re: Specifiying multiple mlt.fl fields
Hi Darren, try mlt.fl=field1 field2 Best, Sascha Darren Govoni wrote: Hi, I read the wiki and tried about a dozen variations such as: ...mlt.fl=field1mlt.fl=field2 and ...mlt.fl=field1,field2... to specify more than one MLT field and it won't take. What's the trick? Also, how to do it with SolrJ? Nothing I try works. Solr 4.0 nightly build. Any tips, very appreciated! Darren
Re: federated / meta search
Hi Joe Markus, sounds good! Maybe I should better add a note on the Wiki page on federated search [1]. Thanks, Sascha [1] http://wiki.apache.org/solr/FederatedSearch Joe Calderon wrote: yes, you can use distributed search across shards with different schemas as long as the query only references overlapping fields, i usually test adding new fields or tokenizers on one shard and deploy only after i verified its working properly On Thu, Jun 17, 2010 at 1:10 PM, Markus Jelsmamarkus.jel...@buyways.nl wrote: Hi, Check out Solr sharding [1] capabilities. I never tested it with different schema's but if each node is queried with fields that it supports, it should return useful results. [1]: http://wiki.apache.org/solr/DistributedSearch Cheers. -Original message- From: Sascha Szottsz...@zib.de Sent: Thu 17-06-2010 19:44 To: solr-user@lucene.apache.org; Subject: federated / meta search Hi folks, if I'm seeing it right Solr currently does not provide any support for federated / meta searching. Therefore, I'd like to know if anyone has already put efforts into this direction? Moreover, is federated / meta search considered a scenario Solr should be able to deal with at all or is it (far) beyond the scope of Solr? To be more precise, I'll give you a short explanation of my requirements. Assume, there are a couple of Solr instances running at different places. The documents stored within those instances are all from the same domain (bibliographic records), but it can not be ensured that the schema definitions conform to 100%. But lets say, there are at least some index fields that are present in all instances (fields with the same name and type definition). Now, I'd like to perform a search on all instances at the same time (with the restriction that the query contains only those fields that overlap among the different schemas) and combine the results in a reasonable way by utilizing the score information associated with each hit. Please note, that due to legal issues it is not feasible to build a single index that integrates the documents of all Solr instances under consideration. Thanks in advance, Sascha
federated / meta search
Hi folks, if I'm seeing it right Solr currently does not provide any support for federated / meta searching. Therefore, I'd like to know if anyone has already put efforts into this direction? Moreover, is federated / meta search considered a scenario Solr should be able to deal with at all or is it (far) beyond the scope of Solr? To be more precise, I'll give you a short explanation of my requirements. Assume, there are a couple of Solr instances running at different places. The documents stored within those instances are all from the same domain (bibliographic records), but it can not be ensured that the schema definitions conform to 100%. But lets say, there are at least some index fields that are present in all instances (fields with the same name and type definition). Now, I'd like to perform a search on all instances at the same time (with the restriction that the query contains only those fields that overlap among the different schemas) and combine the results in a reasonable way by utilizing the score information associated with each hit. Please note, that due to legal issues it is not feasible to build a single index that integrates the documents of all Solr instances under consideration. Thanks in advance, Sascha
Re: strange results with query and hyphened words
Hi Markus, the default-config for index is: filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ and for query: filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0/ That's not true. The default configuration for query-time processing is: filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ By using this setting, a search for profi-auskunft will match profiauskunft. It's important to note, that WordDelimiterFilterFactory's catenate* parameters should only be used in the index-time analysis stack. Otherwise the strange behaviour (search for profi-auskunft is translated into profi followed by (auskunft or profiauskunft) you mentioned will occur. Best, Sascha -Ursprüngliche Nachricht- Von: Sascha Szott [mailto:sz...@zib.de] Gesendet: Sonntag, 30. Mai 2010 19:01 An: solr-user@lucene.apache.org Betreff: Re: strange results with query and hyphened words Hi Markus, I was facing the same problem a few days ago and found an explanation in the mail archive that clarifies my question regarding the usage of Solr's WordDelimiterFilterFactory: http://markmail.org/message/qoby6kneedtwd42h Best, Sascha markus.rietz...@rzf.fin-nrw.de wrote: i am wondering why a search term with hyphen doesn't match. my search term is prof-auskunft. in WordDelimiterFilterFactory i have catenateWords, so my understanding is that profi-auskunft would search for profiauskunft. when i use the analyse panel in solr admi i see that profi-auskunft matches a term profiauskunft. the analyse will show Query Analyzer WhitespaceTokenizerFactory profi-auskunft SynonymFilterFactory profi-auskunft StopFilterFactory profi-auskunft WordDelimiterFilterFactory term position 1 2 term text profi auskunft profiauskunft term type wordword word source start,end0,5 6,14 0,15 LowerCaseFilterFactory SnowballPorterFilterFactory why is auskunft and profiauskunft in one column. how do they get searched? when i search profiauskunft i have 230 hits, when i now search for profi-auskunft i do get less hits. when i call the search with debugQuery=on i see body:profi (auskunft profiauskunft) what does this query mean? profi and auskunft or profiauskunft? fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory / tokenizer class=solr.WhitespaceTokenizerFactory/ !-- sg324 bei wortern die durch - und weitere leerzeichen getrennt sind, werden diese zusammengefuehrt. -- filter class=solr.HiphenatedWordsFilterFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms_de.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=de/stopwords_de.txt enablePositionIncrements=true / !-- sg324 -- filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German protected=de/protwords_de.txt/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=de/synonyms_de.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=de/stopwords_de.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German protected=de/protwords_de.txt/ /analyzer /fieldType
Re: strange results with query and hyphened words
Sorry Markus, I mixed up the index and query field in analysis.jsp. In fact, I meant that a search for profiauskunft matches profi-auskunft. I'm not sure, whether the case you are dealing with (search for profi-auskunft should match profiauskunft) is appropriately addressed by the WordDelimiterFilter. What about using the PatternReplaceCharFilter at query time to eliminate all intra-word hyphens? -Sascha Sascha Szott wrote: Hi Markus, the default-config for index is: filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ and for query: filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0/ That's not true. The default configuration for query-time processing is: filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ By using this setting, a search for profi-auskunft will match profiauskunft. It's important to note, that WordDelimiterFilterFactory's catenate* parameters should only be used in the index-time analysis stack. Otherwise the strange behaviour (search for profi-auskunft is translated into profi followed by (auskunft or profiauskunft) you mentioned will occur. Best, Sascha -Ursprüngliche Nachricht- Von: Sascha Szott [mailto:sz...@zib.de] Gesendet: Sonntag, 30. Mai 2010 19:01 An: solr-user@lucene.apache.org Betreff: Re: strange results with query and hyphened words Hi Markus, I was facing the same problem a few days ago and found an explanation in the mail archive that clarifies my question regarding the usage of Solr's WordDelimiterFilterFactory: http://markmail.org/message/qoby6kneedtwd42h Best, Sascha markus.rietz...@rzf.fin-nrw.de wrote: i am wondering why a search term with hyphen doesn't match. my search term is prof-auskunft. in WordDelimiterFilterFactory i have catenateWords, so my understanding is that profi-auskunft would search for profiauskunft. when i use the analyse panel in solr admi i see that profi-auskunft matches a term profiauskunft. the analyse will show Query Analyzer WhitespaceTokenizerFactory profi-auskunft SynonymFilterFactory profi-auskunft StopFilterFactory profi-auskunft WordDelimiterFilterFactory term position 1 2 term text profi auskunft profiauskunft term type word word word source start,end 0,5 6,14 0,15 LowerCaseFilterFactory SnowballPorterFilterFactory why is auskunft and profiauskunft in one column. how do they get searched? when i search profiauskunft i have 230 hits, when i now search for profi-auskunft i do get less hits. when i call the search with debugQuery=on i see body:profi (auskunft profiauskunft) what does this query mean? profi and auskunft or profiauskunft? fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory / tokenizer class=solr.WhitespaceTokenizerFactory/ !-- sg324 bei wortern die durch - und weitere leerzeichen getrennt sind, werden diese zusammengefuehrt. -- filter class=solr.HiphenatedWordsFilterFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms_de.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=de/stopwords_de.txt enablePositionIncrements=true / !-- sg324 -- filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German protected=de/protwords_de.txt/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=de/synonyms_de.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=de/stopwords_de.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German protected=de/protwords_de.txt/ /analyzer /fieldType
Re: strange results with query and hyphened words
Hi Markus, I was facing the same problem a few days ago and found an explanation in the mail archive that clarifies my question regarding the usage of Solr's WordDelimiterFilterFactory: http://markmail.org/message/qoby6kneedtwd42h Best, Sascha markus.rietz...@rzf.fin-nrw.de wrote: i am wondering why a search term with hyphen doesn't match. my search term is prof-auskunft. in WordDelimiterFilterFactory i have catenateWords, so my understanding is that profi-auskunft would search for profiauskunft. when i use the analyse panel in solr admi i see that profi-auskunft matches a term profiauskunft. the analyse will show Query Analyzer WhitespaceTokenizerFactory profi-auskunft SynonymFilterFactory profi-auskunft StopFilterFactory profi-auskunft WordDelimiterFilterFactory term position 1 2 term text profi auskunft profiauskunft term type wordword word source start,end0,5 6,14 0,15 LowerCaseFilterFactory SnowballPorterFilterFactory why is auskunft and profiauskunft in one column. how do they get searched? when i search profiauskunft i have 230 hits, when i now search for profi-auskunft i do get less hits. when i call the search with debugQuery=on i see body:profi (auskunft profiauskunft) what does this query mean? profi and auskunft or profiauskunft? fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory / tokenizer class=solr.WhitespaceTokenizerFactory/ !-- sg324 bei wortern die durch - und weitere leerzeichen getrennt sind, werden diese zusammengefuehrt. -- filter class=solr.HiphenatedWordsFilterFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms_de.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=de/stopwords_de.txt enablePositionIncrements=true / !-- sg324 -- filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German protected=de/protwords_de.txt/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=de/synonyms_de.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=de/stopwords_de.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German protected=de/protwords_de.txt/ /analyzer /fieldType
Re: sort by field length
Hi Erick, Erick Erickson wrote: Ah, I may have misunderstood, I somehow got it in my mind you were talking about the length of each term (as in string length). But if you're looking at the field length as the count of terms, that's another question, sorry for the confusion... I have to ask, though, why you want to sort this way? The relevance calculations already factor in both term frequency and field length. What's the use-case for sorting by field length given the above? It's not a real world use-case -- I just want to get a better understanding of the data I'm indexing (therefore, performance is neglectable). In my current use case, you can think of the field length as an indicator of data quality (i.e., the longer the field content, the worse the quality is). Being able to sort the field data in order of decreasing length would allow me to investigate exceptional data items that are not appropriately handled by my curation process. Best, Sascha Best Erick On Tue, May 25, 2010 at 3:40 AM, Sascha Szottsz...@zib.de wrote: Hi Erick, Erick Erickson wrote: Are you sure you want to recompute the length when sorting? It's the classic time/space tradeoff, but I'd suggest that when your index is big enough to make taking up some more space a problem, it's far too big to spend the cycles calculating each term length for sorting purposes considering you may be sorting all the terms in your index worst-case. Good point, thank you for the clarification. I thought that Lucene internally stores the field length (e.g., in order to compute the relevance) and getting this information at query time requires only a simple lookup. -Sascha But you could consider payloads for storing the length, although that would still be redundant... Best Erick On Mon, May 24, 2010 at 8:30 AM, Sascha Szottsz...@zib.de wrote: Hi folks, is it possible to sort by field length without having to (redundantly) save the length information in a seperate index field? At first, I thought to accomplish this using a function query, but I couldn't find an appropriate one. Thanks in advance, Sascha
Re: Faceted search not working?
Hi Birger, Birger Lie wrote: I don't think the bolean fields is mapped to on and off :) You can use true and on interchangeably. -Sascha -birger -Original Message- From: Ilya Sterin [mailto:ster...@gmail.com] Sent: 24. mai 2010 23:11 To: solr-user@lucene.apache.org Subject: Faceted search not working? I'm trying to perform a faceted search without any luck. Result set doesn't return any facet information... http://localhost:8080/solr/select/?q=title:*facet=onfacet.field=title I'm getting the result set, but no face information present? Is there something else that needs to happen to turn faceting on? I'm using latest Solr 1.4 release. Data is indexed from the database using dataimporter. Thanks. Ilya Sterin
Re: sort by field length
Hi Erick, Erick Erickson wrote: Are you sure you want to recompute the length when sorting? It's the classic time/space tradeoff, but I'd suggest that when your index is big enough to make taking up some more space a problem, it's far too big to spend the cycles calculating each term length for sorting purposes considering you may be sorting all the terms in your index worst-case. Good point, thank you for the clarification. I thought that Lucene internally stores the field length (e.g., in order to compute the relevance) and getting this information at query time requires only a simple lookup. -Sascha But you could consider payloads for storing the length, although that would still be redundant... Best Erick On Mon, May 24, 2010 at 8:30 AM, Sascha Szottsz...@zib.de wrote: Hi folks, is it possible to sort by field length without having to (redundantly) save the length information in a seperate index field? At first, I thought to accomplish this using a function query, but I couldn't find an appropriate one. Thanks in advance, Sascha
Re: Highlighting is not happening
Hi, to accomplish that, use the highlighting parameters hl.simple.pre and hl.simple.post. By the way, there are a plenty of other parameters that affect highlighting. Take a look at: http://wiki.apache.org/solr/HighlightingParameters -Sascha Doddamani, Prakash wrote: Hey, I thought the Highlights would happen in the field of the documents returned from SOLR J But it gives new list of Highlighting at below, sorry for the confusion I was wondering is there a way that the fields returned itself contains bold characters Eg : if searched for query doc str field name=onereturned response which contains bquery/b should be bold/str /doc Regards Prakash -Original Message- From: Sascha Szott [mailto:sz...@zib.de] Sent: Monday, May 24, 2010 10:55 PM To: solr-user@lucene.apache.org Subject: Re: Highlighting is not happening Hi Prakash, can you provide 1. the definition of the relevant field 2. your query 3. the definition of the relevant request handler 4. a field value that is stored in your index and should be highlighted -Sascha Doddamani, Prakash wrote: Thanks Sascha, The type for fields for which I am searching are all text , and I am using solr.TextField fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. enablePositionIncrements=true ensures that a 'gap' is left to allow for accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Regards Prakash -Original Message- From: Sascha Szott [mailto:sz...@zib.de] Sent: Monday, May 24, 2010 10:29 PM To: solr-user@lucene.apache.org Subject: Re: Highlighting is not happening Hi Prakash, more importantly, check the field type and its associated analyzer. In case you use a non-tokenized type (e.g., string), highlighting will not appear if only a partial field match exists (only exact matches, i.e. the query coincides with the field value, will be highlighted). If that's not your intent, you should at least define an tokenizer for the field type. Best, Sascha Doddamani, Prakash wrote: Hey Daren, Yes the fields for which I am searching are stored and indexed, also they are returned from the query, Also it is not coming, if the entire search keyword is part of the field. Thanks Prakash -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Monday, May 24, 2010 9:32 PM To: solr-user@lucene.apache.org Subject: Re: Highlighting is not happening Check that the field you are highlighting on is stored. It won't work otherwise. Now, this also means that the field is returned from the query. For large text fields to be highlighted only, this means the entire text is returned for each result. There is a pending feature to address this, that allows you to tell Solr to NOT return a specific field (to avoid unecessary transfer of large text fields in this scenario). Darren Hi I am using dismax request handler, I wanted to highlight the search field, So added str name=hltrue/str I was expecting like if I search for keyword Akon resultant docs wherever the Akon is available is bold. But I am not seeing them getting bold, could some one tell me the real path where I should tune If I pass explicitly the hl=true does not work I have added the request handler requestHandler name=dismax class=solr.SearchHandler lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str
Re: Faceted search not working?
Hi, please note, that the FacetComponent is one of the six search components that are automatically associated with solr.SearchHandler (this holds also for the QueryComponent). Another note: By using name=components all default components will be replaced by the components you explicitly mentioned (i.e., QueryComponent and FacetComponent in your example). To avoid this, use name=last-components instead. -Sascha Jean-Sebastien Vachon wrote: Is the FacetComponent loaded at all? requestHandler name=standard class=solr.SearchHandler default=true arr name=components strquery/str strfacet/str /arr /requestHandler On 2010-05-25, at 3:32 AM, Sascha Szott wrote: Hi Birger, Birger Lie wrote: I don't think the bolean fields is mapped to on and off :) You can use true and on interchangeably. -Sascha -birger -Original Message- From: Ilya Sterin [mailto:ster...@gmail.com] Sent: 24. mai 2010 23:11 To: solr-user@lucene.apache.org Subject: Faceted search not working? I'm trying to perform a faceted search without any luck. Result set doesn't return any facet information... http://localhost:8080/solr/select/?q=title:*facet=onfacet.field=title I'm getting the result set, but no face information present? Is there something else that needs to happen to turn faceting on? I'm using latest Solr 1.4 release. Data is indexed from the database using dataimporter. Thanks. Ilya Sterin
sort by field length
Hi folks, is it possible to sort by field length without having to (redundantly) save the length information in a seperate index field? At first, I thought to accomplish this using a function query, but I couldn't find an appropriate one. Thanks in advance, Sascha
Re: Highlighting is not happening
Hi Prakash, more importantly, check the field type and its associated analyzer. In case you use a non-tokenized type (e.g., string), highlighting will not appear if only a partial field match exists (only exact matches, i.e. the query coincides with the field value, will be highlighted). If that's not your intent, you should at least define an tokenizer for the field type. Best, Sascha Doddamani, Prakash wrote: Hey Daren, Yes the fields for which I am searching are stored and indexed, also they are returned from the query, Also it is not coming, if the entire search keyword is part of the field. Thanks Prakash -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Monday, May 24, 2010 9:32 PM To: solr-user@lucene.apache.org Subject: Re: Highlighting is not happening Check that the field you are highlighting on is stored. It won't work otherwise. Now, this also means that the field is returned from the query. For large text fields to be highlighted only, this means the entire text is returned for each result. There is a pending feature to address this, that allows you to tell Solr to NOT return a specific field (to avoid unecessary transfer of large text fields in this scenario). Darren Hi I am using dismax request handler, I wanted to highlight the search field, So added str name=hltrue/str I was expecting like if I search for keyword Akon resultant docs wherever the Akon is available is bold. But I am not seeing them getting bold, could some one tell me the real path where I should tune If I pass explicitly the hl=true does not work I have added the request handler requestHandler name=dismax class=solr.SearchHandler lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str float name=tie0.01/float str name=qf name^20.0 coming^5 playing^4 keywords^0.1 /str str name=bf rord(isclassic)^0.5 ord(listeners)^0.3 /str str name=*,score name, coming, playing, keywords, score /str str name=mm 2lt;-1 5lt;-2 6lt;90% /str int name=ps100/int str name=q.alt*:*/str !-- example highlighter config, enable per-query with hl=true -- str name=hltrue/str !--str name=hl.simple.preb/str str name=hl.simple.post/b/str -- !-- for this field, we want no fragmenting, just highlighting -- str name=f.name.hl.fragsize0/str !-- instructs Solr to return the field itself if no query terms are found -- !--str name=f.name.hl.alternateFieldname/str -- str name=f.text.hl.fragmenterregex/str !-- defined below -- /lst /requestHandler regards prakash
Re: Highlighting is not happening
Hi Prakash, can you provide 1. the definition of the relevant field 2. your query 3. the definition of the relevant request handler 4. a field value that is stored in your index and should be highlighted -Sascha Doddamani, Prakash wrote: Thanks Sascha, The type for fields for which I am searching are all text , and I am using solr.TextField fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. enablePositionIncrements=true ensures that a 'gap' is left to allow for accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Regards Prakash -Original Message- From: Sascha Szott [mailto:sz...@zib.de] Sent: Monday, May 24, 2010 10:29 PM To: solr-user@lucene.apache.org Subject: Re: Highlighting is not happening Hi Prakash, more importantly, check the field type and its associated analyzer. In case you use a non-tokenized type (e.g., string), highlighting will not appear if only a partial field match exists (only exact matches, i.e. the query coincides with the field value, will be highlighted). If that's not your intent, you should at least define an tokenizer for the field type. Best, Sascha Doddamani, Prakash wrote: Hey Daren, Yes the fields for which I am searching are stored and indexed, also they are returned from the query, Also it is not coming, if the entire search keyword is part of the field. Thanks Prakash -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Monday, May 24, 2010 9:32 PM To: solr-user@lucene.apache.org Subject: Re: Highlighting is not happening Check that the field you are highlighting on is stored. It won't work otherwise. Now, this also means that the field is returned from the query. For large text fields to be highlighted only, this means the entire text is returned for each result. There is a pending feature to address this, that allows you to tell Solr to NOT return a specific field (to avoid unecessary transfer of large text fields in this scenario). Darren Hi I am using dismax request handler, I wanted to highlight the search field, So added str name=hltrue/str I was expecting like if I search for keyword Akon resultant docs wherever the Akon is available is bold. But I am not seeing them getting bold, could some one tell me the real path where I should tune If I pass explicitly the hl=true does not work I have added the request handler requestHandler name=dismax class=solr.SearchHandler lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str float name=tie0.01/float str name=qf name^20.0 coming^5 playing^4 keywords^0.1 /str str name=bf rord(isclassic)^0.5 ord(listeners)^0.3 /str str name=*,score name, coming, playing, keywords, score /str str name=mm 2lt;-1 5lt;-2 6lt;90% /str int name=ps100/int str name=q.alt*:*/str !-- example highlighter config, enable per-query with hl=true -- str name=hltrue/str !--str name=hl.simple.preb/str str name=hl.simple.post/b/str -- !-- for this field, we want no fragmenting, just highlighting -- str name=f.name.hl.fragsize0/str !-- instructs Solr to return the field itself if no query terms are found -- !--str name=f.name.hl.alternateFieldname/str -- str name=f.text.hl.fragmenterregex/str !-- defined below
Re: Faceted search not working?
Hi Ilya, Ilya Sterin wrote: I'm trying to perform a faceted search without any luck. Result set doesn't return any facet information... http://localhost:8080/solr/select/?q=title:*facet=onfacet.field=title I'm getting the result set, but no face information present? Is there something else that needs to happen to turn faceting on? No. What does http://localhost:8080/solr/select/?q=title:*fl=titlewt=xml return? -Sascha
Wildcard queries
Hi folks, what's the idea behind the fact that no text analysis (e.g. lowercasing) is performed on wildcarded search terms? In my context this behaviour seems to be counter-intuitive (I guess that's the case in the majority of applications) and my application needs to lowercase any input term before sending the HTTP request to my Solr server. Would it be easy to disable this behaviour in Solr (1.5)? I would like to see a config parameter (per field type) that allows to disable this odd behaviour if needed. To ensure backward compatibility the odd behaviour should be the default anymore. Am I missing any drawbacks? Best, Sascha
Re: Wildcard queries
Hi Robert, thanks, you're absolutely right. I should better refine my initial question to: What's the idea behind the fact that no *lowercasing* is performed on wildcarded search terms if the field in question contains a LowercaseFilter in its associated field type definition? -Sascha Robert Muir wrote: we can use stemming as an example: lets say your query is c?ns?st?nt?y how will this match consistently, which the porter stemmer transforms to 'consistent'. furthermore, note that i replaced the vowels with ?'s here. The porter stemmer doesnt just rip stuff off the end, but attempts to guess syllables as part of the process, so it cannot possibly work. the only way it would work in this situation would be if you formed permutations of all the possible words this wildcard would match, and then did analysis on each form, and searched on all stems. but, this is impossible, since the * operator allows an infinite language. On Fri, May 21, 2010 at 10:11 AM, Sascha Szottsz...@zib.de wrote: Hi folks, what's the idea behind the fact that no text analysis (e.g. lowercasing) is performed on wildcarded search terms? In my context this behaviour seems to be counter-intuitive (I guess that's the case in the majority of applications) and my application needs to lowercase any input term before sending the HTTP request to my Solr server. Would it be easy to disable this behaviour in Solr (1.5)? I would like to see a config parameter (per field type) that allows to disable this odd behaviour if needed. To ensure backward compatibility the odd behaviour should be the default anymore. Am I missing any drawbacks? Best, Sascha
Re: Autosuggest
Hi, maybe you would like to have a look at solr.ShingleFilterFactory [1] to expand your autosuggest to more than one term. -Sascha [1] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory Blargy wrote: Thanks for your help and especially your analyzer.. probably saved me a full-import or two :)
Re: How to tell which field matched?
Hi, I'm not sure if debugQuery=on is a feasible solution in a productive environment, as generating such extra information requires a reasonable amount of computation. -Sascha Jon Baer wrote: Does the standard debug component (?debugQuery=on) give you what you need? http://wiki.apache.org/solr/SolrRelevancyFAQ#Why_does_id:archangel_come_before_id:hawkgirl_when_querying_for_.22wings.22 - Jon On May 14, 2010, at 4:03 PM, Tim Garton wrote: All, I've searched around for help with something we are trying to do and haven't come across much. We are running solr 1.4. Here is a summary of the issue we are facing: A simplified example of our schema is something like this: field name=id type=string indexed=true stored=true required=true / field name=title type=text indexed=true stored=true required=true / field name=date_posted type=tdate indexed=true stored=true / field name=supplement_title type=text indexed=true stored=true multiValued=true / field name=supplement_pdf_url type=text indexed=true stored=true multiValued=true / field name=supplement_pdf_text type=text indexed=true stored=true multiValued=true / When someone does a search we search across the title, supplement_title, and supplement_pdf_text fields. When we get our results, we would like to be able to tell which field the search matched and if it's a multiValued field, which of the multiple values matched. This is so that we can display results similar to: Example Title Example Supplement Title Example Supplement Title 2 (your search matched this document) Example Supplement Title 3 Example Title 2 Example Supplement Title 4 Example Supplement Title 5 Example Supplement Title 6 (your search matched this document) etc. How would you recommend doing this? Is there some way to get solr to tell us which field matched, including multiValued fields? As a workaround we have been using highlighting to tell which field matched, but it doesn't get us what we want for multiValued fields and there is a significant cost to enabling the highlighting. Should we design our schema in some other fashion to achieve these results? Thanks. -Tim
Re: Solr Schema Question
Hi Serdar, take a look at Solr's DataImportHandler: http://wiki.apache.org/solr/DataImportHandler Best, Sascha Serdar Sahin wrote: Hi, I am rather new to Solr and have a question. We have around 200.000 txt files which are placed into the file cloud. The file path is something similar to this: file/97/8f/840/fa4-1.txt file/a6/9d/ab0/ca2-2.txt etc. and we also store the metadata (like title, description, tags etc) about these files in the mysql server. So, what I want to do is to index title, description, tags and other data from mysql, and also get the txt file from file server, and link them as one record for searching, but I could not figure out how to automatize this process. I can give the path from the sql query like, Select id, title, description, file_path, and then solr can use this path to retrieve txt file, but I don't know whether is it possible or not. What is the best way to index these files with their tag title and description without coding in Java (Perl is ok). These txt files are large, between 100kb-10mb, so the last option is to store them in the database. Thanks, Serdar
Re: StreamingUpdateSolrServer hangs
Hi Yonik, Yonik Seeley wrote: Stephen, were you running stock Solr 1.4, or did you apply any of the SolrJ patches? I'm trying to figure out if anyone still has any problems, or if this was fixed with SOLR-1711: I'm using the latest trunk version (rev. 934846) and constantly running into the same problem. I'm using StreamingUpdateSolrServer with 3 treads and a queue size of 20 (not really knowing if this configuration is optimal). My multi-threaded application indexes 200k data items (bibliographic metadata in Dublin Core format) and constantly hangs after running for some time. Below you can find the thread dump of one of my index threads (after the app hangs all dumps are the same) thread 19 prio=10 tid=0x7fe8c0415800 nid=0x277d waiting on condition [0x42d05000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x7fe8cdcb7598 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:254) at org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer.request(StreamingUpdateSolrServer.java:216) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:64) at de.kobv.ked.index.SolrIndexWriter.addIndexDocument(SolrIndexWriter.java:29) at de.kobv.ked.index.SolrIndexWriter.addIndexDocument(SolrIndexWriter.java:10) at de.kobv.ked.index.AbstractIndexThread.addIndexDocument(AbstractIndexThread.java:59) at de.kobv.ked.rss.RssThread.indiziere(RssThread.java:30) at de.kobv.ked.rss.RssThread.run(RssThread.java:58) and of the three SUSS threads: pool-1-thread-3 prio=10 tid=0x7fe8c7b7f000 nid=0x2780 in Object.wait() [0x409ac000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x7fe8cdcb6f10 (a org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518) - locked 0x7fe8cdcb6f10 (a org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) at org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner.run(StreamingUpdateSolrServer.java:153) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) pool-1-thread-2 prio=10 tid=0x7fe8c7afa000 nid=0x277f in Object.wait() [0x40209000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x7fe8cdcb6f10 (a org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518) - locked 0x7fe8cdcb6f10 (a org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) at org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner.run(StreamingUpdateSolrServer.java:153) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) pool-1-thread-1 prio=10 tid=0x7fe8c79f2800 nid=0x277e in Object.wait() [0x42e06000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x7fe8cdcb6f10 (a org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool) at
Re: StreamingUpdateSolrServer hangs
Hi Yonik, thanks for your fast reply. Yonik Seeley wrote: Thanks for the report Sascha. So after the hang, it never recovers? Some amount of hanging could be visible if there was a commit on the Solr server or something else to cause the solr requests to block for a while... but it should return to normal on it's own... In my case the whole application hangs and never recovers (CPU utilization goes down to near 0%). Interestingly, the problem reproducibly occurs only if SUSS is created with *more than 2* threads. Looking at the stack trace, it looks like threads are blocked waiting to get an http connection. I forgot to mention that my index app has exclusive access to the Solr instance. Therefore, concurrent searches against the same Solr instance while indexing are excluded. I'm traveling all next week, but I'll open a JIRA issue for this now. Thank you! Anything that would help us reproduce this is much appreciated. Are there any other who have experienced the same problem? -Sascha On Fri, Apr 16, 2010 at 8:57 AM, Sascha Szottsz...@zib.de wrote: Hi Yonik, Yonik Seeley wrote: Stephen, were you running stock Solr 1.4, or did you apply any of the SolrJ patches? I'm trying to figure out if anyone still has any problems, or if this was fixed with SOLR-1711: I'm using the latest trunk version (rev. 934846) and constantly running into the same problem. I'm using StreamingUpdateSolrServer with 3 treads and a queue size of 20 (not really knowing if this configuration is optimal). My multi-threaded application indexes 200k data items (bibliographic metadata in Dublin Core format) and constantly hangs after running for some time. Below you can find the thread dump of one of my index threads (after the app hangs all dumps are the same) thread 19 prio=10 tid=0x7fe8c0415800 nid=0x277d waiting on condition [0x42d05000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for0x7fe8cdcb7598 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:254) at org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer.request(StreamingUpdateSolrServer.java:216) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:64) at de.kobv.ked.index.SolrIndexWriter.addIndexDocument(SolrIndexWriter.java:29) at de.kobv.ked.index.SolrIndexWriter.addIndexDocument(SolrIndexWriter.java:10) at de.kobv.ked.index.AbstractIndexThread.addIndexDocument(AbstractIndexThread.java:59) at de.kobv.ked.rss.RssThread.indiziere(RssThread.java:30) at de.kobv.ked.rss.RssThread.run(RssThread.java:58) and of the three SUSS threads: pool-1-thread-3 prio=10 tid=0x7fe8c7b7f000 nid=0x2780 in Object.wait() [0x409ac000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on0x7fe8cdcb6f10 (a org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518) - locked0x7fe8cdcb6f10 (a org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) at org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner.run(StreamingUpdateSolrServer.java:153) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) pool-1-thread-2 prio=10 tid=0x7fe8c7afa000 nid=0x277f in Object.wait() [0x40209000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on0x7fe8cdcb6f10 (a org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518) - locked0x7fe8cdcb6f10 (a
Re: Deploying Solr 1.3 in JBoss 5
Hi Luca, could you add a note to the Wiki page [1]. Thanks! -Sascha [1] http://wiki.apache.org/solr/SolrJBoss Luca Molteni wrote: Bye the way, I finally solved it. To deploy solr 1.3 in jboss 5, you simply have to remove xercesImpl-2.8.1.jar xml-apis-1.3.03.jar From the WEB-INF/lib folder of solr.war Solr will use the lib provided by jboss 5. Thank you again. L.M. On 3 February 2010 10:38, Luca Moltenivoloth...@gmail.com wrote: Apparently, that worked! I've never realized that the order of the elements in XML is significant, nice to see. As always, problems leads to other problems, so now I'm facing with a Xerces ClassCastException with JDK 6. org.jboss.xb.binding.JBossXBRuntimeException: Failed to create a new SAX parser at org.jboss.xb.binding.UnmarshallerFactory$UnmarshallerFactoryImpl.newUnmarshaller(UnmarshallerFactory.java:100) at org.jboss.web.tomcat.service.deployers.JBossContextConfig.processContextConfig(JBossContextConfig.java:549) at org.jboss.web.tomcat.service.deployers.JBossContextConfig.init(JBossContextConfig.java:536) at org.apache.catalina.startup.ContextConfig.lifecycleEvent(ContextConfig.java:279) at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:117) at org.apache.catalina.core.StandardContext.init(StandardContext.java:5436) at org.apache.catalina.core.StandardContext.start(StandardContext.java:4148) at org.jboss.web.tomcat.service.deployers.TomcatDeployment.performDeployInternal(TomcatDeployment.java:310) at org.jboss.web.tomcat.service.deployers.TomcatDeployment.performDeploy(TomcatDeployment.java:142) at org.jboss.web.deployers.AbstractWarDeployment.start(AbstractWarDeployment.java:461) at org.jboss.web.deployers.WebModule.startModule(WebModule.java:118) at org.jboss.web.deployers.WebModule.start(WebModule.java:97) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.jboss.mx.interceptor.ReflectedDispatcher.invoke(ReflectedDispatcher.java:157) at org.jboss.mx.server.Invocation.dispatch(Invocation.java:96) at org.jboss.mx.server.Invocation.invoke(Invocation.java:88) at org.jboss.mx.server.AbstractMBeanInvoker.invoke(AbstractMBeanInvoker.java:264) at org.jboss.mx.server.MBeanServerImpl.invoke(MBeanServerImpl.java:668) at org.jboss.system.microcontainer.ServiceProxy.invoke(ServiceProxy.java:206) at $Proxy38.start(Unknown Source) at org.jboss.system.microcontainer.StartStopLifecycleAction.installAction(StartStopLifecycleAction.java:42) at org.jboss.system.microcontainer.StartStopLifecycleAction.installAction(StartStopLifecycleAction.java:37) at org.jboss.dependency.plugins.action.SimpleControllerContextAction.simpleInstallAction(SimpleControllerContextAction.java:62) at org.jboss.dependency.plugins.action.AccessControllerContextAction.install(AccessControllerContextAction.java:71) at org.jboss.dependency.plugins.AbstractControllerContextActions.install(AbstractControllerContextActions.java:51) at org.jboss.dependency.plugins.AbstractControllerContext.install(AbstractControllerContext.java:348) at org.jboss.system.microcontainer.ServiceControllerContext.install(ServiceControllerContext.java:297) at org.jboss.dependency.plugins.AbstractController.install(AbstractController.java:1633) at org.jboss.dependency.plugins.AbstractController.incrementState(AbstractController.java:935) at org.jboss.dependency.plugins.AbstractController.resolveContexts(AbstractController.java:1083) at org.jboss.dependency.plugins.AbstractController.resolveContexts(AbstractController.java:985) at org.jboss.dependency.plugins.AbstractController.change(AbstractController.java:823) at org.jboss.dependency.plugins.AbstractController.change(AbstractController.java:553) at org.jboss.system.ServiceController.doChange(ServiceController.java:688) at org.jboss.system.ServiceController.start(ServiceController.java:460) at org.jboss.system.deployers.ServiceDeployer.start(ServiceDeployer.java:163) at org.jboss.system.deployers.ServiceDeployer.deploy(ServiceDeployer.java:99) at org.jboss.system.deployers.ServiceDeployer.deploy(ServiceDeployer.java:46) at org.jboss.deployers.spi.deployer.helpers.AbstractSimpleRealDeployer.internalDeploy(AbstractSimpleRealDeployer.java:62) at org.jboss.deployers.spi.deployer.helpers.AbstractRealDeployer.deploy(AbstractRealDeployer.java:50) at org.jboss.deployers.plugins.deployers.DeployerWrapper.deploy(DeployerWrapper.java:171)
Re: (default) maximum chars per field
markus.rietz...@rzf.fin-nrw.de wrote: ok, i was looking for all types of max but somehow didn't saw the maxFieldLength. this is a global parameter, right? can this be defined on a field basis? It's a global parameter counting the maximum number of tokens(!) - not the number of characters or bytes - per field. If a field's content exceeds that number, the remaining tokens are truncated without any notice. -Sascha global would be enough at the moment. thank you -Ursprüngliche Nachricht- Von: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] Gesendet: Freitag, 5. Februar 2010 11:35 An: solr-user@lucene.apache.org Betreff: Re: (default) maximum chars per field On Fri, Feb 5, 2010 at 3:56 PM, markus.rietz...@rzf.fin-nrw.de wrote: hi, what is the default maximum charsize per field? i found a macChars paramater for copyField but i don't think, that this is what i am looking for. we have indexed some documents via tika/solrcell. only the beginning of these documents can be searched. where can i define the maximum size of a document/field that will be indexed? at the moment we do the updates via xml upload. is there a maxsize for this xml. in solconfig.xml i have found multipartUploadLimitInKB=2048000, that means 2 GB would be the max size to post. that would be enough... Increase maxFieldLength in your solrconfig.xml. The default is 10KB. -- Regards, Shalin Shekhar Mangar.
Re: java.lang.NullPointerException with MySQL DataImportHandler
Hi, can you post * the output of MySQL's describe command for all tables/views referenced in your DIH configuration * the DIH configuration file (i.e., data-config.xml) * the schema definition (i.e., schema.xml) -Sascha Jean-Michel Philippon-Nadeau wrote: Hi, It is my first install of Solr. The setup has been pretty straightforward and yet, the performance is very impressive. I am running into an issue with my MySQL DataImportHandler. I've followed the quick-start in order to write the necessary config and so far everything seemed to work. However, I am missing some fields in my index. I've switched all fields to stored=true temporarily in my schema to troubleshoot the issue. I only have 3 fields listed in search results while I should have 8. Could this be caused by ampersands or illegal entities in my database? How can I see if DIH is importing correctly all my rows into the index? Follows is the warning I have in my catalina.log. Thank you very much, Jean-Michel === Feb 2, 2010 12:21:07 AM org.apache.solr.handler.dataimport.SolrWriter upload WARNING: Error creating document : SolrInputDocument[{keywords=keywords(1.0)={Dolce}, name=name(1.0)={Dolce amp; Gabbana Damp;G Neckties designer Tie for men 543}, productID=productID(1.0)={220213}}] java.lang.NullPointerException at org.apache.lucene.util.StringHelper.intern(StringHelper.java:36) at org.apache.lucene.document.Field.init(Field.java:341) at org.apache.lucene.document.Field.init(Field.java:305) at org.apache.solr.schema.FieldType.createField(FieldType.java:210) at org.apache.solr.schema.SchemaField.createField(SchemaField.java:94) at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:246) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60) at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:75) at org.apache.solr.handler.dataimport.DataImportHandler $1.upload(DataImportHandler.java:292) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:392) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389) at org.apache.solr.handler.dataimport.DataImporter $1.run(DataImporter.java:370)
Re: Deploying Solr 1.3 in JBoss 5
Hi, I'm not sure if that's a Solr issue. However, what happens if you set env-entry-value to C:/mypath/solr instead of ${solr.home.myhome}? -Sascha Am 02.02.2010 15:20, schrieb Luca Molteni: Hello list, I'm having some problem deploying solr to JBoss 5. The problem is with environment variables: Following this page of the wiki: http://wiki.apache.org/solr/SolrJBoss I've added to the web.xml of WEB-INF of solr env-entry env-entry-namesolr/home/env-entry-name env-entry-typejava.lang.String/env-entry-type env-entry-value${solr.home.myhome}/env-entry-value /env-entry Since I'm using lots of instances of solr in the same container. This variable should be expanded by jboss itself in a path using properties-services.xml: attribute name=Properties solr.home.myhome=C:/mypath/solr /attribute Unfortunately, during deployment of the solr application, it gives me this error: Caused by: org.jboss.xb.binding.JBossXBException: Failed to parse source: The content of element type env-entry must match (description?,env-entry-name,env-entry-value?,env-entry-type). @ vfsfile:/C:/pathtojboss/server/solrrepo/deploy/Solrrepo/solr-mysolr.war/WEB-INF/web.xml[146,14] at org.jboss.xb.binding.parser.sax.SaxJBossXBParser.parse(SaxJBossXBParser.java:203) ... 33 more Caused by: org.xml.sax.SAXException: The content of element type env-entry must match (description?,env-entry-name,env-entry-value?,env-entry-type). @ vfsfile:/C:/pathtojboss/server/solrrepo/deploy/Solrrepo/solr-mysolr.war/WEB-INF/web.xml[146,14] at org.jboss.xb.binding.parser.sax.SaxJBossXBParser$MetaDataErrorHandler.error(SaxJBossXBParser.java:426) Notice that the same .war and properties-services.xml works flawlessly in JBoss 4.2.3 Any ideas? Thank you very much. L.M. -- Sascha Szott Kooperativer Bibliotheksverbund Berlin-Brandenburg (KOBV) c/o Konrad-Zuse-Zentrum fuer Informationstechnik Berlin (ZIB) Takustr. 7, D-14195 Berlin Zimmer 4357 Telefon: (030) 841 85 - 457 Telefax: (030) 841 85 - 269 E-Mail: sz...@zib.de WWW: http://www.kobv.de
Re: java.lang.NullPointerException with MySQL DataImportHandler
Hi, since some of the fields used in your DIH configuration aren't mandatory (e.g., keywords and tags are defined as nullable in your db table schema), add a default value to all optional fields in your schema configuration (e.g., default = ). Note, that Solr does not understand the db-related concept of null values. Solr's log output SolrInputDocument[{keywords=keywords(1.0)={Dolce}, name=name(1.0)={Dolce amp; Gabbana Damp;G Neckties designer Tie for men 543}, productID=productID(1.0)={220213}}] indicates that there aren't any tags or descriptions stored for the item with productId 220213. Since no default value is specified, Solr raises an error when creating the index document. -Sascha Jean-Michel Philippon-Nadeau wrote: Hi, Thanks for the reply. On Tue, 2010-02-02 at 16:57 +0100, Sascha Szott wrote: * the output of MySQL's describe command for all tables/views referenced in your DIH configuration mysql describe products; ++--+--+-+-++ | Field | Type | Null | Key | Default | Extra | ++--+--+-+-++ | productID | int(10) unsigned | NO | PRI | NULL| auto_increment | | skuCode| varchar(320) | YES | MUL | NULL| | | upcCode| varchar(320) | YES | MUL | NULL| | | name | varchar(320) | NO | | NULL| | | description| text | NO | | NULL| | | keywords | text | YES | | NULL| | | disqusThreadID | varchar(50) | NO | | NULL| | | tags | text | YES | | NULL| | | createdOn | int(10) unsigned | NO | | NULL| | | lastUpdated| int(10) unsigned | NO | | NULL| | | imageURL | varchar(320) | YES | | NULL| | | inStock| tinyint(1) | YES | MUL | 1 | | | active | tinyint(1) | YES | | 1 | | ++--+--+-+-++ 13 rows in set (0.00 sec) mysql describe product_soldby_vendor; +-+--+--+-+-+---+ | Field | Type | Null | Key | Default | Extra | +-+--+--+-+-+---+ | productID | int(10) unsigned | NO | MUL | NULL| | | productVendorID | int(10) unsigned | NO | MUL | NULL| | | price | double | NO | | NULL| | | currency| varchar(5) | NO | | NULL| | | buyURL | varchar(320) | NO | | NULL| | +-+--+--+-+-+---+ 5 rows in set (0.00 sec) mysql describe products_vendors_subcategories; ++--+--+-+-++ | Field | Type | Null | Key | Default | Extra | ++--+--+-+-++ | productVendorSubcategoryID | int(10) unsigned | NO | PRI | NULL| auto_increment | | productVendorCategoryID| int(10) unsigned | NO | | NULL| | | labelEnglish | varchar(320) | NO | | NULL| | | labelFrench| varchar(320) | NO | | NULL| | ++--+--+-+-++ 4 rows in set (0.00 sec) mysql describe products_vendors_categories; +-+--+--+-+-++ | Field | Type | Null | Key | Default | Extra | +-+--+--+-+-++ | productVendorCategoryID | int(10) unsigned | NO | PRI | NULL| auto_increment | | labelEnglish| varchar(320) | NO | | NULL| | | labelFrench | varchar(320) | NO | | NULL| | +-+--+--+-+-++ 3 rows in set (0.00 sec) mysql describe product_vendor_in_subcategory; +---+--+--+-+-+---+ | Field | Type | Null | Key | Default | Extra | +---+--+--+-+-+---+ | productVendorID | int(10) unsigned | NO | MUL | NULL| | | productCategoryID | int(10) unsigned | NO | MUL | NULL| | +---+--+--+-+-+---+ 2 rows in set (0.00 sec) mysql describe products_vendors_countries; ++--+--+-+-++ | Field | Type | Null | Key | Default | Extra
Re: Deploying Solr 1.3 in JBoss 5
Luca Molteni wrote: Actually, if I hard-code the value, it gives me the same error... interesting. According to the error message: The content of element type env-entry must match (description?,env-entry-name,env-entry-value?,env-entry-type) Maybe it helps to change the order of elements within env-entry (env-entry-value before env-entry-type)? -Sascha On 2 February 2010 17:14, Sascha Szottsz...@zib.de wrote: Hi, I'm not sure if that's a Solr issue. However, what happens if you set env-entry-value to C:/mypath/solr instead of ${solr.home.myhome}? -Sascha Am 02.02.2010 15:20, schrieb Luca Molteni: Hello list, I'm having some problem deploying solr to JBoss 5. The problem is with environment variables: Following this page of the wiki: http://wiki.apache.org/solr/SolrJBoss I've added to the web.xml of WEB-INF of solr env-entry env-entry-namesolr/home/env-entry-name env-entry-typejava.lang.String/env-entry-type env-entry-value${solr.home.myhome}/env-entry-value /env-entry Since I'm using lots of instances of solr in the same container. This variable should be expanded by jboss itself in a path using properties-services.xml: attribute name=Properties solr.home.myhome=C:/mypath/solr /attribute Unfortunately, during deployment of the solr application, it gives me this error: Caused by: org.jboss.xb.binding.JBossXBException: Failed to parse source: The content of element type env-entry must match (description?,env-entry-name,env-entry-value?,env-entry-type). @ vfsfile:/C:/pathtojboss/server/solrrepo/deploy/Solrrepo/solr-mysolr.war/WEB-INF/web.xml[146,14] at org.jboss.xb.binding.parser.sax.SaxJBossXBParser.parse(SaxJBossXBParser.java:203) ... 33 more Caused by: org.xml.sax.SAXException: The content of element type env-entry must match (description?,env-entry-name,env-entry-value?,env-entry-type). @ vfsfile:/C:/pathtojboss/server/solrrepo/deploy/Solrrepo/solr-mysolr.war/WEB-INF/web.xml[146,14] at org.jboss.xb.binding.parser.sax.SaxJBossXBParser$MetaDataErrorHandler.error(SaxJBossXBParser.java:426) Notice that the same .war and properties-services.xml works flawlessly in JBoss 4.2.3 Any ideas? Thank you very much. L.M.
Re: How to display Highlight with VelocityResponseWriter?
Hi Qiuyan, Thanks a lot. It works now. When i added the line #set($hl = $response.highlighting) i got the highlighting. But i wonder if there's any document that describes the usage of that. I mean i didn't know the name of those methods. Actually i just managed to guess it. Solritas (aka VelocityResponseWriter) binds a number of objects into a so called VelocityContext (consult [1] for a complete list). You can think of a map that allows you to access objects by symbolic names, e.g., an instance of QueryResponse is stored under response (that's why you write $response in your template). Since $response is an instance of QueryResponse you can call all methods on it the API [2] provides. Furthermore, Velocity incorporates a JavaBean-like introspection mechanism that lets you write $response.highlighting instead of $response.getHighlighting() (only a bit of syntactic sugar). -Sascha [1] http://wiki.apache.org/solr/VelocityResponseWriter#line-93 [2] http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/response/QueryResponse.html Quoting Sascha Szott sz...@zib.de: Qiuyan, with highlight can also be displayed in the web gui. I've added bool name=hltrue/bool into the standard responseHandler and it already works, i.e without velocity. But the same line doesn't take effect in itas. Should i configure anything else? Thanks in advance. First of all, just a few notes on the /itas request handler in your solrconfig.xml: 1. The entry arr name=components strhighlight/str /arr is obsolete, since the highlighting component is a default search component [1]. 2. Note that since you didn't specify a value for hl.fl highlighting will only affect the fields listed inside of qf. 3. Why did you override the default value of hl.fragmenter? In most cases the default fragmenting algorithm (gap) works fine - and maybe in yours as well? To make sure all your hl related settings are correct, can you post an xml output (change the wt parameter to xml) for a search with highlighted results. And finally, can you post the vtl code snippet that should produce the highlighted output. -Sascha [1] http://wiki.apache.org/solr/SearchComponent
Re: How to display Highlight with VelocityResponseWriter?
Qiuyan, with highlight can also be displayed in the web gui. I've added bool name=hltrue/bool into the standard responseHandler and it already works, i.e without velocity. But the same line doesn't take effect in itas. Should i configure anything else? Thanks in advance. First of all, just a few notes on the /itas request handler in your solrconfig.xml: 1. The entry arr name=components strhighlight/str /arr is obsolete, since the highlighting component is a default search component [1]. 2. Note that since you didn't specify a value for hl.fl highlighting will only affect the fields listed inside of qf. 3. Why did you override the default value of hl.fragmenter? In most cases the default fragmenting algorithm (gap) works fine - and maybe in yours as well? To make sure all your hl related settings are correct, can you post an xml output (change the wt parameter to xml) for a search with highlighted results. And finally, can you post the vtl code snippet that should produce the highlighted output. -Sascha [1] http://wiki.apache.org/solr/SearchComponent
Re: solrJ and spell check queries
Hi, Jay Fisher wrote: I'm trying to find a way to formulate the following query in solrJ. This is the only way I can get the desired result but I can't figure out how to get solrJ to generate the same query string. It always generates a url that starts with select and I need it to start with spell. If there is an alternative url string that will work please let me know. http://solr-server/spell/?indent=onq=shertwt=jsonspellcheck=truespellcheck.collate=true In case you hook SpellCheckComponent directly into the standard request handler, i.e., /select, http://solr-server/select?indent=onq=shertwt=jsonspellcheck=truespellcheck.collate=true should work. -Sascha
Re: how to do a Parent/Child Mapping using entities
Hi, Thanks Sascha for your post, but i find it interresting, but in my case i don't want to use an additionnal field, i want to be able with the same schema to do a simple query like : q=res_url:some url, and a query like the other one; You could easily write your own query parser (QParserPlugin, in Solr's terminology) that internally translates queries like q = res_url:url AND res_rank:rank into q = res_ranked_url:rank url thus hiding the res_ranked_url field from the user/client. I'm not sure, but maybe it's possible to utilize the order of values within the multi-valued field res_url directly in the newly created parser. This seems like the cleanest solution to me. -Sascha in other word; is there any solution to make two or more multivalued fields in the same document linked with each other, e.g: in this result: -result name=response numFound=1 start=0 -doc str name=id1/str str name=keywordKey1/str -arr name=res_url strurl1/str strurl2/str strurl3/str strurl4/str /arr -arr name=res_rank str1/str str2/str str3/str str4/str /arr /doc /result i would like to make solr understand that for this document, value:url1 of res_url field is linked to value:1 of res_rank field, and all of them are linked to the commen field keyword. I think that i should use a custom field analyser or some thing like that; but i don't know what to do. but thanks for all; and any supplied help will be lovable. Sascha Szott wrote: Hi, you could create an additional index field res_ranked_url that contains the concatenated value of an url and its corresponding rank, e.g., res_rank + + res_url Then, q=res_ranked_url:1 url1 retrieves all documents with url1 as the first url. A drawback of this workaround is that you have to use a phrase query thus preventing wildcard searches for urls. -Sascha Hello everybody, i would like to know how to create index supporting a parent/child mapping and then querying the child to get the results. in other words; imagine that we have a database containing 2 tables:Keyword[id(int), value(string)] and Result[id(int), res_url(text), res_text(tex), res_date(date), res_rank(int)] For indexing, i used the DataImportHandler to import data and it works well, and my query response seems good:(q=*:*) (imagine that we have only this to keywords and their results) ?xml version=1.0 encoding=UTF-8 ? -response -lst name=responseHeader int name=status0/int int name=QTime0/int -lst name=params str name=q*:*/str /lst /lst -result name=response numFound=2 start=0 -doc str name=id1/str str name=keywordKey1/str -arr name=res_url strurl1/str strurl2/str strurl3/str strurl4/str /arr -arr name=res_rank str1/str str2/str str3/str str4/str /arr /doc -doc str name=id2/str str name=keywordKey2/str -arr name=res_url strurl1/str strurl5/str strurl8/str strurl7/str /arr -arr name=res_rank str1/str str2/str str3/str str4/str /arr /doc /result /response but the problem is when i tape a query kind of this:q=res_url:url2 AND res_rank:1 and this to say that i want to search for the keywords in which the url (url2) is ranked at the first position, i have a result like this: ?xml version=1.0 encoding=UTF-8 ? -response -lst name=responseHeader int name=status0/int int name=QTime0/int -lst name=params str name=qres_url:url2 AND res_rank:1/str /lst /lst -result name=response numFound=1 start=0 -doc str name=id1/str str name=keywordKey1/str -arr name=res_url strurl1/str strurl2/str strurl3/str strurl4/str /arr -arr name=res_rank str1/str str2/str str3/str str4/str /arr /doc /result /response But this is not true; because the url present in the 1st position in the results of the keyword key1 is url1 and not url2. So what i want to say is : is there any solution to make the values of the multivalued fields linked; so in our case we can see that the previous result say that: - url1 is present in 1st position of key1 results - url2 is present in 2nd position of key1 results - url3 is present in 3rd position of key1 results - url4 is present in 4th position of key1 results and i would like that solr consider this when executing queries. Any helps please; and thanks for all :)
Re: how to do a Parent/Child Mapping using entities
Hi, you could create an additional index field res_ranked_url that contains the concatenated value of an url and its corresponding rank, e.g., res_rank + + res_url Then, q=res_ranked_url:1 url1 retrieves all documents with url1 as the first url. A drawback of this workaround is that you have to use a phrase query thus preventing wildcard searches for urls. -Sascha Hello everybody, i would like to know how to create index supporting a parent/child mapping and then querying the child to get the results. in other words; imagine that we have a database containing 2 tables:Keyword[id(int), value(string)] and Result[id(int), res_url(text), res_text(tex), res_date(date), res_rank(int)] For indexing, i used the DataImportHandler to import data and it works well, and my query response seems good:(q=*:*) (imagine that we have only this to keywords and their results) ?xml version=1.0 encoding=UTF-8 ? -response -lst name=responseHeader int name=status0/int int name=QTime0/int -lst name=params str name=q*:*/str /lst /lst -result name=response numFound=2 start=0 -doc str name=id1/str str name=keywordKey1/str -arr name=res_url strurl1/str strurl2/str strurl3/str strurl4/str /arr -arr name=res_rank str1/str str2/str str3/str str4/str /arr /doc -doc str name=id2/str str name=keywordKey2/str -arr name=res_url strurl1/str strurl5/str strurl8/str strurl7/str /arr -arr name=res_rank str1/str str2/str str3/str str4/str /arr /doc /result /response but the problem is when i tape a query kind of this:q=res_url:url2 AND res_rank:1 and this to say that i want to search for the keywords in which the url (url2) is ranked at the first position, i have a result like this: ?xml version=1.0 encoding=UTF-8 ? -response -lst name=responseHeader int name=status0/int int name=QTime0/int -lst name=params str name=qres_url:url2 AND res_rank:1/str /lst /lst -result name=response numFound=1 start=0 -doc str name=id1/str str name=keywordKey1/str -arr name=res_url strurl1/str strurl2/str strurl3/str strurl4/str /arr -arr name=res_rank str1/str str2/str str3/str str4/str /arr /doc /result /response But this is not true; because the url present in the 1st position in the results of the keyword key1 is url1 and not url2. So what i want to say is : is there any solution to make the values of the multivalued fields linked; so in our case we can see that the previous result say that: - url1 is present in 1st position of key1 results - url2 is present in 2nd position of key1 results - url3 is present in 3rd position of key1 results - url4 is present in 4th position of key1 results and i would like that solr consider this when executing queries. Any helps please; and thanks for all :)
Re: Optimize not having any effect on my index
Hi Aleksander, Aleksander Stensby wrote: So i tried with curl: curl http://server:8983/solr/update --data-binary 'optimize/' -H 'Content-type:text/xml; charset=utf-8' No difference here either... Am I doing anything wrong? Do i need to issue a commit after the optimize? Did you restart the Solr server instance after the optimize operation was completed? BTW: You could initiate the optimization operation by POSTing optimize=true directly, i.e., curl http://server:8983/solr/update/update --form-string optimize=true -Sascha
Re: Exception from Spellchecker
Hi Rafael, Rafael Pappert wrote: I try to enable the spellchecker in my 1.4.0 solr (running with tomcat 6 on debian). But I always get the following exception, when I try to open http://localhost:8080/spell?: The spellcheck=true pair is missing in your request. Try http://localhost:8080/spell?q=spellcheck=true -Sascha
RE: search on tomcat server
Hi Jill, just to make sure your index contains at least one document, what is the output of http://localhost:8080/solr/select?q=*:*debugQuery=trueechoParams=all Best, Sascha Jill Han wrote: In fact, I just followed the instructions titled as Tomcat On Windows. Here are the updates on my computer 1. -Dsolr.solr.home=C:\solr\example 2. change dataDir to dataDirC:\solr\example\data/dataDir in solrconfig.xml at C:\solr\example\conf 3. created solr.xml at C:\Tomcat 5.5\conf\Catalina\localhost ?xml version=1.0 encoding=utf-8? Context docBase=c:/solr/example/apache-solr-1.3.0.war debug=0 crossContext=true Environment name=solr/home type=java.lang.String value=c:/solr/example override=true/ /Context I restarted Tomcat, went to http://localhost:8080/solr/admin/ Entered video in Query String field, and got /** ?xml version=1.0 encoding=UTF-8 ? - response - lst name=responseHeader int name=status0/int int name=QTime0/int - lst name=params str name=rows10/str str name=start0/str str name=indenton/str str name=qvideo/str str name=version2.2/str /lst /lst result name=response numFound=0 start=0 / /response / My questions are 1. is the setting correct? 2. where does solr start to search words entered in Query String field 3. how can I make result page like general searching result page, such as, not found, if found, a url, instead of xml will be returned. Thanks a lot for your helps, Jill -Original Message- From: William Pierce [mailto:evalsi...@hotmail.com] Sent: Friday, December 04, 2009 12:56 PM To: solr-user@lucene.apache.org Subject: Re: search on tomcat server Have you gone through the solr tomcat wiki? http://wiki.apache.org/solr/SolrTomcat I found this very helpful when I did our solr installation on tomcat. - Bill -- From: Jill Han jill@alverno.edu Sent: Friday, December 04, 2009 8:54 AM To: solr-user@lucene.apache.org Subject: RE: search on tomcat server X-HOSTLOC: hermes.apache.org/140.211.11.3 I went through all the links on http://wiki.apache.org/solr/#Search_and_Indexing And still have no clue as how to proceed. 1. do I have to do some implementation in order to get solr to search doc. on tomcat server? 2. if I have files, such as .doc, docx, .pdf, .jsp, .html, etc under window xp, c:/tomcat/webapps/test1, /webapps/test2, What should I do to make solr search those directories 3. since I am using tomcat, instead of jetty, is there any demo that shows the solr searching features, and real searching result? Thanks, Jill -Original Message- From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] Sent: Monday, November 30, 2009 10:40 AM To: solr-user@lucene.apache.org Subject: Re: search on tomcat server On Mon, Nov 30, 2009 at 9:55 PM, Jill Han jill@alverno.edu wrote: I got solr running on the tomcat server, http://localhost:8080/solr/admin/ After I enter a search word, such as, solr, then hit Search button, it will go to http://localhost:8080/solr/select/?q=solrversion=2.2start=0rows=10in dent=onhttp://localhost:8080/solr/select/?q=solrversion=2.2start=0rows=10in%0Adent=on and display ?xml version=1.0 encoding=UTF-8 ? - http://localhost:8080/solr/select/?q=solrversion=2.2start=0rows=10i ndent=onhttp://localhost:8080/solr/select/?q=solrversion=2.2start=0rows=10i%0Andent=on response - http://localhost:8080/solr/select/?q=solrversion=2.2start=0rows=10i ndent=onhttp://localhost:8080/solr/select/?q=solrversion=2.2start=0rows=10i%0Andent=on lst name=responseHeader int name=status0/int int name=QTime0/int - http://localhost:8080/solr/select/?q=solrversion=2.2start=0rows=10i ndent=onhttp://localhost:8080/solr/select/?q=solrversion=2.2start=0rows=10i%0Andent=on lst name=params str name=rows10/str str name=start0/str str name=indenton/str str name=qsolr/str str name=version2.2/str /lst /lst result name=response numFound=0 start=0 / /response My question is what is the next step to search files on tomcat server? Looks like you have not added any documents to Solr. See the Indexing Documents section at http://wiki.apache.org/solr/#Search_and_Indexing -- Regards, Shalin Shekhar Mangar.
How to instruct MoreLikeThisHandler to sort results
Hi Folks, is there any way to instruct MoreLikeThisHandler to sort results? I was wondering that MLTHandler recognizes faceting parameters among others, but it ignores the sort parameter. Best, Sascha
Re: Hierarchical xml
Pooja, have a look at Solr's DataImportHandler. XPathEntityProcessor [1] should suit your needs. Best, Sascha [1] http://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessor Pooja Verlani schrieb: Hi, I want to index an xml like following: officer nameJohn/name dob1979-29-17T28:14:48Z/dob collegeGroup college nameABC College/name year1998/year /college college namePQRS College/name year2001/year /college college nameXYZ College/name year2003/year /college /collegeGroup /officer I am not able to judge how should be the schema like? Also, if I flatten such an xml and make collegename year as multivalued like this: college_nameABC College, PQRS College, XYZ College/college_name college_year1998,2001,2003/year In such a scenario I can't make a coorespondence between ABC college year 1998. In case someone has an efficient way out, do share. Thanks in anticipation. Regards, Pooja
Re: Indexing file content with custom field
Piero, it sounds you're looking for an integration of Solr Cell and Solr's DIH facility -- a feature that isn't implemented yet (but the issue is already addressed in Solr-1358). As a workaround, you could store the extracted contents in plain text files (either by using Solr Cell or Apache Tika directly, which is under the hood of Solr Cell). Afterwards, you could use DIH's XPathEntityProcessor (to read the metadata in your XML files) in conjunction with DIH's PlainTextEntityProcessor (to read the previously created text files). Another workaround would be to pass the metadata content as literal parameters along with the /update/extract request, as described in [1]. This would require you to write a small program that constructs and sends appropriate POST requests by parsing your XML metadata files. Best, Sascha [1] http://wiki.apache.org/solr/ExtractingRequestHandler#Literals Rodolico Piero wrote: Hi, I need to index the contents of a file (doc, pdf, ecc) and a set of custom metadata specified in the XML like a standard request to Solr. From the documentation I can extract the contents of a file with the request /update/extract (tika) and index metadata with a second request /update by passing the XML. How do I do it all in a single request? (without using curl but using http java lib or solrj). For example (although I know that is not correct): add doc field name=id / field field name=myfield-1 / field field name=myfield-n / field field name=content content of the extracted file (text) / field /doc /add So I search it or by using metadata or full text on the content. Sorry for my English ... Thanks a lot. Piero
[Solved] Re: VelocityResponseWriter/Solritas character encoding issue
Hi Erik, I've finally solved the problem. Unfortunately, the parameter v.contentType was not described in the Solr wiki (I've fixed that now). The point is, you must specify (in your solrconfig.xml) str name=v.contentTypetext/xml;charset=UTF-8/str in order to receive correctly UTF-8 encoded HTML. That's it! Best, Sascha Erik Hatcher schrieb: Sascha, Can you give me a test document that causes an issue? (maybe send me a Solr XML document in private e-mail). I'll see what I can do once I can see the issue first hand. Erik On Nov 18, 2009, at 2:48 PM, Sascha Szott wrote: Hi, I've played around with Solr's VelocityResponseWriter (which is indeed a very useful feature for rapid prototyping). I've realized that Velocity uses ISO-8859-1 as default character encoding. I've changed this setting to UTF-8 in my velocity.properties file (inside the conf directory), i.e., input.encoding=UTF-8 output.encoding=UTF-8 and checked that the settings were successfully loaded. Within the main Velocity template, browse.vm, the character encoding is set to UTF-8 as well, i.e., meta http-equiv=content-type content=text/html; charset=UTF-8/ After starting Solr (which is deployed in a Tomcat 6 server on a Ubuntu machine), I ran into some character encoding problems. Due to the change of input.encoding to UTF-8, no problems occur when non-ASCII characters are presend in the query string, e.g. german umlauts. But unfortunately, something is wrong with the encoding of characters in the html page that is generated by VelocityResponseWriter. The non-ASCII characters aren't displayed properly (for example, FF prints a black diamond with a white question mark). If I manually set the encoding to ISO-8859-1, the non-ASCII characters are displayed correctly. Does anybody have a clue? Thanks in advance, Sascha
VelocityResponseWriter/Solritas character encoding issue
Hi, I've played around with Solr's VelocityResponseWriter (which is indeed a very useful feature for rapid prototyping). I've realized that Velocity uses ISO-8859-1 as default character encoding. I've changed this setting to UTF-8 in my velocity.properties file (inside the conf directory), i.e., input.encoding=UTF-8 output.encoding=UTF-8 and checked that the settings were successfully loaded. Within the main Velocity template, browse.vm, the character encoding is set to UTF-8 as well, i.e., meta http-equiv=content-type content=text/html; charset=UTF-8/ After starting Solr (which is deployed in a Tomcat 6 server on a Ubuntu machine), I ran into some character encoding problems. Due to the change of input.encoding to UTF-8, no problems occur when non-ASCII characters are presend in the query string, e.g. german umlauts. But unfortunately, something is wrong with the encoding of characters in the html page that is generated by VelocityResponseWriter. The non-ASCII characters aren't displayed properly (for example, FF prints a black diamond with a white question mark). If I manually set the encoding to ISO-8859-1, the non-ASCII characters are displayed correctly. Does anybody have a clue? Thanks in advance, Sascha
Re: VelocityResponseWriter/Solritas character encoding issue
Hi Erik, Erik Hatcher wrote: Can you give me a test document that causes an issue? (maybe send me a Solr XML document in private e-mail). I'll see what I can do once I can see the issue first hand. Thank you! Just try the utf8-example.xml file in the exampledoc directory. After having indexed the document, the output of the script test_utf8.sh suggests to me that everything works correctly: Solr server is up. HTTP GET is accepting UTF-8 HTTP POST is accepting UTF-8 HTTP POST does not default to UTF-8 HTTP GET is accepting UTF-8 beyond the basic multilingual plane HTTP POST is accepting UTF-8 beyond the basic multilingual plane HTTP POST + URL params is accepting UTF-8 beyond the basic multilingual If I'm using the standard QueryResponseWriter and the query q=umlauts, the responding xml page contains properly printed non-ASCII characters. The same query against the VelocityResponseWriter returns a lot of Unicode replacement characters (u+FFFD) instead. -Sascha On Nov 18, 2009, at 2:48 PM, Sascha Szott wrote: Hi, I've played around with Solr's VelocityResponseWriter (which is indeed a very useful feature for rapid prototyping). I've realized that Velocity uses ISO-8859-1 as default character encoding. I've changed this setting to UTF-8 in my velocity.properties file (inside the conf directory), i.e., input.encoding=UTF-8 output.encoding=UTF-8 and checked that the settings were successfully loaded. Within the main Velocity template, browse.vm, the character encoding is set to UTF-8 as well, i.e., meta http-equiv=content-type content=text/html; charset=UTF-8/ After starting Solr (which is deployed in a Tomcat 6 server on a Ubuntu machine), I ran into some character encoding problems. Due to the change of input.encoding to UTF-8, no problems occur when non-ASCII characters are presend in the query string, e.g. german umlauts. But unfortunately, something is wrong with the encoding of characters in the html page that is generated by VelocityResponseWriter. The non-ASCII characters aren't displayed properly (for example, FF prints a black diamond with a white question mark). If I manually set the encoding to ISO-8859-1, the non-ASCII characters are displayed correctly. Does anybody have a clue? Thanks in advance, Sascha
Re: Indexing multiple documents in Solr/SolrCell
Kewin, Kerwin wrote: Our approach is similar to what you have mentioned in the jira issue except that we have all metadata in the xml and not in the database. I am therefore using a custom XmlUpdateRequestHandler to parse the XML and then calling Tika from within the XML Loader to parse the content. Until now this seems to work. When and in which Solr version do you expect the jira issue to be addressed? That's a good question. Since I'm not a Solr committer, I cannot give any estimate on when it will be released (hopefully in Solr 1.5). -Sascha On Mon, Nov 16, 2009 at 5:02 PM, Sascha Szott sz...@zib.de wrote: Hi, the problem you've described -- an integration of DataImportHandler (to traverse the XML file and get the document urls) and Solr Cell (to extract content afterwards) -- is already addressed in issue SOLR-1358 ( https://issues.apache.org/jira/browse/SOLR-1358). Best, Sascha Kerwin wrote: Hi, I am new to this forum and would like to know if the function described below has been developed or exists in Solr. If it does not exist, is it a good Idea and can I contribute. We need to index multiple documents with different formats. So we use Solr with Tika (Solr Cell). Question: Can you index both metadata and content for multiple documents iteratively in Solr? For example I have an XML with metadata and a links to the documents content. There are many documents in this XML and I would like to index them all without firing multiple URLs. Example of XML add doc field name=id34122/field field name=authorMichael/field field name=size3MB/field field name=URLURL of the document/field /doc /add doc2./doc2.../docN I need to index all these documents by sending this XML in a single URL.The collection of documents to be indexed could be on a file system. I have altered the Solr code to be able to do this but is there an already existing feature?
Re: Indexing multiple documents in Solr/SolrCell
Hi, the problem you've described -- an integration of DataImportHandler (to traverse the XML file and get the document urls) and Solr Cell (to extract content afterwards) -- is already addressed in issue SOLR-1358 (https://issues.apache.org/jira/browse/SOLR-1358). Best, Sascha Kerwin wrote: Hi, I am new to this forum and would like to know if the function described below has been developed or exists in Solr. If it does not exist, is it a good Idea and can I contribute. We need to index multiple documents with different formats. So we use Solr with Tika (Solr Cell). Question: Can you index both metadata and content for multiple documents iteratively in Solr? For example I have an XML with metadata and a links to the documents content. There are many documents in this XML and I would like to index them all without firing multiple URLs. Example of XML add doc field name=id34122/field field name=authorMichael/field field name=size3MB/field field name=URLURL of the document/field /doc /add doc2./doc2.../docN I need to index all these documents by sending this XML in a single URL.The collection of documents to be indexed could be on a file system. I have altered the Solr code to be able to do this but is there an already existing feature?
Re: [DIH] blocking import operation
Noble Paul wrote: Yes , open an issue . This is a trivial change I've opened JIRA issue SOLR-1554. -Sascha On Thu, Nov 12, 2009 at 5:08 AM, Sascha Szott sz...@zib.de wrote: Noble, Noble Paul wrote: DIH imports are really long running. There is a good chance that the connection times out or breaks in between. Yes, you're right, I missed that point (in my case imports take no longer than a minute). how about a callback? Thanks for the hint. There was a discussion on adding a callback url to DIH a month ago, but it seems that no issue was raised. So, up to now its only possible to implement an appropriate Solr EventListener. Should we open an issue for supporting callback urls? Best, Sascha On Tue, Nov 10, 2009 at 12:12 AM, Sascha Szott sz...@zib.de wrote: Hi all, currently, DIH's import operation(s) only works asynchronously. Therefore, after submitting an import request, DIH returns immediately, while the import process (in case a large amount of data needs to be indexed) continues asynchronously behind the scenes. So, what is the recommended way to check if the import process has already finished? Or still better, is there any method / workaround that will block the import operation's caller until the operation has finished? In my application, the DIH receives some URL parameters which are used for determining the database name that is used within data-config.xml, e.g. http://localhost:8983/solr/dataimport?command=full-importdbname=foo Since only one DIH, /dataimport, is defined, but several database needs to be indexed, it is required to issue this command several times, e.g. http://localhost:8983/solr/dataimport?command=full-importdbname=foo ... wait until /dataimport?command=status says Indexing completed (but without using a loop that checks it again and again) ... http://localhost:8983/solr/dataimport?command=full-importdbname=barclean=false A suitable solution, at least IMHO, would be to have an additional DIH parameter which determines whether the import call is blocking on non-blocking, the default. As far as I see, this could be accomplished since Solr can execute more than one import operation at a time (it starts a new thread for each). Perhaps, my question is somehow related to the discussion [1] on ParallelDataImportHandler. Best, Sascha [1] http://www.lucidimagination.com/search/document/a9b26ade46466ee -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: [DIH] concurrent requests to DIH
Hi Avlesh, Avlesh Singh wrote: 1. Is it considered as good practice to set up several DIH request handlers, one for each possible parameter value? Nothing wrong with this. My assumption is that you want to do this to speed up indexing. Each DIH instance would block all others, once a Lucene commit for the former is performed. Thanks for this clarification. 2. In case the range of parameter values is broad, it's not convenient to define separate request handlers for each value. But this entails a limitation (as far as I see): It is not possible to fire several request to the same DIH handler (with different parameter values) at the same time. Nope. I had done a similar exercise in my quest to write a ParallelDataImportHandler. This thread might be of interest to you - http://www.lucidimagination.com/search/document/a9b26ade46466ee/queries_regarding_a_paralleldataimporthandler. Though there is a ticket in JIRA, I haven't been able to contribute this back. If you think this is what you need, lemme know. Actually, I've already read this thread. In my opinion, both support for batch processing and multi-threading are important extensions of DIH's current capabilities, though issue SOLR-1352 mainly targets the latter. Is your PDIH implementation able to deal with batch processing right now? Best, Sascha On Thu, Nov 12, 2009 at 6:35 AM, Sascha Szott sz...@zib.de wrote: Hi all, I'm using the DIH in a parameterized way by passing request parameters that are used inside of my data-config. All imports end up in the same index. 1. Is it considered as good practice to set up several DIH request handlers, one for each possible parameter value? 2. In case the range of parameter values is broad, it's not convenient to define separate request handlers for each value. But this entails a limitation (as far as I see): It is not possible to fire several request to the same DIH handler (with different parameter values) at the same time. However, in case several request handlers would be used (as in 1.), concurrent requests (to the different handlers) are possible. So, how to overcome this limitation? Best, Sascha
Re: [DIH] blocking import operation
Noble, Noble Paul wrote: DIH imports are really long running. There is a good chance that the connection times out or breaks in between. Yes, you're right, I missed that point (in my case imports take no longer than a minute). how about a callback? Thanks for the hint. There was a discussion on adding a callback url to DIH a month ago, but it seems that no issue was raised. So, up to now its only possible to implement an appropriate Solr EventListener. Should we open an issue for supporting callback urls? Best, Sascha On Tue, Nov 10, 2009 at 12:12 AM, Sascha Szott sz...@zib.de wrote: Hi all, currently, DIH's import operation(s) only works asynchronously. Therefore, after submitting an import request, DIH returns immediately, while the import process (in case a large amount of data needs to be indexed) continues asynchronously behind the scenes. So, what is the recommended way to check if the import process has already finished? Or still better, is there any method / workaround that will block the import operation's caller until the operation has finished? In my application, the DIH receives some URL parameters which are used for determining the database name that is used within data-config.xml, e.g. http://localhost:8983/solr/dataimport?command=full-importdbname=foo Since only one DIH, /dataimport, is defined, but several database needs to be indexed, it is required to issue this command several times, e.g. http://localhost:8983/solr/dataimport?command=full-importdbname=foo ... wait until /dataimport?command=status says Indexing completed (but without using a loop that checks it again and again) ... http://localhost:8983/solr/dataimport?command=full-importdbname=barclean=false A suitable solution, at least IMHO, would be to have an additional DIH parameter which determines whether the import call is blocking on non-blocking, the default. As far as I see, this could be accomplished since Solr can execute more than one import operation at a time (it starts a new thread for each). Perhaps, my question is somehow related to the discussion [1] on ParallelDataImportHandler. Best, Sascha [1] http://www.lucidimagination.com/search/document/a9b26ade46466ee
[DIH] concurrent requests to DIH
Hi all, I'm using the DIH in a parameterized way by passing request parameters that are used inside of my data-config. All imports end up in the same index. 1. Is it considered as good practice to set up several DIH request handlers, one for each possible parameter value? 2. In case the range of parameter values is broad, it's not convenient to define separate request handlers for each value. But this entails a limitation (as far as I see): It is not possible to fire several request to the same DIH handler (with different parameter values) at the same time. However, in case several request handlers would be used (as in 1.), concurrent requests (to the different handlers) are possible. So, how to overcome this limitation? Best, Sascha
[DIH] SqlEntityProcessor does not recognize onError attribute
Hi all, as stated in the Solr-WIKI, Solr 1.4 allows it to specify an onError attribute for *each* entity listed in the data config file (it is considered as one of the default attributes). Unfortunately, the SqlEntityProcessor does not recognize the attribute's value -- i.e., in case an SQL exception is thrown somewhere inside the constructor of ResultSetIterators (which is an inner class of JdbcDataSource), Solr's import exits immediately, even though onError is set to continue or skip. Why are database related exceptions (e.g., table does not exists, or an error in query syntax occurs) not being covered by the onError attribute? In my opinion, use cases exist that will profit from such an exception handling inside of Solr (for example, in cases where the existence of certain database tables or views is not predictable). Should I raise an JIRA-issue about this? -Sascha
Re: [DIH] SqlEntityProcessor does not recognize onError attribute
Hi, Noble Paul നോബിള് नोब्ळ् wrote: On Mon, Nov 9, 2009 at 4:24 PM, Sascha Szott sz...@zib.de wrote: Hi all, as stated in the Solr-WIKI, Solr 1.4 allows it to specify an onError attribute for *each* entity listed in the data config file (it is considered as one of the default attributes). Unfortunately, the SqlEntityProcessor does not recognize the attribute's value -- i.e., in case an SQL exception is thrown somewhere inside the constructor of ResultSetIterators (which is an inner class of JdbcDataSource), Solr's import exits immediately, even though onError is set to continue or skip. Why are database related exceptions (e.g., table does not exists, or an error in query syntax occurs) not being covered by the onError attribute? In my opinion, use cases exist that will profit from such an exception handling inside of Solr (for example, in cases where the existence of certain database tables or views is not predictable). We thought DB errors are not to be ignored because errors such as table does not exist can be really serious. In principle, I agree with you, though I would consider it as a programmer's responsibility to be aware of it (in case he/she sets onError to skip or continue). Should I raise an JIRA-issue about this? Raise an issue it can be fixed I've created issue SOLR-1549. Best, Sascha
[DIH] blocking import operation
Hi all, currently, DIH's import operation(s) only works asynchronously. Therefore, after submitting an import request, DIH returns immediately, while the import process (in case a large amount of data needs to be indexed) continues asynchronously behind the scenes. So, what is the recommended way to check if the import process has already finished? Or still better, is there any method / workaround that will block the import operation's caller until the operation has finished? In my application, the DIH receives some URL parameters which are used for determining the database name that is used within data-config.xml, e.g. http://localhost:8983/solr/dataimport?command=full-importdbname=foo Since only one DIH, /dataimport, is defined, but several database needs to be indexed, it is required to issue this command several times, e.g. http://localhost:8983/solr/dataimport?command=full-importdbname=foo ... wait until /dataimport?command=status says Indexing completed (but without using a loop that checks it again and again) ... http://localhost:8983/solr/dataimport?command=full-importdbname=barclean=false A suitable solution, at least IMHO, would be to have an additional DIH parameter which determines whether the import call is blocking on non-blocking, the default. As far as I see, this could be accomplished since Solr can execute more than one import operation at a time (it starts a new thread for each). Perhaps, my question is somehow related to the discussion [1] on ParallelDataImportHandler. Best, Sascha [1] http://www.lucidimagination.com/search/document/a9b26ade46466ee
Re: How to use DataImportHandler with ExtractingRequestHandler?
Hi Khai, a few weeks ago, I was facing the same problem. In my case, this workaround helped (assuming, you're using Solr 1.3): For each row, extract the content from the corresponding pdf file using a parser library of your choice (I suggest Apache PDFBox or Apache Tika in case you need to process other file types as well), put it between foo![CDATA[ and ]]/foo and store it in a text file. To keep the relationship between a file and its corresponding database row, use the primary key as the file name. Within data-config.xml use the XPathEntityProcessor as follows (replace dbRow and primaryKey respectively): entity name=pdfcontent processor=XPathEntityProcessor forEach=/foo url=${dbRow.primaryKey}.xml field column=pdftext xpath=/foo/ /entity And, by the way, in Solr 1.4 you do not have to put your content between xml tags: use the PlainTextEntityProcessor instead of XPathEntityProcessor. Best, Sascha Khai Doan schrieb: Hi all, My name is Khai. I have a table in a relational database. I have successfully use DataImportHandler to import this data into Apache Solr. However, one of the column store the location of PDF file. How can I configure DataImportHandler to use ExtractingRequestHandler to extract the content of the PDF? Thanks! Khai Doan
Building documents using content residing both in database tables and text files
Hello, is it possible (and if it is, how can I accomplish it) to configure DIH to build up index documents by using content that resides in different data sources? Here is an example scenario: Let's assume we have a table T with two columns, ID (which is the primary key of T) and TITLE. Furthermore, each record in T is assigned a directory containing text files that were generated out of pdf documents by using Tika. A directory name is build by using the ID of the record in T associated to that directory, e.g. all text files associated to a record with id = 101 are stored in direcory 101. Is there a way to configure DIH such that it uses ID, TITLE and the content of all related text files when building a document (the documents should have three fields: id, title, and text)? Furthermore, as you may have noticed, a second question arises naturally: Will there be any integration of Solr Cell and DIH in an upcoming release, so that it would be possible to directly use the pdf documents instead of the extracted text files that were generated outside of Solr? Best, Sascha
Re: Building documents using content residing both in database tables and text files
Hi Noble, Noble Paul wrote: isn't it possible to do this by having two datasources (one Js=dbc and another File) and two entities . The outer entity can read from a DB and the inner entity can read from a file. Yes, it is. Here's my db-data-config.xml file: !-- definition of data sources -- dataSource name=ds.database driver=... url=... user=... password=... / dataSource name=ds.filesystem type=FileDataSource / !-- building the document using both db and file content (files are stored in /tmp/recordId) -- document name=doc entity name=t query=select * from t dataSource=ds.database field column=id name=id / field column=title name=title / entity name=dir processor=FileListEntityProcessor baseDir=/tmp/${id} fileName=.* dataSource=null rootEntity=false entity name=file dataSource=ds.filesystem processor=XPathEntityProcessor forEach=/root url=${dir.fileAbsolutePath} stream=false field column=text xpath=/root / /entity /entity /entity /document Only one additional adjustment has to be made: Since I'm using Solr 1.3 and it comes without PlainTextEntityProcessor, I have to transform my plain text files in xml files by surrounding the content with a root element. That's all! On Tue, Aug 11, 2009 at 8:05 PM, Sascha Szottsz...@zib.de wrote: Hello, is it possible (and if it is, how can I accomplish it) to configure DIH to build up index documents by using content that resides in different data sources? Here is an example scenario: Let's assume we have a table T with two columns, ID (which is the primary key of T) and TITLE. Furthermore, each record in T is assigned a directory containing text files that were generated out of pdf documents by using Tika. A directory name is build by using the ID of the record in T associated to that directory, e.g. all text files associated to a record with id = 101 are stored in direcory 101. Is there a way to configure DIH such that it uses ID, TITLE and the content of all related text files when building a document (the documents should have three fields: id, title, and text)? Furthermore, as you may have noticed, a second question arises naturally: Will there be any integration of Solr Cell and DIH in an upcoming release, so that it would be possible to directly use the pdf documents instead of the extracted text files that were generated outside of Solr? This is something I wish to see. But there has been no user request yet. You can raise an issue and it can be looked upon I've raised issue SOLR-1358. Best, Sascha