Boosting of search results
HI, I want to boost / block search results, i don't want to use boosting of fields/ term of dismaxrequest handler. I have seen some post saying setting a value to the key $docBoost via transformer, but i am not sure how to use / set doc boost via transformer. http://www.nabble.com/Boosting-Code-td22119017.html#a22119017 Please let me know how should we use docboost or is there any other way to boost documents. Thanks, Prerna -- View this message in context: http://www.nabble.com/Boosting-of-search-results-tp24600557p24600557.html Sent from the Solr - User mailing list archive at Nabble.com.
importing lots of db data. specially formated. what is fasted approach?
Hi folks, I have around 50k documents that are reindexed now and then. Question is what would be the fastest approach to all this. Data is just text ~20fields or so. It comes from database but is first specially formated to get to format suitable for passing in solr. Currently xml post is used but have the feeling this is not optimal for speed wise when it is up to bulk import/reindex. I see http://wiki.apache.org/solr/DataImportHandler but kinda fail to see howto do this specially formated data so solr makes use of it. Are there some real examples,articles on howto use this?
Re: Boosting Code
Hi, I have to boost document, Can someone help me understanding how can we implement docBoost via transformer. Thanks, Prerna Marc Sturlese wrote: If you mean at indexing time, you set field boost via data-config.xml. That boost is parsed from there and set to the lucene document going through DocBuilder,java, SolrInputDocuemnt.java and DocuemntBuilder.java In case you want to set full-document boost (not just to a field) you can do it setting a value to the key $docBoost via transformer. That value is set using same classes (DocBuilder,java, SolrInputDocuemnt.java and DocuemntBuilder.java). dabboo wrote: Hi, Can anyone please tell me where I can find the actual logic/implementation of field boosting in Solr. I am looking for classes. Thanks, Amit Garg -- View this message in context: http://www.nabble.com/Boosting-Code-tp22119017p24600769.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: importing lots of db data. specially formated. what is fasted approach?
if each field from the db goes to a separate field in solr as-is . Then it is very simple. if you need to split/join fields before feeding it into solr fields you may need to apply transformers an example on how your db field looks like and how you wish it to look like in solr would be helpful On Wed, Jul 22, 2009 at 11:57 AM, Julian Davchevj...@drun.net wrote: Hi folks, I have around 50k documents that are reindexed now and then. Question is what would be the fastest approach to all this. Data is just text ~20fields or so. It comes from database but is first specially formated to get to format suitable for passing in solr. Currently xml post is used but have the feeling this is not optimal for speed wise when it is up to bulk import/reindex. I see http://wiki.apache.org/solr/DataImportHandler but kinda fail to see howto do this specially formated data so solr makes use of it. Are there some real examples,articles on howto use this? -- - Noble Paul | Principal Engineer| AOL | http://aol.com
DIH example explanation
Hi, I am looking at the slashdot example and I am having hard time understanding the following, from the wiki == You can use this feature for indexing from REST API's such as rss/atom feeds, XML data feeds , other Solr servers or even well formed xhtml documents . Our XPath support has its limitations (no wildcards , only fullpath etc) but we have tried to make sure that common use-cases are covered and since it's based on a streaming parser, it is extremely fast and consumes constant amount of memory even for large XMLs. It does not support namespaces , but it can handle xmls with namespaces . When you provide the xpath, just drop the namespace and give the rest (eg if the tag is 'dc:subject' the mapping should just contain 'subject').Easy, isn't it? And you didn't need to write one line of code! Enjoy == How does dc:subject becomes field subject and why it's mapping xpath=/RDF/item/subject.. what is the secret? I am trying to index atom files and I need to understand the above cos I have namespace, not sure how to proceed. are there any atom example anywhere? Thanks again for clarification. Anton __ Ta semester! - sök efter resor hos Kelkoo. Jämför pris på flygbiljetter och hotellrum här: http://www.kelkoo.se/c-169901-resor-biljetter.html?partnerId=96914052
Re: Boosting Code
public MapString,Object transform(MapString,Object row , Context ctx){ row.put($docBoost, 3445); return row; } On Wed, Jul 22, 2009 at 12:02 PM, prerna07pkhandelw...@sapient.com wrote: Hi, I have to boost document, Can someone help me understanding how can we implement docBoost via transformer. Thanks, Prerna Marc Sturlese wrote: If you mean at indexing time, you set field boost via data-config.xml. That boost is parsed from there and set to the lucene document going through DocBuilder,java, SolrInputDocuemnt.java and DocuemntBuilder.java In case you want to set full-document boost (not just to a field) you can do it setting a value to the key $docBoost via transformer. That value is set using same classes (DocBuilder,java, SolrInputDocuemnt.java and DocuemntBuilder.java). dabboo wrote: Hi, Can anyone please tell me where I can find the actual logic/implementation of field boosting in Solr. I am looking for classes. Thanks, Amit Garg -- View this message in context: http://www.nabble.com/Boosting-Code-tp22119017p24600769.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: All in one index, or multiple indexes?
keep in mind that everytime a commit is done all the caches are thrown away. If updates for each of these indexes happen at different time then the caches get invalidated each time you commit. so in that case smaller index helps On Wed, Jul 8, 2009 at 4:55 PM, Tim Selltrs...@gmail.com wrote: Hi, I am wondering if it is common to have just one very large index, or multiple smaller indexes specialized for different content types. We currently have multiple smaller indexes, although one of them is much larger then the others. We are considering merging them, to allow the convenience of searching across multiple types at once and get them back in one list. The largest of the current indexes has a couple of types that belong together, it has just one text field, and it is usually quite short and is similar to product names (words like The matter). Another index I would merge with this one, has multiple text fields (also quite short). We of course would still like to be able to get specific types. Is doing filtering on just one type a big performance hit compared to just querying it from it's own index? Bare in mind all these indexes run on the same machine. (we replicate them all to three machines and do load balancing). There are a number of considerations. From an application standpoint when querying across all types we may split the results out into the separate types anyway once we have the list back. If we always do this, is it silly to have them in one index, rather then query multiple indexes at once? Is multiple http requests less significant then the time to post split the results? In some ways it is easier to maintain a single index, although it has felt easier to optimize the results for the type of content if they are in separate indexes. My main concern of putting it all in one index is that we'll make it harder to work with. We will definitely want to do filtering on types sometimes, and if we go with a mashed up index I'd prefer not to maintain separate specialized indexes as well. Any thoughts? ~Tim. -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: DIH example explanation
The point is that namespace is ignored while DIH reads the xml. So just use the part after the colon (:) in your xpath expressions and it should just work. On Wed, Jul 22, 2009 at 2:16 PM, Antonio Eggbergantonio_eggb...@yahoo.se wrote: Hi, I am looking at the slashdot example and I am having hard time understanding the following, from the wiki == You can use this feature for indexing from REST API's such as rss/atom feeds, XML data feeds , other Solr servers or even well formed xhtml documents . Our XPath support has its limitations (no wildcards , only fullpath etc) but we have tried to make sure that common use-cases are covered and since it's based on a streaming parser, it is extremely fast and consumes constant amount of memory even for large XMLs. It does not support namespaces , but it can handle xmls with namespaces . When you provide the xpath, just drop the namespace and give the rest (eg if the tag is 'dc:subject' the mapping should just contain 'subject').Easy, isn't it? And you didn't need to write one line of code! Enjoy == How does dc:subject becomes field subject and why it's mapping xpath=/RDF/item/subject.. what is the secret? I am trying to index atom files and I need to understand the above cos I have namespace, not sure how to proceed. are there any atom example anywhere? Thanks again for clarification. Anton __ Ta semester! - sök efter resor hos Kelkoo. Jämför pris på flygbiljetter och hotellrum här: http://www.kelkoo.se/c-169901-resor-biljetter.html?partnerId=96914052 -- - Noble Paul | Principal Engineer| AOL | http://aol.com
US/UK/CA/AU English support
Hi, 1) Out of US/UK/CA/AU,which english does solr support ? 2) PhoneticFilterFactory perform search for similar sounding words. For example : search on carat will give results of carat, caret and carrat. I also observed that PhoneticFilterFactory also support linguistic variation for US/UK/CA/AU. For example: search on Optimize give results of optimise and optimize. Question : Does PhoneticFilterFactory support all characters/ words of linguistic variations for US/UK/CA/AU OR linguistic search for US/UK/CA/AU will be subset of phonetic search. Please suggest. Thanks, Prerna -- View this message in context: http://www.nabble.com/US-UK-CA-AU-English-support-tp24602629p24602629.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Word frequency count in the index
Hi Grant, thanks for your reply. I have one more doubt, if I use Luke's request handler in solr for this issue, the top terms I get, are they term frequency or highest document frequency terms. I would like to get terms that occur max in a document and those document form a good percentage in the total index. Kindly reply if any other option straight or an elaborate one is available. Thank you, Pooja On Thu, Jul 16, 2009 at 4:05 PM, Grant Ingersoll gsing...@apache.orgwrote: In the trunk version, the TermsComponent should give you this: http://wiki.apache.org/solr/TermsComponent. Also, you can use the LukeRequestHandler to get the top words in each field. Alternatively, you may just want to point Luke at your index. On Jul 16, 2009, at 6:29 AM, Pooja Verlani wrote: Hi, Is there any way in SOLR to know the count of each word indexed in the solr ? I want to find out the different word frequencies to figure out ' application specific stop words'. Please let me know if its possible. Thank you, Regards, Pooja -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Best approach to multiple languages
Hi We have a dataset that contains productname, category and descriptions. The descriptions can be in one or more different languages. What would be the recommended way of indexing these? My initial thoughts are to index each description as a separate field and append the language identifier to the field name, for example, three fields with description_en, description_de, descrtiption_fr. Is this the best approach or is there a better way? Regards Andrew McCombe
Re: Best approach to multiple languages
Hi, We have such case...we don't want to search in all of those languages at once but just one of them. So we took the approach of different indexes for each language. From what I know it helps not breaking relevance of the stats as well. You know, how much an index is used etc etc. If you dig in mailling list. This has been discussed quite many times. Andrew McCombe wrote: Hi We have a dataset that contains productname, category and descriptions. The descriptions can be in one or more different languages. What would be the recommended way of indexing these? My initial thoughts are to index each description as a separate field and append the language identifier to the field name, for example, three fields with description_en, description_de, descrtiption_fr. Is this the best approach or is there a better way? Regards Andrew McCombe
Re: importing lots of db data. specially formated. what is fasted approach?
As Noble has already said, transforming content before indexing a very common requirement. DataImportHandler's Transformer lets you achieve this. Read up on the same here - http://wiki.apache.org/solr/DataImportHandler#head-a6916b30b5d7605a990fb03c4ff461b3736496a9 Cheers Avlesh 2009/7/22 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com if each field from the db goes to a separate field in solr as-is . Then it is very simple. if you need to split/join fields before feeding it into solr fields you may need to apply transformers an example on how your db field looks like and how you wish it to look like in solr would be helpful On Wed, Jul 22, 2009 at 11:57 AM, Julian Davchevj...@drun.net wrote: Hi folks, I have around 50k documents that are reindexed now and then. Question is what would be the fastest approach to all this. Data is just text ~20fields or so. It comes from database but is first specially formated to get to format suitable for passing in solr. Currently xml post is used but have the feeling this is not optimal for speed wise when it is up to bulk import/reindex. I see http://wiki.apache.org/solr/DataImportHandler but kinda fail to see howto do this specially formated data so solr makes use of it. Are there some real examples,articles on howto use this? -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: Behaviour when we get more than 1 million hits
Hi, There is this particulat scenarion where I want to search for a product and i get a million records which will be given for further processing. Regards, Raakhi On Mon, Jul 13, 2009 at 7:33 PM, Erick Erickson erickerick...@gmail.comwrote: It depends (tm) on what you try to do with the results. You really need togive us some more details on what you want to *do* with 1,000,000 hits before any meaningful response is possible. Best Erick On Mon, Jul 13, 2009 at 8:47 AM, Rakhi Khatwani rkhatw...@gmail.com wrote: Hi, If while using Solr, what would the behaviour be like if we perform the search and we get more than one million hits Regards, Raakhi
[ApacheCon US] Travel Assistance
The Travel Assistance Committee is taking in applications for those wanting to attend ApacheCon US 2009 (Oakland) which takes place between the 2nd and 6th November 2009. The Travel Assistance Committee is looking for people who would like to be able to attend ApacheCon US 2009 who may need some financial support in order to get there. There are limited places available, and all applications will be scored on their individual merit. Applications are open to all open source developers who feel that their attendance would benefit themselves, their project(s), the ASF and open source in general. Financial assistance is available for flights, accommodation, subsistence and Conference fees either in full or in part, depending on circumstances. It is intended that all our ApacheCon events are covered, so it may be prudent for those in Europe and/or Asia to wait until an event closer to them comes up - you are all welcome to apply for ApacheCon US of course, but there should be compelling reasons for you to attend an event further away that your home location for your application to be considered above those closer to the event location. More information can be found on the main Apache website at http://www.apache.org/travel/index.html - where you will also find a link to the online application and details for submitting. Applications for applying for travel assistance will open on 27th July 2009 and close of the 17th August 2009. Good luck to all those that will apply. Regards, The Travel Assistance Committee
Re: DIH example explanation
:) thank you paul! and it works! I have one more stupid question about the wiki. url (required) : The url used to invoke the REST API. (Can be templatized). How do you templatize the URL? My URL's are being updated all the time by an external program. i.e. list of atom sites it's a text file. So I should use some form of transformer to process it? any hint.. Thanks. Anton --- Den ons 2009-07-22 skrev Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com: Från: Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com Ämne: Re: DIH example explanation Till: solr-user@lucene.apache.org Datum: onsdag 22 juli 2009 10.52 The point is that namespace is ignored while DIH reads the xml. So just use the part after the colon (:) in your xpath expressions and it should just work. On Wed, Jul 22, 2009 at 2:16 PM, Antonio Eggbergantonio_eggb...@yahoo.se wrote: Hi, I am looking at the slashdot example and I am having hard time understanding the following, from the wiki == You can use this feature for indexing from REST API's such as rss/atom feeds, XML data feeds , other Solr servers or even well formed xhtml documents . Our XPath support has its limitations (no wildcards , only fullpath etc) but we have tried to make sure that common use-cases are covered and since it's based on a streaming parser, it is extremely fast and consumes constant amount of memory even for large XMLs. It does not support namespaces , but it can handle xmls with namespaces . When you provide the xpath, just drop the namespace and give the rest (eg if the tag is 'dc:subject' the mapping should just contain 'subject').Easy, isn't it? And you didn't need to write one line of code! Enjoy == How does dc:subject becomes field subject and why it's mapping xpath=/RDF/item/subject.. what is the secret? I am trying to index atom files and I need to understand the above cos I have namespace, not sure how to proceed. are there any atom example anywhere? Thanks again for clarification. Anton __ Ta semester! - sök efter resor hos Kelkoo. Jämför pris på flygbiljetter och hotellrum här: http://www.kelkoo.se/c-169901-resor-biljetter.html?partnerId=96914052 -- - Noble Paul | Principal Engineer| AOL | http://aol.com __ Ta semester! - sök efter resor hos Kelkoo. Jämför pris på flygbiljetter och hotellrum här: http://www.kelkoo..se/c-169901-resor-biljetter.html?partnerId=96914052
Re: importing lots of db data. specially formated. what is fasted approach?
Well yes, transformation is required. But it's like data coming from multiple tables.. etc. It's not like getting one row from table and possibly transforming it and using it. I am thinking perhaps to create some tables (views) that will have the data ready/flattened and then simply feeding it. Cause not sure how much flexibility transformer will give me. Java not number 1 language either :) Thanks for suggestions. Will get a look there. Avlesh Singh wrote: As Noble has already said, transforming content before indexing a very common requirement. DataImportHandler's Transformer lets you achieve this. Read up on the same here - http://wiki.apache.org/solr/DataImportHandler#head-a6916b30b5d7605a990fb03c4ff461b3736496a9 Cheers Avlesh 2009/7/22 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com if each field from the db goes to a separate field in solr as-is . Then it is very simple. if you need to split/join fields before feeding it into solr fields you may need to apply transformers an example on how your db field looks like and how you wish it to look like in solr would be helpful On Wed, Jul 22, 2009 at 11:57 AM, Julian Davchevj...@drun.net wrote: Hi folks, I have around 50k documents that are reindexed now and then. Question is what would be the fastest approach to all this. Data is just text ~20fields or so. It comes from database but is first specially formated to get to format suitable for passing in solr. Currently xml post is used but have the feeling this is not optimal for speed wise when it is up to bulk import/reindex. I see http://wiki.apache.org/solr/DataImportHandler but kinda fail to see howto do this specially formated data so solr makes use of it. Are there some real examples,articles on howto use this? -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Synonyms from index
Hi, Is there a possible way to generate synonyms from the index ? I have an index with lots of searchable terms turning out to be having synonyms and users too have different synonyms. If not then the only way if to learn from the query logs and click logs but in case there exists one, please share. regards, Pooja
Re: DIH example explanation
any string that is templatized in DIH can have variables like this ${a.b} for instance look at the following url=http://xyz.com/atom/${dataimporter.request.foo}; if you pass a parameter foo=bar when you invoke the command the url invoked becomes http://xyz.com/atom/bar the variable can come from many places see this http://wiki.apache.org/solr/DataImportHandler#head-86408ce7721ea6f9a3f05b12ace8742fd41737d4 On Wed, Jul 22, 2009 at 4:30 PM, Antonio Eggbergantonio_eggb...@yahoo.se wrote: :) thank you paul! and it works! I have one more stupid question about the wiki. url (required) : The url used to invoke the REST API. (Can be templatized). How do you templatize the URL? My URL's are being updated all the time by an external program. i.e. list of atom sites it's a text file. So I should use some form of transformer to process it? any hint.. Thanks. Anton --- Den ons 2009-07-22 skrev Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com: Från: Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com Ämne: Re: DIH example explanation Till: solr-user@lucene.apache.org Datum: onsdag 22 juli 2009 10.52 The point is that namespace is ignored while DIH reads the xml. So just use the part after the colon (:) in your xpath expressions and it should just work. On Wed, Jul 22, 2009 at 2:16 PM, Antonio Eggbergantonio_eggb...@yahoo.se wrote: Hi, I am looking at the slashdot example and I am having hard time understanding the following, from the wiki == You can use this feature for indexing from REST API's such as rss/atom feeds, XML data feeds , other Solr servers or even well formed xhtml documents . Our XPath support has its limitations (no wildcards , only fullpath etc) but we have tried to make sure that common use-cases are covered and since it's based on a streaming parser, it is extremely fast and consumes constant amount of memory even for large XMLs. It does not support namespaces , but it can handle xmls with namespaces . When you provide the xpath, just drop the namespace and give the rest (eg if the tag is 'dc:subject' the mapping should just contain 'subject').Easy, isn't it? And you didn't need to write one line of code! Enjoy == How does dc:subject becomes field subject and why it's mapping xpath=/RDF/item/subject.. what is the secret? I am trying to index atom files and I need to understand the above cos I have namespace, not sure how to proceed. are there any atom example anywhere? Thanks again for clarification. Anton __ Ta semester! - sök efter resor hos Kelkoo. Jämför pris på flygbiljetter och hotellrum här: http://www.kelkoo.se/c-169901-resor-biljetter.html?partnerId=96914052 -- - Noble Paul | Principal Engineer| AOL | http://aol.com __ Ta semester! - sök efter resor hos Kelkoo. Jämför pris på flygbiljetter och hotellrum här: http://www.kelkoo..se/c-169901-resor-biljetter.html?partnerId=96914052 -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: importing lots of db data. specially formated. what is fasted approach?
a transformer can be written in any language if you are using java 6 javascript support comes out of the box. http://wiki.apache.org/solr/DataImportHandler#head-27fcc2794bd71f7d727104ffc6b99e194bdb6ff9 On Wed, Jul 22, 2009 at 4:57 PM, Julian Davchevj...@drun.net wrote: Well yes, transformation is required. But it's like data coming from multiple tables.. etc. It's not like getting one row from table and possibly transforming it and using it. I am thinking perhaps to create some tables (views) that will have the data ready/flattened and then simply feeding it. Cause not sure how much flexibility transformer will give me. Java not number 1 language either :) Thanks for suggestions. Will get a look there. Avlesh Singh wrote: As Noble has already said, transforming content before indexing a very common requirement. DataImportHandler's Transformer lets you achieve this. Read up on the same here - http://wiki.apache.org/solr/DataImportHandler#head-a6916b30b5d7605a990fb03c4ff461b3736496a9 Cheers Avlesh 2009/7/22 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com if each field from the db goes to a separate field in solr as-is . Then it is very simple. if you need to split/join fields before feeding it into solr fields you may need to apply transformers an example on how your db field looks like and how you wish it to look like in solr would be helpful On Wed, Jul 22, 2009 at 11:57 AM, Julian Davchevj...@drun.net wrote: Hi folks, I have around 50k documents that are reindexed now and then. Question is what would be the fastest approach to all this. Data is just text ~20fields or so. It comes from database but is first specially formated to get to format suitable for passing in solr. Currently xml post is used but have the feeling this is not optimal for speed wise when it is up to bulk import/reindex. I see http://wiki.apache.org/solr/DataImportHandler but kinda fail to see howto do this specially formated data so solr makes use of it. Are there some real examples,articles on howto use this? -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: importing lots of db data. specially formated. what is fasted approach?
You can also do tables join in a SQL select to pick out the fields you want from multiple tables. You may want to use temporary tables during processing. Once you get the data the way you want it, you can use the CSV request handler to load in the output of the SQL select. Bill On Wed, Jul 22, 2009 at 7:27 AM, Julian Davchev j...@drun.net wrote: Well yes, transformation is required. But it's like data coming from multiple tables.. etc. It's not like getting one row from table and possibly transforming it and using it. I am thinking perhaps to create some tables (views) that will have the data ready/flattened and then simply feeding it. Cause not sure how much flexibility transformer will give me. Java not number 1 language either :) Thanks for suggestions. Will get a look there. Avlesh Singh wrote: As Noble has already said, transforming content before indexing a very common requirement. DataImportHandler's Transformer lets you achieve this. Read up on the same here - http://wiki.apache.org/solr/DataImportHandler#head-a6916b30b5d7605a990fb03c4ff461b3736496a9 Cheers Avlesh 2009/7/22 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com if each field from the db goes to a separate field in solr as-is . Then it is very simple. if you need to split/join fields before feeding it into solr fields you may need to apply transformers an example on how your db field looks like and how you wish it to look like in solr would be helpful On Wed, Jul 22, 2009 at 11:57 AM, Julian Davchevj...@drun.net wrote: Hi folks, I have around 50k documents that are reindexed now and then. Question is what would be the fastest approach to all this. Data is just text ~20fields or so. It comes from database but is first specially formated to get to format suitable for passing in solr. Currently xml post is used but have the feeling this is not optimal for speed wise when it is up to bulk import/reindex. I see http://wiki.apache.org/solr/DataImportHandler but kinda fail to see howto do this specially formated data so solr makes use of it. Are there some real examples,articles on howto use this? -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: Random Slowness
We can never reproduce the slowness with the same query. As soon as we try to run them again they are fine. I have even tried running the same query the next day and it is fine. All of our requests go through our dismax handler which is part of why it is so weird. Most queries are fine, however just occasionally they aren't. Additionally, why would the command=details command also go slow? That seems like a server issue. It appears that for FieldValueCache and FilterCache we have no evictions, but for queryResultCache and DocumentCache there are a good number of evictions. How would I help lower the evictions to see if that is the problem? Dismax Config below: requestHandler name=dismax class=solr.DisMaxRequestHandler default=true ? lst name=defaults str name=echoParamsexplicit/str float name=tie0.5/float ? str name=qf productId^10.0 personality^8.0 subCategory^8.0 category^6.0 productType^5.0 brandName^1.0 realBrandName^1.0 productNameSearch^1.0 size^1.2 width^1.0 heelHeight^1.0 productDescription^1.0 color^10.0 price^1.0 attrs^5.0 expandedGender^0.5 /str ? str name=pf attrs^3 brandName^10.0 productNameSearch^8.0 productDescription^2.0 personality^4.0 subCategory^12.0 category^10.0 productType^8.0 /str ? str name=fl productId, productName, price, originalPrice, brandNameFacet, productRating, imageUrl, productUrl, isNew, onSale, styleId /str str name=mm100%/str int name=ps1/int int name=qs5/int str name=q.alt*:*/str !-- More like this search parameters -- ? str name=mlt.fl brandNameFacet,productTypeFacet,productName,categoryFacet,subCategoryFacet,p ersonalityFacet,colorFacet,heelHeight,expandedGender /str int name=mlt.mindf1/int int name=mlt.mintf1/int /lst ? arr name=last-components strspellcheck/str /arr /requestHandler -- Jeff Newburn Software Engineer, Zappos.com jnewb...@zappos.com - 702-943-7562 From: Erik Hatcher e...@ehatchersolutions.com Reply-To: solr-user@lucene.apache.org Date: Wed, 22 Jul 2009 00:36:30 -0400 To: solr-user@lucene.apache.org Subject: Re: Random Slowness On Jul 21, 2009, at 6:52 PM, Jeff Newburn wrote: We are experiencing random slowness on certain queries. I have been unable to diagnose what the issue is. We are using SOLR 1.4 and 99.99% of queries return in under 250 ms. The remaining queries are returning in 2-5 seconds for no apparent reason. There does not seem to be any commonality between the queries. This problem also includes admin system queries. Any help or direction would be much appreciated. Do you experience the same slow speeds when you manually issue those queries? In other words, is it repeatable? If so, try debugQuery=true and see the component timings and where the time is going. What's the query parsing too? Anything unusually large due to synonym lists or something like that? What about your filter cache - how's it looking when these slow queries take place? evictions 0? params ={facet=truefacet.mincount=1facet.limit=-1wt=javabinrows=0facet.s ort = true start=0q=shoesfacet.field=colorFacetfacet.field=brandNameFacetf acet .field =heelHeightfacet.field=attrFacet_Styleqt=dismaxfq=productTypeFa cet:Shoes fq=gender:Womensfq=categoryFacet:Sandalsfq=width:EEfq=size:10.5 fq=priceFacet:$100.00+and+Underfq=personalityFacet:Sexy} hits=19 status=0 QTime=3689 What's the config of your dismax handler look like? Erik
Re: Random Slowness
I haven't read this whole thread, so maybe it's already come up. Have you turned on the garbage collection logging to see if the jvm is busy cleaning up when you are seeing the slowness? Maybe the jvm is struggling to keep the heap size within a particular limit? //Ed On Wed, Jul 22, 2009 at 10:22 AM, Jeff Newburnjnewb...@zappos.com wrote: We can never reproduce the slowness with the same query. As soon as we try to run them again they are fine. I have even tried running the same query the next day and it is fine. All of our requests go through our dismax handler which is part of why it is so weird. Most queries are fine, however just occasionally they aren't. Additionally, why would the command=details command also go slow? That seems like a server issue. It appears that for FieldValueCache and FilterCache we have no evictions, but for queryResultCache and DocumentCache there are a good number of evictions. How would I help lower the evictions to see if that is the problem? Dismax Config below: requestHandler name=dismax class=solr.DisMaxRequestHandler default=true ? lst name=defaults str name=echoParamsexplicit/str float name=tie0.5/float ? str name=qf productId^10.0 personality^8.0 subCategory^8.0 category^6.0 productType^5.0 brandName^1.0 realBrandName^1.0 productNameSearch^1.0 size^1.2 width^1.0 heelHeight^1.0 productDescription^1.0 color^10.0 price^1.0 attrs^5.0 expandedGender^0.5 /str ? str name=pf attrs^3 brandName^10.0 productNameSearch^8.0 productDescription^2.0 personality^4.0 subCategory^12.0 category^10.0 productType^8.0 /str ? str name=fl productId, productName, price, originalPrice, brandNameFacet, productRating, imageUrl, productUrl, isNew, onSale, styleId /str str name=mm100%/str int name=ps1/int int name=qs5/int str name=q.alt*:*/str !-- More like this search parameters -- ? str name=mlt.fl brandNameFacet,productTypeFacet,productName,categoryFacet,subCategoryFacet,p ersonalityFacet,colorFacet,heelHeight,expandedGender /str int name=mlt.mindf1/int int name=mlt.mintf1/int /lst ? arr name=last-components strspellcheck/str /arr /requestHandler -- Jeff Newburn Software Engineer, Zappos.com jnewb...@zappos.com - 702-943-7562 From: Erik Hatcher e...@ehatchersolutions.com Reply-To: solr-user@lucene.apache.org Date: Wed, 22 Jul 2009 00:36:30 -0400 To: solr-user@lucene.apache.org Subject: Re: Random Slowness On Jul 21, 2009, at 6:52 PM, Jeff Newburn wrote: We are experiencing random slowness on certain queries. I have been unable to diagnose what the issue is. We are using SOLR 1.4 and 99.99% of queries return in under 250 ms. The remaining queries are returning in 2-5 seconds for no apparent reason. There does not seem to be any commonality between the queries. This problem also includes admin system queries. Any help or direction would be much appreciated. Do you experience the same slow speeds when you manually issue those queries? In other words, is it repeatable? If so, try debugQuery=true and see the component timings and where the time is going. What's the query parsing too? Anything unusually large due to synonym lists or something like that? What about your filter cache - how's it looking when these slow queries take place? evictions 0? params ={facet=truefacet.mincount=1facet.limit=-1wt=javabinrows=0facet.s ort = true start=0q=shoesfacet.field=colorFacetfacet.field=brandNameFacetf acet .field =heelHeightfacet.field=attrFacet_Styleqt=dismaxfq=productTypeFa cet:Shoes fq=gender:Womensfq=categoryFacet:Sandalsfq=width:EEfq=size:10.5 fq=priceFacet:$100.00+and+Underfq=personalityFacet:Sexy} hits=19 status=0 QTime=3689 What's the config of your dismax handler look like? Erik
Re: Random Slowness
Ed, How do I go about enabling the gc logging for solr? -- Jeff Newburn Software Engineer, Zappos.com jnewb...@zappos.com - 702-943-7562 From: Ed Summers e...@pobox.com Reply-To: solr-user@lucene.apache.org Date: Wed, 22 Jul 2009 10:39:03 -0400 To: solr-user@lucene.apache.org Subject: Re: Random Slowness I haven't read this whole thread, so maybe it's already come up. Have you turned on the garbage collection logging to see if the jvm is busy cleaning up when you are seeing the slowness? Maybe the jvm is struggling to keep the heap size within a particular limit? //Ed On Wed, Jul 22, 2009 at 10:22 AM, Jeff Newburnjnewb...@zappos.com wrote: We can never reproduce the slowness with the same query. As soon as we try to run them again they are fine. I have even tried running the same query the next day and it is fine. All of our requests go through our dismax handler which is part of why it is so weird. Most queries are fine, however just occasionally they aren't. Additionally, why would the command=details command also go slow? That seems like a server issue. It appears that for FieldValueCache and FilterCache we have no evictions, but for queryResultCache and DocumentCache there are a good number of evictions. How would I help lower the evictions to see if that is the problem? Dismax Config below: requestHandler name=dismax class=solr.DisMaxRequestHandler default=true ? lst name=defaults str name=echoParamsexplicit/str float name=tie0.5/float ? str name=qf productId^10.0 personality^8.0 subCategory^8.0 category^6.0 productType^5.0 brandName^1.0 realBrandName^1.0 productNameSearch^1.0 size^1.2 width^1.0 heelHeight^1.0 productDescription^1.0 color^10.0 price^1.0 attrs^5.0 expandedGender^0.5 /str ? str name=pf attrs^3 brandName^10.0 productNameSearch^8.0 productDescription^2.0 personality^4.0 subCategory^12.0 category^10.0 productType^8.0 /str ? str name=fl productId, productName, price, originalPrice, brandNameFacet, productRating, imageUrl, productUrl, isNew, onSale, styleId /str str name=mm100%/str int name=ps1/int int name=qs5/int str name=q.alt*:*/str !-- More like this search parameters -- ? str name=mlt.fl brandNameFacet,productTypeFacet,productName,categoryFacet,subCategoryFacet,p ersonalityFacet,colorFacet,heelHeight,expandedGender /str int name=mlt.mindf1/int int name=mlt.mintf1/int /lst ? arr name=last-components strspellcheck/str /arr /requestHandler -- Jeff Newburn Software Engineer, Zappos.com jnewb...@zappos.com - 702-943-7562 From: Erik Hatcher e...@ehatchersolutions.com Reply-To: solr-user@lucene.apache.org Date: Wed, 22 Jul 2009 00:36:30 -0400 To: solr-user@lucene.apache.org Subject: Re: Random Slowness On Jul 21, 2009, at 6:52 PM, Jeff Newburn wrote: We are experiencing random slowness on certain queries. I have been unable to diagnose what the issue is. We are using SOLR 1.4 and 99.99% of queries return in under 250 ms. The remaining queries are returning in 2-5 seconds for no apparent reason. There does not seem to be any commonality between the queries. This problem also includes admin system queries. Any help or direction would be much appreciated. Do you experience the same slow speeds when you manually issue those queries? In other words, is it repeatable? If so, try debugQuery=true and see the component timings and where the time is going. What's the query parsing too? Anything unusually large due to synonym lists or something like that? What about your filter cache - how's it looking when these slow queries take place? evictions 0? params ={facet=truefacet.mincount=1facet.limit=-1wt=javabinrows=0facet.s ort = true start=0q=shoesfacet.field=colorFacetfacet.field=brandNameFacetf acet .field =heelHeightfacet.field=attrFacet_Styleqt=dismaxfq=productTypeFa cet:Shoes fq=gender:Womensfq=categoryFacet:Sandalsfq=width:EEfq=size:10.5 fq=priceFacet:$100.00+and+Underfq=personalityFacet:Sexy} hits=19 status=0 QTime=3689 What's the config of your dismax handler look like? Erik
Re: Random Slowness
On Wed, Jul 22, 2009 at 10:44 AM, Jeff Newburnjnewb...@zappos.com wrote: How do I go about enabling the gc logging for solr? It depends how you are running solr. You basically want to make sure that when the JVM is started up with the java command, that it gets some additional arguments [1]. So for example if you are running solr using jetty you would: java -verbose:gc -Xloggc:solr_gc.log -jar start.jar And then poke around in the log looking for garbage collection events that take as long as the pauses you are seeing in your app. I think there are tools that will help you analyze the log files if you need them. If there is a correlation you'll probably want to tune your solr memory usage with -xMx and -xMs. Hope this helps. //Ed [1] http://java.sun.com/javase/7/docs/technotes/tools/windows/java.html
Re: solr 1.3.0 and Oracle Fusion Middleware
Lets keep this communication on the list so others can benefit and chime in. What about the filter-dispatched-requests-enabled setting? Perhaps it doesn't use the weblogic.xml file anymore and you'll need to find the new way to configure that setting? From what I can see, that setting will default to true now if you are using a web.xml defined as 2.4 (according to weblogic 9 docs). Solr is using 2.3 at the moment - you might try changing the web.xml to 2.4 from 2.3, or figure out how to adjust that setting (filter-dispatched-requests-enabled) with your current container. It will default to true with a web.xml 2.4 for back compat. - Mark On Wed, Jul 22, 2009 at 10:35 AM, Hall, David dh...@vermeer.com wrote: Mark --- Thanks for the info - I took a look at the two urls and even though it is not true Weblogic - ie - this is the Oracle OC4J not the Weblogic Java Containers from the pre-oracle acquisition. I have tried to remove the encoding from the header and created the weblogic.xml, bounced the container and re-tried. However this did not fix the issue. I think this is the correct directionmaybe just a little different. Maybe it needs to be put in the web.xml. (I am not using weblogic (Oracle Portal Replacment) directly - just the Oracle Java Container. I don't know if that makes any difference.) Here are my observations on this issue When I hit solr/admin - I do get a page - it is just missing the pretty stuff. Statistics totally do not work. I get a stackoverflow errors in the opmn/log for this container. Below is what I can see from Paros... solr-admin.css HTTP/1.1 500 Internal Server Error Date: Wed, 22 Jul 2009 14:21:22 GMT Server: Oracle-Application-Server-10g/10.1.3.4.0 Oracle-HTTP-Server Content-Location: https://testportalapp.vermeer.com/solr/admin/solr-admin.css Content-Typehttps://testportalapp.vermeer.com/solr/admin/solr-admin.cssContent-Type: text/html Connection: close 500 Internal Server Errornull java.lang.StackOverflowError at java.security.AccessController.doPrivileged(Native Method) at java.io.PrintWriter.(PrintWriter.java:77) at java.io.PrintWriter.(PrintWriter.java:61) at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:316) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:281) at com.evermind.server.http.FileRequestDispatcher.handleWithFilter(FileRequestDispatcher.java:135) at com.evermind.server.http.FileRequestDispatcher.unprivileged_forwardInternal(FileRequestDispatcher.java:283) at com.evermind.server.http.FileRequestDispatcher.access$100(FileRequestDispatcher.java:29) at com.evermind.server.http.FileRequestDispatcher$2.oc4jRun(FileRequestDispatcher.java:254) at oracle.oc4j.security.OC4JSecurity.doPrivileged(OC4JSecurity.java:284) at com.evermind.server.http.FileRequestDispatcher.forwardInternal(FileRequestDispatcher.java:259) at com.evermind.server.http.FileRequestDispatcher.forward(FileRequestDispatcher.java:346) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:273) at com.evermind.server.http.FileRequestDispatcher.handleWithFilter(FileRequestDispatcher.java:135) at com.evermind.server.http.FileRequestDispatcher.unprivileged_forwardInternal(FileRequestDispatcher.java:283) at com.evermind.server.http.FileRequestDispatcher.access$100(FileRequestDispatcher.java:29) at com.evermind.server.http.FileRequestDispatcher$2.oc4jRun(FileRequestDispatcher.java:254) at oracle.oc4j.security.OC4JSecurity.doPrivileged(OC4JSecurity.java:284) at com.evermind.server.http.FileRequestDispatcher.forwardInternal(FileRequestDispatcher.java:259) at com.evermind.server.http.FileRequestDispatcher.forward(FileRequestDispatcher.java:346) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:273) at com.evermind.server.http.FileRequestDispatcher.handleWithFilter(FileRequestDispatcher.java:135) at com.evermind.server.http.FileRequestDispatcher.unprivileged_forwardInternal(FileRequestDispatcher.java:283) at com.evermind.server.http.FileRequestDispatcher.access$100(FileRequestDispatcher.java:29) at com.evermind.server.http.FileRequestDispatcher$2.oc4jRun(FileRequestDispatcher.java:254) at oracle.oc4j.security.OC4JSecurity.doPrivileged(OC4JSecurity.java:284) at com.evermind.server.http.FileRequestDispatcher.forwardInternal(FileRequestDispatcher.java:259) at com.evermind.server.http.FileRequestDispatcher.forward(FileRequestDispatcher.java:346) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:273) at com.evermind.server.http.FileRequestDispatcher.handleWithFilter(FileRequestDispatcher.java:135) at com.evermind.server.http.FileRequestDispatcher.unprivileged_forwardInternal(FileRequestDispatcher.java:283) at com.evermind.server.http.FileRequestDispatcher.access$100(FileRequestDispatcher.java:29) at -- -- - Mark http://www.lucidimagination.com
Re: US/UK/CA/AU English support
On Jul 22, 2009, at 5:09 AM, prerna07 wrote: Hi, 1) Out of US/UK/CA/AU,which english does solr support ? Please clarify what you mean by support? The only thing in Solr that is potentially language dependent are the Tokenizers and TokenFilters and those are completely pluggable. For tokenization, I'd say all are supported since all of those languages are whitespace delimited. For things like stemming and synonyms, I'm not sure, but I suspect many of the existing capabilities will work in most cases, which is all one can ever expect no matter the language. 2) PhoneticFilterFactory perform search for similar sounding words. For example : search on carat will give results of carat, caret and carrat. I also observed that PhoneticFilterFactory also support linguistic variation for US/UK/CA/AU. For example: search on Optimize give results of optimise and optimize. Question : Does PhoneticFilterFactory support all characters/ words of linguistic variations for US/UK/CA/AU OR linguistic search for US/UK/ CA/AU will be subset of phonetic search. I would think so, but I might suggest using either the Admin analysis capabilities and doing some tests with the various FieldTypes or automating some more tests by using the AnalysisRequestHandler (or whatever it is called these days) -Grant -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: Best approach to multiple languages
How do you want to search those descriptions? Do you know the query language going in? On Jul 22, 2009, at 6:12 AM, Andrew McCombe wrote: Hi We have a dataset that contains productname, category and descriptions. The descriptions can be in one or more different languages. What would be the recommended way of indexing these? My initial thoughts are to index each description as a separate field and append the language identifier to the field name, for example, three fields with description_en, description_de, descrtiption_fr. Is this the best approach or is there a better way? Regards Andrew McCombe -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: Best approach to multiple languages
On Wed, Jul 22, 2009 at 11:35 AM, Grant Ingersollgsing...@apache.org wrote: My initial thoughts are to index each description as a separate field and append the language identifier to the field name, for example, three fields with description_en, description_de, descrtiption_fr. Is this the best approach or is there a better way? FWIW, this approach is essentially what we did at the Library of Congress to support multi-lingual fulltext search in the World Digital Library [1] webapp. It seems to have paid off pretty well, since we were able to configure analysis on a per-language basis. In case you are curious I've attached a copy of our schema.xml to give you an idea of what we did. //Ed [1] http://www.wdl.org/ ?xml version=1.0 encoding=ISO-8859-15? schema name=example version=1.1 !-- Note: there are lots more types available, see original schema.xml for the full picture. -- types fieldType name=string class=solr.StrField omitNorms=true sortMissingLast=true/ fieldType name=integer class=solr.SortableIntField omitNorms=true/ fieldType name=date class=solr.DateField sortMissingLast=true omitNorms=true/ !-- default -- fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType fieldType name=suggest_text_eng class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType fieldType name=suggest_text_por class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=brazilian-stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType fieldType name=suggest_text_fra class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=french-stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType fieldType name=suggest_text_spa class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=spanish-stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType fieldType name=suggest_text_rus class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=russian-stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType !-- Arabic (based on aramorph) -- fieldType name=text_arabic class=solr.TextField analyzer type=index tokenizer class=solr.ArabicTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.ArabicTokenizerFactory/ /analyzer /fieldType !-- ArabicAnalyser = ArabicTokenizer = ArabicStemmer = ArabicGrammaticalFilter -- fieldType name=text_arabic_analyzed class=solr.TextField analyzer type=index class=solr.ArabicAnalyzer/ analyzer type=query class=solr.ArabicAnalyzer/ /fieldType !-- Brazilian (Portuguese) -- fieldType name=text_brazilian class=solr.TextField analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory
DataImportHandler / Import from DB : one data set comes in multiple rows
Hi all, this is my first post, as I am new to SOLR (some Lucene exp). I am trying to load data from an existing datamart into SOLR using the DataImportHandler but in my opinion it is too slow due to the special structure of the datamart I have to use. Root Cause: This datamart uses a row based approach (pivot) to present its data. It was so done to allow adding more attributes to the data set without having to change the table structure. Impact: To use the DataImportHandler, i have to pivot the data to create again one row per data set. Unfortunately, this results in more and less performant queries. Moreover, there are sometimes multiple rows for a single attribute, that require separate queries - or more tricky subselects that probably don't speed things up. Here is an example of the relation between DB requests, row fetches and actual number of documents created: lst name=statusMessages str name=Total Requests made to DataSource3737/str str name=Total Rows Fetched5380/str str name=Total Documents Skipped0/str str name=Full Dump Started2009-07-22 18:19:06/str − str name= Indexing completed. Added/Updated: 934 documents. Deleted 0 documents. /str str name=Committed2009-07-22 18:22:29/str str name=Optimized2009-07-22 18:22:29/str str name=Time taken 0:3:22.484/str /lst (Full index creation.) There are about half a million data sets, in total. That would require about 30h for indexing? My feeling is that there are far too many row fetches per data set. I am testing it on a smaller machine (2GB, Windows :-( ), Tomcat6 using around 680MB RAM, Java6. I haven't changed the Lucene configuration (merge factor 10, ram buffer size 32). Possible solutions? A) Write my own DataImportHandler? B) Write my own MultiRowTransformer that accepts several rows as input argument (not sure this is a valid option)? C) Approach the DB developers to add a flat table with one data set per row? D) ...? If someone would like to share their experiences, that would be great! Thanks a lot! Chantal -- Chantal Ackermann
Re: Best approach to multiple languages
Hi We will know the user's language choice before searching. Regards Andrew 2009/7/22 Grant Ingersoll gsing...@apache.org How do you want to search those descriptions? Do you know the query language going in? On Jul 22, 2009, at 6:12 AM, Andrew McCombe wrote: Hi We have a dataset that contains productname, category and descriptions. The descriptions can be in one or more different languages. What would be the recommended way of indexing these? My initial thoughts are to index each description as a separate field and append the language identifier to the field name, for example, three fields with description_en, description_de, descrtiption_fr. Is this the best approach or is there a better way? Regards Andrew McCombe -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: Best approach to multiple languages
Am 22.07.2009 um 18:31 schrieb Ed Summers: In case you are curious I've attached a copy of our schema.xml to give you an idea of what we did. Thanks for sharing! -- Olivier Dobberkau
LocalSolr - order of fields on xml response
Hi folks, When I do some query with LocalSolr to get the geo_distance, the order of xml fields is different of a standard query. It's a simple query, like this: http://myhost.com:8088/solr/core/select?qt=geox=-46.01y=-23.01radius=15sort=geo_distanceascq=*:* Is this an expected behavior of LocalSolr? Thanks! -- Daniel Cassiano _ http://www.apontador.com.br/ http://www.maplink.com.br/
Re: Best approach to multiple languages
Typically there are three options that people do: 1. Put 'em all in one big field 2. Split Fields (as you and others have described) - not sure why no one ever splits on documents, which is viable too, but comes with repeated data 3. Split indexes For your case, #1 isn't going to work since you want to search language specific. I'd likely go with #2, but #3 has it's merits too. #3 allows for managing the languages separately (you can update the Spanish document w/o affecting the English version, and also can take the whole collection offline if you want w/o affecting the other indexes), which can sometimes be helpful, but the cost is more operational complexity, etc. -Grant On Jul 22, 2009, at 12:39 PM, Andrew McCombe wrote: Hi We will know the user's language choice before searching. Regards Andrew 2009/7/22 Grant Ingersoll gsing...@apache.org How do you want to search those descriptions? Do you know the query language going in? On Jul 22, 2009, at 6:12 AM, Andrew McCombe wrote: Hi We have a dataset that contains productname, category and descriptions. The descriptions can be in one or more different languages. What would be the recommended way of indexing these? My initial thoughts are to index each description as a separate field and append the language identifier to the field name, for example, three fields with description_en, description_de, descrtiption_fr. Is this the best approach or is there a better way? Regards Andrew McCombe -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: Synonyms from index
Hi, There is nothing built-in. It might be possible to infer if two words are synonyms, but that's really not strictly a search thing, so it's not likely to be added to Solr in the near future. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Pooja Verlani pooja.verl...@gmail.com To: solr-user@lucene.apache.org Sent: Wednesday, July 22, 2009 7:41:56 AM Subject: Synonyms from index Hi, Is there a possible way to generate synonyms from the index ? I have an index with lots of searchable terms turning out to be having synonyms and users too have different synonyms. If not then the only way if to learn from the query logs and click logs but in case there exists one, please share. regards, Pooja
Re: Random Slowness
Or simply attach to the JVM with Jconsole and watch the GC from there. You'd have to watch things (logs and jconsole) closely though, and correlate the slow query periods with a GC spike. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ed Summers e...@pobox.com To: solr-user@lucene.apache.org Sent: Wednesday, July 22, 2009 11:03:08 AM Subject: Re: Random Slowness On Wed, Jul 22, 2009 at 10:44 AM, Jeff Newburnwrote: How do I go about enabling the gc logging for solr? It depends how you are running solr. You basically want to make sure that when the JVM is started up with the java command, that it gets some additional arguments [1]. So for example if you are running solr using jetty you would: java -verbose:gc -Xloggc:solr_gc.log -jar start.jar And then poke around in the log looking for garbage collection events that take as long as the pauses you are seeing in your app. I think there are tools that will help you analyze the log files if you need them. If there is a correlation you'll probably want to tune your solr memory usage with -xMx and -xMs. Hope this helps. //Ed [1] http://java.sun.com/javase/7/docs/technotes/tools/windows/java.html
Re: excluding certain terms from facet counts when faceting based on indexed terms of a field
: I am faceting based on the indexed terms of a field by using facet.field. : Is there any way to exclude certain terms from the facet counts? if you're talking about a lot of terms, and they're going to be hte same for *all* queries, the best appraoch is to strip them out when indexing (StopWordFilter is your freind) -Hoss
RE: solr 1.3.0 and Oracle Fusion Middleware
Thanks for the feedback... I tried to add what is below directly to the web.xml file right after the web-app tag and bounce the OC4J - still same issue. container-descriptor filter-dispatched-requests-enabledfalse/filter-dispatched-requests-enabled /container-descriptor I checked other applications running on 10.1.3 OC4J and they are also using 2.3 web.xml. Regardless - I tried to change it to 2.4 and now have the container starting up with out errors - and have the above filter statement still in the xml file and I get a different error - here is the first subset SEVERE: java.lang.StackOverflowError at sun.util.calendar.ZoneInfo.getTransitionIndex(ZoneInfo.java:288) at sun.util.calendar.ZoneInfo.getOffsets(ZoneInfo.java:238) at sun.util.calendar.ZoneInfo.getOffsets(ZoneInfo.java:215) at java.util.GregorianCalendar.computeFields(GregorianCalendar.java:1998) at java.util.GregorianCalendar.computeFields(GregorianCalendar.java:1970) at java.util.Calendar.setTimeInMillis(Calendar.java:1066) at java.util.Calendar.setTime(Calendar.java:1032) at java.text.SimpleDateFormat.format(SimpleDateFormat.java:785) at java.text.SimpleDateFormat.format(SimpleDateFormat.java:778) at java.text.DateFormat.format(DateFormat.java:274) at java.text.Format.format(Format.java:133) at java.text.MessageFormat.subformat(MessageFormat.java:1279) at java.text.MessageFormat.format(MessageFormat.java:787) at java.util.logging.SimpleFormatter.format(SimpleFormatter.java:50) at java.util.logging.StreamHandler.publish(StreamHandler.java:179) at java.util.logging.ConsoleHandler.publish(ConsoleHandler.java:88) at java.util.logging.Logger.log(Logger.java:428) at java.util.logging.Logger.doLog(Logger.java:450) at java.util.logging.Logger.log(Logger.java:473) at java.util.logging.Logger.severe(Logger.java:960) at org.apache.solr.common.SolrException.log(SolrException.java:132) at org.apache.solr.common.SolrException.logOnce(SolrException.java:150) at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:319) From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Wednesday, July 22, 2009 10:04 AM To: Hall, David; solr-user@lucene.apache.org Subject: Re: solr 1.3.0 and Oracle Fusion Middleware Lets keep this communication on the list so others can benefit and chime in. What about the filter-dispatched-requests-enabled setting? Perhaps it doesn't use the weblogic.xml file anymore and you'll need to find the new way to configure that setting? From what I can see, that setting will default to true now if you are using a web.xml defined as 2.4 (according to weblogic 9 docs). Solr is using 2.3 at the moment - you might try changing the web.xml to 2.4 from 2.3, or figure out how to adjust that setting (filter-dispatched-requests-enabled) with your current container. It will default to true with a web.xml 2.4 for back compat. - Mark On Wed, Jul 22, 2009 at 10:35 AM, Hall, David dh...@vermeer.commailto:dh...@vermeer.com wrote: Mark --- Thanks for the info - I took a look at the two urls and even though it is not true Weblogic - ie - this is the Oracle OC4J not the Weblogic Java Containers from the pre-oracle acquisition. I have tried to remove the encoding from the header and created the weblogic.xml, bounced the container and re-tried. However this did not fix the issue. I think this is the correct directionmaybe just a little different. Maybe it needs to be put in the web.xml. (I am not using weblogic (Oracle Portal Replacment) directly - just the Oracle Java Container. I don't know if that makes any difference.) Here are my observations on this issue When I hit solr/admin - I do get a page - it is just missing the pretty stuff. Statistics totally do not work. I get a stackoverflow errors in the opmn/log for this container. Below is what I can see from Paros... solr-admin.css HTTP/1.1 500 Internal Server Error Date: Wed, 22 Jul 2009 14:21:22 GMT Server: Oracle-Application-Server-10g/10.1.3.4.0 Oracle-HTTP-Server Content-Location: https://testportalapp.vermeer.com/solr/admin/solr-admin.css Content-Typehttps://testportalapp.vermeer.com/solr/admin/solr-admin.cssContent-Type: text/html Connection: close 500 Internal Server Error null java.lang.StackOverflowError at java.security.AccessController.doPrivileged(Native Method) at java.io.PrintWriter.(PrintWriter.java:77) at java.io.PrintWriter.(PrintWriter.java:61) at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:316) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:281) at com.evermind.server.http.FileRequestDispatcher.handleWithFilter(FileRequestDispatcher.java:135) at
Re: Storing string field in solr.ExternalFieldFile type
Hoping the experts chime in if I'm wrong, but As far as I know, while storing a field increases the size of an index, it doesn't have much impact on the search speed. Which you could pretty easily test by creating the index both ways and firing off some timing queries and comparing. Although it would be time consuming... I believe there's some info on the Lucene Wiki about this, but my memory isn't what it used to be. Erick On Tue, Jul 21, 2009 at 2:42 PM, Jibo John jiboj...@mac.com wrote: We're in the process of building a log searcher application. In order to reduce the index size to improve the query performance, we're exploring the possibility of having: 1. One field for each log line with 'indexed=true stored=false' that will be used for searching 2. Another field for each log line of type solr.ExternalFileField that will be used only for display purpose. We realized that currently solr.ExternalFileField supports only float type. Is there a way we can override this to support string type? Any issues with this approach? Any ideas are welcome. Thanks, -Jibo
Re: Behaviour when we get more than 1 million hits
That's still not very useful. Additional processing? Where, some clientthat you return all the data to? In which case SOLR is the least of your concerns, your network speed counts more. At a blind guess I'd worry more about how you're doing your additional processing than solr. Erick On Wed, Jul 22, 2009 at 6:38 AM, Rakhi Khatwani rkhatw...@gmail.com wrote: Hi, There is this particulat scenarion where I want to search for a product and i get a million records which will be given for further processing. Regards, Raakhi On Mon, Jul 13, 2009 at 7:33 PM, Erick Erickson erickerick...@gmail.com wrote: It depends (tm) on what you try to do with the results. You really need togive us some more details on what you want to *do* with 1,000,000 hits before any meaningful response is possible. Best Erick On Mon, Jul 13, 2009 at 8:47 AM, Rakhi Khatwani rkhatw...@gmail.com wrote: Hi, If while using Solr, what would the behaviour be like if we perform the search and we get more than one million hits Regards, Raakhi
Re: SolrException - Lock obtain timed out, no leftover locks
My only guess here is that you are using SolrJ in an embedded sense, not via HTTP, and something about the code you have in your MyIndexers class causes two differnet threads to attempt to create two differnet cores (or perhaps the same core) using identical data directories at the same time. either that: or maybe there is a bug in the CoreAdmin functionality for creating/opening a new core resulting from improper synchronization. it would help to have the full stack trace of hte Lock timed out exception, and to know more details about how exactly your code goes about creating new cores on the fly. : I'm running Solr 1.3.0 in multicore mode and feeding it data from which the : core name is inferred from a specific field. My service extracts the core : name and, if it has not seen it before, issues a create request for that : core before attempting to add the document (via SolrJ). I have a pool of : MyIndexers that run in parallel, taking documents from a queue and adding : them via the add method on the SolrServer instance corresponding to that : core (exactly one per core exists). Each core is in a separate data : directory. My timeouts are set as such: : : writeLockTimeout15000/writeLockTimeout : commitLockTimeout25000/commitLockTimeout : : I remove the index directories, start the server, check that no locks exist, : and generate ~500 documents spread across 5 cores for the MyIndexers to : handle. Each time, I see one or more exceptions with a message like : : Lock_obtain_timed_out_SimpleFSLockmulticoreNewUser3dataindexlucenebd4994617386d14e2c8c29e23bcca719writelock__orgapachelucenestoreLockObtainFailedException_Lock_obtain_timed_out_... : : When the indexers have completed, no lock is left over. There is no : discernible pattern as far as when the exception occurs (ie, it does not : tend to happen on the first or last or any particular document). : : Interestingly, this problem does not happen when I have only a single : MyIndexer, or if I have a pool of MyIndexers and am running in single core : mode. : : I've looked at the other posts from users getting this exception but it : always seemed to be a different case, such as the server having crashed : previously and a lock file being left over. : : -- : View this message in context: http://www.nabble.com/SolrException---Lock-obtain-timed-out%2C-no-leftover-locks-tp24393255p24393255.html : Sent from the Solr - User mailing list archive at Nabble.com. : -Hoss
Re: SolrException - Lock obtain timed out, no leftover locks
Sorry, I thought I had removed this posting. I am running Solr over HTTP, but (as you surmised) I had a concurrency bug. Thanks for the response. Dan hossman wrote: My only guess here is that you are using SolrJ in an embedded sense, not via HTTP, and something about the code you have in your MyIndexers class causes two differnet threads to attempt to create two differnet cores (or perhaps the same core) using identical data directories at the same time. either that: or maybe there is a bug in the CoreAdmin functionality for creating/opening a new core resulting from improper synchronization. it would help to have the full stack trace of hte Lock timed out exception, and to know more details about how exactly your code goes about creating new cores on the fly. : I'm running Solr 1.3.0 in multicore mode and feeding it data from which the : core name is inferred from a specific field. My service extracts the core : name and, if it has not seen it before, issues a create request for that : core before attempting to add the document (via SolrJ). I have a pool of : MyIndexers that run in parallel, taking documents from a queue and adding : them via the add method on the SolrServer instance corresponding to that : core (exactly one per core exists). Each core is in a separate data : directory. My timeouts are set as such: : : writeLockTimeout15000/writeLockTimeout : commitLockTimeout25000/commitLockTimeout : : I remove the index directories, start the server, check that no locks exist, : and generate ~500 documents spread across 5 cores for the MyIndexers to : handle. Each time, I see one or more exceptions with a message like : : Lock_obtain_timed_out_SimpleFSLockmulticoreNewUser3dataindexlucenebd4994617386d14e2c8c29e23bcca719writelock__orgapachelucenestoreLockObtainFailedException_Lock_obtain_timed_out_... : : When the indexers have completed, no lock is left over. There is no : discernible pattern as far as when the exception occurs (ie, it does not : tend to happen on the first or last or any particular document). : : Interestingly, this problem does not happen when I have only a single : MyIndexer, or if I have a pool of MyIndexers and am running in single core : mode. : : I've looked at the other posts from users getting this exception but it : always seemed to be a different case, such as the server having crashed : previously and a lock file being left over. : : -- : View this message in context: http://www.nabble.com/SolrException---Lock-obtain-timed-out%2C-no-leftover-locks-tp24393255p24393255.html : Sent from the Solr - User mailing list archive at Nabble.com. : -Hoss -- View this message in context: http://www.nabble.com/SolrException---Lock-obtain-timed-out%2C-no-leftover-locks-tp24393255p24616034.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: LocalSolr - order of fields on xml response
ya... 'expected', but perhaps not ideal. As is, LocalSolr munges the document on its way out the door to add the distance. When LocalSolr makes it into the source, it will likely use a method like: https://issues.apache.org/jira/browse/SOLR-705 to augment each document with the calculated distance. This will at least have consistent behavior. On Jul 22, 2009, at 10:47 AM, Daniel Cassiano wrote: Hi folks, When I do some query with LocalSolr to get the geo_distance, the order of xml fields is different of a standard query. It's a simple query, like this: http://myhost.com:8088/solr/core/select?qt=geox=-46.01y=-23.01radius=15sort=geo_distanceascq=*:* Is this an expected behavior of LocalSolr? Thanks! -- Daniel Cassiano _ http://www.apontador.com.br/ http://www.maplink.com.br/
how to get all the docIds in the search result?
When I use SolrQuery query = new SolrQuery(); query.set(q, issn:0002-9505); query.setRows(10); QueryResponse response = server.query(query); I only can get the 10 ids in the response. How can i get all the docIds in the search result? Thanks.
Re: importing lots of db data. specially formated. what is fasted approach?
look at this http://wiki.apache.org/solr/DIHQuickStart#head-532678fa5d0d9b33880abeb4d4995562014f8ef9 to know how to fetch data from multiple tables On Wed, Jul 22, 2009 at 4:57 PM, Julian Davchevj...@drun.net wrote: Well yes, transformation is required. But it's like data coming from multiple tables.. etc. It's not like getting one row from table and possibly transforming it and using it. I am thinking perhaps to create some tables (views) that will have the data ready/flattened and then simply feeding it. Cause not sure how much flexibility transformer will give me. Java not number 1 language either :) Thanks for suggestions. Will get a look there. Avlesh Singh wrote: As Noble has already said, transforming content before indexing a very common requirement. DataImportHandler's Transformer lets you achieve this. Read up on the same here - http://wiki.apache.org/solr/DataImportHandler#head-a6916b30b5d7605a990fb03c4ff461b3736496a9 Cheers Avlesh 2009/7/22 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com if each field from the db goes to a separate field in solr as-is . Then it is very simple. if you need to split/join fields before feeding it into solr fields you may need to apply transformers an example on how your db field looks like and how you wish it to look like in solr would be helpful On Wed, Jul 22, 2009 at 11:57 AM, Julian Davchevj...@drun.net wrote: Hi folks, I have around 50k documents that are reindexed now and then. Question is what would be the fastest approach to all this. Data is just text ~20fields or so. It comes from database but is first specially formated to get to format suitable for passing in solr. Currently xml post is used but have the feeling this is not optimal for speed wise when it is up to bulk import/reindex. I see http://wiki.apache.org/solr/DataImportHandler but kinda fail to see howto do this specially formated data so solr makes use of it. Are there some real examples,articles on howto use this? -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: DataImportHandler / Import from DB : one data set comes in multiple rows
alternately, you can write your own EntityProcessor and just override the nextRow() . I guess you can still use the JdbcDataSource On Wed, Jul 22, 2009 at 10:05 PM, Chantal Ackermannchantal.ackerm...@btelligent.de wrote: Hi all, this is my first post, as I am new to SOLR (some Lucene exp). I am trying to load data from an existing datamart into SOLR using the DataImportHandler but in my opinion it is too slow due to the special structure of the datamart I have to use. Root Cause: This datamart uses a row based approach (pivot) to present its data. It was so done to allow adding more attributes to the data set without having to change the table structure. Impact: To use the DataImportHandler, i have to pivot the data to create again one row per data set. Unfortunately, this results in more and less performant queries. Moreover, there are sometimes multiple rows for a single attribute, that require separate queries - or more tricky subselects that probably don't speed things up. Here is an example of the relation between DB requests, row fetches and actual number of documents created: lst name=statusMessages str name=Total Requests made to DataSource3737/str str name=Total Rows Fetched5380/str str name=Total Documents Skipped0/str str name=Full Dump Started2009-07-22 18:19:06/str - str name= Indexing completed. Added/Updated: 934 documents. Deleted 0 documents. /str str name=Committed2009-07-22 18:22:29/str str name=Optimized2009-07-22 18:22:29/str str name=Time taken 0:3:22.484/str /lst (Full index creation.) There are about half a million data sets, in total. That would require about 30h for indexing? My feeling is that there are far too many row fetches per data set. I am testing it on a smaller machine (2GB, Windows :-( ), Tomcat6 using around 680MB RAM, Java6. I haven't changed the Lucene configuration (merge factor 10, ram buffer size 32). Possible solutions? A) Write my own DataImportHandler? B) Write my own MultiRowTransformer that accepts several rows as input argument (not sure this is a valid option)? C) Approach the DB developers to add a flat table with one data set per row? D) ...? If someone would like to share their experiences, that would be great! Thanks a lot! Chantal -- Chantal Ackermann -- - Noble Paul | Principal Engineer| AOL | http://aol.com