Re: Hotel Searches
It seems to me like you want to use result grouping by hotel. You'll have to add up the tariffs for each hotel, but that isn't hard. Upayavira On Wed, Jan 9, 2013, at 06:08 AM, Harshvardhan Ojha wrote: Hi Alex, Thanks for your reply. I saw prices based on daterange using multipoints . But this is not my problem. Instead the problem statement for me is pretty simple. Say I have 100 documents each having tariff as field. Doc1 doc double name=tariff2400.0/double /doc Doc2 doc double name=tariff2500.0/double /doc Now a user's search should give me a total tariff. Desired result doc double name=tariff4900.0/double /doc And this could be any combination for 100 docs it is (100+101)/2. (N*N+1)/2. How can I get these combination of documents already indexed ? Or is there any way to do calculations at runtime? How can I place this constraint that if there is any 1 doc missing in a range don’t give me any result.(if a user asked for hotel tariff from 11th to 13th, and I don’t have tariff for 12th, I shouldn't add 11th and 13th only). Hope I made my problem very simple. Regards Harshvardhan Ojha -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Tuesday, January 08, 2013 6:12 PM To: solr-user@lucene.apache.org Subject: Re: Hotel Searches Did you look at a conversation thread from 12 Dec 2012 on this list? Just go to the archives and search for 'hotel'. Hopefully that will give you something to work with. If you have any questions after that, come back with more specifics. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Tue, Jan 8, 2013 at 7:18 AM, Harshvardhan Ojha harshvardhan.o...@makemytrip.com wrote: Sorry for that, we just spoiled that thread so posted my question in a fresh thread. Problem is indeed very simple. I have solr documents, which has all the required fields(from db). Say DOC1,DOC2,DOC3.DOCn. Every document has 1 night tariff and I have 180 nights tariff. So a person can search for any combination in these 180 nights. Say a request came to me to give total tariff for 10th to 15th of jan 2013. Now I need to get a sum of tariff field of 6 docs. So how can I keep this data indexed, to avoid search time calculation, and there are other dimensions of this data also beside tariff. Hope this makes sense. Regards Harshvardhan Ojha -Original Message- From: Gora Mohanty [mailto:g...@mimirtech.com] Sent: Tuesday, January 08, 2013 5:37 PM To: solr-user@lucene.apache.org Subject: Re: Hotel Searches On 8 January 2013 17:10, Harshvardhan Ojha harshvardhan.o...@makemytrip.com wrote: Hi All, Looking into a finding solution for Hotel searches based on the below criteria's [...] Didn't you just post this on a separate thread, complete with some nonsensical follow-up from a colleague of yours? Please do not repost the same message over and over again. It is not clear what you are trying to achieve. What is the difference between a city and a hotel in your data? How is a person represented in your documents? Is it by the ID field? Are you looking to cache all possible combinations of ID, city, and startdate? If so, to what end? This smells like a XY problem: http://people.apache.org/~hossman/#xyproblem Regards, Gora
Re: Hotel Searches
Hi, maybe I'm thinking too simple again. Nevertheless, here an idea to solve the question. The basic thought is to get rid of the range query. Have: - a textfield 'vacant_days'. Instead of ISO-Dates just simple dates in the form mmdd - a dynamic field 'price_*', You can add the tariff for Jan. 31th into 'price_0131' To get the total, eg. Feb. 1st to Feb. 3th you could query for the days 0201, 0202 and 0203. You can calculate the sum of the corresponding price fields q=vacant_days:0201 AND vacant_days:0202 AND vacant_days:0203fl?_val_:sum(price_0201, price_0202, price_0203) (not tested) Uwe Am 09.01.2013 07:08, schrieb Harshvardhan Ojha: Hi Alex, Thanks for your reply. I saw prices based on daterange using multipoints . But this is not my problem. Instead the problem statement for me is pretty simple. Say I have 100 documents each having tariff as field. Doc1 doc double name=tariff2400.0/double /doc Doc2 doc double name=tariff2500.0/double /doc Now a user's search should give me a total tariff. Desired result doc double name=tariff4900.0/double /doc And this could be any combination for 100 docs it is (100+101)/2. (N*N+1)/2. How can I get these combination of documents already indexed ? Or is there any way to do calculations at runtime? How can I place this constraint that if there is any 1 doc missing in a range don’t give me any result.(if a user asked for hotel tariff from 11th to 13th, and I don’t have tariff for 12th, I shouldn't add 11th and 13th only). Hope I made my problem very simple. Regards Harshvardhan Ojha -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Tuesday, January 08, 2013 6:12 PM To: solr-user@lucene.apache.org Subject: Re: Hotel Searches Did you look at a conversation thread from 12 Dec 2012 on this list? Just go to the archives and search for 'hotel'. Hopefully that will give you something to work with. If you have any questions after that, come back with more specifics. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Tue, Jan 8, 2013 at 7:18 AM, Harshvardhan Ojha harshvardhan.o...@makemytrip.com wrote: Sorry for that, we just spoiled that thread so posted my question in a fresh thread. Problem is indeed very simple. I have solr documents, which has all the required fields(from db). Say DOC1,DOC2,DOC3.DOCn. Every document has 1 night tariff and I have 180 nights tariff. So a person can search for any combination in these 180 nights. Say a request came to me to give total tariff for 10th to 15th of jan 2013. Now I need to get a sum of tariff field of 6 docs. So how can I keep this data indexed, to avoid search time calculation, and there are other dimensions of this data also beside tariff. Hope this makes sense. Regards Harshvardhan Ojha -Original Message- From: Gora Mohanty [mailto:g...@mimirtech.com] Sent: Tuesday, January 08, 2013 5:37 PM To: solr-user@lucene.apache.org Subject: Re: Hotel Searches On 8 January 2013 17:10, Harshvardhan Ojha harshvardhan.o...@makemytrip.com wrote: Hi All, Looking into a finding solution for Hotel searches based on the below criteria's [...] Didn't you just post this on a separate thread, complete with some nonsensical follow-up from a colleague of yours? Please do not repost the same message over and over again. It is not clear what you are trying to achieve. What is the difference between a city and a hotel in your data? How is a person represented in your documents? Is it by the ID field? Are you looking to cache all possible combinations of ID, city, and startdate? If so, to what end? This smells like a XY problem: http://people.apache.org/~hossman/#xyproblem Regards, Gora
Solr + Munin, a good plugin?
Dear Solr Users, Does anyone have a plugin to scan the number of request (/select) by hour/day/week/Month/Year I try to use the plugin solr_qps but it's not really good. Thanks a lot, Bruno
[OFFER] Consulting job with search specialists based in Cambridge UK
Hi all, Hope you don't mind me cluttering up the list with a job offer. We're a team of search specialists based in the UK and we're hiring: http://www.flax.co.uk/hiring/ We're ideally looking for someone with experience of Apache Lucene/Solr development, able to work on a flexible contract basis, probably mainly remotely. We work on search and related applications for a wide variety of clients in the UK and abroad including major newspapers, recruitment firms, governments and startups. If you're in the UK that's great but if not it's still worth you contacting us. Examples of past work on Lucene/Solr projects would be useful. Cheers Charlie -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
performance improvements on ip look up query
Hi We are doing a lat/lon look up query using ip address. We have a 6.5 million document core of the following structure start ip block end ip block location id location_lat_lon the field defs are types fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true/ fieldType name=tlong class=solr.TrieLongField precisionStep=8 omitNorms=true positionIncrementGap=0/ fieldType name=tfloat class=solr.TrieFloatField precisionStep=8 omitNorms=true positionIncrementGap=0/ fieldType name=tdouble class=solr.TrieDoubleField precisionStep=8 omitNorms=true positionIncrementGap=0/ fieldType name=location class=solr.LatLonType subFieldSuffix= _coordinate/ /types fields field name=startIp type=string indexed=true stored=false required= true/ field name=startIpNum type=tlong indexed=true stored=false required =true/ field name=endIpNum type=tlong indexed=true stored=false required= true/ field name=locId type=string indexed=true stored=true required= true/ field name=countryCode type=string indexed=true stored=true required=false/ field name=cityName type=string indexed=false stored=true required =false/ field name=latLon type=location indexed=true stored=true required= true/ field name=latitude type=string indexed=false stored=true required =true/ field name=longitude type=string indexed=false stored=true required =true/ dynamicField name=*_coordinate type=tdouble indexed=true stored= false/ /fields the query at the moment is simply a range query q=startIpNum:[* TO 180891652]AND endIpNum:[180891652 TO *] we are seeing a full query cache with a low hit rate 0.2 and a high eviction rate which makes sense given the use of ip address in the query. query time mean is 120. Is their a better way of structuring the core for this usecase ? I suspect our heap memory settings are conservative 1g but will need to convince our sys admins to change this (they are not ringing any resource alarm bells) just the query is a little slow
Re: fieldtype for name
Thanks. It isn't necessarily the need to match 'dick' to 'robert' but to search for: 'name surname' name, surname' 'surname name' 'surname, name' And nothing else, I don't need to worry about nick names or abbreviations of a name, just the above variations. I think I might use text_ws. On Tue, Jan 8, 2013 at 9:39 PM, Uwe Reh r...@hebis.uni-frankfurt.de wrote: Hi Michael, in our index ob bibliographic metadata, we see the need for at least tree fields: - name_facet: String as type, because the facet should should represent the original inverted format from our data. - name: TextField for searching. This field is heavily analyzed to match different orders, to match synonyms, phonetic similarity, German umlauts and other European stuff. - name_lc: TextField. This field is just mapped to lower case. It's used to boost docs with the same style of writing like the users input. Uwe Am 08.01.2013 15:30, schrieb Michael Jones: Hi, What would be the best fieldtype for a persons name? at the moment I'm using text_general but, if I search for bob smith, some results I get back might be rob thomas. In that it's matched 'ob'. But I only really want results that are either 'bob smith' 'bob, smith' 'smith, bob' 'smith bob' Thanks
Re: fieldtype for name
Also. I'm allowing users to do enter a name with quotes to search for an exact name. So at the moment only smith, robert will return any results where *robert smith* will return all variations including 'smith, herbert' On Wed, Jan 9, 2013 at 11:09 AM, Michael Jones michaelj...@gmail.comwrote: Thanks. It isn't necessarily the need to match 'dick' to 'robert' but to search for: 'name surname' name, surname' 'surname name' 'surname, name' And nothing else, I don't need to worry about nick names or abbreviations of a name, just the above variations. I think I might use text_ws. On Tue, Jan 8, 2013 at 9:39 PM, Uwe Reh r...@hebis.uni-frankfurt.dewrote: Hi Michael, in our index ob bibliographic metadata, we see the need for at least tree fields: - name_facet: String as type, because the facet should should represent the original inverted format from our data. - name: TextField for searching. This field is heavily analyzed to match different orders, to match synonyms, phonetic similarity, German umlauts and other European stuff. - name_lc: TextField. This field is just mapped to lower case. It's used to boost docs with the same style of writing like the users input. Uwe Am 08.01.2013 15:30, schrieb Michael Jones: Hi, What would be the best fieldtype for a persons name? at the moment I'm using text_general but, if I search for bob smith, some results I get back might be rob thomas. In that it's matched 'ob'. But I only really want results that are either 'bob smith' 'bob, smith' 'smith, bob' 'smith bob' Thanks
Highlighting: When alternateField does not exist
Hi, The alternateField and maxAlternateFieldLength params work well, but only as long as the alternate field actually exists for the document. If it does not, highlighting returns nothing. We would like this behavior 1. Highlighting in body if matches 2. Fallback to verbatim teaser if it exists 3. If fallback field does not exist, look for a secondary fallback field To support this behaviour in a back-compat way, how about allowing a comma-separated list of alternate fields to consider: hl.alternateField=field1,field2,field3.. where the first existing one is selected Or do you have other workarounds for this problem on the solr side? In this case we cannot control the source DB to make sure the teaser exists. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com
RE: Hotel Searches
Hi Uwe, Thanks for your reply. I think this will solve my problem. Regards Harshvardhan Ojha -Original Message- From: Uwe Reh [mailto:r...@hebis.uni-frankfurt.de] Sent: Wednesday, January 09, 2013 2:52 PM To: solr-user@lucene.apache.org Subject: Re: Hotel Searches Hi, maybe I'm thinking too simple again. Nevertheless, here an idea to solve the question. The basic thought is to get rid of the range query. Have: - a textfield 'vacant_days'. Instead of ISO-Dates just simple dates in the form mmdd - a dynamic field 'price_*', You can add the tariff for Jan. 31th into 'price_0131' To get the total, eg. Feb. 1st to Feb. 3th you could query for the days 0201, 0202 and 0203. You can calculate the sum of the corresponding price fields q=vacant_days:0201 AND vacant_days:0202 AND vacant_days:0203fl?_val_:sum(price_0201, price_0202, price_0203) (not tested) Uwe Am 09.01.2013 07:08, schrieb Harshvardhan Ojha: Hi Alex, Thanks for your reply. I saw prices based on daterange using multipoints . But this is not my problem. Instead the problem statement for me is pretty simple. Say I have 100 documents each having tariff as field. Doc1 doc double name=tariff2400.0/double /doc Doc2 doc double name=tariff2500.0/double /doc Now a user's search should give me a total tariff. Desired result doc double name=tariff4900.0/double /doc And this could be any combination for 100 docs it is (100+101)/2. (N*N+1)/2. How can I get these combination of documents already indexed ? Or is there any way to do calculations at runtime? How can I place this constraint that if there is any 1 doc missing in a range don’t give me any result.(if a user asked for hotel tariff from 11th to 13th, and I don’t have tariff for 12th, I shouldn't add 11th and 13th only). Hope I made my problem very simple. Regards Harshvardhan Ojha -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Tuesday, January 08, 2013 6:12 PM To: solr-user@lucene.apache.org Subject: Re: Hotel Searches Did you look at a conversation thread from 12 Dec 2012 on this list? Just go to the archives and search for 'hotel'. Hopefully that will give you something to work with. If you have any questions after that, come back with more specifics. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Tue, Jan 8, 2013 at 7:18 AM, Harshvardhan Ojha harshvardhan.o...@makemytrip.com wrote: Sorry for that, we just spoiled that thread so posted my question in a fresh thread. Problem is indeed very simple. I have solr documents, which has all the required fields(from db). Say DOC1,DOC2,DOC3.DOCn. Every document has 1 night tariff and I have 180 nights tariff. So a person can search for any combination in these 180 nights. Say a request came to me to give total tariff for 10th to 15th of jan 2013. Now I need to get a sum of tariff field of 6 docs. So how can I keep this data indexed, to avoid search time calculation, and there are other dimensions of this data also beside tariff. Hope this makes sense. Regards Harshvardhan Ojha -Original Message- From: Gora Mohanty [mailto:g...@mimirtech.com] Sent: Tuesday, January 08, 2013 5:37 PM To: solr-user@lucene.apache.org Subject: Re: Hotel Searches On 8 January 2013 17:10, Harshvardhan Ojha harshvardhan.o...@makemytrip.com wrote: Hi All, Looking into a finding solution for Hotel searches based on the below criteria's [...] Didn't you just post this on a separate thread, complete with some nonsensical follow-up from a colleague of yours? Please do not repost the same message over and over again. It is not clear what you are trying to achieve. What is the difference between a city and a hotel in your data? How is a person represented in your documents? Is it by the ID field? Are you looking to cache all possible combinations of ID, city, and startdate? If so, to what end? This smells like a XY problem: http://people.apache.org/~hossman/#xyproblem Regards, Gora
Re: fieldtype for name
Hi, Without seeing the configs I would guess default query operator might be OR (and check docs for mm parameter on the Wiki) or there are ngrams involved. Former is more likely. Otis Solr ElasticSearch Support http://sematext.com/ On Jan 9, 2013 6:16 AM, Michael Jones michaelj...@gmail.com wrote: Also. I'm allowing users to do enter a name with quotes to search for an exact name. So at the moment only smith, robert will return any results where *robert smith* will return all variations including 'smith, herbert' On Wed, Jan 9, 2013 at 11:09 AM, Michael Jones michaelj...@gmail.com wrote: Thanks. It isn't necessarily the need to match 'dick' to 'robert' but to search for: 'name surname' name, surname' 'surname name' 'surname, name' And nothing else, I don't need to worry about nick names or abbreviations of a name, just the above variations. I think I might use text_ws. On Tue, Jan 8, 2013 at 9:39 PM, Uwe Reh r...@hebis.uni-frankfurt.de wrote: Hi Michael, in our index ob bibliographic metadata, we see the need for at least tree fields: - name_facet: String as type, because the facet should should represent the original inverted format from our data. - name: TextField for searching. This field is heavily analyzed to match different orders, to match synonyms, phonetic similarity, German umlauts and other European stuff. - name_lc: TextField. This field is just mapped to lower case. It's used to boost docs with the same style of writing like the users input. Uwe Am 08.01.2013 15:30, schrieb Michael Jones: Hi, What would be the best fieldtype for a persons name? at the moment I'm using text_general but, if I search for bob smith, some results I get back might be rob thomas. In that it's matched 'ob'. But I only really want results that are either 'bob smith' 'bob, smith' 'smith, bob' 'smith bob' Thanks
Re: wildcard faceting in solr cloud
I am testing it.. and i will upload it after that.. ./Zahoor HBase Musings On 09-Jan-2013, at 2:55 AM, Upayavira u...@odoko.co.uk wrote: Have you uploaded a patch to JIRA??? Upayavira On Tue, Jan 8, 2013, at 07:57 PM, jmozah wrote: Hmm. Fixed it. Did similar thing as SOLR-247 for distributed search. Basically modified the FacetInfo method of the FacetComponent.java to make it work.. :-) ./zahoor On 08-Jan-2013, at 9:35 PM, jmozah jmo...@gmail.com wrote: I can try to bump it for distributed search... Some pointer where to start will be helpful... Can SOLR-2894 be a good start to look at this? ./Zahoor On 08-Jan-2013, at 9:27 PM, Michael Ryan mr...@moreover.com wrote: I'd guess that the patch simply doesn't implement it for distributed searches. The code for distributed facets is quite a bit more complicated, and I don't see it touched in this patch. -Michael -Original Message- From: jmozah [mailto:jmo...@gmail.com] Sent: Tuesday, January 08, 2013 10:51 AM To: solr-user@lucene.apache.org Subject: wildcard faceting in solr cloud Hi I am performing wildcard faceting using the patch in SOLR-247 on solr 4.0. It works like a charm in a single instance... But it does not work in a distributed mode... Am i missing something? ./zahoor
RE: Highlighting: When alternateField does not exist
Hi , That should be fairly easy to make in alternateField() in DefaultSolrHighlighter. We made a small change there to support globs in alternateField. Cheers, -Original message- From:Jan Høydahl jan@cominvent.com Sent: Wed 09-Jan-2013 12:44 To: solr-user@lucene.apache.org Subject: Highlighting: When alternateField does not exist Hi, The alternateField and maxAlternateFieldLength params work well, but only as long as the alternate field actually exists for the document. If it does not, highlighting returns nothing. We would like this behavior 1. Highlighting in body if matches 2. Fallback to verbatim teaser if it exists 3. If fallback field does not exist, look for a secondary fallback field To support this behaviour in a back-compat way, how about allowing a comma-separated list of alternate fields to consider: hl.alternateField=field1,field2,field3.. where the first existing one is selected Or do you have other workarounds for this problem on the solr side? In this case we cannot control the source DB to make sure the teaser exists. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com
Re: fieldtype for name
Hi, My schema file is here http://pastebin.com/ArY7xVUJ Query (name:'ian paisley') returns ~ 3000 results Query (name:'paisley, ian') returns ~ 250 results - That is how the name is stored, so is returning just the results with that person. I need all variations to return 250 results Query (name:*ian paisley*) returns ~ 8000 results - but acceptable as I know it has a wild card. Thanks On Wed, Jan 9, 2013 at 12:56 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Without seeing the configs I would guess default query operator might be OR (and check docs for mm parameter on the Wiki) or there are ngrams involved. Former is more likely. Otis Solr ElasticSearch Support http://sematext.com/ On Jan 9, 2013 6:16 AM, Michael Jones michaelj...@gmail.com wrote: Also. I'm allowing users to do enter a name with quotes to search for an exact name. So at the moment only smith, robert will return any results where *robert smith* will return all variations including 'smith, herbert' On Wed, Jan 9, 2013 at 11:09 AM, Michael Jones michaelj...@gmail.com wrote: Thanks. It isn't necessarily the need to match 'dick' to 'robert' but to search for: 'name surname' name, surname' 'surname name' 'surname, name' And nothing else, I don't need to worry about nick names or abbreviations of a name, just the above variations. I think I might use text_ws. On Tue, Jan 8, 2013 at 9:39 PM, Uwe Reh r...@hebis.uni-frankfurt.de wrote: Hi Michael, in our index ob bibliographic metadata, we see the need for at least tree fields: - name_facet: String as type, because the facet should should represent the original inverted format from our data. - name: TextField for searching. This field is heavily analyzed to match different orders, to match synonyms, phonetic similarity, German umlauts and other European stuff. - name_lc: TextField. This field is just mapped to lower case. It's used to boost docs with the same style of writing like the users input. Uwe Am 08.01.2013 15:30, schrieb Michael Jones: Hi, What would be the best fieldtype for a persons name? at the moment I'm using text_general but, if I search for bob smith, some results I get back might be rob thomas. In that it's matched 'ob'. But I only really want results that are either 'bob smith' 'bob, smith' 'smith, bob' 'smith bob' Thanks
Re: fieldtype for name
Try q=name:(ian paisley)q.op=AND Does that work better for you? It would also match Ian James Paisley, but not Ian Jackson. Upayavira On Wed, Jan 9, 2013, at 01:30 PM, Michael Jones wrote: Hi, My schema file is here http://pastebin.com/ArY7xVUJ Query (name:'ian paisley') returns ~ 3000 results Query (name:'paisley, ian') returns ~ 250 results - That is how the name is stored, so is returning just the results with that person. I need all variations to return 250 results Query (name:*ian paisley*) returns ~ 8000 results - but acceptable as I know it has a wild card. Thanks On Wed, Jan 9, 2013 at 12:56 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Without seeing the configs I would guess default query operator might be OR (and check docs for mm parameter on the Wiki) or there are ngrams involved. Former is more likely. Otis Solr ElasticSearch Support http://sematext.com/ On Jan 9, 2013 6:16 AM, Michael Jones michaelj...@gmail.com wrote: Also. I'm allowing users to do enter a name with quotes to search for an exact name. So at the moment only smith, robert will return any results where *robert smith* will return all variations including 'smith, herbert' On Wed, Jan 9, 2013 at 11:09 AM, Michael Jones michaelj...@gmail.com wrote: Thanks. It isn't necessarily the need to match 'dick' to 'robert' but to search for: 'name surname' name, surname' 'surname name' 'surname, name' And nothing else, I don't need to worry about nick names or abbreviations of a name, just the above variations. I think I might use text_ws. On Tue, Jan 8, 2013 at 9:39 PM, Uwe Reh r...@hebis.uni-frankfurt.de wrote: Hi Michael, in our index ob bibliographic metadata, we see the need for at least tree fields: - name_facet: String as type, because the facet should should represent the original inverted format from our data. - name: TextField for searching. This field is heavily analyzed to match different orders, to match synonyms, phonetic similarity, German umlauts and other European stuff. - name_lc: TextField. This field is just mapped to lower case. It's used to boost docs with the same style of writing like the users input. Uwe Am 08.01.2013 15:30, schrieb Michael Jones: Hi, What would be the best fieldtype for a persons name? at the moment I'm using text_general but, if I search for bob smith, some results I get back might be rob thomas. In that it's matched 'ob'. But I only really want results that are either 'bob smith' 'bob, smith' 'smith, bob' 'smith bob' Thanks
Restore hot backup
Hi, Is possible to restore an old backup without shutting down Solr? Regards, Sergio -- View this message in context: http://lucene.472066.n3.nabble.com/Restore-hot-backup-tp4031866.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: fieldtype for name
Brilliant! Thank you! On Wed, Jan 9, 2013 at 1:37 PM, Upayavira u...@odoko.co.uk wrote: q=name:(ian paisley)q.op=AND
Performance issue with group.ngroups=true
Hi, I have a performance issue with group.ngroups=true parameters. I have an index with 100k documents (small documents, 1-10 documents per group, group on string field), if i make a q=*:*...group.ngroups=true i have 4s responsetime vs 50ms without the ngroups parameters. Is it a workaround for this problem? Mickael -- View this message in context: http://lucene.472066.n3.nabble.com/Performance-issue-with-group-ngroups-true-tp4031888.html Sent from the Solr - User mailing list archive at Nabble.com.
CoreAdmin STATUS performance
Hi All, I have a client app that uses SolrJ and which requires to collect the names (and just the names) of all loaded cores. I have about 380 Solr Cores on a single Solr server (net indices size is about 220GB). Running the STATUS action takes about 800ms - that seems a bit too long, given my requirements. So here are my questions: 1) Is there any way to get _only_ the core Name of all cores? 2) Why does the STATUS request take such a long time and is there a way to improve its performance? Thanks, Shahar.
massive memory consumption of grouping feature
Hello, we are upgrading solr from 1.3 to 4.0. In solr 1.3 we used the SOLR-236 patch to realize grouping/ field collapsing. We did not have a memory issue with the field collapsing feature in our 1.3 version. However, we do now. The query looks something like this: http://localhost:8983/solr/select?fl=*,scoregroup.ngroups=truegroup.limit=-1group.field=someGroupingFieldgroup=truefq=someField:someValuefq=anotherField:anotherValuewt=xmlfq=thirdField:[0+TO+1]rows=3 as you can see the q parameter is empty, but it does not make a difference if I query for q=someValue+anotherValue The result returns: int name=matches3772/int int name=ngroups2175/int We have a memory consumption of about 4G. What causes this massive memory consumption? How can it be reduced? Regards, Claas -- View this message in context: http://lucene.472066.n3.nabble.com/massive-memory-consumption-of-grouping-feature-tp4031895.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Clean Up Aged Index Using DeletionPolicy
Hey Shawn, Thanks a lot for your detailed explanation on deletionPolicy. Although it's frustrated that Solr doesn't support the function I need, I'm really glad that you point it out so that I can move on. What I'm thinking now is adding a new field for the time a document is indexed, so a simple range query can delete the aged indexes I want to remove to maintain my disk space. Thanks again, Hao -- View this message in context: http://lucene.472066.n3.nabble.com/Clean-Up-Aged-Index-Using-DeletionPolicy-tp4031704p4031896.html Sent from the Solr - User mailing list archive at Nabble.com.
SolrCloud - shard distribution
Hi, Simple question, I hope. Using the nightly build of 4.1 from yesterday (Jan 8, 2013), I started 6 Solr nodes. I issued the following command to create a collection with 3 shards, and a replication factor=2. So a total of 6 shards. curl 'http://localhost:11000/solr/admin/collections?action=CREATEname=consumer1numShards=3replicationFactor=2' The end result was the following shard distribution: shard1 - node #13, #15 (with #13 as leader) shard2 - node #15, #16 (with #15 as leader) shard3 - node #11, #16 (with #11 as leader) Since I am using the default value of 1 for 'maxShardsPerNode', I was surprised to see that Solr created two shards on instance #16. I expected that each Solr node (there are 6) would each be assigned one shard from the collection. Is this a bug or expected behavior? Thanks, James
Re: performance improvements on ip look up query
Hi Otis The cache was modest 4096 with a hit rate of 0.23 after a 24hr period. We doubled it and the hit rate went to 0.25. Our interpretation is ip is pretty much a cache busting value ? and cache size is not at play here. the q param is just startIpNum:[* TO 180891652]AND endIpNum:[180891652 TO *] so again our interpretation is its got little reuse Could we re-formulate the query to be more per-formant ? On 9 January 2013 12:56, Otis Gospodnetic otis.gospodne...@gmail.comwrote: Hi, Maybe your cache is too small? How big is it and does the hit rate change if you make it bigger? Do any parts of the query repeat a lot? Maybe there is room for fq. Otis Solr ElasticSearch Support http://sematext.com/ On Jan 9, 2013 6:08 AM, Lee Carroll lee.a.carr...@googlemail.com wrote: Hi We are doing a lat/lon look up query using ip address. We have a 6.5 million document core of the following structure start ip block end ip block location id location_lat_lon the field defs are types fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true/ fieldType name=tlong class=solr.TrieLongField precisionStep=8 omitNorms=true positionIncrementGap=0/ fieldType name=tfloat class=solr.TrieFloatField precisionStep=8 omitNorms=true positionIncrementGap=0/ fieldType name=tdouble class=solr.TrieDoubleField precisionStep=8 omitNorms=true positionIncrementGap=0/ fieldType name=location class=solr.LatLonType subFieldSuffix= _coordinate/ /types fields field name=startIp type=string indexed=true stored=false required= true/ field name=startIpNum type=tlong indexed=true stored=false required =true/ field name=endIpNum type=tlong indexed=true stored=false required= true/ field name=locId type=string indexed=true stored=true required= true/ field name=countryCode type=string indexed=true stored=true required=false/ field name=cityName type=string indexed=false stored=true required =false/ field name=latLon type=location indexed=true stored=true required= true/ field name=latitude type=string indexed=false stored=true required =true/ field name=longitude type=string indexed=false stored=true required =true/ dynamicField name=*_coordinate type=tdouble indexed=true stored= false/ /fields the query at the moment is simply a range query q=startIpNum:[* TO 180891652]AND endIpNum:[180891652 TO *] we are seeing a full query cache with a low hit rate 0.2 and a high eviction rate which makes sense given the use of ip address in the query. query time mean is 120. Is their a better way of structuring the core for this usecase ? I suspect our heap memory settings are conservative 1g but will need to convince our sys admins to change this (they are not ringing any resource alarm bells) just the query is a little slow
Re: Restore hot backup
If you are in multicore mode, you can stop a core, move the backed up files into place, and restart/recreate the core. That would have the effect you desire. You may well be able to get away with swapping out the files and reloading the core, but the above would be safer. Best make sure you're not indexing or committing to the core at the time you do this. Upayavira On Wed, Jan 9, 2013, at 01:48 PM, marotosg wrote: Hi, Is possible to restore an old backup without shutting down Solr? Regards, Sergio -- View this message in context: http://lucene.472066.n3.nabble.com/Restore-hot-backup-tp4031866.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud - shard distribution
I just tried this. I started 6 nodes with collection1 spread across two shards. Looked at the admin-cloud-graph view and everything looked right and green. Next, I copy and pasted your command and refreshed the graph cloud view. I see a new collection called consumer1 - all of it's nodes are green and the collection consists of 3 shards. Each shard has 1 leader and 1 replica, each hosted by a different Solr instance. In other words, it seemed to work for me. - Mark On Jan 9, 2013, at 10:58 AM, James Thomas jtho...@camstar.com wrote: Hi, Simple question, I hope. Using the nightly build of 4.1 from yesterday (Jan 8, 2013), I started 6 Solr nodes. I issued the following command to create a collection with 3 shards, and a replication factor=2. So a total of 6 shards. curl 'http://localhost:11000/solr/admin/collections?action=CREATEname=consumer1numShards=3replicationFactor=2' The end result was the following shard distribution: shard1 - node #13, #15 (with #13 as leader) shard2 - node #15, #16 (with #15 as leader) shard3 - node #11, #16 (with #11 as leader) Since I am using the default value of 1 for 'maxShardsPerNode', I was surprised to see that Solr created two shards on instance #16. I expected that each Solr node (there are 6) would each be assigned one shard from the collection. Is this a bug or expected behavior? Thanks, James
Re: DIH fails after processing roughly 10million records
On 1/8/2013 11:19 PM, vijeshnair wrote: Yes Shawn, the batchSize is -1 only and I also have the mergeScheduler exactly same as you mentioned. When I had this problem in SOLR 3.4, I did an extensive googling and gathered much of the tweaks and tuning from different blogs and forums and configured the 4.0 instance. My next full run is scheduled for this weekend, I will try with a higher mysql wait_timeout value and update you the outcome. With maxThreadCount at 1 and maxMergeCount at 6, I was able to complete full-import with no problems. All mysql (5.1.61) server-side timeouts are at their defaults - they don't show up in my.cnf and I haven't tweaked them anywhere else either. A full import for me consists of six simultaneous imports into six Solr cores, each of which is over 12 million rows. It takes three hours, and each of those six imports creates a 16GB index on Solr 4.1-SNAPSHOT, 22GB on Solr 3.5.0. There is a seventh import as well, but it only does a few hundred thousand rows. That one finishes before any major merging takes place. Thanks, Shawn
Re: Performance issue with group.ngroups=true
group.ngroups=true is always going to be somewhat expensive, but in your case it seems more expensive than I would expect. You should check to see that you have enough Java JVM heap to hold more of the index and to avoid any excessive GCs. -- Jack Krupansky -Original Message- From: Mickael Magniez Sent: Wednesday, January 09, 2013 10:09 AM To: solr-user@lucene.apache.org Subject: Performance issue with group.ngroups=true Hi, I have a performance issue with group.ngroups=true parameters. I have an index with 100k documents (small documents, 1-10 documents per group, group on string field), if i make a q=*:*...group.ngroups=true i have 4s responsetime vs 50ms without the ngroups parameters. Is it a workaround for this problem? Mickael -- View this message in context: http://lucene.472066.n3.nabble.com/Performance-issue-with-group-ngroups-true-tp4031888.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: SolrJ DirectXmlRequest
I also don't know what's creating them. Maybe Solr, but also maybe Tomcat, maybe apache commons. I could change java.io.tmpdir to one with more space, but the problem is that many of the temp files end up permanent, so eventually it would still run out of space. I also considered setting the tmpdir to /dev/null, but that would defeat the purpose of whatever is writing those log files in the first place. I could periodically clean up the tmpdir myself, but that feels the hackiest. Is it fairly common to send XML to Solr this way from a remote host? If it is, then that would lead me to believe Solr and any of it's libraries aren't causing it, and I should inspect Tomcat. I'm using Tomcat 7. Ryan From: Otis Gospodnetic [otis.gospodne...@gmail.com] Sent: Tuesday, January 08, 2013 7:29 PM To: solr-user@lucene.apache.org Subject: Re: SolrJ DirectXmlRequest Hi Ryan, I'm not sure what is creating those upload files something in Solr? Or Tomcat? Why not specify a different temp dir via system property command line parameter? Otis Solr ElasticSearch Support http://sematext.com/ On Jan 8, 2013 12:17 PM, Ryan Josal rjo...@rim.com wrote: I have encountered an issue where using DirectXmlRequest to index data on a remote host results in eventually running out have temp disk space in the java.io.tmpdir directory. This occurs when I process a sufficiently large batch of files. About 30% of the temporary files end up permanent. The filenames look like: upload__2341cdae_13c02829b77__7ffd_00029003.tmp. Has anyone else had this happen before? The relevant code is: DirectXmlRequest up = new DirectXmlRequest( /update, xml ); up.process(solr); where `xml` is a String containing Solr formatted XML, and `solr` is the SolrServer. When disk space is eventually exhausted, this is the error message that is repeatedly seen on the master host: 2013-01-07 19:22:16,911 [http-bio-8090-exec-2657] [] ERROR org.apache.solr.servlet.SolrDispatchFilter [] - org.apache.commons.fileupload.FileUploadBase$IOFileUploadException: Processing of multipart/form-data request failed. No space left on device at org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:367) at org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126) at org.apache.solr.servlet.MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:344) at org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:397) at org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:115) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) ... truncated stack trace I am running Solr 3.6 on an Ubuntu 12.04 server. I am considering working around this by pulling out as much as I can from XMLLoader into my client, and processing the XML myself into SolrInputDocuments for indexing, but this is certainly not ideal. Ryan - This transmission (including any attachments) may contain confidential information, privileged material (including material protected by the solicitor-client or other applicable privileges), or constitute non-public information. Any use of this information by anyone other than the intended recipient is prohibited. If you have received this transmission in error, please immediately reply to the sender and delete this information from your system. Use, dissemination, distribution, or reproduction of this transmission by unintended recipients is not authorized and may be unlawful. - This transmission (including any attachments) may contain confidential information, privileged material (including material protected by the solicitor-client or other applicable privileges), or constitute non-public information. Any use of this information by anyone other than the intended recipient is prohibited. If you have received this transmission in error, please immediately reply to the sender and delete this information from your system. Use, dissemination, distribution, or reproduction of this transmission by unintended recipients is not authorized and may be unlawful.
Re: DIH fails after processing roughly 10million records
On 1/9/2013 9:41 AM, Shawn Heisey wrote: With maxThreadCount at 1 and maxMergeCount at 6, I was able to complete full-import with no problems. All mysql (5.1.61) server-side timeouts are at their defaults - they don't show up in my.cnf and I haven't tweaked them anywhere else either. A full import for me consists of six simultaneous imports into six Solr cores, each of which is over 12 million rows. It takes three hours, and each of those six imports creates a 16GB index on Solr 4.1-SNAPSHOT, 22GB on Solr 3.5.0. There is a seventh import as well, but it only does a few hundred thousand rows. That one finishes before any major merging takes place. Full timeout info: mysql SHOW SESSION VARIABLES LIKE '%timeout%'; ++---+ | Variable_name | Value | ++---+ | connect_timeout| 10| | delayed_insert_timeout | 300 | | innodb_lock_wait_timeout | 50| | innodb_rollback_on_timeout | OFF | | interactive_timeout| 28800 | | net_read_timeout | 30| | net_write_timeout | 60| | slave_net_timeout | 3600 | | table_lock_wait_timeout| 50| | wait_timeout | 28800 | ++---+ 10 rows in set (0.00 sec)
Re: Is there faceting with Solr 4 spatial?
Erick, Alex asked about Solr 4 spatial, and his use-case requires it because he's got multi-value spatial fields (multiple business office locations per document). So the Solr 3 spatial solution you posted won't cut it. Alex, You can do this in Solr 4.0. Use one facet.query per circle (I.e. Distance ring away from center). Here's an example with just one facet.query: http://localhost:8983/solr/collection1/select?q=*%3A*wt=xmlfacet=truefac et.query=geo:%22Intersects%28Circle%2845.15,-93.85%20d=0.045%29%29%22 That facet.query without url escaping is: geo:Intersects(Circle(45.15,-93.85 d=0.045)) The's a 5km ring. Repeat such facet queries for larger rings. Each bigger circle will of course encompass the smaller circle(s) before it. I suspect it's more useful to the user to see a facet count based on all businesses within each threshold distance, versus having counts exclude the one before basically. But if you really want to do that, you'll have to do that part yourself by simply subtracting one facet count from the previous smaller ring. And to generate the filter query if they click it, you'd then have to have a NOT clause for the smaller ring. Ex: fq=geo:Intersects(Circle(45.15,-93.85 d=0.09)) NOT geo:Intersects(Circle(45.15,-93.85 d=0.045)) I'm aware it's a bit verbose. In Solr 4.1 I've already committed a change to allow use of {!geofilt} which will make the syntax shorter, allowing sharing of the pt reference, and kilometer based distances instead of degrees. I'm collaborating with Ryan McKinley on https://issues.apache.org/jira/browse/SOLR-4242 A better spatial query parser including conversations off-list but feel free to participate via commenting. ~ David Smiley On 1/8/13 7:33 PM, Erick Erickson erickerick...@gmail.com wrote: For facets, doesn't http://localhost:8983/solr/select?wt=jsonindent=truefl=name,storeq=*:*; facet=on facet.query={!frange l=0 u=3}geodist(store,45.15,-93.85) facet.query={!frange l=3.001 u=4}geodist(store,45.15,-93.85) facet.query={!frange l=4.001 u=5}geodist(store,45.15,-93.85) work (from http://wiki.apache.org/solr/SpatialSearch#How_to_facet_by_distance) Although I also confess to being really unfamiliar with all things geodist... Best Erick On Tue, Jan 8, 2013 at 4:02 AM, Alexandre Rafalovitch arafa...@gmail.comwrote: Hello, I am trying to understand the new Solr 4 spatial type and what it can do. I sort of understand the old implementation, though also far from well. The use case is to have companies that has multiple offices, for which I indexed locations. I then want to do a 'radar' style ranges/facets, so I can say show me everything in 100k, in 300k, etc. The wiki page for old implementation shows how to do it, but I am having troubles figuring this out for new implementation. Regards, Alex. P.s. Not yet possible, wait till 4.1/5, etc are perfectly valid shortest answers for me, at this stage. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: Convert Complex Lucene Query to SolrQuery
Thanks Otis and Jack for your responses. We are trying to use embeddedsolr server with a solr query as follows: EmbeddedSolrServer server = new EmbeddedSolrServer(coreContainer, ); SolrQuery solrQuery = new SolrQuery(luceneQuery.toString()); // Here luceneQuery is a dismax query with additional filters QueryResponse rsp = server.query(solrQuery); The toString method does not give us good results and server.query(solrQuery) fails. As otis has suggested, we are going to take a look at LuceneQueryParser more closely. Thanks, Jagdish On Tue, Jan 8, 2013 at 9:41 PM, Jack Krupansky j...@basetechnology.comwrote: How complex? Does it use any of the more advanced Query Types or detailed options that are not supported in the Solr query syntax? What specific problems did you have. -- Jack Krupansky -Original Message- From: Jagdish Nomula Sent: Tuesday, January 08, 2013 9:13 PM To: solr-user@lucene.apache.org Subject: Convert Complex Lucene Query to SolrQuery Hello Solr Users, I am trying to convert a complex lucene query to solrquery to use it in a embeddedsolrserver instance. I have tried the regular toString method without success. Is there any suggested method to do this ?. Greatly appreciate the response. Thanks, -- Jagadish Nomula - Senior Manager Search Simply Hired, Inc. 370 San Aleso Ave, Ste. 200 Sunnyvale, CA 94085 simplyhired.com -- *Jagadish Nomula - Senior Manager Search* *Simply Hired, Inc.* 370 San Aleso Ave, Ste. 200 Sunnyvale, CA 94085 simplyhired.com
newbie questions about cache stats query perf
Sorry, I did search for an answer, but didn't find an applicable one. I'm currently stuck on 1.4.1 (running in Tomcat 6 on 64bit Linux) for the time being... When I see stats like this: name: documentCache class: org.apache.solr.search.LRUCache version: 1.0 description: LRU Cache(maxSize=512, initialSize=512) lookups : 0 hits : 0 hitratio : 0.00 inserts : 0 evictions : 0 size : 0 warmupTime : 0 cumulative_lookups : 8158 cumulative_hits : 685 cumulative_hitratio : 0.08 cumulative_inserts : 7473 cumulative_evictions : 3023 I don't understand lookups vs. cumulative_lookups, etc. I _do_ understand that a hit-ratio of 0.08 isn't a very good one. Something I definitely find strange is that I've allocated 4G of RAM to the java heap, but solr consistently remains around 1.7G. I'm trying to give it all the RAM I can spare (I could go higher, but it's not even using what I'm giving it) to make it faster. The index takes-up roughly 25GB on disk, and indexing is very fast (well, nothing we're complaining about anyway). We're trying to figure out why queries against the default, document content are slow (15-30 seconds for only a few mm total documents). Mergefactor=3, if that helps. So if anyone could point me to someplace that defines what these stats mean, and if anyone has any immediate tips/tricks/recommendations as to increasing query performance (and whether this documentCache is a good candidate to be increased substantially), I would very much appreciate it. -AJ
RE: SolrCloud - shard distribution
Thanks for the quick reply Mark. I tried all kinds of variations, I could not get all 6 nodes to participate. So I downloaded the source code and took a look at OverseerCollectionProcessor.java I think my result is as-coded. Line 251 has this loop: for (int i = 1; i = numSlices; i++) { for (int j = 1; j = repFactor; j++) { String nodeName = nodeList.get(((i - 1) + (j - 1)) % nodeList.size()); So for my inputs, numSlices=3 and repFactor=2. And the logic here will choose the same node for these two slices: --- slice1, rep2 (i=2,j=1) == chooses node[1] --- slice2, rep1 (i=1,j=2) == chooses node[1] BTW, I did notice the comment in the code: // we need to look at every node and see how many cores it serves // add our new cores to existing nodes serving the least number of cores // but (for now) require that each core goes on a distinct node. // TODO: add smarter options that look at the current number of cores per // node? // for now we just go random Thanks, James -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Wednesday, January 09, 2013 11:30 AM To: solr-user@lucene.apache.org Subject: Re: SolrCloud - shard distribution I just tried this. I started 6 nodes with collection1 spread across two shards. Looked at the admin-cloud-graph view and everything looked right and green. Next, I copy and pasted your command and refreshed the graph cloud view. I see a new collection called consumer1 - all of it's nodes are green and the collection consists of 3 shards. Each shard has 1 leader and 1 replica, each hosted by a different Solr instance. In other words, it seemed to work for me. - Mark On Jan 9, 2013, at 10:58 AM, James Thomas jtho...@camstar.com wrote: Hi, Simple question, I hope. Using the nightly build of 4.1 from yesterday (Jan 8, 2013), I started 6 Solr nodes. I issued the following command to create a collection with 3 shards, and a replication factor=2. So a total of 6 shards. curl 'http://localhost:11000/solr/admin/collections?action=CREATEname=consumer1numShards=3replicationFactor=2' The end result was the following shard distribution: shard1 - node #13, #15 (with #13 as leader) shard2 - node #15, #16 (with #15 as leader) shard3 - node #11, #16 (with #11 as leader) Since I am using the default value of 1 for 'maxShardsPerNode', I was surprised to see that Solr created two shards on instance #16. I expected that each Solr node (there are 6) would each be assigned one shard from the collection. Is this a bug or expected behavior? Thanks, James
Re: SOLR '0' Status: Communication Error
I forgot to mention.When I add documents to SOLR, I add it in batches of 50. Because my table has a lot of records, I have to do in batches due to memory constraints. The 'Communication error' occurs only for some batches. For other batches, documents get added properly. And also, I am including the stack trace just in case if it helps. '0' Status: Communication Error#0 C:\wamp\www\nist\application\library\SolrPhpClient\Apache\Solr\Service.php(672): Apache_Solr_Service-_sendRawPost('http://129.107', 'add allowDups=...') #1 C:\wamp\www\nist\application\library\SolrPhpClient\Apache\Solr\Service.php(736): Apache_Solr_Service-add('add allowDups=...') #2 C:\wamp\www\nist\application\library\Nist\Console\NistSolrIndex.php(106): Apache_Solr_Service-addDocuments(Array) #3 C:\wamp\www\nist\application\library\Nist\Console\CrawlUNT.php(346): Nist_Console_NistSolrIndex-createIndex() #4 C:\wamp\www\nist\application\library\Nist\Console\CrawlUNT.php(89): Nist_Console_CrawlUNT-CrawlParseAndIndexProfiles() #5 C:\wamp\www\nist\application\Bootstrap.php(107): Nist_Console_CrawlUNT-run(Object(Zend_Console_Getopt)) #6 C:\wamp\www\nist\application\Bootstrap.php(78): Bootstrap-_runConsoleApp() #7 C:\wamp\www\dkumar\mentis-libs\Zend\Application.php(366): Bootstrap-run() #8 C:\wamp\www\nist\index.php(37): Zend_Application-run() -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-0-Status-Communication-Error-tp4031698p4031949.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR '0' Status: Communication Error
On 1/9/2013 11:48 AM, ddineshkumar wrote: I forgot to mention.When I add documents to SOLR, I add it in batches of 50. Because my table has a lot of records, I have to do in batches due to memory constraints. The 'Communication error' occurs only for some batches. For other batches, documents get added properly. And also, I am including the stack trace just in case if it helps. If it sometimes works and sometimes you get the communications error, then I would guess that you are running into long garbage collection pauses on your Solr server that make Solr unresponsive long enough for the next update to time out. Garbage collection tuning is an art form with a million different styles. You could try increasing the php client timeouts. Thanks, Shawn
RE: SolrCloud - shard distribution
Oops, small copy-paste error. Had my i's and j's backwards. Should be: --- slice1, rep2 (i=1,j=2) == chooses node[1] --- slice2, rep1 (i=2,j=1) == chooses node[1] -Original Message- From: James Thomas [mailto:jtho...@camstar.com] Sent: Wednesday, January 09, 2013 1:39 PM To: solr-user@lucene.apache.org Subject: RE: SolrCloud - shard distribution Thanks for the quick reply Mark. I tried all kinds of variations, I could not get all 6 nodes to participate. So I downloaded the source code and took a look at OverseerCollectionProcessor.java I think my result is as-coded. Line 251 has this loop: for (int i = 1; i = numSlices; i++) { for (int j = 1; j = repFactor; j++) { String nodeName = nodeList.get(((i - 1) + (j - 1)) % nodeList.size()); So for my inputs, numSlices=3 and repFactor=2. And the logic here will choose the same node for these two slices: --- slice1, rep2 (i=2,j=1) == chooses node[1] --- slice2, rep1 (i=1,j=2) == chooses node[1] BTW, I did notice the comment in the code: // we need to look at every node and see how many cores it serves // add our new cores to existing nodes serving the least number of cores // but (for now) require that each core goes on a distinct node. // TODO: add smarter options that look at the current number of cores per // node? // for now we just go random Thanks, James -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Wednesday, January 09, 2013 11:30 AM To: solr-user@lucene.apache.org Subject: Re: SolrCloud - shard distribution I just tried this. I started 6 nodes with collection1 spread across two shards. Looked at the admin-cloud-graph view and everything looked right and green. Next, I copy and pasted your command and refreshed the graph cloud view. I see a new collection called consumer1 - all of it's nodes are green and the collection consists of 3 shards. Each shard has 1 leader and 1 replica, each hosted by a different Solr instance. In other words, it seemed to work for me. - Mark On Jan 9, 2013, at 10:58 AM, James Thomas jtho...@camstar.com wrote: Hi, Simple question, I hope. Using the nightly build of 4.1 from yesterday (Jan 8, 2013), I started 6 Solr nodes. I issued the following command to create a collection with 3 shards, and a replication factor=2. So a total of 6 shards. curl 'http://localhost:11000/solr/admin/collections?action=CREATEname=consumer1numShards=3replicationFactor=2' The end result was the following shard distribution: shard1 - node #13, #15 (with #13 as leader) shard2 - node #15, #16 (with #15 as leader) shard3 - node #11, #16 (with #11 as leader) Since I am using the default value of 1 for 'maxShardsPerNode', I was surprised to see that Solr created two shards on instance #16. I expected that each Solr node (there are 6) would each be assigned one shard from the collection. Is this a bug or expected behavior? Thanks, James
Re: defaultOperator in schema.xml
Hello! You should set the q.op parameter in your request handler configuration in solrconfig.xml instead of using the default operator from schema.xml. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch I'm testing out Solr 4.0. I got the sample schema working, so now I'm converting my existing schema (from a Solr 3.4 instance), but I'm confused as to what to do for the defaultOperator setting for the solrQueryParser field. For my existing Solr, I have: solrQueryParser defaultOperator=AND/ The 4.0 schema says that the defaultOperator field is deprecated, and seems to suggest that I just pass it along in my queries. Is there no way I can set it to AND by default somewhere else? I don't control all the applications that use my Solr Indexes, and I want to ensure that they operate with AND and not OR. Thanks! -- Chris
Re: SolrJ DirectXmlRequest
Hi Ryan, One typically uses a Solr client library to talk to Solr instead of sending raw XML. For example, if your application in written in Java then you would use SolrJ. Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, Jan 9, 2013 at 12:03 PM, Ryan Josal rjo...@rim.com wrote: I also don't know what's creating them. Maybe Solr, but also maybe Tomcat, maybe apache commons. I could change java.io.tmpdir to one with more space, but the problem is that many of the temp files end up permanent, so eventually it would still run out of space. I also considered setting the tmpdir to /dev/null, but that would defeat the purpose of whatever is writing those log files in the first place. I could periodically clean up the tmpdir myself, but that feels the hackiest. Is it fairly common to send XML to Solr this way from a remote host? If it is, then that would lead me to believe Solr and any of it's libraries aren't causing it, and I should inspect Tomcat. I'm using Tomcat 7. Ryan From: Otis Gospodnetic [otis.gospodne...@gmail.com] Sent: Tuesday, January 08, 2013 7:29 PM To: solr-user@lucene.apache.org Subject: Re: SolrJ DirectXmlRequest Hi Ryan, I'm not sure what is creating those upload files something in Solr? Or Tomcat? Why not specify a different temp dir via system property command line parameter? Otis Solr ElasticSearch Support http://sematext.com/ On Jan 8, 2013 12:17 PM, Ryan Josal rjo...@rim.com wrote: I have encountered an issue where using DirectXmlRequest to index data on a remote host results in eventually running out have temp disk space in the java.io.tmpdir directory. This occurs when I process a sufficiently large batch of files. About 30% of the temporary files end up permanent. The filenames look like: upload__2341cdae_13c02829b77__7ffd_00029003.tmp. Has anyone else had this happen before? The relevant code is: DirectXmlRequest up = new DirectXmlRequest( /update, xml ); up.process(solr); where `xml` is a String containing Solr formatted XML, and `solr` is the SolrServer. When disk space is eventually exhausted, this is the error message that is repeatedly seen on the master host: 2013-01-07 19:22:16,911 [http-bio-8090-exec-2657] [] ERROR org.apache.solr.servlet.SolrDispatchFilter [] - org.apache.commons.fileupload.FileUploadBase$IOFileUploadException: Processing of multipart/form-data request failed. No space left on device at org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:367) at org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126) at org.apache.solr.servlet.MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:344) at org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:397) at org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:115) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) ... truncated stack trace I am running Solr 3.6 on an Ubuntu 12.04 server. I am considering working around this by pulling out as much as I can from XMLLoader into my client, and processing the XML myself into SolrInputDocuments for indexing, but this is certainly not ideal. Ryan - This transmission (including any attachments) may contain confidential information, privileged material (including material protected by the solicitor-client or other applicable privileges), or constitute non-public information. Any use of this information by anyone other than the intended recipient is prohibited. If you have received this transmission in error, please immediately reply to the sender and delete this information from your system. Use, dissemination, distribution, or reproduction of this transmission by unintended recipients is not authorized and may be unlawful. - This transmission (including any attachments) may contain confidential information, privileged material (including material protected by the solicitor-client or other applicable privileges), or constitute non-public information. Any use of this information by anyone other than the intended recipient is prohibited. If you have received this transmission in error, please immediately reply to the sender and delete this information from your system. Use, dissemination, distribution, or reproduction of this transmission by unintended recipients is not authorized and may be unlawful.
Re: Convert Complex Lucene Query to SolrQuery
Aha. I think the problem here is the assumption that .toString() on Lucene query will give you a string that can then be re-parsed in the proper query and that is currently not the case. But if you start with the raw query like the one you would use with the Lucene QP, you should be fine. Can you replace: new SolrQuery(luceneQuery.toString()); with: new SolrQuery(Your Raw Query String Here) Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, Jan 9, 2013 at 12:33 PM, Jagdish Nomula jagd...@simplyhired.comwrote: Thanks Otis and Jack for your responses. We are trying to use embeddedsolr server with a solr query as follows: EmbeddedSolrServer server = new EmbeddedSolrServer(coreContainer, ); SolrQuery solrQuery = new SolrQuery(luceneQuery.toString()); // Here luceneQuery is a dismax query with additional filters QueryResponse rsp = server.query(solrQuery); The toString method does not give us good results and server.query(solrQuery) fails. As otis has suggested, we are going to take a look at LuceneQueryParser more closely. Thanks, Jagdish On Tue, Jan 8, 2013 at 9:41 PM, Jack Krupansky j...@basetechnology.com wrote: How complex? Does it use any of the more advanced Query Types or detailed options that are not supported in the Solr query syntax? What specific problems did you have. -- Jack Krupansky -Original Message- From: Jagdish Nomula Sent: Tuesday, January 08, 2013 9:13 PM To: solr-user@lucene.apache.org Subject: Convert Complex Lucene Query to SolrQuery Hello Solr Users, I am trying to convert a complex lucene query to solrquery to use it in a embeddedsolrserver instance. I have tried the regular toString method without success. Is there any suggested method to do this ?. Greatly appreciate the response. Thanks, -- Jagadish Nomula - Senior Manager Search Simply Hired, Inc. 370 San Aleso Ave, Ste. 200 Sunnyvale, CA 94085 simplyhired.com -- *Jagadish Nomula - Senior Manager Search* *Simply Hired, Inc.* 370 San Aleso Ave, Ste. 200 Sunnyvale, CA 94085 simplyhired.com
Re: newbie questions about cache stats query perf
Hi, In your Solr version there is a notion of Searcher being opened and reopened. Every time that happens those non-cumulative stats reset. The cumulative_ stats just don't refresh, so you have numbers from when the whole Solr started, not just from the last time Searcher opened. Your cache is small, which is why you have evictions, which is partially why you have low hit rate, which is partially why your queries are slow. But 15-30 seconds is crazy high, so I am sure there are other issues. Note that you should *not* give Solr/Tomcat all the RAM you can spare - leave it to the OS to use for index caching. If you don't have issues with full heap (OOMs or crazy GCing) with say -Xmx=2g, then use that. Plug: http://sematext.com/spm/solr-performance-monitoring/index.html Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, Jan 9, 2013 at 12:56 PM, AJ Weber awe...@comcast.net wrote: Sorry, I did search for an answer, but didn't find an applicable one. I'm currently stuck on 1.4.1 (running in Tomcat 6 on 64bit Linux) for the time being... When I see stats like this: name: documentCache class: org.apache.solr.search.**LRUCache version: 1.0 description: LRU Cache(maxSize=512, initialSize=512) lookups : 0 hits : 0 hitratio : 0.00 inserts : 0 evictions : 0 size : 0 warmupTime : 0 cumulative_lookups : 8158 cumulative_hits : 685 cumulative_hitratio : 0.08 cumulative_inserts : 7473 cumulative_evictions : 3023 I don't understand lookups vs. cumulative_lookups, etc. I _do_ understand that a hit-ratio of 0.08 isn't a very good one. Something I definitely find strange is that I've allocated 4G of RAM to the java heap, but solr consistently remains around 1.7G. I'm trying to give it all the RAM I can spare (I could go higher, but it's not even using what I'm giving it) to make it faster. The index takes-up roughly 25GB on disk, and indexing is very fast (well, nothing we're complaining about anyway). We're trying to figure out why queries against the default, document content are slow (15-30 seconds for only a few mm total documents). Mergefactor=3, if that helps. So if anyone could point me to someplace that defines what these stats mean, and if anyone has any immediate tips/tricks/recommendations as to increasing query performance (and whether this documentCache is a good candidate to be increased substantially), I would very much appreciate it. -AJ
Re: unittest fail (sometimes) for float field search
Hi, It is not Eclipse related, neither codec related. There were two issues I had a wrong configuration of NumericConfig: new NumericConfig(4, NumberFormat.getNumberInstance(), NumericType.FLOAT)) I changed that to: new NumericConfig(4, NumberFormat.getNumberInstance(Locale.US), NumericType.FLOAT)) And the second problem was that I used the default float with precisionStep=0, however NumericRangeQuery requires precision step =1 I tried all steps 1-8, and it worked only if the precison step of the field and of the NumericConfig are the same (for range queries) roman On Tue, Jan 8, 2013 at 7:34 PM, Roman Chyla roman.ch...@gmail.com wrote: The test checks we are properly getting/indexing data - we index database and fetch parts of the documents separately from mongodb. You can look at the file here: https://github.com/romanchyla/montysolr/blob/3c18312b325874bdecefceb9df63096b2cf20ca2/contrib/adsabs/src/test/org/apache/solr/update/TestAdsDataImport.java But your comment made me to run the tests on command line and I am seeing I can't make it fail (it fails only inside Eclipse). Sorry, I should have tried that myself, but I am so used to running unittests inside Eclipse it didn't occur to me...i'll try to find out what is going on... thanks, roman On Tue, Jan 8, 2013 at 6:53 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : apparently, it fails also with @SuppressCodecs(Lucene3x) what exactly is the test failure message? When you run tests that use the lucene test framework, any failure should include information about the random seed used to run the test -- that random seed affects things like the codec used, the directoryfactory used, etc... Can you confirm wether the test reliably passes/fails consistently when you reuse the same seed? Can you elaborate more on what exactly your test does? ... we probably need to see the entire test to make sense of why you might get inconsistent failures. -Hoss
Re: Pause and resume indexing on SolR 4 for backups
Are you sure a commit didn't happen between? Also, a background merge might have happened. As to using a backup, you are right, just stop solr, put the snapshot into index/data, and restart. This was mentioned before but seems not to have gotten any attention: can't you use the ReplicationHandler by just going to a URL like this?: http://host:8080/solr/replication?command=backuplocation=/home/jboss/backup The 2nd edition Lucene in Action book describes a way to take hot backups without stopping your IndexWriter (pp. 374ff), and it appears that ReplicationHandler uses a similar strategy if I'm reading the code correctly (Solr 3.6.1; I guess v4 is the same). It'd be great if someone more knowledgeable could confirm that you can use the ReplicationHandler to take hot backups. I'm surprised to see such a long thread about starting/stopping index jobs when there is such an easy answer. Or am I mistaken and at risk of corrupt backups if I use it? Thanks, Paul -- _ Pulchritudo splendor veritatis.
Re: Pause and resume indexing on SolR 4 for backups
Hi Paul, Hot backup is OK. There was a thread on this topic yesterday and the day before. But you should always try running from backup regardless of what anyone says here, because if you have to do that one day you want to know you verified it :) Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, Jan 9, 2013 at 3:12 PM, Paul Jungwirth p...@illuminatedcomputing.comwrote: Are you sure a commit didn't happen between? Also, a background merge might have happened. As to using a backup, you are right, just stop solr, put the snapshot into index/data, and restart. This was mentioned before but seems not to have gotten any attention: can't you use the ReplicationHandler by just going to a URL like this?: http://host:8080/solr/replication?command=backuplocation=/home/jboss/backup The 2nd edition Lucene in Action book describes a way to take hot backups without stopping your IndexWriter (pp. 374ff), and it appears that ReplicationHandler uses a similar strategy if I'm reading the code correctly (Solr 3.6.1; I guess v4 is the same). It'd be great if someone more knowledgeable could confirm that you can use the ReplicationHandler to take hot backups. I'm surprised to see such a long thread about starting/stopping index jobs when there is such an easy answer. Or am I mistaken and at risk of corrupt backups if I use it? Thanks, Paul -- _ Pulchritudo splendor veritatis.
Re: Clean Up Aged Index Using DeletionPolicy
Just to satisfy my curiosity - are you looking to have TTL for documents or for indices? The former: https://issues.apache.org/jira/browse/SOLR-3874 The latter: no issue that I know off, typically managed by the application. Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, Jan 9, 2013 at 10:57 AM, hyrax hao.w...@selerityfinancial.comwrote: Hey Shawn, Thanks a lot for your detailed explanation on deletionPolicy. Although it's frustrated that Solr doesn't support the function I need, I'm really glad that you point it out so that I can move on. What I'm thinking now is adding a new field for the time a document is indexed, so a simple range query can delete the aged indexes I want to remove to maintain my disk space. Thanks again, Hao -- View this message in context: http://lucene.472066.n3.nabble.com/Clean-Up-Aged-Index-Using-DeletionPolicy-tp4031704p4031896.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR '0' Status: Communication Error
Thanks Shawn. I tried increasing following timeouts in php: max_execution_time max_input_time default_socket_timeout But still I get 'Communication error'. Please let me know if I have to change any other timeout in php. -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-0-Status-Communication-Error-tp4031698p4032012.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Clean Up Aged Index Using DeletionPolicy
Exactly what I want. For a simple scenario: Index a batch of documents 20 days ago and they are searchable via Solr. After say 20 days, you can't search them anymore because they are deleted automatically by Solr. Thanks, Hao -- View this message in context: http://lucene.472066.n3.nabble.com/Clean-Up-Aged-Index-Using-DeletionPolicy-tp4031704p4032019.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Clean Up Aged Index Using DeletionPolicy
Options: 1. Run delete by query every N hours/days to purge old docs 2. Create daily indices and drop them every H hours/days to get rid of all old docs The TTL support for 1. would probably be implemented with delete by query. The drawback of 1. compared to 2. is that you will pay the price when Lucene merges segments with lots of deleted docs. 2. is cheaper. Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, Jan 9, 2013 at 3:44 PM, hyrax hao.w...@selerityfinancial.comwrote: Exactly what I want. For a simple scenario: Index a batch of documents 20 days ago and they are searchable via Solr. After say 20 days, you can't search them anymore because they are deleted automatically by Solr. Thanks, Hao -- View this message in context: http://lucene.472066.n3.nabble.com/Clean-Up-Aged-Index-Using-DeletionPolicy-tp4031704p4032019.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Clean Up Aged Index Using DeletionPolicy
Solr does not delete anything automatically. Add a timestamp field when you index. Use delete by query to delete everything older than 20 days. wunder On Jan 9, 2013, at 12:44 PM, hyrax wrote: Exactly what I want. For a simple scenario: Index a batch of documents 20 days ago and they are searchable via Solr. After say 20 days, you can't search them anymore because they are deleted automatically by Solr. Thanks, Hao -- View this message in context: http://lucene.472066.n3.nabble.com/Clean-Up-Aged-Index-Using-DeletionPolicy-tp4031704p4032019.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Pause and resume indexing on SolR 4 for backups
Yes, I agree about making sure the backups actually work, whatever the approach. Thanks for your reply and all you've contributed to the Solr/Lucene community. The Lucene in Action book has been a huge help to me. Paul On Wed, Jan 9, 2013 at 12:16 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi Paul, Hot backup is OK. There was a thread on this topic yesterday and the day before. But you should always try running from backup regardless of what anyone says here, because if you have to do that one day you want to know you verified it :) Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, Jan 9, 2013 at 3:12 PM, Paul Jungwirth p...@illuminatedcomputing.comwrote: Are you sure a commit didn't happen between? Also, a background merge might have happened. As to using a backup, you are right, just stop solr, put the snapshot into index/data, and restart. This was mentioned before but seems not to have gotten any attention: can't you use the ReplicationHandler by just going to a URL like this?: http://host:8080/solr/replication?command=backuplocation=/home/jboss/backup The 2nd edition Lucene in Action book describes a way to take hot backups without stopping your IndexWriter (pp. 374ff), and it appears that ReplicationHandler uses a similar strategy if I'm reading the code correctly (Solr 3.6.1; I guess v4 is the same). It'd be great if someone more knowledgeable could confirm that you can use the ReplicationHandler to take hot backups. I'm surprised to see such a long thread about starting/stopping index jobs when there is such an easy answer. Or am I mistaken and at risk of corrupt backups if I use it? Thanks, Paul -- _ Pulchritudo splendor veritatis. -- _ Pulchritudo splendor veritatis.
performing a boolean query (OR) with a large number of terms
hello, environment: solr 3.5 i have a requirement to perform a boolean query (like the example below) with a large number of terms. the number of terms could be 15 or possibly larger. after looking over several theads and the smiley book - i think i just have include the parens and string all of the terms together with OR's i just want to make sure that i am not missing anything. is there a better or more efficient way of doing this? http://server:port/dir/core1/select?qt=modelItemNoSearchq=itemModelNoExactMatchStr:%285-100-NGRT7%20OR%205-10-10MS7%20OR%20404%29rows=30debugQuery=onrows=40 thx mark -- View this message in context: http://lucene.472066.n3.nabble.com/performing-a-boolean-query-OR-with-a-large-number-of-terms-tp4032039.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: SolrJ DirectXmlRequest
Thanks Otis, DirectXmlRequest is part of the SolrJ library, so I guess that means it is not commonly used. My use case is that I'm applying an XSLT to the raw XML on the client side, instead of leaving that up to the Solr master (although even if I applied the XSLT on the Solr server, I'd still use DirectXmlRequest to get the raw XML there). This does lead me to the idea that parsing the XML without the XSLT is probably better than copying some of XMLLoader to parse Solr XML as a workaround, and might be a good idea to do anyway. I've done some research and I'm fairly confident that apache commons-fileupload library is responsible for the temp files. There's an explanation for how files are cleaned up at http://commons.apache.org/fileupload/using.html in the Resource cleanup section. I have observed that forcing a garbage collection over JMX results in all temporary files being purged. This implies that many of the java.io.File objects are moving to old gen in the heap which survive long enough (only a few minutes in my case) to use up all tmp disk space. I think this can probably be solved by GC tuning, or, failing that, introducing a (less desirable) System.gc() somewhere in the updateRequestProcessorChain. Thanks for your help, and hopefully this will be useful if someone else runs into a similar problem. Ryan From: Otis Gospodnetic [otis.gospodne...@gmail.com] Sent: Wednesday, January 09, 2013 11:53 AM To: solr-user@lucene.apache.org Subject: Re: SolrJ DirectXmlRequest Hi Ryan, One typically uses a Solr client library to talk to Solr instead of sending raw XML. For example, if your application in written in Java then you would use SolrJ. Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, Jan 9, 2013 at 12:03 PM, Ryan Josal rjo...@rim.com wrote: I also don't know what's creating them. Maybe Solr, but also maybe Tomcat, maybe apache commons. I could change java.io.tmpdir to one with more space, but the problem is that many of the temp files end up permanent, so eventually it would still run out of space. I also considered setting the tmpdir to /dev/null, but that would defeat the purpose of whatever is writing those log files in the first place. I could periodically clean up the tmpdir myself, but that feels the hackiest. Is it fairly common to send XML to Solr this way from a remote host? If it is, then that would lead me to believe Solr and any of it's libraries aren't causing it, and I should inspect Tomcat. I'm using Tomcat 7. Ryan From: Otis Gospodnetic [otis.gospodne...@gmail.com] Sent: Tuesday, January 08, 2013 7:29 PM To: solr-user@lucene.apache.org Subject: Re: SolrJ DirectXmlRequest Hi Ryan, I'm not sure what is creating those upload files something in Solr? Or Tomcat? Why not specify a different temp dir via system property command line parameter? Otis Solr ElasticSearch Support http://sematext.com/ On Jan 8, 2013 12:17 PM, Ryan Josal rjo...@rim.com wrote: I have encountered an issue where using DirectXmlRequest to index data on a remote host results in eventually running out have temp disk space in the java.io.tmpdir directory. This occurs when I process a sufficiently large batch of files. About 30% of the temporary files end up permanent. The filenames look like: upload__2341cdae_13c02829b77__7ffd_00029003.tmp. Has anyone else had this happen before? The relevant code is: DirectXmlRequest up = new DirectXmlRequest( /update, xml ); up.process(solr); where `xml` is a String containing Solr formatted XML, and `solr` is the SolrServer. When disk space is eventually exhausted, this is the error message that is repeatedly seen on the master host: 2013-01-07 19:22:16,911 [http-bio-8090-exec-2657] [] ERROR org.apache.solr.servlet.SolrDispatchFilter [] - org.apache.commons.fileupload.FileUploadBase$IOFileUploadException: Processing of multipart/form-data request failed. No space left on device at org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:367) at org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126) at org.apache.solr.servlet.MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:344) at org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:397) at org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:115) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) ... truncated stack trace I am running Solr 3.6 on an Ubuntu 12.04 server. I am considering working around this by pulling out as
Re: Pause and resume indexing on SolR 4 for backups
The point was as much about how to use a backup, as to how to make one in the first place. the replication handler can handle spitting out a backup, but there's no straightforward way to tell Solr to switch to another set of index files instead. You'd have to do clever stuff with the CoreAdminHandler, I reckon. Upayavira On Wed, Jan 9, 2013, at 09:27 PM, Paul Jungwirth wrote: Yes, I agree about making sure the backups actually work, whatever the approach. Thanks for your reply and all you've contributed to the Solr/Lucene community. The Lucene in Action book has been a huge help to me. Paul On Wed, Jan 9, 2013 at 12:16 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi Paul, Hot backup is OK. There was a thread on this topic yesterday and the day before. But you should always try running from backup regardless of what anyone says here, because if you have to do that one day you want to know you verified it :) Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, Jan 9, 2013 at 3:12 PM, Paul Jungwirth p...@illuminatedcomputing.comwrote: Are you sure a commit didn't happen between? Also, a background merge might have happened. As to using a backup, you are right, just stop solr, put the snapshot into index/data, and restart. This was mentioned before but seems not to have gotten any attention: can't you use the ReplicationHandler by just going to a URL like this?: http://host:8080/solr/replication?command=backuplocation=/home/jboss/backup The 2nd edition Lucene in Action book describes a way to take hot backups without stopping your IndexWriter (pp. 374ff), and it appears that ReplicationHandler uses a similar strategy if I'm reading the code correctly (Solr 3.6.1; I guess v4 is the same). It'd be great if someone more knowledgeable could confirm that you can use the ReplicationHandler to take hot backups. I'm surprised to see such a long thread about starting/stopping index jobs when there is such an easy answer. Or am I mistaken and at risk of corrupt backups if I use it? Thanks, Paul -- _ Pulchritudo splendor veritatis. -- _ Pulchritudo splendor veritatis.
SOLR/Velocity Test Cases
Hi, I'm trying to write some tests based on SolrTestCaseJ4 that test using velocity in SOLR. I found VelocityResponseWriterTest.java, but this does not test that. In fact it has a todo to do what I want to do. Anyone have an example out there? I just need to check if velocity is loaded with my configuration. Any help is appreciated.
Re: How to run many MoreLikeThis request efficiently?
Any comments on this? Thanks very much in advance! 2013/1/9 Yandong Yao yydz...@gmail.com Hi Solr Guru, I have two set of documents in one SolrCore, each set has about 1M documents with different document type, say 'type1' and 'type2'. Many documents in first set are very similar with 1 or 2 documents in the second set, What I want to get is: for each document in set 2, return the most similar document in set 1 using either 'MoreLikeThisHandler' or 'MoreLikeThisComponent'. Currently I use following code to get the result, while it will send far too many request to Solr server serially. Is there any way to enhance this besides using multi-threading? Thanks very much! for each document in set 2 whose type is 'type2' run MoreLikeThis request against Solr server and get the most similar document end. Regards, Yandong
what is difference between 4.1 and 5.x
just curious as to what the difference is between 4.1 and 5.0 i.e. is 4.1 a maintenance branch for what is currently 4.0 or are they very different designs/architectures -- View this message in context: http://lucene.472066.n3.nabble.com/what-is-difference-between-4-1-and-5-x-tp4032064.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: CoreAdmin STATUS performance
On 1/9/2013 10:38 AM, Shahar Davidson wrote: Hi All, I have a client app that uses SolrJ and which requires to collect the names (and just the names) of all loaded cores. I have about 380 Solr Cores on a single Solr server (net indices size is about 220GB). Running the STATUS action takes about 800ms - that seems a bit too long, given my requirements. So here are my questions: 1) Is there any way to get _only_ the core Name of all cores? If you have access to the filesystem, you could just read solr.xml where all cores are listed.
Re: what is difference between 4.1 and 5.x
On 1/9/2013 5:11 PM, solr-user wrote: just curious as to what the difference is between 4.1 and 5.0 i.e. is 4.1 a maintenance branch for what is currently 4.0 or are they very different designs/architectures There are several code branches in the SVN repository. I'll talk about three of them here. The first is lucene_solr_4_0, which is the branch that 4.0.0 was released from. The second is called branch_4x, which is the 4.x development branch. This includes a version number of 4.1 right now. The third branch isn't really a branch - it's the main development area, called trunk. The trunk currently includes a version number of 5.0. Very soon now, a lucene_solr_4_1 branch will be created from which version 4.1 will get released. When that happens, branch_4x will get renumbered to 4.2. At some point in the future, trunk will be copied to another branch called branch_5x, and then trunk will have its internal version number changed to 6.0. New development happens on both branch_4x and trunk. Right now, both development trees are actually very similar - most of the changes that have happened in the last few months have been made to both. Eventually, someone will come up with a major design overhaul that won't be appropriate to include in branch_4x. That kind of change will only get put into trunk. Thanks, Shawn
Re: CoreAdmin STATUS performance
On 1/9/2013 8:38 AM, Shahar Davidson wrote: I have a client app that uses SolrJ and which requires to collect the names (and just the names) of all loaded cores. I have about 380 Solr Cores on a single Solr server (net indices size is about 220GB). Running the STATUS action takes about 800ms - that seems a bit too long, given my requirements. So here are my questions: 1) Is there any way to get _only_ the core Name of all cores? 2) Why does the STATUS request take such a long time and is there a way to improve its performance? I'm curious why 800 milliseconds isn't fast enough. How often do you actually need to gather this information? If you are incorporating it into something that will get accessed a lot (such as a status servlet page), put a minimum interval capability into the part of the program that contacts solr. If it's been less than that minimum interval (5-10 seconds could be a recommended starting point) since the last time the information was gathered, just use the previously stored response rather than make a new request. I have used this approach in a homegrown status servlet written with SolrJ. I have been trying to come up with a way to generalize the paradigm so it can be incorporated directly into a future SolrJ version. Thanks, Shawn
Re: SOLR/Velocity Test Cases
Marcos - I just happen to be tinkering with VrW over the last few days (to get some big improvements across the board with it and the /browse UI into Solr 5.0, and maybe eventually 4.x too), so I whipped up such a test case just now. Here's the short and sweet version: public void testVelocityResponseWriterRegistered() { QueryResponseWriter writer = h.getCore().getQueryResponseWriter(velocity); assertTrue(VrW registered check, writer instanceof VelocityResponseWriter); } This required that I put in the test solrconfig.xml queryResponseWriter name=velocity class=solr.VelocityResponseWriter/ (which was not there before, as it wasn't needed for the direct VrW test that already was there). I added another test too, to check a template from the conf/velocity directory being rendered like this: public void testSolrResourceLoaderTemplate() throws Exception { assertEquals(0, h.query(req(q,*:*, wt,velocity,v.template,test))); } And I added a conf/velocity/test.vm file with just this in it: $response.response.response.numFound So there ya go... I'll commit these in hopefully the near future along with the other related stuff. I'm curious - what are you using VrW for? Erik On Jan 9, 2013, at 17:43 , Marcos Mendez wrote: Hi, I'm trying to write some tests based on SolrTestCaseJ4 that test using velocity in SOLR. I found VelocityResponseWriterTest.java, but this does not test that. In fact it has a todo to do what I want to do. Anyone have an example out there? I just need to check if velocity is loaded with my configuration. Any help is appreciated.
Schema Field Names i18n
Anyone have experience with internationalizing the field names in the SOLR schema, so users in different languages can specify fields in their own language? My first thoughts would be to create a custom search component or query parser than would convert localized field names back to the English names in the schema, but I haven't dived in too deep yet. Any input would be greatly appreciated. Thanks, Daryl __ * This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential and exempt from disclosure under applicable law. Unless you are the addressee (or authorized to receive for the addressee), you may not use, copy or disclose the message or any information contained in the message. If you have received this message in error, please advise the sender by reply e-mail, and delete the message, or call +1-613-747-4698. *
Re: How to run many MoreLikeThis request efficiently?
Patience, young Yandong :) Multi-threading *in your application* is the way to go. Alternatively, one could write a custom SearchComponent that is called once and inside of which the whole work is done after just one call to it. This component could then write the output somewhere, like in a new index since making a blocking call to it may time out. Otis Solr ElasticSearch Support http://sematext.com/ On Jan 9, 2013 6:07 PM, Yandong Yao yydz...@gmail.com wrote: Any comments on this? Thanks very much in advance! 2013/1/9 Yandong Yao yydz...@gmail.com Hi Solr Guru, I have two set of documents in one SolrCore, each set has about 1M documents with different document type, say 'type1' and 'type2'. Many documents in first set are very similar with 1 or 2 documents in the second set, What I want to get is: for each document in set 2, return the most similar document in set 1 using either 'MoreLikeThisHandler' or 'MoreLikeThisComponent'. Currently I use following code to get the result, while it will send far too many request to Solr server serially. Is there any way to enhance this besides using multi-threading? Thanks very much! for each document in set 2 whose type is 'type2' run MoreLikeThis request against Solr server and get the most similar document end. Regards, Yandong
Re: SOLR/Velocity Test Cases
And to add a little to this, since it looked ugly below, the $response.response.response.numFound thing is something I'm going to improve to make it leaner and cleaner to get at the actual result set and other response structures. $response is the actual SolrQueryResponse, and navigating that down to numFound through NamedLists and so on is pretty ridiculous looking. Erik On Jan 9, 2013, at 19:54 , Erik Hatcher wrote: Marcos - I just happen to be tinkering with VrW over the last few days (to get some big improvements across the board with it and the /browse UI into Solr 5.0, and maybe eventually 4.x too), so I whipped up such a test case just now. Here's the short and sweet version: public void testVelocityResponseWriterRegistered() { QueryResponseWriter writer = h.getCore().getQueryResponseWriter(velocity); assertTrue(VrW registered check, writer instanceof VelocityResponseWriter); } This required that I put in the test solrconfig.xml queryResponseWriter name=velocity class=solr.VelocityResponseWriter/ (which was not there before, as it wasn't needed for the direct VrW test that already was there). I added another test too, to check a template from the conf/velocity directory being rendered like this: public void testSolrResourceLoaderTemplate() throws Exception { assertEquals(0, h.query(req(q,*:*, wt,velocity,v.template,test))); } And I added a conf/velocity/test.vm file with just this in it: $response.response.response.numFound So there ya go... I'll commit these in hopefully the near future along with the other related stuff. I'm curious - what are you using VrW for? Erik On Jan 9, 2013, at 17:43 , Marcos Mendez wrote: Hi, I'm trying to write some tests based on SolrTestCaseJ4 that test using velocity in SOLR. I found VelocityResponseWriterTest.java, but this does not test that. In fact it has a todo to do what I want to do. Anyone have an example out there? I just need to check if velocity is loaded with my configuration. Any help is appreciated.
Re: DIH fails after processing roughly 10million records
At this scale, your indexing job is prone to break in various ways. If you want this to be reliable, it should be able to restart in the middle of an upload, rather than starting over. On 01/08/2013 10:19 PM, vijeshnair wrote: Yes Shawn, the batchSize is -1 only and I also have the mergeScheduler exactly same as you mentioned. When I had this problem in SOLR 3.4, I did an extensive googling and gathered much of the tweaks and tuning from different blogs and forums and configured the 4.0 instance. My next full run is scheduled for this weekend, I will try with a higher mysql wait_timeout value and update you the outcome. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-fails-after-processing-roughly-10million-records-tp4031508p4031779.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud - Query performance degrades with multiple servers
Hi Yonik, Could you merger this feature with 4.0 branch, We tried to use 4.1 it did solve the CPU spike but we did get other issues. As we are very tight on schedule so it would very beneficial if you could merge this feature with 4.0 branch. Let me know. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Query-performance-degrades-with-multiple-servers-tp4024660p4032088.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud graph status is out of date
It may be able to do that because it's forwarding requests to other nodes that are up? Would be good to dig into the logs to see if you can narrow in on the reason for the recovery_failed. - Mark On Jan 9, 2013, at 8:52 PM, Zeng Lames lezhi.z...@gmail.com wrote: Hi , we meet below strange case in production environment. from the Solr Admin Console - Cloud - Graph, we can find that one node is in recovery_failed status. but at the same time, we found that the recovery_failed node can server query/update request normally. any idea about it? thanks! -- Best Wishes! Lames
Re: SolrCloud - Query performance degrades with multiple servers
On 1/9/2013 7:01 PM, sausarkar wrote: Hi Yonik, Could you merger this feature with 4.0 branch, We tried to use 4.1 it did solve the CPU spike but we did get other issues. As we are very tight on schedule so it would very beneficial if you could merge this feature with 4.0 branch. 4.1 *is* the next release after 4.0. At this point, with 4.1 close to release, there will not be a 4.0.1. Thanks, Shawn
Setting up new SolrCloud - need some guidance
I have a lot of experience with Solr, starting with 1.4.0 and currently running 3.5.0 in production. I am working on a 4.1 upgrade, but I have not touched SolrCloud at all. I now need to set up a brand new Solr deployment to replace a custom Lucene system, and due to the way the client works, SolrCloud is going to be the only reasonable way to have redundancy. I am planning to have two Solr servers (each also running standalone zookeeper) plus a third low-end machine that will complete the zookeeper ensemble. I'm planning to set it up with numShards=1, replica 2. It will need to support several different collections. Although it's possible that those collections will all use the same schema and config at first, it's likely that they will diverge before too long. What would be the best practice for setting up zookeeper for this? Would I use multiple zk chroots, or put everything into one? I've been trying to figure this out on my own, without much luck. Can anyone share some known good ZK/SolrCloud configs? What gotchas am I likely to run into? The existing config that I've come up with for this system heavily uses xinclude in solrconfig.xml. Is it possible to use xinclude when the config files are in zookeeper, or will I have to re-combine it? Thanks, Shawn
Re: Setting up new SolrCloud - need some guidance
I'd put everything into one. You can upload different named sets of config files and point collections either to the same sets or different sets. You can really think about it the same way you would setting up a single node with multiple cores. The main difference is that it's easier to share sets of config files across collections if you want to. You don't need to at all though. I'm not sure if xinclude works with zk, but I don't think it does. - Mark On Jan 9, 2013, at 10:31 PM, Shawn Heisey s...@elyograg.org wrote: I have a lot of experience with Solr, starting with 1.4.0 and currently running 3.5.0 in production. I am working on a 4.1 upgrade, but I have not touched SolrCloud at all. I now need to set up a brand new Solr deployment to replace a custom Lucene system, and due to the way the client works, SolrCloud is going to be the only reasonable way to have redundancy. I am planning to have two Solr servers (each also running standalone zookeeper) plus a third low-end machine that will complete the zookeeper ensemble. I'm planning to set it up with numShards=1, replica 2. It will need to support several different collections. Although it's possible that those collections will all use the same schema and config at first, it's likely that they will diverge before too long. What would be the best practice for setting up zookeeper for this? Would I use multiple zk chroots, or put everything into one? I've been trying to figure this out on my own, without much luck. Can anyone share some known good ZK/SolrCloud configs? What gotchas am I likely to run into? The existing config that I've come up with for this system heavily uses xinclude in solrconfig.xml. Is it possible to use xinclude when the config files are in zookeeper, or will I have to re-combine it? Thanks, Shawn
Re: How to run many MoreLikeThis request efficiently?
Hi Otis, Really appreciate your help on this!! Will go with multi-thread firstly, and then provide a custom component when performance is not good enough. Regards, Yandong 2013/1/10 Otis Gospodnetic otis.gospodne...@gmail.com Patience, young Yandong :) Multi-threading *in your application* is the way to go. Alternatively, one could write a custom SearchComponent that is called once and inside of which the whole work is done after just one call to it. This component could then write the output somewhere, like in a new index since making a blocking call to it may time out. Otis Solr ElasticSearch Support http://sematext.com/ On Jan 9, 2013 6:07 PM, Yandong Yao yydz...@gmail.com wrote: Any comments on this? Thanks very much in advance! 2013/1/9 Yandong Yao yydz...@gmail.com Hi Solr Guru, I have two set of documents in one SolrCore, each set has about 1M documents with different document type, say 'type1' and 'type2'. Many documents in first set are very similar with 1 or 2 documents in the second set, What I want to get is: for each document in set 2, return the most similar document in set 1 using either 'MoreLikeThisHandler' or 'MoreLikeThisComponent'. Currently I use following code to get the result, while it will send far too many request to Solr server serially. Is there any way to enhance this besides using multi-threading? Thanks very much! for each document in set 2 whose type is 'type2' run MoreLikeThis request against Solr server and get the most similar document end. Regards, Yandong
Re: SolrCloud graph status is out of date
thanks Mark. will further dig into the logs. there is another problem related. we have collections with 3 shards (2 nodes in one shard), the collection have about 1000 records in it. but unfortunately that after the leader is down, replica node failed to become the leader.the detail is : after the leader node is down, replica node try to become the new leader, but it said === ShardLeaderElectionContext.runLeaderProcess(131) - Running the leader process. ShardLeaderElectionContext.shouldIBeLeader(331) - Checking if I should try and be the leader. ShardLeaderElectionContext.shouldIBeLeader(339) - My last published State was Active, it's okay to be the leader. ShardLeaderElectionContext.runLeaderProcess(164) - I may be the new leader - try and sync SyncStrategy.sync(89) - Sync replicas to http://localhost:8486/solr/exception/ PeerSync.sync(182) - PeerSync: core=exception url=http://localhost:8486/solr START replicas=[http://localhost:8483/solr/exception/] nUpdates=100 PeerSync.sync(250) - PeerSync: core=exception url=http://localhost:8486/solr DONE. We have no versions. sync failed. SyncStrategy.log(114) - Sync Failed ShardLeaderElectionContext.rejoinLeaderElection(311) - There is a better leader candidate than us - going back into recovery DefaultSolrCoreState.doRecovery(214) - Running recovery - first canceling any ongoing recovery after that, it try to recovery from the leader node, which is already down. then recovery + failed + recovery. is it related to SOLR-3939 and SOLR-3940? but the index data isn't empty. On Thu, Jan 10, 2013 at 10:09 AM, Mark Miller markrmil...@gmail.com wrote: It may be able to do that because it's forwarding requests to other nodes that are up? Would be good to dig into the logs to see if you can narrow in on the reason for the recovery_failed. - Mark On Jan 9, 2013, at 8:52 PM, Zeng Lames lezhi.z...@gmail.com wrote: Hi , we meet below strange case in production environment. from the Solr Admin Console - Cloud - Graph, we can find that one node is in recovery_failed status. but at the same time, we found that the recovery_failed node can server query/update request normally. any idea about it? thanks! -- Best Wishes! Lames -- Best Wishes! Lames