Re: Which Tokeniser (and/or filter)
Apologies if things were a little vague. Given the example snippet to index (numbered to show searches needed to match)... 1: i am a sales-manager in here 2: using asp.net and .net daily 3: working in design. 4: using something called sage 200. and i'm fluent 5: german sausages. 6: busy AE dept earning £10,000 annually ... all with newlines in place. able to match... 1. sales 1. sales manager 1. sales-manager 1. sales-manager 2. .net 2. asp.net 3. design 4. sage 200 6. AE 6. £10,000 But do NOT match fluent german from 4 + 5 since there's a newline between them when indexed, but not when searched. Do the filters (wdf in this case) not create multiple tokens, so if splitting on period in asp.net would create tokens for all of asp, asp., asp.net, .net, net. Cheers, Rob -- IntelCompute Web Design and Online Marketing http://www.intelcompute.com -Original Message- From: Chris Hostetter hossman_luc...@fucit.org Reply-to: solr-user@lucene.apache.org To: solr-user@lucene.apache.org Subject: Re: Which Tokeniser (and/or filter) Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST) : This all seems a bit too much work for such a real-world scenario? You haven't really told us what your scenerio is. You said you want to split tokens on whitespace, full-stop (aka: period) and comma only, but then in response to some suggestions you added comments other things that you never mentioned previously... 1) evidently you don't want the . in foo.net to cause a split in tokens? 2) evidently you not only want token splits on newlines, but also positition gaps to prevent phrases matching across newlines. ...these are kind of important details that affect suggestions people might give you. can you please provide some concrete examples of hte types of data you have, the types of queries you want them to match, and the types of queries you *don't* want to match? -Hoss
Re: is there any practice to load index into RAM to accelerate solr performance?
This is true with Lucene as it stands. It would be much faster if there were a specialized in-memory index such as is typically used with high performance search engines. On Tue, Feb 7, 2012 at 9:50 PM, Lance Norskog goks...@gmail.com wrote: Experience has shown that it is much faster to run Solr with a small amount of memory and let the rest of the ram be used by the operating system disk cache. That is, the OS is very good at keeping the right disk blocks in memory, much better than Solr. How much RAM is in the server and how much RAM does the JVM get? How big are the documents, and how large is the term index for your searches? How many documents do you get with each search? And, do you use filter queries- these are very powerful at limiting searches. 2012/2/7 James ljatreey...@163.com: Is there any practice to load index into RAM to accelerate solr performance? The over all documents is about 100 million. The search time around 100ms. I am seeking some method to accelerate the respond time for solr. Just check that there is some practice use SSD disk. And SSD is also cost much, just want to know is there some method like to load the index file in RAM and keep the RAM index and disk index synchronized. Then I can search on the RAM index. -- Lance Norskog goks...@gmail.com
Re:Re: is there any practice to load index into RAM to accelerate solr performance?
But the solr did not have the im-memory index, I am right? At 2012-02-08 16:17:49,Ted Dunning ted.dunn...@gmail.com wrote: This is true with Lucene as it stands. It would be much faster if there were a specialized in-memory index such as is typically used with high performance search engines. On Tue, Feb 7, 2012 at 9:50 PM, Lance Norskog goks...@gmail.com wrote: Experience has shown that it is much faster to run Solr with a small amount of memory and let the rest of the ram be used by the operating system disk cache. That is, the OS is very good at keeping the right disk blocks in memory, much better than Solr. How much RAM is in the server and how much RAM does the JVM get? How big are the documents, and how large is the term index for your searches? How many documents do you get with each search? And, do you use filter queries- these are very powerful at limiting searches. 2012/2/7 James ljatreey...@163.com: Is there any practice to load index into RAM to accelerate solr performance? The over all documents is about 100 million. The search time around 100ms. I am seeking some method to accelerate the respond time for solr. Just check that there is some practice use SSD disk. And SSD is also cost much, just want to know is there some method like to load the index file in RAM and keep the RAM index and disk index synchronized. Then I can search on the RAM index. -- Lance Norskog goks...@gmail.com
Re: is there any practice to load index into RAM to accelerate solr performance?
A start maybe to use a RAM disk for that. Mount is as a normal disk and have the index files stored there. Have a read here: http://en.wikipedia.org/wiki/RAM_disk Cheers, Patrick 2012/2/8 Ted Dunning ted.dunn...@gmail.com This is true with Lucene as it stands. It would be much faster if there were a specialized in-memory index such as is typically used with high performance search engines. On Tue, Feb 7, 2012 at 9:50 PM, Lance Norskog goks...@gmail.com wrote: Experience has shown that it is much faster to run Solr with a small amount of memory and let the rest of the ram be used by the operating system disk cache. That is, the OS is very good at keeping the right disk blocks in memory, much better than Solr. How much RAM is in the server and how much RAM does the JVM get? How big are the documents, and how large is the term index for your searches? How many documents do you get with each search? And, do you use filter queries- these are very powerful at limiting searches. 2012/2/7 James ljatreey...@163.com: Is there any practice to load index into RAM to accelerate solr performance? The over all documents is about 100 million. The search time around 100ms. I am seeking some method to accelerate the respond time for solr. Just check that there is some practice use SSD disk. And SSD is also cost much, just want to know is there some method like to load the index file in RAM and keep the RAM index and disk index synchronized. Then I can search on the RAM index. -- Lance Norskog goks...@gmail.com -- Patrick Plaatje Senior Consultant http://www.nmobile.nl/
Re: is there any practice to load index into RAM to accelerate solr performance?
Hi, This talk has some interesting details on setting up an Lucene index in RAM: http://www.lucidimagination.com/devzone/events/conferences/revolution/2011/lucene-yelp Would be great to hear your findings! Dmitry 2012/2/8 James ljatreey...@163.com Is there any practice to load index into RAM to accelerate solr performance? The over all documents is about 100 million. The search time around 100ms. I am seeking some method to accelerate the respond time for solr. Just check that there is some practice use SSD disk. And SSD is also cost much, just want to know is there some method like to load the index file in RAM and keep the RAM index and disk index synchronized. Then I can search on the RAM index.
Query in starting solr 3.5
Hi, I am using solr 3.5 version. I moved the data import handler files from solr 1.4(which I used previously) to the new solr. When I tried to start the solr 3.5, I got the following message in my log WARNING: XML parse warning in solrres:/dataimport.xml, line 2, column 95: Include operation failed, reverting to fallback. Resource error reading file as XML (href='solr/conf/solrconfig_master.xml'). Reason: Can't find resource 'solr/conf/solrconfig_master.xml' in classpath or '/solr/apache-solr-3.5.0/example/multicore/core1/conf/', cwd=/solr/apache-solr-3.5.0/example The partial content of dataimport file that I used in solr1.4 is as follows xi:include href=solr/conf/solrconfig_master.xml xmlns:xi=http://www.w3.org/2001/XInclude; xi:fallback xi:include href=/solr/apache-solr-3.5.0/example/multicore/IncludeFile/File1.xml/ xi:include href=/solr/apache-solr-3.5.0/example/multicore/IncludeFile/File2.xml/ xi:include href=/solr/apache-solr3.5.0/example/multicore/IncludeFile/File3/ /xi:fallback /xi:include The 3 files given in Fallback tag are present in the location. Does solr 3.5 support fallback? Can someone please suggest a solution? Also, I got the following warnings in my log while starting solr 3.5 WARNING: the luceneMatchVersion is not specified, defaulting to LUCENE_24 emulation. You should at some point declare and reindex to at least 3.0, because 2.4 emulation is deprecated and will be removed in 4.0. This parameter will be mandatory in 4.0. The solution i got after googling is to apply a patch. Is there any other option other than applying this patch to overcome the warnings? Which is the best option. Kindly help me out. Thanks in advance. -- View this message in context: http://lucene.472066.n3.nabble.com/Query-in-starting-solr-3-5-tp3725372p3725372.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: is there any practice to load index into RAM to accelerate solr performance?
On 08/02/2012 09:17, Ted Dunning wrote: This is true with Lucene as it stands. It would be much faster if there were a specialized in-memory index such as is typically used with high performance search engines. This could be implemented in Lucene trunk as a Codec. The challenge though is to come up with the right data structures. There has been some interesting research on optimizations for in-memory inverted indexes, but it usually involves changing the query evaluation algos as well - for reference: http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/1202502 http://www.siam.org/proceedings/alenex/2008/alx08_01transierf.pdf http://research.google.com/pubs/archive/37365.pdf -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Improving performance for SOLR geo queries?
Hi Erick, if we're not doing geo searches, we filter by location tags that we attach to places. This is simply a hierachical regional id, which is simple to filter for, but much less flexible. We use that on Web a lot, but not on mobile, where we want to performance searches in arbitrary radii around arbitrary positions. For those location tag kind of queries, the average time spent in SOLR is 43msec (I'm looking at the New Relic snapshot of the last 12 hours). I have disabled our optimization again just yesterday, so for the bbox queries we're now at an avg of 220ms (same time window). That's a 5 fold increase in response time, and in peak hours it's worse than that. I've also found a blog post from 3 years ago which outlines the inner workings of the SOLR spatial indexing and searching: http://www.searchworkings.org/blog/-/blogs/23842 From that it seems as if SOLR already performs a similar optimization we had in mind during the index step, so if I understand correctly, it doesn't even search over all records, only those that were mapped to the grid box identified during indexing. What I would love to see is what the suggested way is to perform a geo query on SOLR, considering that they're so difficult to cache and expensive to run. Is the best approach to restrict the candidate set as much as possible using cheap filter queries, so that SOLR merely has to do the geo search against these subsets? How does the query planner work here? I see there's a cost attached to a filter query, but one can only set it when cache is set to false? Are cached geo queries executed last when there are cheaper filter queries to cut down on documents? If you have a real world practical setup to share, one that performs well in a production environment that serves requests in the Millions per day, that would be great. I'd love to contribute documentation by the way, if you knew me you'd know I'm an avid open source contributor and actually run several open source projects myself. But tell me, how can I possibly contribute answer to questions I don't have an answer to? That's why I'm here, remember :) So please, these kinds of snippy replies are not helping anyone. Thanks -Matthias On Tue, Feb 7, 2012 at 3:06 PM, Erick Erickson erickerick...@gmail.com wrote: So the obvious question is what is your performance like without the distance filters? Without that knowledge, we have no clue whether the modifications you've made had any hope of speeding up your response times As for the docs, any improvements you'd like to contribute would be happily received Best Erick 2012/2/6 Matthias Käppler matth...@qype.com: Hi, we need to perform fast geo lookups on an index of ~13M places, and were running into performance problems here with SOLR. We haven't done a lot of query optimization / SOLR tuning up until now so there's probably a lot of things we're missing. I was wondering if you could give me some feedback on the way we do things, whether they make sense, and especially why a supposed optimization we implemented recently seems to have no effect, when we actually thought it would help a lot. What we do is this: our API is built on a Rails stack and talks to SOLR via a Ruby wrapper. We have a few filters that almost always apply, which we put in filter queries. Filter cache hit rate is excellent, about 97%, and cache size caps at 10k filters (max size is 32k, but it never seems to reach that many, probably because we replicate / delta update every few minutes). Still, geo queries are slow, about 250-500msec on average. We send them with cache=false, so as to not flood the fq cache and cause undesirable evictions. Now our idea was this: while the actual geo queries are poorly cacheable, we could clearly identify geographical regions which are more often queried than others (naturally, since we're a user driven service). Therefore, we dynamically partition Earth into a static grid of overlapping boxes, where the grid size (the distance of the nodes) depends on the maximum allowed search radius. That way, for every user query, we would always be able to identify a single bounding box that covers it. This larger bounding box (200km edge length) we would send to SOLR as a cached filter query, along with the actual user query which would still be sent uncached. Ex: User asks for places in 10km around 49.14839,8.5691, then what we will send to SOLR is something like this: fq={!bbox cache=false d=10 sfield=location_ll pt=49.14839,8.5691} fq={!bbox cache=true d=100.0 sfield=location_ll pt=49.4684836290799,8.31165802979391} -- this one we derive automatically That way SOLR would intersect the two filters and return the same results as when only looking at the smaller bounding box, but keep the larger box in cache and speed up subsequent geo queries in the same regions. Or so we thought; unfortunately this approach did not help query execution times get better, at all.
Re: URI Encoding with Solr and Weblogic
Hi, I found a solution to it. Adding the Weblogic Server Argument -Dfile.encoding=UTF-8 did not affect the encoding. Only a change to the .war file's weblogic.xml and redeployment of the modified .war solved it. I added the following to the weblogic.xml: charset-params input-charset resource-path*/resource-path java-charset-nameUTF-8/java-charset-name /input-charset /charset-params Would it make sense to include this in the shipped weblogic.xml file? Best, Elisabeth On 07.02.2012 23:12, Elisabeth Adler wrote: Hi, I try to get Solr 3.3.0 to process Arabic search requests using its admin interface. I have successfully managed to set it up on Tomcat using the URIEncoding attribute but fail miserably on WebLogic 10. Invoking the URL http://localhost:7012/solr/select/?q=تهنئة returns the XML below: response lst name=responseHeader int name=status0/int int name=QTime0/int lst name=params str name=qتÙÙئة/str /lst /lst result name=response numFound=0 start=0/ /response The search term is just gibberish. Running the query through Luke or Tomcat returns the expected result and renders the search term correctly. I have tried to change the URI encoding and JVM default encoding by setting the following start up arguments in WebLogic: -Dfile.encoding=UTF-8 -Dweblogic.http.URIDecodeEncoding=UTF-8. I can see them being set through Solr's admin interface. They don't have any impact though. I am running out of ideas on how to get this working. Any thoughts and pointers are much appreciated. Thanks, Elisabeth
How to reindex about 10Mio. docs
Hello folks, i want to reindex about 10Mio. Docs. from one Solr(1.4.1) to another Solr(1.4.1). I changed my schema.xml (field types sing to slong), standard replication would fail. what is the fastest and smartest way to manage this? this here sound great (EntityProcessor): http://www.searchworkings.org/blog/-/blogs/importing-data-from-another-solr But would it work with Solr 1.4.1? Best Regards Vadim
Re: How to reindex about 10Mio. docs
i want to reindex about 10Mio. Docs. from one Solr(1.4.1) to another Solr(1.4.1). I changed my schema.xml (field types sing to slong), standard replication would fail. what is the fastest and smartest way to manage this? this here sound great (EntityProcessor): http://www.searchworkings.org/blog/-/blogs/importing-data-from-another-solr But would it work with Solr 1.4.1? SolrEntityProcessor is not available in 1.4.1. I would dump stored fields into comma separated file, and use http://wiki.apache.org/solr/UpdateCSV to feed into new solr instance.
Re: is there any practice to load index into RAM to accelerate solr performance?
I concur with this. As long as index segment files are cached in OS file cache performance is as about good as it gets. Pulling segment files into RAM inside JVM process may actually be slower, given Lucene's existing data structures and algorithms for reading segment file data. If you have very large index (much bigger than available RAM) then it will only be slow when accessing disk for uncached segment files. In that case you might consider sharding index across more than one server and using distributed searching (possibly SOLR cloud, etc.). How large is your index in GB? You can also try making index files smaller by removing indexed/stored fields you dont need, compressing large stored fields, etc. Also maybe turn off storing norms, term frequencies, positions, vectors and stuff if you dont need them. On Feb 8, 2012, at 3:17 AM, Ted Dunning wrote: This is true with Lucene as it stands. It would be much faster if there were a specialized in-memory index such as is typically used with high performance search engines. On Tue, Feb 7, 2012 at 9:50 PM, Lance Norskog goks...@gmail.com wrote: Experience has shown that it is much faster to run Solr with a small amount of memory and let the rest of the ram be used by the operating system disk cache. That is, the OS is very good at keeping the right disk blocks in memory, much better than Solr. How much RAM is in the server and how much RAM does the JVM get? How big are the documents, and how large is the term index for your searches? How many documents do you get with each search? And, do you use filter queries- these are very powerful at limiting searches. 2012/2/7 James ljatreey...@163.com: Is there any practice to load index into RAM to accelerate solr performance? The over all documents is about 100 million. The search time around 100ms. I am seeking some method to accelerate the respond time for solr. Just check that there is some practice use SSD disk. And SSD is also cost much, just want to know is there some method like to load the index file in RAM and keep the RAM index and disk index synchronized. Then I can search on the RAM index. -- Lance Norskog goks...@gmail.com
Re: How to reindex about 10Mio. docs
Hi Ahmet, thanks for quick response:) I've already thought the same... And it will be a pain to export and import this huge doc-set as CSV. Do i have an another solution? Regards Vadim 2012/2/8 Ahmet Arslan iori...@yahoo.com: i want to reindex about 10Mio. Docs. from one Solr(1.4.1) to another Solr(1.4.1). I changed my schema.xml (field types sing to slong), standard replication would fail. what is the fastest and smartest way to manage this? this here sound great (EntityProcessor): http://www.searchworkings.org/blog/-/blogs/importing-data-from-another-solr But would it work with Solr 1.4.1? SolrEntityProcessor is not available in 1.4.1. I would dump stored fields into comma separated file, and use http://wiki.apache.org/solr/UpdateCSV to feed into new solr instance.
usage of /etc/jetty.xml when debugging Solr in Eclipse
Hi, I am following http://www.lucidimagination.com/devzone/technical-articles/setting-apache-solr-eclipse in order to be able to debug Solr in eclipse. I got it working fine. Now, I usually use ./etc/jetty.xml to set logging configuration. When starting jetty in eclipse I dont see any log files created, so I guessed jetty.xml is not being used. So I added it to RunJetty Advanced configuration (Additional jetty.xml), but in that case something goes wrong, as I get a 'java.net.BindException: Address already in use: JVM_Bind' error, like if something is started twice. So my question is: can jetty.xml be used while debugging in eclipse? If so, how? I would like to use the same configuration I use when I am just changing xml stuff in Solr and starting with 'java -jar start.jar'. thank in advance -- View this message in context: http://lucene.472066.n3.nabble.com/usage-of-etc-jetty-xml-when-debugging-Solr-in-Eclipse-tp3725588p3725588.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to reindex about 10Mio. docs
Another problem appeared ;) how can i export my docs in csv-format? In Solr 3.1+ i can use the query-param wt=csv, but in Solr 1.4.1? Best Regards Vadim 2012/2/8 Vadim Kisselmann v.kisselm...@googlemail.com: Hi Ahmet, thanks for quick response:) I've already thought the same... And it will be a pain to export and import this huge doc-set as CSV. Do i have an another solution? Regards Vadim 2012/2/8 Ahmet Arslan iori...@yahoo.com: i want to reindex about 10Mio. Docs. from one Solr(1.4.1) to another Solr(1.4.1). I changed my schema.xml (field types sing to slong), standard replication would fail. what is the fastest and smartest way to manage this? this here sound great (EntityProcessor): http://www.searchworkings.org/blog/-/blogs/importing-data-from-another-solr But would it work with Solr 1.4.1? SolrEntityProcessor is not available in 1.4.1. I would dump stored fields into comma separated file, and use http://wiki.apache.org/solr/UpdateCSV to feed into new solr instance.
Custom Document Clustering and Mahout Integration
Hi all, I am trying to write a custom document clustering component that should take all the docs in commit and cluster them; Solr Version:3.5.0 Main Class: public class KMeansClusteringEngine extends DocumentClusteringEngine implements SolrEventListener I added newSearcher event listener, that works as expected. But, when is the document clustering called ?, I have two functions of DocumentClusteringEngine in my custom code, but when do they get called ?, wiki page says to add clustering.collection=true, but I am not sure as my guess is document clustering noway related to search. public NamedList cluster(SolrParams params) public NamedList cluster(DocSet docSet, SolrParams solrParams) Note: Actually I am trying to integrate Solr 3.5 with Mahout 0.5 for incremental clustering (i.e mapping new docs to existing cluster to avoid complete re-clustering ) basing my work from this github code, https://github.com/gsingers/ApacheCon2010/blob/master/src/main/java/com/grantingersoll/intell/clustering/KMeansClusteringEngine.java . I would love to get some support from you. -- Regards, S.Selvam http://knackforge.com
Re: struggling with solr.WordDelimiterFilterFactory and periods . or dots
Hmmm, seems OK. Did you re-index after any schema changes? You'll learn to love admin/analysis for questions like this, that page should show you what the actual tokenization results are, make sure to click the verbose check boxes. Best Erick On Tue, Feb 7, 2012 at 10:52 PM, geeky2 gee...@hotmail.com wrote: hello all, i am struggling with getting solr.WordDelimiterFilterFactory to behave as is indicated in the solr book (Smiley) on page 54. the example in the books reads like this: Here is an example exercising all options: WiFi-802.11b to Wi, Fi, WiFi, 802, 11, 80211, b, WiFi80211b essentially - i have the same requirement with embedded periods and need to return a successful search on a field, even if the user does NOT enter the period. i have a field, itemNo that can contain periods .. example content in the itemNo field: B12.0123 when the user searches on this field, they need to be able to enter an itemNo without the period, and still find the item. example: user enters: B120123 and a document is returned with B12.0123. unfortunately, the search will NOT return the appropriate document, if the user enters B120123. however - the search does work if the user enters B12 0123 (a space in place of the period). can someone help me understand what is missing from my configuration? this is snipped from my schema.xml file fields ... field name=itemNo type=text indexed=true stored=true/ ... /fields fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ *filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=1/* filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType -- View this message in context: http://lucene.472066.n3.nabble.com/struggling-with-solr-WordDelimiterFilterFactory-and-periods-or-dots-tp3724822p3724822.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Which Tokeniser (and/or filter)
Yes, WDDF creates multiple tokens. But that has nothing to do with the multiValued suggestion. You can get exactly what you want by 1 setting multiValued=true in your schema file and re-indexing. Say positionIncrementGap is set to 100 2 When you index, add the field for each sentence, so your doc looks something like: doc field name=sentencesi am a sales-manager in here/field field name=sentencesusing asp.net and .net daily/field . /doc 3 search like sales manager~100 Best Erick On Wed, Feb 8, 2012 at 3:05 AM, Rob Brown r...@intelcompute.com wrote: Apologies if things were a little vague. Given the example snippet to index (numbered to show searches needed to match)... 1: i am a sales-manager in here 2: using asp.net and .net daily 3: working in design. 4: using something called sage 200. and i'm fluent 5: german sausages. 6: busy AE dept earning £10,000 annually ... all with newlines in place. able to match... 1. sales 1. sales manager 1. sales-manager 1. sales-manager 2. .net 2. asp.net 3. design 4. sage 200 6. AE 6. £10,000 But do NOT match fluent german from 4 + 5 since there's a newline between them when indexed, but not when searched. Do the filters (wdf in this case) not create multiple tokens, so if splitting on period in asp.net would create tokens for all of asp, asp., asp.net, .net, net. Cheers, Rob -- IntelCompute Web Design and Online Marketing http://www.intelcompute.com -Original Message- From: Chris Hostetter hossman_luc...@fucit.org Reply-to: solr-user@lucene.apache.org To: solr-user@lucene.apache.org Subject: Re: Which Tokeniser (and/or filter) Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST) : This all seems a bit too much work for such a real-world scenario? You haven't really told us what your scenerio is. You said you want to split tokens on whitespace, full-stop (aka: period) and comma only, but then in response to some suggestions you added comments other things that you never mentioned previously... 1) evidently you don't want the . in foo.net to cause a split in tokens? 2) evidently you not only want token splits on newlines, but also positition gaps to prevent phrases matching across newlines. ...these are kind of important details that affect suggestions people might give you. can you please provide some concrete examples of hte types of data you have, the types of queries you want them to match, and the types of queries you *don't* want to match? -Hoss
Re: Fields not indexed?
How does your schema for the fields look like? On Wed, Feb 8, 2012 at 2:41 PM, Radu Toev radut...@gmail.com wrote: Hi, I am really new to Solr so I apologize if the question is a little off. I was playing with DataImportHandler and tried to index a table in a MS SQL database. I configured my datasource with the necessary parameters and added three fields with column(uppercase) and name: field column=ID name=machineId / field column=SERIAL name=machineSerial/ field column=IVK name=machineIvk/ The full-import command seems to have completed successfully and I see that the number of documents processed is the same as the number of entries in my table. However when I try to run a *:* query from the admin console I only get responses in the form: doc float name=score1.0/float str name=id1/str /doc I'm not sure how to get to the bottom of this. Thanks. -- Regards, Dmitry Kan
Re: Fields not indexed?
The schema.xml is the default file that comes with Solr 3.5, didn't change anything there. On Wed, Feb 8, 2012 at 2:45 PM, Dmitry Kan dmitry@gmail.com wrote: How does your schema for the fields look like? On Wed, Feb 8, 2012 at 2:41 PM, Radu Toev radut...@gmail.com wrote: Hi, I am really new to Solr so I apologize if the question is a little off. I was playing with DataImportHandler and tried to index a table in a MS SQL database. I configured my datasource with the necessary parameters and added three fields with column(uppercase) and name: field column=ID name=machineId / field column=SERIAL name=machineSerial/ field column=IVK name=machineIvk/ The full-import command seems to have completed successfully and I see that the number of documents processed is the same as the number of entries in my table. However when I try to run a *:* query from the admin console I only get responses in the form: doc float name=score1.0/float str name=id1/str /doc I'm not sure how to get to the bottom of this. Thanks. -- Regards, Dmitry Kan
Re: Fields not indexed?
well, you should add these fields in schema.xml, otherwise solr won't know them. On Wed, Feb 8, 2012 at 2:48 PM, Radu Toev radut...@gmail.com wrote: The schema.xml is the default file that comes with Solr 3.5, didn't change anything there. On Wed, Feb 8, 2012 at 2:45 PM, Dmitry Kan dmitry@gmail.com wrote: How does your schema for the fields look like? On Wed, Feb 8, 2012 at 2:41 PM, Radu Toev radut...@gmail.com wrote: Hi, I am really new to Solr so I apologize if the question is a little off. I was playing with DataImportHandler and tried to index a table in a MS SQL database. I configured my datasource with the necessary parameters and added three fields with column(uppercase) and name: field column=ID name=machineId / field column=SERIAL name=machineSerial/ field column=IVK name=machineIvk/ The full-import command seems to have completed successfully and I see that the number of documents processed is the same as the number of entries in my table. However when I try to run a *:* query from the admin console I only get responses in the form: doc float name=score1.0/float str name=id1/str /doc I'm not sure how to get to the bottom of this. Thanks. -- Regards, Dmitry Kan -- Regards, Dmitry Kan
Re: Fields not indexed?
I just realized that as I pushed the send button :P Thanks, I'll have a look. On Wed, Feb 8, 2012 at 2:58 PM, Dmitry Kan dmitry@gmail.com wrote: well, you should add these fields in schema.xml, otherwise solr won't know them. On Wed, Feb 8, 2012 at 2:48 PM, Radu Toev radut...@gmail.com wrote: The schema.xml is the default file that comes with Solr 3.5, didn't change anything there. On Wed, Feb 8, 2012 at 2:45 PM, Dmitry Kan dmitry@gmail.com wrote: How does your schema for the fields look like? On Wed, Feb 8, 2012 at 2:41 PM, Radu Toev radut...@gmail.com wrote: Hi, I am really new to Solr so I apologize if the question is a little off. I was playing with DataImportHandler and tried to index a table in a MS SQL database. I configured my datasource with the necessary parameters and added three fields with column(uppercase) and name: field column=ID name=machineId / field column=SERIAL name=machineSerial/ field column=IVK name=machineIvk/ The full-import command seems to have completed successfully and I see that the number of documents processed is the same as the number of entries in my table. However when I try to run a *:* query from the admin console I only get responses in the form: doc float name=score1.0/float str name=id1/str /doc I'm not sure how to get to the bottom of this. Thanks. -- Regards, Dmitry Kan -- Regards, Dmitry Kan
Re: usage of /etc/jetty.xml when debugging Solr in Eclipse
Hi, run-jetty-run issue #9: ... In the VM Arguments of your launch configuration set -Drjrxml=./jetty.xml If jetty.xml is in the root of your project it will be used (you can also use a fully qualified path name). The UI port, context and WebApp dir are ignored, since you can define them in jetty.xml Note: You still have to specify a valid WebApp dir because there are other checks that the plugin performs. ... Or you can start solr with jetty as usual and then connect eclipse to the running process. Regards Am 08.02.2012 12:24, schrieb jmlucjav: Hi, I am following http://www.lucidimagination.com/devzone/technical-articles/setting-apache-solr-eclipse in order to be able to debug Solr in eclipse. I got it working fine. Now, I usually use ./etc/jetty.xml to set logging configuration. When starting jetty in eclipse I dont see any log files created, so I guessed jetty.xml is not being used. So I added it to RunJetty Advanced configuration (Additional jetty.xml), but in that case something goes wrong, as I get a 'java.net.BindException: Address already in use: JVM_Bind' error, like if something is started twice. So my question is: can jetty.xml be used while debugging in eclipse? If so, how? I would like to use the same configuration I use when I am just changing xml stuff in Solr and starting with 'java -jar start.jar'. thank in advance
Re: struggling with solr.WordDelimiterFilterFactory and periods . or dots
hello, thank you for the reply. yes - i did re-index after the changes to the schema. also - thank you for the direction on using the analyzer - but i am not sure if i am interpreting the feedback from the analyzer correctly. here is what i did: in the Field value (Index) box - i placed this: BP2.1UAA in the Field value (Query) box - i placed this: BP21UAA then after hitting the Analyze button - i see the following: Under Index Analyzer for: org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_33, generateWordParts=1, catenateAll=1, catenateNumbers=1} i see position1 2 3 4 term text BP 2 1 UAA 21 BP21UAA Under Query Analyzer for: org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_33, generateWordParts=1, catenateAll=1, catenateNumbers=1} i see position1 2 3 term text BP 21 UAA BP21UAA the above information leads me to believe that i should have BP21UAA as an indexed term generated from the BP2.1UAA value coming from the database. also - the query analysis lead me to believe that i should find a document when i search on BP21UAA in the itemNo field do i have this correct am i missing something here? i am still unable to get a hit when i search on BP21UAA in the itemNo field. thank you, mark -- View this message in context: http://lucene.472066.n3.nabble.com/struggling-with-solr-WordDelimiterFilterFactory-and-periods-or-dots-tp3724822p3726021.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Which Tokeniser (and/or filter)
Thanks Erick, I didn't get confused with multiple tokens vs multiValued :) Before I go ahead and re-index 4m docs, and believe me I'm using the analysis page like a mad-man! What do I need to configure to have the following both indexed with and without the dots... .net sales manager. £12.50 Currently... tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=1 splitOnNumerics=1 types=wdftypes.txt / with nothing specific in wdftypes.txt for full-stops. Should there also be any difference when quoting my searches? The analysis page seems to just drop the quotes, but surely actual calls don't do this? --- IntelCompute Web Design Local Online Marketing http://www.intelcompute.com On Wed, 8 Feb 2012 07:38:42 -0500, Erick Erickson erickerick...@gmail.com wrote: Yes, WDDF creates multiple tokens. But that has nothing to do with the multiValued suggestion. You can get exactly what you want by 1 setting multiValued=true in your schema file and re-indexing. Say positionIncrementGap is set to 100 2 When you index, add the field for each sentence, so your doc looks something like: doc field name=sentencesi am a sales-manager in here/field field name=sentencesusing asp.net and .net daily/field . /doc 3 search like sales manager~100 Best Erick On Wed, Feb 8, 2012 at 3:05 AM, Rob Brown r...@intelcompute.com wrote: Apologies if things were a little vague. Given the example snippet to index (numbered to show searches needed to match)... 1: i am a sales-manager in here 2: using asp.net and .net daily 3: working in design. 4: using something called sage 200. and i'm fluent 5: german sausages. 6: busy AE dept earning £10,000 annually ... all with newlines in place. able to match... 1. sales 1. sales manager 1. sales-manager 1. sales-manager 2. .net 2. asp.net 3. design 4. sage 200 6. AE 6. £10,000 But do NOT match fluent german from 4 + 5 since there's a newline between them when indexed, but not when searched. Do the filters (wdf in this case) not create multiple tokens, so if splitting on period in asp.net would create tokens for all of asp, asp., asp.net, .net, net. Cheers, Rob -- IntelCompute Web Design and Online Marketing http://www.intelcompute.com -Original Message- From: Chris Hostetter hossman_luc...@fucit.org Reply-to: solr-user@lucene.apache.org To: solr-user@lucene.apache.org Subject: Re: Which Tokeniser (and/or filter) Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST) : This all seems a bit too much work for such a real-world scenario? You haven't really told us what your scenerio is. You said you want to split tokens on whitespace, full-stop (aka: period) and comma only, but then in response to some suggestions you added comments other things that you never mentioned previously... 1) evidently you don't want the . in foo.net to cause a split in tokens? 2) evidently you not only want token splits on newlines, but also positition gaps to prevent phrases matching across newlines. ...these are kind of important details that affect suggestions people might give you. can you please provide some concrete examples of hte types of data you have, the types of queries you want them to match, and the types of queries you *don't* want to match? -Hoss
Re: is there any practice to load index into RAM to accelerate solr performance?
Add this as well: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.155.5030 On Wed, Feb 8, 2012 at 1:56 AM, Andrzej Bialecki a...@getopt.org wrote: On 08/02/2012 09:17, Ted Dunning wrote: This is true with Lucene as it stands. It would be much faster if there were a specialized in-memory index such as is typically used with high performance search engines. This could be implemented in Lucene trunk as a Codec. The challenge though is to come up with the right data structures. There has been some interesting research on optimizations for in-memory inverted indexes, but it usually involves changing the query evaluation algos as well - for reference: http://digbib.ubka.uni-**karlsruhe.de/volltexte/**documents/1202502http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/1202502 http://www.siam.org/**proceedings/alenex/2008/alx08_**01transierf.pdfhttp://www.siam.org/proceedings/alenex/2008/alx08_01transierf.pdf http://research.google.com/**pubs/archive/37365.pdfhttp://research.google.com/pubs/archive/37365.pdf -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __** [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
How to identify the field with highest score in dismax
Hi, According solr documentation the dismax score is calculating after the formula : (score of matching clause with the highest score) + ( (tie paramenter) * (scores of any other matching clauses) ). Is there a way to identify the field on which the matching clause score is the highest? For example I suppose that I have the following document : doc str name=NameFord Mustang Coupe Cabrio/str str name=DetailsFord Mustang is a great car/str /doc and the following dismax query : defType=dismaxqf=Name^10+Details^1q=Ford+Mustang+Ford+Mustang and receive the document with the score : 5.6. Is there a way to find out if the score is for the matching on Name field or for the matching on Details field? Thanks in advance! -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-identify-the-field-with-highest-score-in-dismax-tp3726297p3726297.html Sent from the Solr - User mailing list archive at Nabble.com.
Sorting solrdocumentlist object after querying
Hi all, I want to sort a SolrDocumentList after it has been queried and obtained from the QueryResponse.getResults(). The reason is i have a SolrDocumentList obtained after querying using QueryResponse.getResults() and i have added few docs to it. Now i want to sort this SolrDocumentList based on the same fields i did the querying before i modified this SolrDocumentList. Please advice any alternatives with sample code will be appreciated a lot if this is not possible. It is an emergency -- View this message in context: http://lucene.472066.n3.nabble.com/Sorting-solrdocumentlist-object-after-querying-tp3726303p3726303.html Sent from the Solr - User mailing list archive at Nabble.com.
Wildcard ? issue?
Sorry for inaccurate title. I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full) containing same value: title xmlns=http://www.tei-c.org/ns/1.0;cal.lígraf/title and these fields are configured accordingly: fieldType name=xml class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.ICUFoldingFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.ICUFoldingFilterFactory/ /analyzer /fieldType fieldType name=xml_unicode class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType fieldType name=xml_unicode_full class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType And finally my search configuration: requestHandler name=dictionary class=solr.SearchHandler lst name=defaults str name=echoParamsall/str str name=defTypeedismax/str str name=mm2lt;-25%/str str name=qfdc_title_unicode_full^2 dc_title_unicode^2 dc_title/str int name=rows10/int str name=spellcheck.onlyMorePopulartrue/str str name=spellcheck.extendedResultsfalse/str str name=spellcheck.count1/str /lst arr name=last-components strspellcheck/str /arr /requestHandler I am trying to match the field with various search phrases (that are valid). There are results: # search phrase match? Comment 1 cal.lígra? yes 2 cal.ligra? no Changed í to i 3 cal.ligraf yes 4 calligra? no The problem is the #2 attempt to match a data. The #3 works replacing ? with f. One more thing. If * is used insted of ? other data is matched as cal.lígrafia but not cal.lígraf... Also I have spotted some logic missmatch in debug parsedQuery field: * cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 | dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0)) *cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 | dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0)) Should the second be *calligra?* insted?* *Environment: Tomcat 7.0.25 (request encoding UTF-8) Solr 3.5.0 Java 7 Oracle Ubuntu 11.10 -- Regards! Dalius Sidlauskas
Re: Wildcard ? issue?
If you can not read this mail easily check this ticket: https://issues.apache.org/jira/browse/SOLR-3106 This is a copy. Regards! Dalius Sidlauskas On 08/02/12 15:44, Dalius Sidlauskas wrote: Sorry for inaccurate title. I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full) containing same value: title xmlns=http://www.tei-c.org/ns/1.0;cal.lígraf/title and these fields are configured accordingly: fieldType name=xml class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.ICUFoldingFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.ICUFoldingFilterFactory/ /analyzer /fieldType fieldType name=xml_unicode class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType fieldType name=xml_unicode_full class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType And finally my search configuration: requestHandler name=dictionary class=solr.SearchHandler lst name=defaults str name=echoParamsall/str str name=defTypeedismax/str str name=mm2lt;-25%/str str name=qfdc_title_unicode_full^2 dc_title_unicode^2 dc_title/str int name=rows10/int str name=spellcheck.onlyMorePopulartrue/str str name=spellcheck.extendedResultsfalse/str str name=spellcheck.count1/str /lst arr name=last-components strspellcheck/str /arr /requestHandler I am trying to match the field with various search phrases (that are valid). There are results: # search phrase match? Comment 1 cal.lígra? yes 2 cal.ligra? no Changed í to i 3 cal.ligraf yes 4 calligra? no The problem is the #2 attempt to match a data. The #3 works replacing ? with f. One more thing. If * is used insted of ? other data is matched as cal.lígrafia but not cal.lígraf... Also I have spotted some logic missmatch in debug parsedQuery field: * cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 | dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0)) *cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 | dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0)) Should the second be *calligra?* insted?* *Environment: Tomcat 7.0.25 (request encoding UTF-8) Solr 3.5.0 Java 7 Oracle Ubuntu 11.10
Re: Wildcard ? issue?
Hi Dalius, If not already tried, Check http://localhost:8983/solr/admin/analysis.jsp (enable verbose output for both Field Value index and query for details) for your queries and see what all filters/tokenizers are being applied. Hope it helps! -param On 2/8/12 10:48 AM, Dalius Sidlauskas dalius.sidlaus...@semantico.com wrote: If you can not read this mail easily check this ticket: https://issues.apache.org/jira/browse/SOLR-3106 This is a copy. Regards! Dalius Sidlauskas On 08/02/12 15:44, Dalius Sidlauskas wrote: Sorry for inaccurate title. I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full) containing same value: title xmlns=http://www.tei-c.org/ns/1.0;cal.lígraf/title and these fields are configured accordingly: fieldType name=xml class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.ICUFoldingFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.ICUFoldingFilterFactory/ /analyzer /fieldType fieldType name=xml_unicode class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType fieldType name=xml_unicode_full class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType And finally my search configuration: requestHandler name=dictionary class=solr.SearchHandler lst name=defaults str name=echoParamsall/str str name=defTypeedismax/str str name=mm2lt;-25%/str str name=qfdc_title_unicode_full^2 dc_title_unicode^2 dc_title/str int name=rows10/int str name=spellcheck.onlyMorePopulartrue/str str name=spellcheck.extendedResultsfalse/str str name=spellcheck.count1/str /lst arr name=last-components strspellcheck/str /arr /requestHandler I am trying to match the field with various search phrases (that are valid). There are results: # search phrase match? Comment 1 cal.lígra? yes 2 cal.ligra? no Changed í to i 3 cal.ligraf yes 4 calligra? no The problem is the #2 attempt to match a data. The #3 works replacing ? with f. One more thing. If * is used insted of ? other data is matched as cal.lígrafia but not cal.lígraf... Also I have spotted some logic missmatch in debug parsedQuery field: * cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 | dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0)) *cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 | dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0)) Should the second be *calligra?* insted?* *Environment: Tomcat 7.0.25 (request encoding UTF-8) Solr 3.5.0 Java 7 Oracle Ubuntu 11.10
Re: Wildcard ? issue?
I have already tried this and it did not helped because it does not highlight matches if wild-card is used. The field configuration turns data to: dc_title: calligraf dc_title_unicode: cal·lígraf dc_title_unicode_full: cal·lígraf Debug parsedquery says: [Search for *cal·ligraf*] +DisjunctionMaxQuery((dc_title:*calligraf* | dc_title_unicode:cal·ligraf^2.0 | dc_title_unicode_full:cal·ligraf^2.0)) [Search for *cal·ligra?*] +DisjunctionMaxQuery((dc_title:*cal·ligra?* | dc_title_unicode:cal·ligra?^2.0 | dc_title_unicode_full:cal·ligra?^2.0)) Why the *dc_title* field is handled differently? The analysis looks fine: Index Analyzer org.apache.solr.analysis.HTMLStripCharFilterFactory {luceneMatchVersion=LUCENE_34} textcal·lígraf org.apache.solr.analysis.PatternReplaceCharFilterFactory {replacement=, pattern=-, maxBlockChars=1, luceneMatchVersion=LUCENE_34, blockDelimiters=} textcal·lígraf org.apache.solr.analysis.WhitespaceTokenizerFactory {luceneMatchVersion=LUCENE_34} position1 term text cal·lígraf startOffset 43 endOffset 53 org.apache.solr.analysis.ICUFoldingFilterFactory {luceneMatchVersion=LUCENE_34} position1 term text calligraf startOffset 43 endOffset 53 Query Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {luceneMatchVersion=LUCENE_34} position1 term text cal·ligra? startOffset 0 endOffset 10 org.apache.solr.analysis.ICUFoldingFilterFactory {luceneMatchVersion=LUCENE_34} position1 term text calligra? startOffset 0 endOffset 10 Is this a Solr or Lucene bug? Regards! Dalius Sidlauskas On 08/02/12 16:03, Sethi, Parampreet wrote: Hi Dalius, If not already tried, Check http://localhost:8983/solr/admin/analysis.jsp (enable verbose output for both Field Value index and query for details) for your queries and see what all filters/tokenizers are being applied. Hope it helps! -param On 2/8/12 10:48 AM, Dalius Sidlauskasdalius.sidlaus...@semantico.com wrote: If you can not read this mail easily check this ticket: https://issues.apache.org/jira/browse/SOLR-3106 This is a copy. Regards! Dalius Sidlauskas On 08/02/12 15:44, Dalius Sidlauskas wrote: Sorry for inaccurate title. I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full) containing same value: title xmlns=http://www.tei-c.org/ns/1.0;cal.lígraf/title and these fields are configured accordingly: fieldType name=xml class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.ICUFoldingFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.ICUFoldingFilterFactory/ /analyzer /fieldType fieldType name=xml_unicode class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType fieldType name=xml_unicode_full class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType And finally my search configuration: requestHandler name=dictionary class=solr.SearchHandler lst name=defaults str name=echoParamsall/str str name=defTypeedismax/str str name=mm2lt;-25%/str str name=qfdc_title_unicode_full^2 dc_title_unicode^2 dc_title/str int name=rows10/int str name=spellcheck.onlyMorePopulartrue/str str name=spellcheck.extendedResultsfalse/str str name=spellcheck.count1/str /lst arr name=last-components strspellcheck/str /arr /requestHandler I am trying to match the field with various search phrases (that are valid). There are results: # search phrase match? Comment 1 cal.lígra? yes 2 cal.ligra? no Changed í to i 3 cal.ligraf yes 4 calligra? no The problem is the #2 attempt to match a data. The #3 works replacing ? with f. One more thing. If * is used insted of ? other data is matched as cal.lígrafia but not cal.lígraf... Also I have spotted some logic missmatch in debug parsedQuery field: * cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 | dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0)) *cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 | dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0)) Should the second be *calligra?* insted?* *Environment: Tomcat 7.0.25 (request encoding UTF-8) Solr 3.5.0 Java 7 Oracle Ubuntu 11.10
Re: struggling with solr.WordDelimiterFilterFactory and periods . or dots
Hmmm, that all looks correct, from the output you pasted I'd expect you to be finding the doc. So next thing: add debugQuery=on to your query and look at the debug information after the list of documents, particularly the parsedQuery bit. Are you searching against the fields you think you are? If you don't specify a field, Solr uses the default defined in schema.xml. Next, look at your actual index using either Luke or the TemsComponent to see what's actually *in* your index rather than what you *think* is. I can't tell you how many times I've made the wrong assumptions. My guess would be that you aren't searching the fields you think you are... Best Erick On Wed, Feb 8, 2012 at 9:06 AM, geeky2 gee...@hotmail.com wrote: hello, thank you for the reply. yes - i did re-index after the changes to the schema. also - thank you for the direction on using the analyzer - but i am not sure if i am interpreting the feedback from the analyzer correctly. here is what i did: in the Field value (Index) box - i placed this: BP2.1UAA in the Field value (Query) box - i placed this: BP21UAA then after hitting the Analyze button - i see the following: Under Index Analyzer for: org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_33, generateWordParts=1, catenateAll=1, catenateNumbers=1} i see position 1 2 3 4 term text BP 2 1 UAA 21 BP21UAA Under Query Analyzer for: org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_33, generateWordParts=1, catenateAll=1, catenateNumbers=1} i see position 1 2 3 term text BP 21 UAA BP21UAA the above information leads me to believe that i should have BP21UAA as an indexed term generated from the BP2.1UAA value coming from the database. also - the query analysis lead me to believe that i should find a document when i search on BP21UAA in the itemNo field do i have this correct am i missing something here? i am still unable to get a hit when i search on BP21UAA in the itemNo field. thank you, mark -- View this message in context: http://lucene.472066.n3.nabble.com/struggling-with-solr-WordDelimiterFilterFactory-and-periods-or-dots-tp3724822p3726021.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Which Tokeniser (and/or filter)
You'll probably have to index them in separate fields to get what you want. The question is always whether it's worth it, is the use-case really well served by having a variant that keeps dots and things? But that's always more a question for your product manager Best Erick On Wed, Feb 8, 2012 at 9:23 AM, Robert Brown r...@intelcompute.com wrote: Thanks Erick, I didn't get confused with multiple tokens vs multiValued :) Before I go ahead and re-index 4m docs, and believe me I'm using the analysis page like a mad-man! What do I need to configure to have the following both indexed with and without the dots... .net sales manager. £12.50 Currently... tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=1 splitOnNumerics=1 types=wdftypes.txt / with nothing specific in wdftypes.txt for full-stops. Should there also be any difference when quoting my searches? The analysis page seems to just drop the quotes, but surely actual calls don't do this? --- IntelCompute Web Design Local Online Marketing http://www.intelcompute.com On Wed, 8 Feb 2012 07:38:42 -0500, Erick Erickson erickerick...@gmail.com wrote: Yes, WDDF creates multiple tokens. But that has nothing to do with the multiValued suggestion. You can get exactly what you want by 1 setting multiValued=true in your schema file and re-indexing. Say positionIncrementGap is set to 100 2 When you index, add the field for each sentence, so your doc looks something like: doc field name=sentencesi am a sales-manager in here/field field name=sentencesusing asp.net and .net daily/field . /doc 3 search like sales manager~100 Best Erick On Wed, Feb 8, 2012 at 3:05 AM, Rob Brown r...@intelcompute.com wrote: Apologies if things were a little vague. Given the example snippet to index (numbered to show searches needed to match)... 1: i am a sales-manager in here 2: using asp.net and .net daily 3: working in design. 4: using something called sage 200. and i'm fluent 5: german sausages. 6: busy AE dept earning £10,000 annually ... all with newlines in place. able to match... 1. sales 1. sales manager 1. sales-manager 1. sales-manager 2. .net 2. asp.net 3. design 4. sage 200 6. AE 6. £10,000 But do NOT match fluent german from 4 + 5 since there's a newline between them when indexed, but not when searched. Do the filters (wdf in this case) not create multiple tokens, so if splitting on period in asp.net would create tokens for all of asp, asp., asp.net, .net, net. Cheers, Rob -- IntelCompute Web Design and Online Marketing http://www.intelcompute.com -Original Message- From: Chris Hostetter hossman_luc...@fucit.org Reply-to: solr-user@lucene.apache.org To: solr-user@lucene.apache.org Subject: Re: Which Tokeniser (and/or filter) Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST) : This all seems a bit too much work for such a real-world scenario? You haven't really told us what your scenerio is. You said you want to split tokens on whitespace, full-stop (aka: period) and comma only, but then in response to some suggestions you added comments other things that you never mentioned previously... 1) evidently you don't want the . in foo.net to cause a split in tokens? 2) evidently you not only want token splits on newlines, but also positition gaps to prevent phrases matching across newlines. ...these are kind of important details that affect suggestions people might give you. can you please provide some concrete examples of hte types of data you have, the types of queries you want them to match, and the types of queries you *don't* want to match? -Hoss
Re: Which Tokeniser (and/or filter)
Attempting to re-produce legacy behaviour (i know!) of simple SQL substring searching, with and without phrases. I feel simply NGram'ing 4m CV's may be pushing it? --- IntelCompute Web Design Local Online Marketing http://www.intelcompute.com On Wed, 8 Feb 2012 11:27:24 -0500, Erick Erickson erickerick...@gmail.com wrote: You'll probably have to index them in separate fields to get what you want. The question is always whether it's worth it, is the use-case really well served by having a variant that keeps dots and things? But that's always more a question for your product manager Best Erick On Wed, Feb 8, 2012 at 9:23 AM, Robert Brown r...@intelcompute.com wrote: Thanks Erick, I didn't get confused with multiple tokens vs multiValued :) Before I go ahead and re-index 4m docs, and believe me I'm using the analysis page like a mad-man! What do I need to configure to have the following both indexed with and without the dots... .net sales manager. £12.50 Currently... tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=1 splitOnNumerics=1 types=wdftypes.txt / with nothing specific in wdftypes.txt for full-stops. Should there also be any difference when quoting my searches? The analysis page seems to just drop the quotes, but surely actual calls don't do this? --- IntelCompute Web Design Local Online Marketing http://www.intelcompute.com On Wed, 8 Feb 2012 07:38:42 -0500, Erick Erickson erickerick...@gmail.com wrote: Yes, WDDF creates multiple tokens. But that has nothing to do with the multiValued suggestion. You can get exactly what you want by 1 setting multiValued=true in your schema file and re-indexing. Say positionIncrementGap is set to 100 2 When you index, add the field for each sentence, so your doc looks something like: doc field name=sentencesi am a sales-manager in here/field field name=sentencesusing asp.net and .net daily/field . /doc 3 search like sales manager~100 Best Erick On Wed, Feb 8, 2012 at 3:05 AM, Rob Brown r...@intelcompute.com wrote: Apologies if things were a little vague. Given the example snippet to index (numbered to show searches needed to match)... 1: i am a sales-manager in here 2: using asp.net and .net daily 3: working in design. 4: using something called sage 200. and i'm fluent 5: german sausages. 6: busy AE dept earning £10,000 annually ... all with newlines in place. able to match... 1. sales 1. sales manager 1. sales-manager 1. sales-manager 2. .net 2. asp.net 3. design 4. sage 200 6. AE 6. £10,000 But do NOT match fluent german from 4 + 5 since there's a newline between them when indexed, but not when searched. Do the filters (wdf in this case) not create multiple tokens, so if splitting on period in asp.net would create tokens for all of asp, asp., asp.net, .net, net. Cheers, Rob -- IntelCompute Web Design and Online Marketing http://www.intelcompute.com -Original Message- From: Chris Hostetter hossman_luc...@fucit.org Reply-to: solr-user@lucene.apache.org To: solr-user@lucene.apache.org Subject: Re: Which Tokeniser (and/or filter) Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST) : This all seems a bit too much work for such a real-world scenario? You haven't really told us what your scenerio is. You said you want to split tokens on whitespace, full-stop (aka: period) and comma only, but then in response to some suggestions you added comments other things that you never mentioned previously... 1) evidently you don't want the . in foo.net to cause a split in tokens? 2) evidently you not only want token splits on newlines, but also positition gaps to prevent phrases matching across newlines. ...these are kind of important details that affect suggestions people might give you. can you please provide some concrete examples of hte types of data you have, the types of queries you want them to match, and the types of queries you *don't* want to match? -Hoss
Re: struggling with solr.WordDelimiterFilterFactory and periods . or dots
hello, thanks for sticking with me on this ...very frustrating ok - i did perform the query with the debug parms using two scenarios: 1) a successful search (where i insert the period / dot) in to the itemNo field and the search returns a document. itemNo:BP2.1UAA http://hfsthssolr1.intra.searshc.com:8180/solrpartscat/core1/select/?q=itemNo%3ABP2.1UAAversion=2.2start=0rows=10indent=ondebugQuery=on results from debug ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime1/int lst name=params str name=indenton/str str name=rows10/str str name=version2.2/str str name=debugQueryon/str str name=start0/str str name=qitemNo:BP2.1UAA/str /lst /lst result name=response numFound=1 start=0 doc arr name=brandstrPHILIPS/str/arr str name=groupId0333500/str str name=id0333500,1549 ,BP2.1UAA /str str name=itemDescPLASMA TELEVISION/str str name=itemNoBP2.1UAA /str int name=itemType2/int arr name=modelstrBP2.1UAA /str/arr arr name=productTypestrPlasma Television^/str/arr int name=rankNo0/int str name=supplierId1549 /str /doc /result lst name=debug str name=rawquerystringitemNo:BP2.1UAA/str str name=querystringitemNo:BP2.1UAA/str str name=parsedqueryMultiPhraseQuery(itemNo:bp 2 (1 21) (uaa bp21uaa))/str str name=parsedquery_toStringitemNo:bp 2 (1 21) (uaa bp21uaa)/str lst name=explain str name=0333500,1549 ,BP2.1UAA 22.539911 = (MATCH) weight(itemNo:bp 2 (1 21) (uaa bp21uaa) in 134993), product of: 0.9994 = queryWeight(itemNo:bp 2 (1 21) (uaa bp21uaa)), product of: 45.079826 = idf(itemNo: bp=829 2=29303 1=43943 21=6716 uaa=32 bp21uaa=1) 0.02218287 = queryNorm 22.539913 = (MATCH) fieldWeight(itemNo:bp 2 (1 21) (uaa bp21uaa) in 134993), product of: 1.0 = tf(phraseFreq=1.0) 45.079826 = idf(itemNo: bp=829 2=29303 1=43943 21=6716 uaa=32 bp21uaa=1) 0.5 = fieldNorm(field=itemNo, doc=134993) /str /lst str name=QParserLuceneQParser/str lst name=timing double name=time1.0/double lst name=prepare double name=time0.0/double lst name=org.apache.solr.handler.component.QueryComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.FacetComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.MoreLikeThisComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.HighlightComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.StatsComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.DebugComponent double name=time0.0/double /lst /lst lst name=process double name=time1.0/double lst name=org.apache.solr.handler.component.QueryComponent double name=time1.0/double /lst lst name=org.apache.solr.handler.component.FacetComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.MoreLikeThisComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.HighlightComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.StatsComponent double name=time0.0/double /lst lst name=org.apache.solr.handler.component.DebugComponent double name=time0.0/double /lst /lst /lst /lst /response 2) a NON-successful search (where i do NOT insert a period / dot) in to the itemNo field and the search does NOT return a document itemNo:BP21UAA http://hfsthssolr1.intra.searshc.com:8180/solrpartscat/core1/select/?q=itemNo%3ABP21UAAversion=2.2start=0rows=10indent=ondebugQuery=on ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime1/int lst name=params str name=indenton/str str name=rows10/str str name=version2.2/str str name=debugQueryon/str str name=start0/str str name=qitemNo:BP21UAA/str /lst /lst result name=response numFound=0 start=0/ lst name=debug str name=rawquerystringitemNo:BP21UAA/str str name=querystringitemNo:BP21UAA/str str name=parsedqueryMultiPhraseQuery(itemNo:bp 21 (uaa bp21uaa))/str str name=parsedquery_toStringitemNo:bp 21 (uaa bp21uaa)/str lst name=explain/ str name=QParserLuceneQParser/str lst name=timing double name=time1.0/double lst name=prepare double name=time1.0/double lst name=org.apache.solr.handler.component.QueryComponent double name=time1.0/double /lst lst name=org.apache.solr.handler.component.FacetComponent double name=time0.0/double /lst lst
Re: Wildcard ? issue?
I have already tried this and it did not helped because it does not highlight matches if wild-card is used. The field configuration turns data to: This writeup should explain your scenario : http://wiki.apache.org/solr/MultitermQueryAnalysis
Re: solr cloud concepts
On Feb 8, 2012, at 10:31 AM, Adeel Qureshi wrote: I have been using solr for a while and have recently started getting into solrcloud .. i am a bit confused with some of the concepts .. 1. what exactly is the relationship between a collection and the core .. can a core has multiple collections in it .. in this case all collections within this core will have the same schema .. and i am assuming all instances of collections within the core can be deployed on different solr nodes to achieve distributed search .. or is it the other way around where a collection can have multiple cores Currently, a core basically equals a replica of the index. So you might have a collection called collection1 - lets say it's 2 shards and each shard has a single replica: Collection1 shard1 replica1 shard1 replica2 shard2 replica1 shard2 replica2 Each of those replicas is a core. So a collection has multiple cores basically. Also, each of those cores can be on a different machine. So yes, you have distributed indexing and distributed search. 2. at some places it has been pointed out that solrcloud doesnt actually supports replication .. but in the solrcloud wiki the second example is supposed to be for replication .. so does solrcloud at this point supports automatic replication where as you add more servers it automatically uses the additional servers as replicas SolrCloud doesn't support the old style Solr replication concept. It does however, handle replication - it's just all pretty much automatic and behind the scenes - eg all the information about Solr replication in the wiki documentation for previous versions of Solr is really not applicable. We now achieve replica copies by sending documents to each shard one document at a time so that we can support near realtime search. The old style replication is only used in recovery, or when you start a new replica machine and it has to 'catchup' to the other replicas. I have a few more questions but I wanted to get these basic ones out of the way first .. I would appreciate any response. Fire away. Thanks Adeel - Mark Miller lucidimagination.com
Re: Sorting solrdocumentlist object after querying
I want to sort a SolrDocumentList after it has been queried and obtained from the QueryResponse.getResults(). The reason is i have a SolrDocumentList obtained after querying using QueryResponse.getResults() and i have added few docs to it. Now i want to sort this SolrDocumentList based on the same fields i did the querying before i modified this SolrDocumentList. QueryResponse.getResults() will return rows many documents. Cant you sort them (plus your injected documents) with your own?
Re: solr cloud concepts
Hi Adeel, I just started looking into SolrCloud and had some of the same questions. I wrote a blog with the understanding I gained so far, maybe it will help you: http://outerthought.org/blog/491-ot.html Regards, Bruno. On Wed, Feb 8, 2012 at 4:31 PM, Adeel Qureshi adeelmahm...@gmail.comwrote: I have been using solr for a while and have recently started getting into solrcloud .. i am a bit confused with some of the concepts .. 1. what exactly is the relationship between a collection and the core .. can a core has multiple collections in it .. in this case all collections within this core will have the same schema .. and i am assuming all instances of collections within the core can be deployed on different solr nodes to achieve distributed search .. or is it the other way around where a collection can have multiple cores 2. at some places it has been pointed out that solrcloud doesnt actually supports replication .. but in the solrcloud wiki the second example is supposed to be for replication .. so does solrcloud at this point supports automatic replication where as you add more servers it automatically uses the additional servers as replicas I have a few more questions but I wanted to get these basic ones out of the way first .. I would appreciate any response. Thanks Adeel -- Bruno Dumon Outerthought http://outerthought.org/
Re: How to reindex about 10Mio. docs
Vadim, Would using xslt output help? Otis Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html From: Vadim Kisselmann v.kisselm...@googlemail.com To: solr-user@lucene.apache.org Sent: Wednesday, February 8, 2012 7:09 AM Subject: Re: How to reindex about 10Mio. docs Another problem appeared ;) how can i export my docs in csv-format? In Solr 3.1+ i can use the query-param wt=csv, but in Solr 1.4.1? Best Regards Vadim 2012/2/8 Vadim Kisselmann v.kisselm...@googlemail.com: Hi Ahmet, thanks for quick response:) I've already thought the same... And it will be a pain to export and import this huge doc-set as CSV. Do i have an another solution? Regards Vadim 2012/2/8 Ahmet Arslan iori...@yahoo.com: i want to reindex about 10Mio. Docs. from one Solr(1.4.1) to another Solr(1.4.1). I changed my schema.xml (field types sing to slong), standard replication would fail. what is the fastest and smartest way to manage this? this here sound great (EntityProcessor): http://www.searchworkings.org/blog/-/blogs/importing-data-from-another-solr But would it work with Solr 1.4.1? SolrEntityProcessor is not available in 1.4.1. I would dump stored fields into comma separated file, and use http://wiki.apache.org/solr/UpdateCSV to feed into new solr instance.
Re: Using UUID for uniqueId
Anderson I would say that this is highly unlikely, but you would need to pay attention to how they are generated, this would be a good place to start: http://en.wikipedia.org/wiki/Universally_unique_identifier Cheers François On Feb 8, 2012, at 1:31 PM, Anderson vasconcelos wrote: HI all If i use the UUID like a uniqueId in the future if i break my index in shards, i will have problems? The UUID generation could generate the same UUID in differents machines? Thanks
Thank you all
All, It appears my attempt at using solr for the application I support is about to fail. I'm personally and professionally disappointed, but I wanted to say Many Thanks to those of you who have provided so much help to so many on this list. In the right hands and in the right environments, it has so much potential. You all have shown the collective knowledge and cooperation it takes to bring that potential to fruition. I wish I'd been able to pick up on the right details of the toolset to be able to make this work. Best of luck to you all! Tim Hibbs On 2/7/2012 2:53 PM, Tim Hibbs wrote: Hi, all... I have a small problem retrieving the full set of query responses I need and would appreciate any help. I have a query string as follows: +((Title:sales) (+Title:sales) (TOC:sales) (+TOC:sales) (Keywords:sales) (+Keywords:sales) (text:sales) (+text:sales) (sales)) +(RepType:WRO Revenue Services) +(ContentType:SOP ContentType:Key Concept) -(Topics:Backup) The query is intended to be: MUST have at least one of: - exact phrase in field Title - all of the phrase words in field Title - exact phrase in field TOC - all of the phrase words in field TOC - exact phrase in field Keywords - all of the phrase words in field Keywords - exact phrase in field text - all of the phrase words in field text - any of the phrase words in field text MUST have WRO Revenue Services in field RepType MUST have at least one of: - SOP in field ContentType - Key Concept in field ContentType MUST NOT have Backup in field Topics It's almost working, but it misses a couple of items that contain a single occurrence of the word sale in a indexed field. The indexed field containing that single occurrence is named UrlContent. schema.xml UrlContent is defined as: field name=UrlContent type=text indexed=true stored=false required=false omitNorms=false/ Copyfields are as follows: copyField source=Title dest=text/ copyField source=Keywords dest=text/ copyField source=TOC dest=text/ copyField source=Overview dest=text/ copyField source=UrlContent dest=text/ Thanks, Tim Hibbs
solr/tomcat performance.
Hi, I'm running solr+tomcat with the following configuration: I have 16 slaves, which are being queried by aggregator, while aggregator being queried by the users. My slaveUrls variable in solr.xml (on aggregator) looks like - 'property name=slaveUrls value=host01/slave01,host02/slave02,host03/slave03,...,host16/slave16 /' I'm running it on linux machine (not dedicated, there are some other 'heavy' processes) with 16 quads CPUs and 66GB Ram. I ran some tests and I saw, that when I did 400 concurrent requests to aggregator the host stopped to respond until I restart the tomcat. I tried to 'play' with tomcat's/java configuration a little, but it didn't help me much and the main issue was memory usage and timeouts. Currently I'm using the following settings: Java: -Xms256m -Xmx8192m I tried to tweak -XX:MinHeapFreeRatio setting, but from what I could see no memory was returned to OS. Tomcat: Executor name=HTTPThreadPool namePrefix=HTTPThread- maxThreads=8000 minSpareThreads=4000/ Connector executor=HTTPThreadPool port=8080 protocol=HTTP/1.1 redirectPort=8443 URIEncoding=UTF-8 maxHttpHeaderSize=8388608 enableLookups=false acceptCount=100 connectionTimeout=1 / Assuming I'll have ~1000 requests/second done to aggregator, on how many aggregators should I balance the loading? Or may be I can achieve better performance only by tweaking the current system? Any help/advise will be appreciated, Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/solr-tomcat-performance-tp3727199p3727199.html Sent from the Solr - User mailing list archive at Nabble.com.
Index Start Question
Please forgive me if this is a dumb question. I've never dealt with SOLR before, and I'm being asked to determine from the logs when a SOLR index is kicked off (it is a Windows server). The TOMCAT service runs continually, so no love there. In parsing the logs, I think org.apache.solr.core.SolrResourceLoader init is the indicator, since org.apache.solr.core.SolrCore execute seems to occur even when I know an index has not been started. Any advice you could give me would be wonderful. Best, --Chase Chase Hoffman Infrastructure Systems Administrator, Performance Technologies The Advisory Board Company 512-681-2190 direct | 512-609-1150 fax hoffm...@advisory.commailto:hoffm...@advisory.com | www.advisory.comhttp://www.advisory.com Don't miss out-log in now Unlock thousands of members-only tools, events, best practices, and more at www.advisory.com. Get startedhttp://www.advisory.com/reasons-to-log-in-now/?WT.mc_id=eMail|SignatureLine|Other|ABC|Login+8+Reasons|Nov212011
SolrCloud is in trunk.
For those that are interested and have not noticed, the latest work on SolrCloud and distributed indexing is now in trunk. SolrCloud is our name for a new set of distributed capabilities that improve upon the old style distributed search and index based replication. It provides for high availability and fault tolerance while allowing for near realtime search and an interface that matches what you are used to with previous versions of Solr. We are looking to release this in the next 4.0 release, and any feedback early users can provide will be very useful. So if you have an interest in these types of features, please take the latest trunk build for a spin and provide some feedback. There is still a lot more planned, so feel free to chime in on what you would like to see - this is essentially the end of stage one. You can read more about what we have done on the wiki: http://wiki.apache.org/solr/SolrCloud Also, a couple blog posts I recently saw pop up: http://blog.sematext.com/2012/02/01/solrcloud-distributed-realtime-search http://outerthought.org/blog/491-ot.html I'll contribute my own blog post as well when I get a chance, but there should be a fair amount of info there to get you started if you are interested. Thanks, - Mark Miller lucidimagination.com
Re: Using UUID for uniqueId
Thanks 2012/2/8 François Schiettecatte fschietteca...@gmail.com Anderson I would say that this is highly unlikely, but you would need to pay attention to how they are generated, this would be a good place to start: http://en.wikipedia.org/wiki/Universally_unique_identifier Cheers François On Feb 8, 2012, at 1:31 PM, Anderson vasconcelos wrote: HI all If i use the UUID like a uniqueId in the future if i break my index in shards, i will have problems? The UUID generation could generate the same UUID in differents machines? Thanks
Re: SolrCloud is in trunk.
Good job on this work. A monumental effort. On Wed, 8 Feb 2012 16:41:13 -0500, Mark Miller markrmil...@gmail.com wrote: For those that are interested and have not noticed, the latest work on SolrCloud and distributed indexing is now in trunk. SolrCloud is our name for a new set of distributed capabilities that improve upon the old style distributed search and index based replication. It provides for high availability and fault tolerance while allowing for near realtime search and an interface that matches what you are used to with previous versions of Solr. We are looking to release this in the next 4.0 release, and any feedback early users can provide will be very useful. So if you have an interest in these types of features, please take the latest trunk build for a spin and provide some feedback. There is still a lot more planned, so feel free to chime in on what you would like to see - this is essentially the end of stage one. You can read more about what we have done on the wiki: http://wiki.apache.org/solr/SolrCloud Also, a couple blog posts I recently saw pop up: http://blog.sematext.com/2012/02/01/solrcloud-distributed-realtime-search http://outerthought.org/blog/491-ot.html I'll contribute my own blog post as well when I get a chance, but there should be a fair amount of info there to get you started if you are interested. Thanks, - Mark Miller lucidimagination.com
Re: Improving performance for SOLR geo queries?
Hi Matthias- I'm trying to understand how you have your data indexed so we can give reasonable direction. What field type are you using for your locations? Is it using the solr spatial field types? What do you see when you look at the debug information from debugQuery=true? From my experience, there is no single best practice for spatial queries -- it will depend on your data density and distribution if. You may also want to look at: http://code.google.com/p/lucene-spatial-playground/ but note this is off lucene trunk -- the geohash queries are super fast though ryan 2012/2/8 Matthias Käppler matth...@qype.com: Hi Erick, if we're not doing geo searches, we filter by location tags that we attach to places. This is simply a hierachical regional id, which is simple to filter for, but much less flexible. We use that on Web a lot, but not on mobile, where we want to performance searches in arbitrary radii around arbitrary positions. For those location tag kind of queries, the average time spent in SOLR is 43msec (I'm looking at the New Relic snapshot of the last 12 hours). I have disabled our optimization again just yesterday, so for the bbox queries we're now at an avg of 220ms (same time window). That's a 5 fold increase in response time, and in peak hours it's worse than that. I've also found a blog post from 3 years ago which outlines the inner workings of the SOLR spatial indexing and searching: http://www.searchworkings.org/blog/-/blogs/23842 From that it seems as if SOLR already performs a similar optimization we had in mind during the index step, so if I understand correctly, it doesn't even search over all records, only those that were mapped to the grid box identified during indexing. What I would love to see is what the suggested way is to perform a geo query on SOLR, considering that they're so difficult to cache and expensive to run. Is the best approach to restrict the candidate set as much as possible using cheap filter queries, so that SOLR merely has to do the geo search against these subsets? How does the query planner work here? I see there's a cost attached to a filter query, but one can only set it when cache is set to false? Are cached geo queries executed last when there are cheaper filter queries to cut down on documents? If you have a real world practical setup to share, one that performs well in a production environment that serves requests in the Millions per day, that would be great. I'd love to contribute documentation by the way, if you knew me you'd know I'm an avid open source contributor and actually run several open source projects myself. But tell me, how can I possibly contribute answer to questions I don't have an answer to? That's why I'm here, remember :) So please, these kinds of snippy replies are not helping anyone. Thanks -Matthias On Tue, Feb 7, 2012 at 3:06 PM, Erick Erickson erickerick...@gmail.com wrote: So the obvious question is what is your performance like without the distance filters? Without that knowledge, we have no clue whether the modifications you've made had any hope of speeding up your response times As for the docs, any improvements you'd like to contribute would be happily received Best Erick 2012/2/6 Matthias Käppler matth...@qype.com: Hi, we need to perform fast geo lookups on an index of ~13M places, and were running into performance problems here with SOLR. We haven't done a lot of query optimization / SOLR tuning up until now so there's probably a lot of things we're missing. I was wondering if you could give me some feedback on the way we do things, whether they make sense, and especially why a supposed optimization we implemented recently seems to have no effect, when we actually thought it would help a lot. What we do is this: our API is built on a Rails stack and talks to SOLR via a Ruby wrapper. We have a few filters that almost always apply, which we put in filter queries. Filter cache hit rate is excellent, about 97%, and cache size caps at 10k filters (max size is 32k, but it never seems to reach that many, probably because we replicate / delta update every few minutes). Still, geo queries are slow, about 250-500msec on average. We send them with cache=false, so as to not flood the fq cache and cause undesirable evictions. Now our idea was this: while the actual geo queries are poorly cacheable, we could clearly identify geographical regions which are more often queried than others (naturally, since we're a user driven service). Therefore, we dynamically partition Earth into a static grid of overlapping boxes, where the grid size (the distance of the nodes) depends on the maximum allowed search radius. That way, for every user query, we would always be able to identify a single bounding box that covers it. This larger bounding box (200km edge length) we would send to SOLR as a cached filter query, along with the actual user query
Re: solr cloud concepts
okay so after reading Bruno's blog post .. lets add slice to the mix as well .. so we have got collections, cores, shards, partitions and slices :) .. The whole point with cores is to be able to have different schemas on the same solr server instance. So how does that changes with collections .. may be an example might help .. if I want to setup a solrcloud cluster with 2 cores (different schema) .. with each core having 2 shards (i m assuming shards are really partitions here, across multiple nodes in the cluster) .. with one shard being the replica.. On Wed, Feb 8, 2012 at 11:35 AM, Mark Miller markrmil...@gmail.com wrote: On Feb 8, 2012, at 10:31 AM, Adeel Qureshi wrote: I have been using solr for a while and have recently started getting into solrcloud .. i am a bit confused with some of the concepts .. 1. what exactly is the relationship between a collection and the core .. can a core has multiple collections in it .. in this case all collections within this core will have the same schema .. and i am assuming all instances of collections within the core can be deployed on different solr nodes to achieve distributed search .. or is it the other way around where a collection can have multiple cores Currently, a core basically equals a replica of the index. So you might have a collection called collection1 - lets say it's 2 shards and each shard has a single replica: Collection1 shard1 replica1 shard1 replica2 shard2 replica1 shard2 replica2 Each of those replicas is a core. So a collection has multiple cores basically. Also, each of those cores can be on a different machine. So yes, you have distributed indexing and distributed search. 2. at some places it has been pointed out that solrcloud doesnt actually supports replication .. but in the solrcloud wiki the second example is supposed to be for replication .. so does solrcloud at this point supports automatic replication where as you add more servers it automatically uses the additional servers as replicas SolrCloud doesn't support the old style Solr replication concept. It does however, handle replication - it's just all pretty much automatic and behind the scenes - eg all the information about Solr replication in the wiki documentation for previous versions of Solr is really not applicable. We now achieve replica copies by sending documents to each shard one document at a time so that we can support near realtime search. The old style replication is only used in recovery, or when you start a new replica machine and it has to 'catchup' to the other replicas. I have a few more questions but I wanted to get these basic ones out of the way first .. I would appreciate any response. Fire away. Thanks Adeel - Mark Miller lucidimagination.com
linking documents in solr
hi, I have a question around documents linking in solr and want to know if its possible. lets say i have a set of blogs and their authors that i want to index seperately. is it possible to link a document describing a blog to another document describing an author? if yes, can i search for blogs with filters on attributes of the author? if yes, if i update an attribute of an author (by its id), then will the search results reflect the updated attribute(s)? thanks
Re: solr cloud concepts
On Feb 8, 2012, at 5:26 PM, Adeel Qureshi wrote: okay so after reading Bruno's blog post .. lets add slice to the mix as well .. so we have got collections, cores, shards, partitions and slices :) .. Yeah - heh - this has bugged me, but we have not really all come down on agreement of terminology here. I was a fan of using shard for each node and slice for partition. Another couple of committers wanted partitions rather than slice. Another says slice in code, shard for both in terminology and use context... I'd even go for shards as partitions and replicas for every node in a shard. But those fine points are still settling ;) The whole point with cores is to be able to have different schemas on the same solr server instance. So how does that changes with collections .. may be an example might help .. if I want to setup a solrcloud cluster with 2 cores (different schema) .. with each core having 2 shards (i m assuming shards are really partitions here, across multiple nodes in the cluster) .. with one shard being the replica.. So this would mean you want to create 2 collections. Think of a collection as a bunch of SolrCores that all share the same schema and config. So you would start up 2 nodes set to one collection and with numShards=1 that will give you one shard hosted by two identical SolrCores, giving you a replication factor. The full index will be in each of the two SolrCores. Then if you start another two nodes and specify a different collection name, you will get the same thing, but distinct from your first collection (although, if both collections have compatible shema/config you can still search across them). On Wed, Feb 8, 2012 at 11:35 AM, Mark Miller markrmil...@gmail.com wrote: On Feb 8, 2012, at 10:31 AM, Adeel Qureshi wrote: I have been using solr for a while and have recently started getting into solrcloud .. i am a bit confused with some of the concepts .. 1. what exactly is the relationship between a collection and the core .. can a core has multiple collections in it .. in this case all collections within this core will have the same schema .. and i am assuming all instances of collections within the core can be deployed on different solr nodes to achieve distributed search .. or is it the other way around where a collection can have multiple cores Currently, a core basically equals a replica of the index. So you might have a collection called collection1 - lets say it's 2 shards and each shard has a single replica: Collection1 shard1 replica1 shard1 replica2 shard2 replica1 shard2 replica2 Each of those replicas is a core. So a collection has multiple cores basically. Also, each of those cores can be on a different machine. So yes, you have distributed indexing and distributed search. 2. at some places it has been pointed out that solrcloud doesnt actually supports replication .. but in the solrcloud wiki the second example is supposed to be for replication .. so does solrcloud at this point supports automatic replication where as you add more servers it automatically uses the additional servers as replicas SolrCloud doesn't support the old style Solr replication concept. It does however, handle replication - it's just all pretty much automatic and behind the scenes - eg all the information about Solr replication in the wiki documentation for previous versions of Solr is really not applicable. We now achieve replica copies by sending documents to each shard one document at a time so that we can support near realtime search. The old style replication is only used in recovery, or when you start a new replica machine and it has to 'catchup' to the other replicas. I have a few more questions but I wanted to get these basic ones out of the way first .. I would appreciate any response. Fire away. Thanks Adeel - Mark Miller lucidimagination.com - Mark Miller lucidimagination.com
Re: Improving performance for SOLR geo queries?
I compared locallucene to spatial search and saw a performance degradation, even using geohash queries, though perhaps I indexed things wrong? Locallucene across 6 machines handles 150 queries per second fine, but using geofilt and geohash I got lots of timeouts even when I was doing only 50 queries per second. Has anybody done a formal comparison of locallucene with spatial search and latlontype, pointtype and geohash? On 2/8/12 2:20 PM, Ryan McKinley ryan...@gmail.com wrote: Hi Matthias- I'm trying to understand how you have your data indexed so we can give reasonable direction. What field type are you using for your locations? Is it using the solr spatial field types? What do you see when you look at the debug information from debugQuery=true? From my experience, there is no single best practice for spatial queries -- it will depend on your data density and distribution if. You may also want to look at: http://code.google.com/p/lucene-spatial-playground/ but note this is off lucene trunk -- the geohash queries are super fast though ryan 2012/2/8 Matthias Käppler matth...@qype.com: Hi Erick, if we're not doing geo searches, we filter by location tags that we attach to places. This is simply a hierachical regional id, which is simple to filter for, but much less flexible. We use that on Web a lot, but not on mobile, where we want to performance searches in arbitrary radii around arbitrary positions. For those location tag kind of queries, the average time spent in SOLR is 43msec (I'm looking at the New Relic snapshot of the last 12 hours). I have disabled our optimization again just yesterday, so for the bbox queries we're now at an avg of 220ms (same time window). That's a 5 fold increase in response time, and in peak hours it's worse than that. I've also found a blog post from 3 years ago which outlines the inner workings of the SOLR spatial indexing and searching: http://www.searchworkings.org/blog/-/blogs/23842 From that it seems as if SOLR already performs a similar optimization we had in mind during the index step, so if I understand correctly, it doesn't even search over all records, only those that were mapped to the grid box identified during indexing. What I would love to see is what the suggested way is to perform a geo query on SOLR, considering that they're so difficult to cache and expensive to run. Is the best approach to restrict the candidate set as much as possible using cheap filter queries, so that SOLR merely has to do the geo search against these subsets? How does the query planner work here? I see there's a cost attached to a filter query, but one can only set it when cache is set to false? Are cached geo queries executed last when there are cheaper filter queries to cut down on documents? If you have a real world practical setup to share, one that performs well in a production environment that serves requests in the Millions per day, that would be great. I'd love to contribute documentation by the way, if you knew me you'd know I'm an avid open source contributor and actually run several open source projects myself. But tell me, how can I possibly contribute answer to questions I don't have an answer to? That's why I'm here, remember :) So please, these kinds of snippy replies are not helping anyone. Thanks -Matthias On Tue, Feb 7, 2012 at 3:06 PM, Erick Erickson erickerick...@gmail.com wrote: So the obvious question is what is your performance like without the distance filters? Without that knowledge, we have no clue whether the modifications you've made had any hope of speeding up your response times As for the docs, any improvements you'd like to contribute would be happily received Best Erick 2012/2/6 Matthias Käppler matth...@qype.com: Hi, we need to perform fast geo lookups on an index of ~13M places, and were running into performance problems here with SOLR. We haven't done a lot of query optimization / SOLR tuning up until now so there's probably a lot of things we're missing. I was wondering if you could give me some feedback on the way we do things, whether they make sense, and especially why a supposed optimization we implemented recently seems to have no effect, when we actually thought it would help a lot. What we do is this: our API is built on a Rails stack and talks to SOLR via a Ruby wrapper. We have a few filters that almost always apply, which we put in filter queries. Filter cache hit rate is excellent, about 97%, and cache size caps at 10k filters (max size is 32k, but it never seems to reach that many, probably because we replicate / delta update every few minutes). Still, geo queries are slow, about 250-500msec on average. We send them with cache=false, so as to not flood the fq cache and cause undesirable evictions. Now our idea was this: while the actual geo queries are poorly cacheable, we could clearly identify geographical regions which are more often
Re: usage of /etc/jetty.xml when debugging Solr in Eclipse
yes, I am using https://github.com/alexwinston/RunJettyRun that apparently is a fork of the original project that originated in the need to use an jetty.xml. So I am already setting an additional jetty.xml, this can be done in the Run configuration, no need to use -D param. But as I mentioned solr does not start cleanly if I do that. So I wanted to understand what role plays /etc/jetty.xml - when solr is started via 'java -jar start.jar' - when started with RunJettyRun in eclipse. -- View this message in context: http://lucene.472066.n3.nabble.com/usage-of-etc-jetty-xml-when-debugging-Solr-in-Eclipse-tp3725588p3728008.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr cloud concepts
Mark, is the recommendation now to have each solr instance be a separate core in solr cloud? I had thought that the core name was by default the collection name? Or are you saying that although they have the same name they are separate because they are in different JVMs? On Wednesday, February 8, 2012, Mark Miller markrmil...@gmail.com wrote: On Feb 8, 2012, at 10:31 AM, Adeel Qureshi wrote: I have been using solr for a while and have recently started getting into solrcloud .. i am a bit confused with some of the concepts .. 1. what exactly is the relationship between a collection and the core .. can a core has multiple collections in it .. in this case all collections within this core will have the same schema .. and i am assuming all instances of collections within the core can be deployed on different solr nodes to achieve distributed search .. or is it the other way around where a collection can have multiple cores Currently, a core basically equals a replica of the index. So you might have a collection called collection1 - lets say it's 2 shards and each shard has a single replica: Collection1 shard1 replica1 shard1 replica2 shard2 replica1 shard2 replica2 Each of those replicas is a core. So a collection has multiple cores basically. Also, each of those cores can be on a different machine. So yes, you have distributed indexing and distributed search. 2. at some places it has been pointed out that solrcloud doesnt actually supports replication .. but in the solrcloud wiki the second example is supposed to be for replication .. so does solrcloud at this point supports automatic replication where as you add more servers it automatically uses the additional servers as replicas SolrCloud doesn't support the old style Solr replication concept. It does however, handle replication - it's just all pretty much automatic and behind the scenes - eg all the information about Solr replication in the wiki documentation for previous versions of Solr is really not applicable. We now achieve replica copies by sending documents to each shard one document at a time so that we can support near realtime search. The old style replication is only used in recovery, or when you start a new replica machine and it has to 'catchup' to the other replicas. I have a few more questions but I wanted to get these basic ones out of the way first .. I would appreciate any response. Fire away. Thanks Adeel - Mark Miller lucidimagination.com
Re: solr cloud concepts
On Feb 8, 2012, at 9:36 PM, Jamie Johnson wrote: Mark, is the recommendation now to have each solr instance be a separate core in solr cloud? I had thought that the core name was by default the collection name? Or are you saying that although they have the same name they are separate because they are in different JVMs? By default, the collection name is set to the core name. This is really just for convenience when you are getting started. If gives you a default collection name of collection1 because the default SolrCore name is collection1, and each SolrCore on each instance is addressable as /solr/collection1. You can certainly have core names be whatever you want and explicitly pass it's collection. In the case, the url for each would be different - though I think there is an open JIRA issue about making that nicer - so that you can look up the right core even if you pass the collection name or something. - Mark Miller lucidimagination.com
Re: multiple cores in a single instance vs multiple instances with single core
On Feb 8, 2012, at 9:52 PM, Jamie Johnson wrote: In solr cloud what is a better approach / use of resources having multiple cores on a single instance or multiple instances with a single core? What are the benefits and drawbacks of each? It depends I suppose. If you are talking about on a single machine, I'd prefer using multiple cores over multiple Solr instances. I think it's just easier to manage. You have to be sensible about that though - if all the replicas for a shard are on the same machine, in the same instance, as different cores, you don't have a lot of room for error - if that box goes down, goodbye. But you can certainly mix and match instances and cores. One interesting thing you can do is a poor mans micro sharding - put a few shards per machine - then later when you add more nodes to your cluster, you can bring up a core on one of the new machines, it will catch up, then you could unload that core on the original machine and replicas. Then start up any other new nodes to add replicas for the moved shard. Roughly and/or something like that anyway - I haven't thought it through thoroughly, but Yonik has brought it up before, and it seems pretty easily doable. - Mark Miller lucidimagination.com
Re: multiple cores in a single instance vs multiple instances with single core
Thanks Mark, in regards to failover I completely agree, I am wondering more about performance and memory usage if the indexes are large and wondering if the separate Java instances under heavy load would more or less performant. Currently we deploy a single core per instance but deploy multiple instances per machine On Wednesday, February 8, 2012, Mark Miller markrmil...@gmail.com wrote: On Feb 8, 2012, at 9:52 PM, Jamie Johnson wrote: In solr cloud what is a better approach / use of resources having multiple cores on a single instance or multiple instances with a single core? What are the benefits and drawbacks of each? It depends I suppose. If you are talking about on a single machine, I'd prefer using multiple cores over multiple Solr instances. I think it's just easier to manage. You have to be sensible about that though - if all the replicas for a shard are on the same machine, in the same instance, as different cores, you don't have a lot of room for error - if that box goes down, goodbye. But you can certainly mix and match instances and cores. One interesting thing you can do is a poor mans micro sharding - put a few shards per machine - then later when you add more nodes to your cluster, you can bring up a core on one of the new machines, it will catch up, then you could unload that core on the original machine and replicas. Then start up any other new nodes to add replicas for the moved shard. Roughly and/or something like that anyway - I haven't thought it through thoroughly, but Yonik has brought it up before, and it seems pretty easily doable. - Mark Miller lucidimagination.com
Re: solr cloud concepts
Thanks for the explanation. It makes sense but I am hoping that you can clarify things a bit more .. so now it sounds like in solrcloud the concept of cores have changed a bit .. as you explained that for me to have 2 cores with different schemas I will need 2 different collections .. and one good thing about solrcores was that you could create new ones with coreadmin api or the http calls .. to create new collections its not that automated right .. secondly if collections represent what kind of used to be solrcore then once i have a collection .. why would i ever want to add multiple cores to it .. i mean i am trying to think of a reason why it would make sense to do that. Thanks On Wed, Feb 8, 2012 at 4:41 PM, Mark Miller markrmil...@gmail.com wrote: On Feb 8, 2012, at 5:26 PM, Adeel Qureshi wrote: okay so after reading Bruno's blog post .. lets add slice to the mix as well .. so we have got collections, cores, shards, partitions and slices :) .. Yeah - heh - this has bugged me, but we have not really all come down on agreement of terminology here. I was a fan of using shard for each node and slice for partition. Another couple of committers wanted partitions rather than slice. Another says slice in code, shard for both in terminology and use context... I'd even go for shards as partitions and replicas for every node in a shard. But those fine points are still settling ;) The whole point with cores is to be able to have different schemas on the same solr server instance. So how does that changes with collections .. may be an example might help .. if I want to setup a solrcloud cluster with 2 cores (different schema) .. with each core having 2 shards (i m assuming shards are really partitions here, across multiple nodes in the cluster) .. with one shard being the replica.. So this would mean you want to create 2 collections. Think of a collection as a bunch of SolrCores that all share the same schema and config. So you would start up 2 nodes set to one collection and with numShards=1 that will give you one shard hosted by two identical SolrCores, giving you a replication factor. The full index will be in each of the two SolrCores. Then if you start another two nodes and specify a different collection name, you will get the same thing, but distinct from your first collection (although, if both collections have compatible shema/config you can still search across them). On Wed, Feb 8, 2012 at 11:35 AM, Mark Miller markrmil...@gmail.com wrote: On Feb 8, 2012, at 10:31 AM, Adeel Qureshi wrote: I have been using solr for a while and have recently started getting into solrcloud .. i am a bit confused with some of the concepts .. 1. what exactly is the relationship between a collection and the core .. can a core has multiple collections in it .. in this case all collections within this core will have the same schema .. and i am assuming all instances of collections within the core can be deployed on different solr nodes to achieve distributed search .. or is it the other way around where a collection can have multiple cores Currently, a core basically equals a replica of the index. So you might have a collection called collection1 - lets say it's 2 shards and each shard has a single replica: Collection1 shard1 replica1 shard1 replica2 shard2 replica1 shard2 replica2 Each of those replicas is a core. So a collection has multiple cores basically. Also, each of those cores can be on a different machine. So yes, you have distributed indexing and distributed search. 2. at some places it has been pointed out that solrcloud doesnt actually supports replication .. but in the solrcloud wiki the second example is supposed to be for replication .. so does solrcloud at this point supports automatic replication where as you add more servers it automatically uses the additional servers as replicas SolrCloud doesn't support the old style Solr replication concept. It does however, handle replication - it's just all pretty much automatic and behind the scenes - eg all the information about Solr replication in the wiki documentation for previous versions of Solr is really not applicable. We now achieve replica copies by sending documents to each shard one document at a time so that we can support near realtime search. The old style replication is only used in recovery, or when you start a new replica machine and it has to 'catchup' to the other replicas. I have a few more questions but I wanted to get these basic ones out of the way first .. I would appreciate any response. Fire away. Thanks Adeel - Mark Miller lucidimagination.com - Mark Miller lucidimagination.com
Re: Sorting solrdocumentlist object after querying
No that sorting is based on multiple fields. Basically i want to sort them as the group by statement like in the SQL based on few fields and many loops to go through. The problem is that i have say 1,000,000 solr docs after injecting my few solr docs and then i want to do group by these solr docs by some fields and then take 20 records for paging. So i need some shortcut for that. -- Kashif Khan. http://www.kashifkhan.in On Wed, Feb 8, 2012 at 11:07 PM, iorixxx [via Lucene] ml-node+s472066n3726788...@n3.nabble.com wrote: I want to sort a SolrDocumentList after it has been queried and obtained from the QueryResponse.getResults(). The reason is i have a SolrDocumentList obtained after querying using QueryResponse.getResults() and i have added few docs to it. Now i want to sort this SolrDocumentList based on the same fields i did the querying before i modified this SolrDocumentList. QueryResponse.getResults() will return rows many documents. Cant you sort them (plus your injected documents) with your own? -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Sorting-solrdocumentlist-object-after-querying-tp3726303p3726788.html To unsubscribe from Sorting solrdocumentlist object after querying, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=3726303code=dXBsaW5rMjAxMEBnbWFpbC5jb218MzcyNjMwM3wtMTgzODU3NDI3OQ== . NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://lucene.472066.n3.nabble.com/Sorting-solrdocumentlist-object-after-querying-tp3726303p3728549.html Sent from the Solr - User mailing list archive at Nabble.com.
How do i do group by in solr with multiple shards?
Hi all, I have tried group by in solr with multiple shards but it does not work. Basically i want to simply do GROUP BY statement like in SQL in solr with multiple shards. Please suggest me how can i do this as it is not supported currently OOB by solr. Thanks regards, Kashif Khan -- View this message in context: http://lucene.472066.n3.nabble.com/How-do-i-do-group-by-in-solr-with-multiple-shards-tp3728555p3728555.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to identify the field with highest score in dismax
Hello, Have you tried to specify debugQuery=on and look into explain section? Though it's not really performant, but anyway I propose to start from it. Regards On Wed, Feb 8, 2012 at 7:32 PM, crisfromnova crisfromn...@gmail.com wrote: Hi, According solr documentation the dismax score is calculating after the formula : (score of matching clause with the highest score) + ( (tie paramenter) * (scores of any other matching clauses) ). Is there a way to identify the field on which the matching clause score is the highest? For example I suppose that I have the following document : doc str name=NameFord Mustang Coupe Cabrio/str str name=DetailsFord Mustang is a great car/str /doc and the following dismax query : defType=dismaxqf=Name^10+Details^1q=Ford+Mustang+Ford+Mustang and receive the document with the score : 5.6. Is there a way to find out if the score is for the matching on Name field or for the matching on Details field? Thanks in advance! -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-identify-the-field-with-highest-score-in-dismax-tp3726297p3726297.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Lucid Certified Apache Lucene/Solr Developer Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com