cores vs indices
Can someone provide me with a succinct defintion of what a solr core is? Is there a one-to-one relationship of cores to solr indices or can you have multiple indices per core? Cheers, Daniel
Re: cores vs indices
Hi Daniel, Yes there is a one-to-one relationship between Solr indices and cores. The one to many comes when you look at the relationship between cores and tomcat/jetty webapps instances. This gives you the ability to clone, add and swap cores around. See for for core manipulation functions: http://wiki.apache.org/solr/CoreAdmin Regards, Dave On 8 Aug 2011, at 04:35, Daniel Schobel wrote: Can someone provide me with a succinct defintion of what a solr core is? Is there a one-to-one relationship of cores to solr indices or can you have multiple indices per core? Cheers, Daniel
Can Master push data to slave
Hi I am using Solr 1.4. and doing a replication process where my slave is pulling data from Master. I have 2 questions a. Can Master push data to slave b. How to make sure that lock file is not created while replication Please help thanks Pawan
string cut-off filter?
Hi list, is there a string cut-off filter to limit the length of a KeywordTokenized string? So the string should not be dropped, only limitited to a certain length. Regards Bernd
Scoring using POJO/SolrJ
Hi, I am using the SolrJ client library and using a POJO with the @Field annotation to index documents and to retrieve documents from the index. I retrieve the documents from the index like so: ListItem beans = response.getBeans(Item.class) Now in order to add the scores to the beans i added a field called score with the @Field annotation and the scores were then returned when i read from the index. Now when i am indexing, i get the error: ERROR:unknown field 'score'. I guess because it expects the score to be defined in my schema. Now i am thinking that if i define this field in my schema then rather than returning the document scores it might just go ahead and return actual values for the field (null if i dont add a value). How can i go around this problem? Many thanks.
how to enable MMapDirectory in solr 1.4?
hi all, I read Apache Solr 3.1 Released Note today and found that MMapDirectory is now the default implementation in 64 bit Systems. I am now using solr 1.4 with 64-bit jvm in Linux. how can I use MMapDirectory? will it improve performance?
Multiplexing TokenFilter for multi-language?
Sorry if this has already been discussed, but I have already spent a couple of days googling in vain The problem: - documents in multiple languages (us, de, fr, es). - language is known (a team of editors determines the language manually, and users are asked to specify language option for searching). My intended approach: - one index. - a multiplexing token filter, a MultilingualSnowballFilterFactory that instantiates a Snowball Stemmer for the appropriate language. - language is a facet, to get rid of cross-language ambiguities with multiple languages mixed in the same field. The problem is how to communicate the language to the MultilingualSnowballFilterFactory. Once the language is known, instantiating the Snowball Stemmer for the right language is easy. I got a working version attached below. My solution: - append the language as the first token for the FilterFactory to pick up. E.g. es This is a spanish document. - this would mean I need to duplicate the fields - an original version for storing, and a version with the language marker appended for indexing. E.g description (indexed=false, stored=true), description_i (indexed=true, stored=false). Is there a better way? Many thanks in advance. Yee http://lucene.472066.n3.nabble.com/file/n3235341/MultilingualSnowballFilterFactory.java MultilingualSnowballFilterFactory.java -- View this message in context: http://lucene.472066.n3.nabble.com/Multiplexing-TokenFilter-for-multi-language-tp3235341p3235341.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: how to enable MMapDirectory in solr 1.4?
If you want to try MMapDirectory with Solr 1.4, then copy the class org.apache.solr.core.MMapDirectoryFactory from 3.x or Trunk, and either add it to the .war file (you can just add it under src/java and re-package the war), or you can put it in its own .jar file in the lib directory under solr_home. Then, in solrconfig.xml, add this entry under the root config element: directoryFactory class=org.apache.solr.core.MMapDirectoryFactory / I'm not sure if MMapDirectory will perform better for you with Linux over NIOFSDir. I'm pretty sure in Trunk/4.0 it's the default for Windows and maybe Solaris. In Windows, there is a definite advantage for using MMapDirectory on a 64-bit system. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Li Li [mailto:fancye...@gmail.com] Sent: Monday, August 08, 2011 4:09 AM To: solr-user@lucene.apache.org Subject: how to enable MMapDirectory in solr 1.4? hi all, I read Apache Solr 3.1 Released Note today and found that MMapDirectory is now the default implementation in 64 bit Systems. I am now using solr 1.4 with 64-bit jvm in Linux. how can I use MMapDirectory? will it improve performance?
PositionIncrement gap and multi-valued fields.
Hello! I have a doubt about the behaviour of searching over field types that have positionIncrementGap defined. For example, supose that: 1. We have a field called test defined as multi-valued and white space tokenized. 2. The index has an single document with a test value: str TEST1 /str str AAA BBB /str str CCC DDD /str str EEE FFF /str str TEST2 /str I read that positionIncrementGap defines the virtual space between the last token of one field instance and the first token of the next instance (source: http://lucene.472066.n3.nabble.com/positionIncrementGap-in-schema-xml-td488338.html). When it says last token of one field instance means that is the last token of the first entry from the multi-valued content? In our example before it will be TEST1. Anyway, I've been doing some tests modifying the positionIncrementGap value with high values and low values. Can anybody explain me with detail which implications has in Solr scoring algorythm an upper and a lower value? I would like to understand how this value affects matching results in fields and also calculating the final score (maybe more gap implies more spaces and a worst score when the value matches, etc.). Thank you for reading so far!
Re: Weighted facet strings
One kind of hacky way to accomplish some of those tasks involves creating a lot more Solr fields. (This kind of 'de-normalization' is often the answer to how to make Solr do something). So facet fields are ordinarily not tokenized or normalized at all. But that doesn't work very well for matching query terms. So if you want actual queries to match on these categories, you probably want an additional field that is tokenized/analyzed. If you want to boost different category assignments differently, you probably want _multiple_ additional tokenized/analyzed fields. So for instance, create separate analyzed fields for each category 'weight', perhaps using the default 'text' analysis type. categor_text_weight_1 category_text_weight_2 etc Then use dismax to query, include all those category_text_* fields in the 'qf', and boost the higher weight ones more than the lower weight ones. That will handle a number of your use cases, but not all of them. Your first two cases are the most problematic: filter: category=some_category_name, query: *.* - Results should be score by the above mentioned weight So Solr doesn't really work like that. Normally a filter does not effect the scoring of the actual results _at all_. But if you change the query to: fq=category:some_category q=some_category defType=dismax qf=category_text_weight1, category_text_weight2^10, category_text_weight3^20 THEN, with the multiple analyzed category_text_weight_* fields, as described above, I think it should do what you want. You may have to play with exactly what boost to give to each field. But your second use case is still tricky. Solr doesn't really do exactly what you ask, but by using this method I think you can figure out hacky ways to accomplish it. I'm not sure if it will solve all of your use cases, but maybe this will give you a start to figuring it out. On 8/5/2011 6:55 AM, Michael Lorz wrote: Hi all, I have documents which are (manually) tagged whith categories. Each category-document relation has a weight between 1 and 5: 5: document fits perfectly in this category, . . 1: document may be considered as belonging to this category. I would now like to use this information with solr. At the moment, I don't use the weight at all: field name=category type=string indexed=true stored=true multiValued=true/ Both the category as well as the document body are specified as query fields (str name=qf in solrconfig.xml). What I would like is the following: - filter: category=some_category_name, query: *.* - Results should be score by the above mentioned weight - filter: category=some_category_name, query: some_keyword - Results should be scored by a combination of the score of 'some_keyword' and the above mentioned weight - filter: none, query: some_category_name - Documents with category 'some_category_name' should be found as well as documents which contain the term 'some_category_name'. Results should be scored by a combination of the score of 'some_keyword' and the above mentioned weight Do you have any ideas how this could be done? Thanks in advance Michi
Re: how to enable MMapDirectory in solr 1.4?
We patched our 1.4.1 build with SOLR-1969https://issues.apache.org/jira/browse/SOLR-1969(making MMapDirectory configurable) and realized a 64% search performance boost on our Linux hosts. On Mon, Aug 8, 2011 at 10:05 AM, Dyer, James james.d...@ingrambook.comwrote: If you want to try MMapDirectory with Solr 1.4, then copy the class org.apache.solr.core.MMapDirectoryFactory from 3.x or Trunk, and either add it to the .war file (you can just add it under src/java and re-package the war), or you can put it in its own .jar file in the lib directory under solr_home. Then, in solrconfig.xml, add this entry under the root config element: directoryFactory class=org.apache.solr.core.MMapDirectoryFactory / I'm not sure if MMapDirectory will perform better for you with Linux over NIOFSDir. I'm pretty sure in Trunk/4.0 it's the default for Windows and maybe Solaris. In Windows, there is a definite advantage for using MMapDirectory on a 64-bit system. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Li Li [mailto:fancye...@gmail.com] Sent: Monday, August 08, 2011 4:09 AM To: solr-user@lucene.apache.org Subject: how to enable MMapDirectory in solr 1.4? hi all, I read Apache Solr 3.1 Released Note today and found that MMapDirectory is now the default implementation in 64 bit Systems. I am now using solr 1.4 with 64-bit jvm in Linux. how can I use MMapDirectory? will it improve performance?
solr-ruby: Error undefined method `closed?' for nil:NilClass
Hi, I have seen some of these errors come through from time to time. It looks like: /usr/lib/ruby/1.8/net/http.rb:1060:in `request'\n/usr/lib/ruby/1.8/net/http.rb:845:in `post' /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:158:in `post' /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:151:in `send' /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:174:in `create_and_send_query' /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:92:in `query' It is as if the http object has gone away. Would it be good to create a new one inside of the connection or is something more serious going on? ubuntu 10.04 passenger 3.0.8 rails 2.3.11 -- Regards, Ian Connor
Re: solr-ruby: Error undefined method `closed?' for nil:NilClass
Ian - What does your solr-ruby using code look like? Solr::Connection is light-weight, so you could just construct a new one of those for each request. Are you keeping an instance around? Erik On Aug 8, 2011, at 12:03 , Ian Connor wrote: Hi, I have seen some of these errors come through from time to time. It looks like: /usr/lib/ruby/1.8/net/http.rb:1060:in `request'\n/usr/lib/ruby/1.8/net/http.rb:845:in `post' /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:158:in `post' /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:151:in `send' /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:174:in `create_and_send_query' /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:92:in `query' It is as if the http object has gone away. Would it be good to create a new one inside of the connection or is something more serious going on? ubuntu 10.04 passenger 3.0.8 rails 2.3.11 -- Regards, Ian Connor
edismax configuration
Hello all Can someone direct me to a link with config info in order to allow use of the edismax QueryHandler? Mark
is it possible to do a sort without query?
I am trying to list some data based on a function I run , specifically termfreq(post_text,'indie music') and I am unable to do it without passing in data to the q paramater. Is it possible to get a sorted list without searching for any terms?
Test failures on lucene_solr_3_3 and branch_3x
I've got a consistent test failure on Solr source code checked out from svn. The same thing happens with 3.3 and branch_3x. I have information saved from the failures on branch_3x, which I have gotten to to fail about a dozen times in a row. It fails on a test called TestSqlEntityProcessorDelta, part of the dataimporthandler tests. It is consistently reproducible in a shorter timeframe than normal with the following commandline: ant test -Dtestcase=TestSqlEntityProcessorDelta Comprehensive ant output here, from a full test run: http://pastebin.com/eyAt8Qg8 Platform information: [root@idxst0-a solr]# uname -a Linux idxst0-a 2.6.18-238.12.1.el5.centos.plusxen #1 SMP Wed Jun 1 11:57:54 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux [root@idxst0-a solr]# cat /etc/redhat-release CentOS release 5.6 (Final) [root@idxst0-a solr]# java -version java version 1.6.0_26 Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode) [root@idxst0-a yum.repos.d]# yum repolist Loaded plugins: fastestmirror, protectbase Loading mirror speeds from cached hostfile * addons: mirror.san.fastserv.com * base: mirrors.tummy.com * centosplus: mirror.san.fastserv.com * contrib: mirror.san.fastserv.com * epel: mirrors.xmission.com * extras: mirrors.xmission.com * jpackage-generic: jpackage.netmindz.net * jpackage-generic-nonfree: www.mirrorservice.org * jpackage-generic-nonfree-updates: www.mirrorservice.org * jpackage-generic-updates: jpackage.netmindz.net * jpackage-rhel: jpackage.netmindz.net * jpackage-rhel-updates: jpackage.netmindz.net * rpmforge: fr2.rpmfind.net * updates: mirrors.tummy.com
Re: is it possible to do a sort without query?
You can use the standard query parser and pass q=*:* 2011/8/8 Jason Toy jason...@gmail.com I am trying to list some data based on a function I run , specifically termfreq(post_text,'indie music') and I am unable to do it without passing in data to the q paramater. Is it possible to get a sorted list without searching for any terms? -- *Alexei Martchenko* | *CEO* | Superdownloads ale...@superdownloads.com.br | ale...@martchenko.com.br | (11) 5083.1018/5080.3535/5080.3533
bug in termfreq? was Re: is it possible to do a sort without query?
Aelexei, thank you , that does seem to work. My sort results seem to be totally wrong though, I'm not sure if its because of my sort function or something else. My query consists of: sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100 And I get back 4571232 hits. All the results don't have the phrase indie music anywhere in their data. Does termfreq not support phrases? If not, how can I sort specifically by termfreq of a phrase? On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko ale...@superdownloads.com.br wrote: You can use the standard query parser and pass q=*:* 2011/8/8 Jason Toy jason...@gmail.com I am trying to list some data based on a function I run , specifically termfreq(post_text,'indie music') and I am unable to do it without passing in data to the q paramater. Is it possible to get a sorted list without searching for any terms? -- *Alexei Martchenko* | *CEO* | Superdownloads ale...@superdownloads.com.br | ale...@martchenko.com.br | (11) 5083.1018/5080.3535/5080.3533 -- - sent from my mobile 6176064373
solr 3.1, not indexing entire document?
hi, i have my solr field text configured as per earlier discussion: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType and for debugging purposes i am storing the text field as well, so: field name=text type=text indexed=true stored=true / now when i do a search against a document, that i KNOW has a certain phrase, in this case official handbook of the Federal Government my query looks like: result name=response numFound=0 start=0 maxScore=0.0/lst name=debugstr name=rawquerystringid:062085.1 AND text:official handbook of the Federal Government/strstr name=querystringid:062085.1 AND text:official handbook of the Federal Government/strstr name=parsedquery+id:062085.1 +PhraseQuery(text:official handbook of the federal government)/strstr name=parsedquery_toString+id:062085.1 +text:official handbook of the federal government/str i get 0 results, so, when i search just for that id, and i get the result: way way at the end sure enough is the string http://qihealing.net/doc.txt output is there a document size limit or is it the fact that im sending to solr using solrj and its too large? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-3-1-not-indexing-entire-document-tp3236719p3236719.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr 3.1, not indexing entire document?
Check your maxFieldLength settting. hi, i have my solr field text configured as per earlier discussion: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType and for debugging purposes i am storing the text field as well, so: field name=text type=text indexed=true stored=true / now when i do a search against a document, that i KNOW has a certain phrase, in this case official handbook of the Federal Government my query looks like: result name=response numFound=0 start=0 maxScore=0.0/lst name=debugstr name=rawquerystringid:062085.1 AND text:official handbook of the Federal Government/strstr name=querystringid:062085.1 AND text:official handbook of the Federal Government/strstr name=parsedquery+id:062085.1 +PhraseQuery(text:official handbook of the federal government)/strstr name=parsedquery_toString+id:062085.1 +text:official handbook of the federal government/str i get 0 results, so, when i search just for that id, and i get the result: way way at the end sure enough is the string http://qihealing.net/doc.txt output is there a document size limit or is it the fact that im sending to solr using solrj and its too large? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-3-1-not-indexing-entire-document-t p3236719p3236719.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: bug in termfreq? was Re: is it possible to do a sort without query?
On 8/8/2011 4:34 PM, Jason Toy wrote: Aelexei, thank you , that does seem to work. My sort results seem to be totally wrong though, I'm not sure if its because of my sort function or something else. My query consists of: sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100 And I get back 4571232 hits. That would be the total number of docs, I guess. Since your query is *:*, ie find everything. All the results don't have the phrase indie music anywhere in their data. You are only sorting on termfreq of indie music, you are not querying documents that contain it.
Re: bug in termfreq? was Re: is it possible to do a sort without query?
Aelexei, thank you , that does seem to work. My sort results seem to be totally wrong though, I'm not sure if its because of my sort function or something else. My query consists of: sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100 And I get back 4571232 hits. That's normal, you issue a catch all query. Sorting should work but.. All the results don't have the phrase indie music anywhere in their data. Does termfreq not support phrases? No, it is TERM frequency and indie music is not one term. I don't know how this function parses your input but it might not understand your + escape and think it's one term constisting of exactly that. If not, how can I sort specifically by termfreq of a phrase? You cannot. What you can do is index multiple terms as one term using the shingle filter. Take care, it can significantly increase your index size and number of unique terms. On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko ale...@superdownloads.com.br wrote: You can use the standard query parser and pass q=*:* 2011/8/8 Jason Toy jason...@gmail.com I am trying to list some data based on a function I run , specifically termfreq(post_text,'indie music') and I am unable to do it without passing in data to the q paramater. Is it possible to get a sorted list without searching for any terms? -- *Alexei Martchenko* | *CEO* | Superdownloads ale...@superdownloads.com.br | ale...@martchenko.com.br | (11) 5083.1018/5080.3535/5080.3533
Re: edismax configuration
http://wiki.apache.org/solr/CommonQueryParameters#defType Hello all Can someone direct me to a link with config info in order to allow use of the edismax QueryHandler? Mark
Re: edismax configuration
Got it. Thank you. I thought this was going to be much more difficult than it actually was. Mark On Mon, Aug 8, 2011 at 4:50 PM, Markus Jelsma markus.jel...@openindex.iowrote: http://wiki.apache.org/solr/CommonQueryParameters#defType Hello all Can someone direct me to a link with config info in order to allow use of the edismax QueryHandler? Mark
Re: PivotFaceting in solr 3.3
As far as I know, there isn't a patch for pivot faceting for 3.x. It'd require extracting the code from trunk and porting it. Perhaps as easy as applying the diff from the pivot commit from trunk to the 3.x codebase? (but probably not quite that easy) Erik On Aug 3, 2011, at 00:58 , Isha Garg wrote: Hi Pranav, I know Pivot faceting is a feature in solr 4.0 But i want is there any patch that can make pivot faceting possible in solr3.3. Thanks! Isha On Wednesday 03 August 2011 10:23 AM, Pranav Prakash wrote: From what I know, this is a feature in Solr 4.0 marked as SOLR-792 in JIRA. Is this what you are looking for ? https://issues.apache.org/jira/browse/SOLR-792 *Pranav Prakash* temet nosce Twitterhttp://twitter.com/pranavprakash | Bloghttp://blog.myblive.com | Googlehttp://www.google.com/profiles/pranny On Wed, Aug 3, 2011 at 10:16, Isha Gargisha.g...@orkash.com wrote: Hi All! Can anyone tell which patch should I apply to solr 3.3 to enable pivot faceting in it. Thanks in advance! Isha garg
Re: string cut-off filter?
Hi Bernd, I also searched for such a filter but did not found it. Best regards Karsten P.S. I am using now this filter: public class CutMaxLengthFilter extends TokenFilter { public CutMaxLengthFilter(TokenStream in) { this(in, DEFAULT_MAXLENGTH); } public CutMaxLengthFilter(TokenStream in, int maxLength) { super(in); this.maxLength = maxLength; } public static final int DEFAULT_MAXLENGTH = 15; private final int maxLength; private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); @Override public final boolean incrementToken() throws IOException { if (!input.incrementToken()) { return false; } int length = termAtt.length(); if (maxLength 0 length maxLength) { termAtt.setLength(maxLength); } return true; } } with this factory public class CutMaxLengthFilterFactory extends BaseTokenFilterFactory { private int maxLength; @Override public void init(MapString, String args) { super.init(args); maxLength = getInt(maxLength, CutMaxLengthFilter.DEFAULT_MAXLENGTH); } public TokenStream create(TokenStream input) { return new CutMaxLengthFilter(input, maxLength); } } Original-Nachricht Datum: Mon, 08 Aug 2011 10:15:45 +0200 Von: Bernd Fehling bernd.fehl...@uni-bielefeld.de An: solr-user@lucene.apache.org Betreff: string cut-off filter? Hi list, is there a string cut-off filter to limit the length of a KeywordTokenized string? So the string should not be dropped, only limitited to a certain length. Regards Bernd
Re: Dispatching a query to multiple different cores
You could use Solr's distributed (shards parameter) capability to do this. However, if you've got somewhat different schemas that isn't necessarily going to work properly. Perhaps unify your schemas in order to facilitate this using Solr's distributed search feature? Erik On Aug 3, 2011, at 05:22 , Ahmed Boubaker wrote: Hello there! I have a multicore solr with 6 different simple cores and somewhat different schemas and I defined another meta core which I would it to be a dispatcher: the requests are sent to simple cores and results are aggregated before sending back the results to the user. Any idea or hints how can I achieve this? I am wondering whether writing custom SearchComponent or a custom SearchHandler are good entry points? Is it possible to acces other SolrCore which are in the same container as the meta core? Many thanks for your help. Boubaker
Re: solr 3.1, not indexing entire document?
that was it... thanks. obviously the document is well over 2 mgs. -- View this message in context: http://lucene.472066.n3.nabble.com/solr-3-1-not-indexing-entire-document-tp3236719p3236773.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: string cut-off filter?
There is none indeed exept using copyField and maxChars. Could you perhaps come up with some regex that replaces the group of chars beyond the desired limit and replace it with '' ? That would fit in a pattern replace char filter. Hi Bernd, I also searched for such a filter but did not found it. Best regards Karsten P.S. I am using now this filter: public class CutMaxLengthFilter extends TokenFilter { public CutMaxLengthFilter(TokenStream in) { this(in, DEFAULT_MAXLENGTH); } public CutMaxLengthFilter(TokenStream in, int maxLength) { super(in); this.maxLength = maxLength; } public static final int DEFAULT_MAXLENGTH = 15; private final int maxLength; private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); @Override public final boolean incrementToken() throws IOException { if (!input.incrementToken()) { return false; } int length = termAtt.length(); if (maxLength 0 length maxLength) { termAtt.setLength(maxLength); } return true; } } with this factory public class CutMaxLengthFilterFactory extends BaseTokenFilterFactory { private int maxLength; @Override public void init(MapString, String args) { super.init(args); maxLength = getInt(maxLength, CutMaxLengthFilter.DEFAULT_MAXLENGTH); } public TokenStream create(TokenStream input) { return new CutMaxLengthFilter(input, maxLength); } } Original-Nachricht Datum: Mon, 08 Aug 2011 10:15:45 +0200 Von: Bernd Fehling bernd.fehl...@uni-bielefeld.de An: solr-user@lucene.apache.org Betreff: string cut-off filter? Hi list, is there a string cut-off filter to limit the length of a KeywordTokenized string? So the string should not be dropped, only limitited to a certain length. Regards Bernd
Can Solr with the StatsComponent analyze 20+ million files?
Hi, Currently we are in the process of figuring out how to deal with millions of CSV files containing weather data(20+ million files). Each file is about 500 bytes in size. We want to calculate statistics on fields read from the file. For example, the standard deviation of wind speed across all 20+ million files. Processing speed isn't an important issue. The analysis routine can run for days, if needed. The StatsComponent(http://wiki.apache.org/solr/StatsComponent) for Solr appears to be able to calculate the statistics we are interested in. Will the StatsComponent in Solr do what we need with minimal configuration? Can the StatsComponent only be used on a subset of the data? For example, only look at data from certain months? Are there other free programs out there that can parse and analyze 20+ million files? We are still very new to Solr and really appreciate all your help. Thanks, Fred
Re: bug in termfreq? was Re: is it possible to do a sort without query?
Are not Dismax queries able to search for phrases using the default index(which is what I am using?) If I can already do phrase searches, I don't understand why I would need to reindex t be able to access phrases from a function. On Mon, Aug 8, 2011 at 1:49 PM, Markus Jelsma markus.jel...@openindex.iowrote: Aelexei, thank you , that does seem to work. My sort results seem to be totally wrong though, I'm not sure if its because of my sort function or something else. My query consists of: sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100 And I get back 4571232 hits. That's normal, you issue a catch all query. Sorting should work but.. All the results don't have the phrase indie music anywhere in their data. Does termfreq not support phrases? No, it is TERM frequency and indie music is not one term. I don't know how this function parses your input but it might not understand your + escape and think it's one term constisting of exactly that. If not, how can I sort specifically by termfreq of a phrase? You cannot. What you can do is index multiple terms as one term using the shingle filter. Take care, it can significantly increase your index size and number of unique terms. On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko ale...@superdownloads.com.br wrote: You can use the standard query parser and pass q=*:* 2011/8/8 Jason Toy jason...@gmail.com I am trying to list some data based on a function I run , specifically termfreq(post_text,'indie music') and I am unable to do it without passing in data to the q paramater. Is it possible to get a sorted list without searching for any terms? -- *Alexei Martchenko* | *CEO* | Superdownloads ale...@superdownloads.com.br | ale...@martchenko.com.br | (11) 5083.1018/5080.3535/5080.3533 -- - sent from my mobile 6176064373
Example Solr Config on EC2
I'm looking for some examples of how to setup Solr on EC2. The configuration I'm looking for would have multiple nodes for redundancy. I've tested in-house with a single master and slave with replication running in Tomcat on Windows Server 2003, but even if I have multiple slaves the single master is a single point of failure. Any suggestions or example configurations? The project I'm working on is a .NET setup, so ideally I'd like to keep this search cluster on Windows Server, even though I prefer Linux. Matthew Shields Owner BeanTown Host - Web Hosting, Domain Names, Dedicated Servers, Colocation, Managed Services www.beantownhost.com www.sysadminvalley.com www.jeeprally.com
Re: Can Solr with the StatsComponent analyze 20+ million files?
This does not seem well matched to Solr. Solr and Lucene are optimized to show the best few matches, not every match. I'd use Hadoop for this. Or MarkLogic, if you'd like to talk about that off-list. wunder Lead Engineer, MarkLogic On Aug 8, 2011, at 1:59 PM, Fred Smith wrote: Hi, Currently we are in the process of figuring out how to deal with millions of CSV files containing weather data(20+ million files). Each file is about 500 bytes in size. We want to calculate statistics on fields read from the file. For example, the standard deviation of wind speed across all 20+ million files. Processing speed isn't an important issue. The analysis routine can run for days, if needed. The StatsComponent(http://wiki.apache.org/solr/StatsComponent) for Solr appears to be able to calculate the statistics we are interested in. Will the StatsComponent in Solr do what we need with minimal configuration? Can the StatsComponent only be used on a subset of the data? For example, only look at data from certain months? Are there other free programs out there that can parse and analyze 20+ million files? We are still very new to Solr and really appreciate all your help. Thanks, Fred
Re: Dispatching a query to multiple different cores
However, if you unify your schemas to do this, I'd consider whether you really want seperate cores/shards in the first place. If you want to search over all of them together, what are your reasons to put them in seperate solr indexes in the first place? Ordinarily, if you want to search over them all together, the best place to start is putting them in the same solr index. Then, the distribution/sharding feature is generally your next step, only if you have so many documents that you need to shard for performance reasons. That is the intended use case of the distribution/sharding feature. On 8/8/2011 4:54 PM, Erik Hatcher wrote: You could use Solr's distributed (shards parameter) capability to do this. However, if you've got somewhat different schemas that isn't necessarily going to work properly. Perhaps unify your schemas in order to facilitate this using Solr's distributed search feature? Erik On Aug 3, 2011, at 05:22 , Ahmed Boubaker wrote: Hello there! I have a multicore solr with 6 different simple cores and somewhat different schemas and I defined another meta core which I would it to be a dispatcher: the requests are sent to simple cores and results are aggregated before sending back the results to the user. Any idea or hints how can I achieve this? I am wondering whether writing custom SearchComponent or a custom SearchHandler are good entry points? Is it possible to acces other SolrCore which are in the same container as the meta core? Many thanks for your help. Boubaker
Re: Can Solr with the StatsComponent analyze 20+ million files?
Hi, Currently we are in the process of figuring out how to deal with millions of CSV files containing weather data(20+ million files). Each file is about 500 bytes in size. We want to calculate statistics on fields read from the file. For example, the standard deviation of wind speed across all 20+ million files. Processing speed isn't an important issue. The analysis routine can run for days, if needed. The StatsComponent(http://wiki.apache.org/solr/StatsComponent) for Solr appears to be able to calculate the statistics we are interested in. Will the StatsComponent in Solr do what we need with minimal configuration? Can the StatsComponent only be used on a subset of the data? For example, only look at data from certain months? If i remember correctly, it cannot. Are there other free programs out there that can parse and analyze 20+ million files? Yes, if analyzing data like your data is all you do (not search, that's Solr's power) then you're most likely much better of not using Solr and write map/reduce programs for Apache Hadoop, it will analyze huge amounts of data. Hadoop can be quite difficult to start with so you can use the excellent Apache CouchDB database that supports map/reduce as well. CouchDB is much easier to begin with. If you transform a sample of your data to the JSON format, install CouchDB, load your data, write a simple map/reduce function all in 8 hours. Loading and processing all the data will take a bit longer. Cheers We are still very new to Solr and really appreciate all your help. Thanks, Fred
Re: Example Solr Config on EC2
On 8/8/2011 5:03 PM, Matt Shields wrote: I'm looking for some examples of how to setup Solr on EC2. The configuration I'm looking for would have multiple nodes for redundancy. I've tested in-house with a single master and slave with replication running in Tomcat on Windows Server 2003, but even if I have multiple slaves the single master is a single point of failure. Any suggestions or example configurations? This article describes various configurations: http://www.lucidimagination.com/content/scaling-lucene-and-solr#d0e410
Re: csv responsewriter and numfound
Great question. But how would that get returned in the response? It is a drag that the header is lost when results are written in CSV, but there really isn't an obvious spot for that information to be returned. Erik On Aug 4, 2011, at 01:52 , Pooja Verlani wrote: Hi, Is there anyway to get numFound from csv response format? Some parameter? Or shall I change the code for csvResponseWriter for this? Thanks, Pooja
Re: bug in termfreq? was Re: is it possible to do a sort without query?
Dismax queries can. But sort=termfreq(all_lists_text,'indie+music') is not using dismax. Apparenty termfreq function can not? I am not familiar with the termfreq function. To understand why you'd need to reindex, you might want to read up on how lucene actually works, to get a basic understanding of how different indexing choices effect what is possible at query time. Lucene In Action is a pretty good book. On 8/8/2011 5:02 PM, Jason Toy wrote: Are not Dismax queries able to search for phrases using the default index(which is what I am using?) If I can already do phrase searches, I don't understand why I would need to reindex t be able to access phrases from a function. On Mon, Aug 8, 2011 at 1:49 PM, Markus Jelsmamarkus.jel...@openindex.iowrote: Aelexei, thank you , that does seem to work. My sort results seem to be totally wrong though, I'm not sure if its because of my sort function or something else. My query consists of: sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100 And I get back 4571232 hits. That's normal, you issue a catch all query. Sorting should work but.. All the results don't have the phrase indie music anywhere in their data. Does termfreq not support phrases? No, it is TERM frequency and indie music is not one term. I don't know how this function parses your input but it might not understand your + escape and think it's one term constisting of exactly that. If not, how can I sort specifically by termfreq of a phrase? You cannot. What you can do is index multiple terms as one term using the shingle filter. Take care, it can significantly increase your index size and number of unique terms. On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko ale...@superdownloads.com.br wrote: You can use the standard query parser and pass q=*:* 2011/8/8 Jason Toyjason...@gmail.com I am trying to list some data based on a function I run , specifically termfreq(post_text,'indie music') and I am unable to do it without passing in data to the q paramater. Is it possible to get a sorted list without searching for any terms? -- *Alexei Martchenko* | *CEO* | Superdownloads ale...@superdownloads.com.br | ale...@martchenko.com.br | (11) 5083.1018/5080.3535/5080.3533
Re: bug in termfreq? was Re: is it possible to do a sort without query?
Are not Dismax queries able to search for phrases using the default index(which is what I am using?) If I can already do phrase searches, I don't understand why I would need to reindex t be able to access phrases from a function. Executing a Lucene phrase query is not the same as term frequency (phrase != term). A phrase consists of multiple terms and Lucene has an inverted term index, not an inverted phrase index (unless your index your data that way). On Mon, Aug 8, 2011 at 1:49 PM, Markus Jelsma markus.jel...@openindex.iowrote: Aelexei, thank you , that does seem to work. My sort results seem to be totally wrong though, I'm not sure if its because of my sort function or something else. My query consists of: sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100 And I get back 4571232 hits. That's normal, you issue a catch all query. Sorting should work but.. All the results don't have the phrase indie music anywhere in their data. Does termfreq not support phrases? No, it is TERM frequency and indie music is not one term. I don't know how this function parses your input but it might not understand your + escape and think it's one term constisting of exactly that. If not, how can I sort specifically by termfreq of a phrase? You cannot. What you can do is index multiple terms as one term using the shingle filter. Take care, it can significantly increase your index size and number of unique terms. On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko ale...@superdownloads.com.br wrote: You can use the standard query parser and pass q=*:* 2011/8/8 Jason Toy jason...@gmail.com I am trying to list some data based on a function I run , specifically termfreq(post_text,'indie music') and I am unable to do it without passing in data to the q paramater. Is it possible to get a sorted list without searching for any terms? -- *Alexei Martchenko* | *CEO* | Superdownloads ale...@superdownloads.com.br | ale...@martchenko.com.br | (11) 5083.1018/5080.3535/5080.3533
Re: csv responsewriter and numfound
On Mon, Aug 8, 2011 at 5:12 PM, Erik Hatcher erik.hatc...@gmail.com wrote: Great question. But how would that get returned in the response? It is a drag that the header is lost when results are written in CSV, but there really isn't an obvious spot for that information to be returned. I guess a comment would be one option. -Yonik http://www.lucidimagination.com
Re: Can Solr with the StatsComponent analyze 20+ million files?
On 8/8/2011 5:10 PM, Markus Jelsma wrote: Will the StatsComponent in Solr do what we need with minimal configuration? Can the StatsComponent only be used on a subset of the data? For example, only look at data from certain months? If i remember correctly, it cannot. Well, if you index things properly, you could an fq to only certain months, and then use StatsComponent on top. But I'd agree with others that Solr is probably not the best tool for this job. Solr's primary area of competency is text indexing and text search, not mathematical calculation. If you need a whole lot of text indexing and a little bit of math too, you might be able to get StatsComponent to do what you need, although you'll probably run into some tricky parts becuase this isn't really Solr's focus. But if you need a whole bunch of math and no text indexing at all -- use a tool that has math rather than text search as it's prime area of competency/focus, don't make things hard for yourself by using the wrong tool for the job. (StatsComponent, incidentally, performs not-so-great on very large result sets, depending on what you ask it for).
Re: bug in termfreq? was Re: is it possible to do a sort without query?
Dismax queries can. But sort=termfreq(all_lists_text,'indie+music') is not using dismax. Apparenty termfreq function can not? I am not familiar with the termfreq function. It simply returns the TF of the given _term_ as it is indexed of the current document. Sorting on TF like this seems strange as by default queries are already sorted that way since TF plays a big role in the final score. To understand why you'd need to reindex, you might want to read up on how lucene actually works, to get a basic understanding of how different indexing choices effect what is possible at query time. Lucene In Action is a pretty good book. On 8/8/2011 5:02 PM, Jason Toy wrote: Are not Dismax queries able to search for phrases using the default index(which is what I am using?) If I can already do phrase searches, I don't understand why I would need to reindex t be able to access phrases from a function. On Mon, Aug 8, 2011 at 1:49 PM, Markus Jelsmamarkus.jel...@openindex.iowrote: Aelexei, thank you , that does seem to work. My sort results seem to be totally wrong though, I'm not sure if its because of my sort function or something else. My query consists of: sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100 And I get back 4571232 hits. That's normal, you issue a catch all query. Sorting should work but.. All the results don't have the phrase indie music anywhere in their data. Does termfreq not support phrases? No, it is TERM frequency and indie music is not one term. I don't know how this function parses your input but it might not understand your + escape and think it's one term constisting of exactly that. If not, how can I sort specifically by termfreq of a phrase? You cannot. What you can do is index multiple terms as one term using the shingle filter. Take care, it can significantly increase your index size and number of unique terms. On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko ale...@superdownloads.com.br wrote: You can use the standard query parser and pass q=*:* 2011/8/8 Jason Toyjason...@gmail.com I am trying to list some data based on a function I run , specifically termfreq(post_text,'indie music') and I am unable to do it without passing in data to the q paramater. Is it possible to get a sorted list without searching for any terms? -- *Alexei Martchenko* | *CEO* | Superdownloads ale...@superdownloads.com.br | ale...@martchenko.com.br | (11) 5083.1018/5080.3535/5080.3533
Re: Can Master push data to slave
Hi, Hi I am using Solr 1.4. and doing a replication process where my slave is pulling data from Master. I have 2 questions a. Can Master push data to slave Not in current versions. Not sure about exotic patches for this. b. How to make sure that lock file is not created while replication What do you mean? Please help thanks Pawan
Re: Example Solr Config on EC2
Matthew, Here's another resource: http://www.lucidimagination.com/blog/2010/02/01/solr-shines-through-the-cloud-lucidworks-solr-on-ec2/ Michael Bohlig Lucid Imagination - Original Message From: Matt Shields m...@mattshields.org To: solr-user@lucene.apache.org Sent: Mon, August 8, 2011 2:03:20 PM Subject: Example Solr Config on EC2 I'm looking for some examples of how to setup Solr on EC2. The configuration I'm looking for would have multiple nodes for redundancy. I've tested in-house with a single master and slave with replication running in Tomcat on Windows Server 2003, but even if I have multiple slaves the single master is a single point of failure. Any suggestions or example configurations? The project I'm working on is a .NET setup, so ideally I'd like to keep this search cluster on Windows Server, even though I prefer Linux. Matthew Shields Owner BeanTown Host - Web Hosting, Domain Names, Dedicated Servers, Colocation, Managed Services www.beantownhost.com www.sysadminvalley.com www.jeeprally.com
Re: Can Solr with the StatsComponent analyze 20+ million files?
Thank you Walter, Markus and Jonathan for your fast responses and help! We will be looking into CouchDB (and Hadoop if necessary) to process our data. Thanks again, Fred
Re: Is anobdy using lotsofcores feature in production?
Hi Shalin, Is this means if I apply the patch mention at below link still Solr does not support lots of core? https://issues.apache.org/jira/browse/SOLR-1293 Are you saying this is just a concept and the patch is not an implementation? We are planning to use lots of core in our commerce system to separate products for each client in search and provide customization for each client. So could you please let us know if this is feasible and if we want to create around 500 cores and have around 8-10 load balancing solr slaves? Please let us know. Based on your feedback our approach will be decided. Thanks Regards, Umesh On Mon, Jul 25, 2011 at 3:36 AM, Markus Jelsma-2 [via Lucene] ml-node+3196893-77535491-416...@n3.nabble.com wrote: No i missed something and interpreted the question as using a lot of cores. LotsOfCores does not exist as a feature. It is just a write-up, some jira issues and a couple of patches. Did I miss something? On Sun, Jul 24, 2011 at 8:26 PM, Markus Jelsma [hidden email] http://user/SendEmail.jtp?type=nodenode=3196893i=0wrote: It works fine but you would keep an eye on additional overhead, cores `stealing` too much CPU from others, trouble with cores that merge segments stealing I/O and of course RAM. It can also result in quite a high number of open file descriptors. There are more, but these seem most common to me. Hi, Is anbody using lots of core feature in production? Is this feature scalable. I have around 1000 core and want to use this feature. Will there be any issue in production? http://wiki.apache.org/solr/LotsOfCores Thanks, Umesh -- View this message in context: http://lucene.472066.n3.nabble.com/Is-anobdy-using-lotsofcores-feature-in - production-tp3193798p3193798.html Sent from the Solr - User mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Is-anobdy-using-lotsofcores-feature-in-production-tp3193798p3196893.html To unsubscribe from Is anobdy using lotsofcores feature in production?, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=3193798code=VW9tZXNoQGdtYWlsLmNvbXwzMTkzNzk4fDIyODkyODYxMg==. -- View this message in context: http://lucene.472066.n3.nabble.com/Is-anobdy-using-lotsofcores-feature-in-production-tp3193798p3236957.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Can Master push data to slave
You could configure a PostCommit event listener on the master which would send a HTTP fetchindex request to the slave you want to carry out replication - see http://wiki.apache.org/solr/SolrReplication#HTTP_API But why do you want the master to push to the slave ? -Simon On Mon, Aug 8, 2011 at 5:26 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, Hi I am using Solr 1.4. and doing a replication process where my slave is pulling data from Master. I have 2 questions a. Can Master push data to slave Not in current versions. Not sure about exotic patches for this. b. How to make sure that lock file is not created while replication What do you mean? Please help thanks Pawan
Re: Is anobdy using lotsofcores feature in production?
Hi Shalin, Is this means if I apply the patch mention at below link still Solr does not support lots of core? https://issues.apache.org/jira/browse/SOLR-1293 Are you saying this is just a concept and the patch is not an implementation? We are planning to use lots of core in our commerce system to separate products for each client in search and provide customization for each client. So could you please let us know if this is feasible and if we want to create around 500 cores and have around 8-10 load balancing solr slaves? Please let us know. Based on your feedback our approach will be decided. Thanks Regards, Umesh -- View this message in context: http://lucene.472066.n3.nabble.com/Is-anobdy-using-lotsofcores-feature-in-production-tp3193798p3236958.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Same id on two shards
Only one should be returned, but it's non-deterministic. See http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations -Simon On Sat, Aug 6, 2011 at 6:27 AM, Pooja Verlani pooja.verl...@gmail.com wrote: Hi, We have a multicore solr with 6 cores. We merge the results using shards parameter or distrib handler. I have a problem, I might post one document on one of the cores and then post it after some days on another core, as I have a time-sliced multicore setup! The question is if I retrieve a document which is posted on both the shards, will solr return me only one document or both. And if only one document will be return, which one? Regards, Pooja
Re: bug in termfreq? was Re: is it possible to do a sort without query?
I am trying to test out and compare different sorts and scoring. When I use dismax to search for indie music with: qf=all_lists_textq=indie+musicdefType=dismaxrows=100 I see some stuff that seems irrelevant, meaning in top results I see only 1 or 2 mentions of indie music, but when I look further down the list I do see other docs that have more occurrences of indie music. So I a want to test by comparing the the different queries versus seeing a list of docs ranked specifically by the count of occurrences of the phrase indie music On Mon, Aug 8, 2011 at 2:19 PM, Markus Jelsma markus.jel...@openindex.iowrote: Dismax queries can. But sort=termfreq(all_lists_text,'indie+music') is not using dismax. Apparenty termfreq function can not? I am not familiar with the termfreq function. It simply returns the TF of the given _term_ as it is indexed of the current document. Sorting on TF like this seems strange as by default queries are already sorted that way since TF plays a big role in the final score. To understand why you'd need to reindex, you might want to read up on how lucene actually works, to get a basic understanding of how different indexing choices effect what is possible at query time. Lucene In Action is a pretty good book. On 8/8/2011 5:02 PM, Jason Toy wrote: Are not Dismax queries able to search for phrases using the default index(which is what I am using?) If I can already do phrase searches, I don't understand why I would need to reindex t be able to access phrases from a function. On Mon, Aug 8, 2011 at 1:49 PM, Markus Jelsmamarkus.jel...@openindex.iowrote: Aelexei, thank you , that does seem to work. My sort results seem to be totally wrong though, I'm not sure if its because of my sort function or something else. My query consists of: sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100 And I get back 4571232 hits. That's normal, you issue a catch all query. Sorting should work but.. All the results don't have the phrase indie music anywhere in their data. Does termfreq not support phrases? No, it is TERM frequency and indie music is not one term. I don't know how this function parses your input but it might not understand your + escape and think it's one term constisting of exactly that. If not, how can I sort specifically by termfreq of a phrase? You cannot. What you can do is index multiple terms as one term using the shingle filter. Take care, it can significantly increase your index size and number of unique terms. On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko ale...@superdownloads.com.br wrote: You can use the standard query parser and pass q=*:* 2011/8/8 Jason Toyjason...@gmail.com I am trying to list some data based on a function I run , specifically termfreq(post_text,'indie music') and I am unable to do it without passing in data to the q paramater. Is it possible to get a sorted list without searching for any terms? -- *Alexei Martchenko* | *CEO* | Superdownloads ale...@superdownloads.com.br | ale...@martchenko.com.br | (11) 5083.1018/5080.3535/5080.3533 -- - sent from my mobile 6176064373
Re: bug in termfreq? was Re: is it possible to do a sort without query?
If your want to understand and debug the scoring you can use debugQuery=true to see how different documents score. Most of the time docs with both terms are on top of the result set unless norms are interferring. To understand your should check the Solr relevancy wiki but the Lucene docs are much better although very low level. http://wiki.apache.org/solr/SolrRelevancyCookbook http://lucene.apache.org/java/3_1_0/api/core/org/apache/lucene/search/Similarity.html Your question is more a relevance question than about the termfreq function. To be short, don't use those kind of functions if you don't yet understand similarity as describe in the Lucene docs. I am trying to test out and compare different sorts and scoring. When I use dismax to search for indie music with: qf=all_lists_textq=indie+musicdefType=dismaxrows=100 I see some stuff that seems irrelevant, meaning in top results I see only 1 or 2 mentions of indie music, but when I look further down the list I do see other docs that have more occurrences of indie music. So I a want to test by comparing the the different queries versus seeing a list of docs ranked specifically by the count of occurrences of the phrase indie music On Mon, Aug 8, 2011 at 2:19 PM, Markus Jelsma markus.jel...@openindex.iowrote: Dismax queries can. But sort=termfreq(all_lists_text,'indie+music') is not using dismax. Apparenty termfreq function can not? I am not familiar with the termfreq function. It simply returns the TF of the given _term_ as it is indexed of the current document. Sorting on TF like this seems strange as by default queries are already sorted that way since TF plays a big role in the final score. To understand why you'd need to reindex, you might want to read up on how lucene actually works, to get a basic understanding of how different indexing choices effect what is possible at query time. Lucene In Action is a pretty good book. On 8/8/2011 5:02 PM, Jason Toy wrote: Are not Dismax queries able to search for phrases using the default index(which is what I am using?) If I can already do phrase searches, I don't understand why I would need to reindex t be able to access phrases from a function. On Mon, Aug 8, 2011 at 1:49 PM, Markus Jelsmamarkus.jel...@openindex.iowrote: Aelexei, thank you , that does seem to work. My sort results seem to be totally wrong though, I'm not sure if its because of my sort function or something else. My query consists of: sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100 And I get back 4571232 hits. That's normal, you issue a catch all query. Sorting should work but.. All the results don't have the phrase indie music anywhere in their data. Does termfreq not support phrases? No, it is TERM frequency and indie music is not one term. I don't know how this function parses your input but it might not understand your + escape and think it's one term constisting of exactly that. If not, how can I sort specifically by termfreq of a phrase? You cannot. What you can do is index multiple terms as one term using the shingle filter. Take care, it can significantly increase your index size and number of unique terms. On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko ale...@superdownloads.com.br wrote: You can use the standard query parser and pass q=*:* 2011/8/8 Jason Toyjason...@gmail.com I am trying to list some data based on a function I run , specifically termfreq(post_text,'indie music') and I am unable to do it without passing in data to the q paramater. Is it possible to get a sorted list without searching for any terms? -- *Alexei Martchenko* | *CEO* | Superdownloads ale...@superdownloads.com.br | ale...@martchenko.com.br | (11) 5083.1018/5080.3535/5080.3533
Re: Same id on two shards
On 8/8/2011 4:07 PM, simon wrote: Only one should be returned, but it's non-deterministic. See http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations I had heard it was based on which one responded first. This is part of why we have a small index that contains the newest content and only distribute content to the other shards once a day. The hope is that the small index (less than 1GB, fits into RAM on that virtual machine) will always respond faster than the other larger shards (over 18GB each). Is this an incorrect assumption on our part? The build system does do everything it can to ensure that periods of overlap are limited to the time it takes to commit a change across all of the shards, which should amount to just a few seconds once a day. There might be situations when the index gets out of whack and we have duplicate id values for a longer time period, but in practice it hasn't happened yet. Thanks, Shawn
Re: merge factor performance
What version of Solr are you using? And how are you sending your docs to Solr? Bumping your JVM size and bumping your RAM size to 128M also might help.. How are you sending your docs to Solr? And where are you getting them from? Are you sure that Solr is your problem or is it your data acquisition? (hint, just comment out the call to Solr if you're using SolrJ)... Bottom line: There isn't much information to go on here... And have you seen: http://wiki.apache.org/solr/FAQ#How_can_indexing_be_accelerated.3F Best Erick also what about RAM Size (default is 32 MB) ? Which other factors we need to consider ? When should we consider optimize ? Any other deviation from default would help us in achieving the target. We are allocating JVM max heap size allocation 512 MB, default concurrent mark sweep is set for garbage collection. Thanks Naveen
Re: MultiSearcher/ParallelSearcher - searching over multiple cores?
I think you'll have to make this go yourself, I don't see how to make Solr do it for you. And even if it could, the scores aren't comparable, so combining them for presentation to the user will be interesting Best Erick On Thu, Aug 4, 2011 at 2:27 PM, Ralf Musick ra...@gmx.de wrote: Hi Erik, I have several types with different properties, but they are supposed to be combined to one search. Imagine a book with property title and a journal with property name. (the types in my project have of course more complex properties.) So I created a new core with combined searchfields: field name is indexed, title is indexed, some shared properties are indexed like id. Further an additional solr field type is created. Of course there are several indexer, each per type. A specific type indexer stores only the fields of that type and stores further the type information eg book. After indexing, all types are in the same core. To search over all types, the query has to look like that ((title: bla) and (type: book)) or ((name: bla) and (type: journal)). At least you get books or journal sorted by boost factor - and you have the type information as return field to differ the search results. I hope it is coherent. Thanks for your answer, Best Ralf
Re: Records skipped when using DataImportHandler
Spend some time in the admin/analysis page, that'll show you what part of the analysis chain is doing what to your data. It'll save you a world of headache... But at a guess WordDelimiterFilterFactory is your culprit... Best Erick On Thu, Aug 4, 2011 at 6:08 PM, anand sridhar anand.for...@gmail.com wrote: Ok. After analysis, I narrowed the reduced results set to the fact that the zipcode field is not indexed 'as is'. i.e the zipcode field values are broken down into tokens and then stored. Hence, if there are 10 documents with zipcode fields varying from 91000-91009, then the zipcode fields are not stored as 91000, 91001 etc.. instead, the most common recurrences are grabbed together and stored as tokens hence resulting in a reduced resultset. The net effect is I cannot search for a value like 91000 since its not stored as it is. I suspect this to do something with the type of field the zipcode is associated to. Right now , zipcode is a field of type text_general where the StandardTokenizerFactory may be breakign the values into tokens. However, I want to store them without tokenizing. Whats the best field type to do this. ? I already explored the String fieldtype which is supposed to store the values as is, but I see that the values are still being tokenized. Thanks, Anand On Wed, Aug 3, 2011 at 7:24 PM, Erick Erickson erickerick...@gmail.comwrote: Sorry, I'm on a restricted machine so can't get the precise URL. But, there's a debug page for DIH that might allow you to see what the query actually returns. I'd guess one of two things: 1 you aren't getting the number of rows you think. 2 you aren't committing the documents you add. But that's just a guess. Best Erick On Aug 3, 2011 2:15 PM, anand sridhar anand.for...@gmail.com wrote: Hi, I am a newbie to Solr and have been trying to learn using DataImportHandler. I have a query in data-config.xml that fetches about 5 records when i fire it in SQL Query manager. However, when Solr does a full import, it is skipping 4 records and only importing 1 record. What could be the reason for that. ? My data-config.xml looks like this - dataConfig dataSource type=JdbcDataSource name=GeoService driver=net.sourceforge.jtds.jdbc.Driver url=jdbc:jtds:sqlserver://10.168.50.104/ZipCodeLookup user=sa password=psiuser/ document entity name=city query=select ll.cityId as id, ll.zip as zipCode, c.cityName as cityName, st.stateName as state, ct.countryName as country from latlonginfo ll,city c, state st, country ct where ll.cityId = c.cityID and c.stateID=st.stateID and st.countryID = ct.countryID order by ll.areacode dataSource=GeoService field column=zipCode name=zipCode/ field column=cityName name=cityName/ field column=state name=state/ field column=country name=country/ /entity /document /dataConfig My fields definition in schema.xml looks as below - field name=CityName type=text_general indexed=true stored=true / field name=zipCode type=text_general indexed=true stored=true/ field name=state type=text_general indexed=true stored=true / field name=country type=text_general indexed=true stored=true / One observation I made was the 1 record that is being indexes is the last record in the result set. I have verified that there are no duplicate records being retreived. For eg, if the result set from Database is - zipcode CityName state country --- - - --- 91324 Northridge CA USA 91325 Northridge CA USA 91327 Northridge CA USA 91328 Northridge CA USA 91329 Northridge CA USA 91330 Northridge CA USA The record being indexed is the last record all the time. Any suggestions are welcome. Thanks, Anand
Re: Same id on two shards
I think the first one to respond is indeed the way it works, but that's only deterministic up to a point (if your small index is in the throes of a commit and everything required for a response happens to be cached on the larger shard ... who knows ?) On Mon, Aug 8, 2011 at 7:10 PM, Shawn Heisey s...@elyograg.org wrote: On 8/8/2011 4:07 PM, simon wrote: Only one should be returned, but it's non-deterministic. See http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations I had heard it was based on which one responded first. This is part of why we have a small index that contains the newest content and only distribute content to the other shards once a day. The hope is that the small index (less than 1GB, fits into RAM on that virtual machine) will always respond faster than the other larger shards (over 18GB each). Is this an incorrect assumption on our part? The build system does do everything it can to ensure that periods of overlap are limited to the time it takes to commit a change across all of the shards, which should amount to just a few seconds once a day. There might be situations when the index gets out of whack and we have duplicate id values for a longer time period, but in practice it hasn't happened yet. Thanks, Shawn
Re: Suggestions for copying fields across cores...
Not that I know of. Separate cores are pretty distinct to Solr, so you're probably stuck with doing it by sending the request to each core... Best Erick On Fri, Aug 5, 2011 at 5:51 PM, josh lucas j...@lucasjosh.com wrote: Is there a suggested way to copy data in fields to additional fields that will only be in a different core? Obviously I could index the data separately and I could build that into my current indexing process but I'm curious if there might be an easier, more automated way. Thanks! josh
Re: how to enable MMapDirectory in solr 1.4?
thank you. I will try it. On Mon, Aug 8, 2011 at 11:18 PM, Rich Cariens richcari...@gmail.com wrote: We patched our 1.4.1 build with SOLR-1969https://issues.apache.org/jira/browse/SOLR-1969(making MMapDirectory configurable) and realized a 64% search performance boost on our Linux hosts. On Mon, Aug 8, 2011 at 10:05 AM, Dyer, James james.d...@ingrambook.comwrote: If you want to try MMapDirectory with Solr 1.4, then copy the class org.apache.solr.core.MMapDirectoryFactory from 3.x or Trunk, and either add it to the .war file (you can just add it under src/java and re-package the war), or you can put it in its own .jar file in the lib directory under solr_home. Then, in solrconfig.xml, add this entry under the root config element: directoryFactory class=org.apache.solr.core.MMapDirectoryFactory / I'm not sure if MMapDirectory will perform better for you with Linux over NIOFSDir. I'm pretty sure in Trunk/4.0 it's the default for Windows and maybe Solaris. In Windows, there is a definite advantage for using MMapDirectory on a 64-bit system. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Li Li [mailto:fancye...@gmail.com] Sent: Monday, August 08, 2011 4:09 AM To: solr-user@lucene.apache.org Subject: how to enable MMapDirectory in solr 1.4? hi all, I read Apache Solr 3.1 Released Note today and found that MMapDirectory is now the default implementation in 64 bit Systems. I am now using solr 1.4 with 64-bit jvm in Linux. how can I use MMapDirectory? will it improve performance?