Re: counter field
Yes, before indexing, we go and check whether that document is already there in index or not. Because along with the document, we also have meta-data information which needs to be appended. So, we have few multivalued metadata fields, which we update if the same document is found again. On Fri, Apr 6, 2012 at 10:17 AM, Walter Underwood wun...@wunderwood.orgwrote: So you will need to do a search for each document before adding it to the index, in case it is already there. That will be slow. And where do you store the last-assigned number? And there are plenty of other problems, like reloading after a corrupted index (disk failure), or deleted documents which are re-added later, or duplicates, splitting content across shards (requires a global lock across all shards to index each document), ... Two recommendations: 1. Having two different unique IDs is likely to cause problems, so choose one. 2. If you must have two IDs, use one table in a lightweight relational database to store the relationships between the md5 value and the serial number. wunder On Apr 5, 2012, at 9:37 PM, Manish Bafna wrote: Actually not. If i am updating the existing document, i need to keep the old number itself. may be this way we can do it. If we pass the number to the field, it will take that value, if we dont pass it, it will do auto-increment. Because if we update, i will have old number and i will pass it as a field again. On Fri, Apr 6, 2012 at 9:59 AM, Walter Underwood wun...@wunderwood.org wrote: Why? When you reindex, is it OK if they all change? If you reindex one document, is it OK if it gets a new sequential number? wunder On Apr 5, 2012, at 9:23 PM, Manish Bafna wrote: We already have a unique key (We use md5 value). We need another id (sequential numbers). On Fri, Apr 6, 2012 at 9:47 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : We need to have a document id available for every document (Per core). : We can pass docid as one of the parameter for fq, and it will return the : docid in the search result. So it sounds like you need a *unique* id, but nothing you described requies that it be a counter. Take a look at the UUIDField, or consider using the SignatureUpdateProcessor to generate a key based on a hash of all the field values. -Hoss -- Walter Underwood wun...@wunderwood.org
Re: It cost some many memory with solrj 3.5 how to decrease it?
Study the update examination more deeply,i logged all elapsetime value of Updateresponse, the result list following: It seems that it spent almost 20 ms on adding/updating one document in general, thus, i called which spend less than 20ms on adding one docs as normal log,and the others were abnormal logs. i can`t get a correct suit of solr 1.4, i use solr3.2 which has same performance as solr 1.4 during the test. solr3.5 vs solr 3.2 solr3.5 sum of docs:31998 sum of elapsetime:1218344 ms average: 38.0744 ms /doc sum of normal docs:28409 sum of normal elapsetime:442258 average=15.5675 ms/doc normal percentage:28409/31998 = 88.78% abnormal docs: 3590 solr 3.2 sum of docs:31998 sum of elapsetime:852935 ms average:26.6559 ms /doc sum of normal docs:28416 sum of normal elapsetime:443045 average=15.5914 ms/doc normal percentage:28409/31998 = 88.80% abnormal docs: 3160 What can be analyzed from them? B.R. murphy On Fri, Apr 6, 2012 at 10:28 AM, a sd liurx.cn@gma il.com wrote: hi,Erick. thanks at first. I had watched the status of JVM at runtime helped by jconsole and jmap. 1,When the Xmx was not assigned, then, the Old Gen area was full whose size was up to 1.5Gb and whose major content are instances of String , when the whole size of heap was up to the maximum ( about 2GB), the JVM run gc() ,which wasted the CPU time,then, the performance was degraded sharply, which was from 100,000 docs per minute to 10,000 docs per minute, as a examination, i assigned Xmx=1024m purposely, the amount was down to 1000 docs per minute. 2,When assigned Xmx=4096m, i found that the Old Gen was up to 2.1 GB and the all size of JVM was up to 3GB, but, the performance with 100,000 docs per minute can attained. During all of the test above, i only adjust the setting of client, which connect to the identical solr server and i empty the data directory of solr home before every test. By the way, i know the client code was very ugly occupied so many heap too, but, i wan`t permitted to promote them before i obtain a benchrank using solrj 3.5 as much as which the old version did using solrj 1.4. B.R murphy On Fri, Apr 6, 2012 at 5:54 AM, Erick Erickson erickerick...@gmail.comwrote: What's memory? Really, how are you measuring it? If it's virtual, you don't need to worry about it. Is this causing you a real problem or are you just nervous about the difference? Best Erick On Wed, Apr 4, 2012 at 11:23 PM, a sd liurx...@gmail.com wrote: hi,all. I have write a program which send data to solr using the update request handler, when i adopted server client library ( namely solrj ) with version 4.0 or 3.2, jvm`s heap size was up to 1.0 G about, but ,when i transfer the all of them to solr 3.5 ( both server and client libs), the size of heap was top to 3.0G ! There are the same server configuration and the identical program. What`s wrong with the new version of solrj 3.5 , i had looked the source code, there is no difference between solrj 3.2 and solrj 3.5 where my program may invoke. How can i do to decrease the memory cost by solrj 3.5? Any advice will be appreciated! murphy
Re: Choosing tokenizer based on language of document
Hi, Yes, I agree it is not an easy issue. Index all languages with the appropriate char filter, tokenizer and filters for each language is not possible without new text type and new analyzer development. If you plan to index up to 10 different languages, I suggest one text field per language or one index per language. One field for all language can be interesting if you plan to index a lot of different languages in the same index. In this case, have one field per language (text_en, text_fr, ...) can be complicated if you want the user be able in one query to retrieve documents in any languages. The query will be complex if you have 50 different languages (text_en:... OR text_fr:... OR ...). In order to achieve this you will need to developp a specific analyzer. This analyzer will be in charge of use correct char filter, tokenizer and filters for the language of the document. You will need a configurable analyzer in order to change specific languages setting (enable stemming or not, chose a specific stopwords file, ...). I did this several years ago for solr 1.4.1. This is still working for solr 3.x. The default of this analyzer is that all language settings are hard coded (tokenizer, filters, stopwords, ..). With Solr 4.0, the analyzer do not work anymore. I decided to redevelop it in order to be able to configure all languages settings in a external configuration file and have nothing hardcoded. I had to develop the analyzer but also a field type. The main issue is in fact that the analyzer is not aware of the values in other fields. So it is not possible to use an other field in order to specify the content language. The only way I found is to start content with a specific char sequence : [en]... or [fr]... The analyzer needs to know the language of the query too. So query criteria for the multilingual field have to include the specific char sequence : [en]... If you are interested by this work, let me know. If someone knows how to provide to the analyzer the content language a index time or the query language at query time in an other way I did, I am interested :). Regards. Dominique Le 05/04/12 23:36, Erick Erickson a écrit : This is really difficult to imagine working well. Even if you do choose the appropriate analysis chain (and it must be a chain here), and manage to appropriately tokenize for each language, what happens at query time? How do you expect to get matches on, say, Ukranian when the tokens of the query are in Erse? This feels like an XY problem, can you explain at a higher level what your requirements are? Best Erick On Wed, Apr 4, 2012 at 8:29 AM, Prakashganesh, Prabhu prabhu.prakashgan...@dowjones.com wrote: Hi, I have documents in different languages and I want to choose the tokenizer to use for a document based on the language of the document. The language of the document is already known and is indexed in a field. What I want to do is when I index the text in the document, I want to choose the tokenizer to use based on the value of the language field. I want to use one field for the text in the document (defining multiple fields for each language is not an option). It seems like I can define a tokenizer for a field, so I guess what I need to do is to write a custom tokenizer that looks at the language field value of the document and calls the appropriate tokenizer for that language (e.g. StandardTokenizer for English, CJKTokenizer for CJK languages etc..). From whatever I have read, it seems quite straight forward to write a custom tokenizer, but how would this custom tokenizer know the language of the document? Is there some way I can pass in this value to the tokenizer? Or is there some way the tokenizer will have access to other fields in the document?. Would be really helpful if someone can provide an answer Thanks Prabhu
Re: A tool for frequent re-indexing...
I've implemented something like described in https://issues.apache.org/jira/browse/SOLR-3246. The idea is to add an update request processor at the end of the update chain in the core you want to copy. The processor converts the SolrInputDocument to XML (there is some utility method for doing this) and dumps the XML into a file which can be fed into Solr again with curl. If you have many documents you will probably want to distribute the XML files into different directories using some common prefix in the id field. On Fri, Apr 6, 2012 at 5:18 AM, Ahmet Arslan iori...@yahoo.com wrote: I am considering writing a small tool that would read from one solr core and write to another as a means of quick re-indexing of data. I have a large-ish set (hundreds of thousands) of documents that I've already parsed with Tika and I keep changing bits and pieces in schema and config to try new things often. Instead of having to go through the process of re-indexing from docs (and some DBs), I thought it may be much more faster to just read from one core and write into new core with new schema, analysers and/or settings. I was wondering if anyone else has done anything similar already? It would be handy if I can use this sort of thing to spin off another core write to it and then swap the two cores discarding the older one. You might find these relevant : https://issues.apache.org/jira/browse/SOLR-3246 http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor
Re: Creating a query-able dictionary using Solr
Hi Joel, Not an advanced Solr user myself - only been looking at it for a while. Still, maybe you are looking to use a suggester? http://wiki.apache.org/solr/Suggester (the examples at the bottom of the page is very helpful) I haven't worked with Pdf documents in Solr yet but the suggester does seem to have the behavior you're looking for (when generating the suggestions from an index). Kind regards, Serdyn du Toit On Tue, Mar 6, 2012 at 6:25 AM, Beach, Joel jtbe...@qualcomm.com wrote: Hi there, Am looking at using Solr to perform the following tasks: 1. Push a lot of PDF documents into SOLR. 2. Build a database of all the words encountered in those documents. 3. Be able to query for a list of words matching a string like a* For example, if the collection contains the words aardvark, apple, doctor and zebra, I would expect a query of a* to return the list: [ aardvark, apple ] I have done a google around for this in Solr and found similar things involving spell-checkers, but nothing that seems exactly the same. Anyone, already done this or something similar in Solr willing to point me in the right direction? Cheers, Joel
Re: A little onfusion with maxPosAsterisk
Let's first figure out, why reversing a token is helpful for doing leading wildcard searches. I'll assume you refer to ReversedWildcardFilterFactory. If you have the query *foo, using a straightforward approach, you would need to scan through the entire dictionary of terms (which can be billions) in your solr index and try to match the suffix foo (which can start with any prefix) = very time consuming and non-optimal. If we used the ReversedWildcardFilterFactory instead, it would reverse every token in the index and store both: koongfoo (original token) and oofgnook (reversed token) Now when searching with *foo, we could also reverse it to oof* and only scan part of the dictionary, with terms starting with letter o, and further, if applying some binary search, we could directly jump to tokens that have oof as their prefix. Thus we have turned ineffective suffix search into effective prefix search. Back to your question: maxPosAsterisk parameter controls when should an ineffective suffix query term should be identified and effective prefix query term generated from it. It says, that both *foo and f*oo (as an example) should be treated as suffix queries to be turned into prefix queries oof* and oo*f. Hope this helps. Dmitry On Fri, Apr 6, 2012 at 5:11 AM, neosky neosk...@yahoo.com wrote: maxPosAsterisk - maximum position (1-based) of the asterisk wildcard ('*') that triggers the reversal of query term. Asterisk that occurs at positions higher than this value will not cause the reversal of query term. Defaults to 2, meaning that asterisks on positions 1 and 2 will cause a reversal. I can't understand that will cause a reversal. I know the Solr will keep the original token and reverse token when withOriginal parameter is open Does that means the searcher will use the reverse one to help to process the query when cause a reversal? -- View this message in context: http://lucene.472066.n3.nabble.com/A-little-onfusion-with-maxPosAsterisk-tp3889226p3889226.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Dmitry Kan
Re: SolrCloud replica and leader out of Sync somehow
awesome Yonik. I'll indeed try this. Thanks! On Thu, Apr 5, 2012 at 10:20 AM, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Apr 5, 2012 at 12:19 AM, Jamie Johnson jej2...@gmail.com wrote: Not sure if this got lost in the shuffle, were there any thoughts on this? Sorting by id could be pretty expensive (memory-wise), so I don't think it should be default or anything. We also need a way for a client to hit the same set of servers again anyway (to handle other possible variations like commit time). To handle the tiebreak stuff, you could also sort by _version_ - that should be unique in an index and is already used under the covers and hence shouldn't add any extra memory overhead. versions increase over time, so _version desc should give you newer documents first. -Yonik lucenerevolution.com - Lucene/Solr Open Source Search Conference. Boston May 7-10 On Wed, Mar 21, 2012 at 11:02 AM, Jamie Johnson jej2...@gmail.com wrote: Given that in a distributed environment the docids are not guaranteed to be the same across shards should the sorting use the uniqueId field as the tie breaker by default? On Tue, Mar 20, 2012 at 2:10 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Tue, Mar 20, 2012 at 2:02 PM, Jamie Johnson jej2...@gmail.com wrote: I'll try to dig for the JIRA. Also I'm assuming this could happen on any sort, not just score correct? Meaning if we sorted by a date field and there were duplicates in that date field order wouldn't be guaranteed for the same reasons right? Correct - internal docid is the tiebreaker for all sorts. -Yonik lucenerevolution.com - Lucene/Solr Open Source Search Conference. Boston May 7-10
Re: Is there any performance cost of using lots of OR in the solr query
On 4/5/2012 3:49 PM, Erick Erickson wrote: Of course putting more clauses in an OR query will have a performance cost, there's more work to do OK, being a smart-alec aside you will probably be fine with a few hundred clauses. The question is simply whether the performance hit is acceptable. I'm afraid that question can't be answered in the abstract, you'll have to test... Since you're putting them in an fq, there's also some chance that they'll be re-used from the cache, at least if there are common patterns. Roz, I have a similar situation going on in my index. Because employees have access to far more than real users, they get filter queries constructed that have HUGE number of clauses in them. We have implemented a new field for a feature that we call search groups but it has not penetrated all aspects of the application yet. Also, until we can make those groups use a hierarchy, which is not a trivial undertaking, we may be stuck with large filter queries. These complex filters have led to a problem that you have probably not considered - really long filterCache autowarm times. I have reduced the autoWarm value on my filterCache to FOUR, and there are still times that the autowarm takes up to 60 seconds. Most of the time it is only a few seconds, with up to 30 seconds being relatively common. I just thought of a new localparam feature for this situation and filed SOLR-. I will talk to our developers about using the existing localparam that skips filterCache entirely. Thanks, Shawn
Re: counter field
On 4/5/2012 1:53 AM, Manish Bafna wrote: Hi, Is it possible to define a field as Counter Column which can be auto-incremented. Manish, Where does your data come from? Can you add the autoincrement field to the data source? My data comes from MySQL, where the private key is an autoincrement field. MySQL is very good at autoincrement fields. Walter, we do have two unique ID values in our system, enforced by MySQL, and it hasn't caused us any problems yet. One is the autoincrement field just mentioned and the other is another id that is specific to our application. We use the autoincrement field to identify deleted documents and as a position indicator for the build program to add new documents to Solr. The other unique field is Solr's uniqueKey. Thanks, Shawn
SolrCloud Zookeeper view does not work on latest snapshot
I just downloaded the latest snapshot and fired it up to take a look around and I'm getting the following error when looking at the Cloud view. Loading of undefined failed with HTTP-Status 404 The request I see going out is as follows http://localhost:8501/solr/slice1_shard1/zookeeper?wt=json this doesn't work but this does http://localhost:8501/solr/zookeeper?wt=json Any thoughts why this would happen?
Re: SolrCloud Zookeeper view does not work on latest snapshot
I looked at our old system and indeed it used to make a call to /solr/zookeeper not /solr/corename/zookeeper. I am making a change locally so I can run with this but is this a bug or did I much something up with my configuration? On Fri, Apr 6, 2012 at 9:33 AM, Jamie Johnson jej2...@gmail.com wrote: I just downloaded the latest snapshot and fired it up to take a look around and I'm getting the following error when looking at the Cloud view. Loading of undefined failed with HTTP-Status 404 The request I see going out is as follows http://localhost:8501/solr/slice1_shard1/zookeeper?wt=json this doesn't work but this does http://localhost:8501/solr/zookeeper?wt=json Any thoughts why this would happen?
Re: waitFlush and waitSearcher with SolrServer.add(docs, commitWithinMs)
You've got it. That's the post I was talking about, I was rushed and couldn't find it quickly... LucidWorks Enterprise uses a trunk version of Solr, so DWPT is in that code in 2.0. For Solr-only, you can just check out a trunk build. Best Erick On Thu, Apr 5, 2012 at 7:54 PM, Mike O'Leary tmole...@uw.edu wrote: First of all, what I was seeing was different from what I thought I was seeing because a few weeks ago I uncommented the autoCommit block in the solrconfig.xml file and I didn't realize it until yesterday just before I went home, so that was controlling the commits more than the add and commit calls that I was making. When I commented that block out again, the times for index with add(docs, commitWithinMs) and with add(docs) and commit(false, false) were very similar. Both of them were about 20 minutes faster (38 minutes instead of about an hour) than indexing with autoCommit set to commit after every 1,000 documents or fifteen minutes. Is this the blog post you are talking about: http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/? It seems to be about the right topic. I am using Solr 3.5. The feature matrix on one of the Lucid Imagination web pages says that DocumentWriterPerThread is available in Solr 4.0 and LucidWorks 2.0. I assume that means LucidWorks Enterprise. Is that right? Thanks, Mike -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, April 05, 2012 2:45 PM To: solr-user@lucene.apache.org Subject: Re: waitFlush and waitSearcher with SolrServer.add(docs, commitWithinMs) Solr version? I suspect your outlier is due to merging segments, if so this should have happened quite some time into the run. See Simon Wilnauer's blog post on DocumenWriterPerThread (trunk) code. What commitWithin time are you using? Best Erick On Wed, Apr 4, 2012 at 7:50 PM, Mike O'Leary tmole...@uw.edu wrote: I am indexing some database contents using add(docs, commitWithinMs), and those add calls are taking over 80% of the time once the database begins returning results. I was wondering if setting waitSearcher to false would speed this up. Many of the calls take 1 to 6 seconds, with one outlier that took over 11 minutes. Thanks, Mike -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Wednesday, April 04, 2012 4:15 PM To: solr-user@lucene.apache.org Subject: Re: waitFlush and waitSearcher with SolrServer.add(docs, commitWithinMs) On Apr 4, 2012, at 6:50 PM, Mike O'Leary wrote: If you index a set of documents with SolrJ and use StreamingUpdateSolrServer.add(CollectionSolrInputDocument docs, int commitWithinMs), it will perform a commit within the time specified, and it seems to use default values for waitFlush and waitSearcher. Is there a place where you can specify different values for waitFlush and waitSearcher, or if you want to use different values do you have to call StreamingUpdateSolrServer.add(CollectionSolrInputDocument docs) and then call StreamingUpdateSolrServer.commit(waitFlush, waitSearcher) explicitly? Thanks, Mike waitFlush actually does nothing in recent versions of Solr. waitSearcher doesn't seem so important when the commit is not done explicitly by the user or a client. - Mark Miller lucidimagination.com
Re: Is there any performance cost of using lots of OR in the solr query
Shawn: Ahhh, so *that* was what your JIRA was about Consider https://issues.apache.org/jira/browse/SOLR-2429 for your ACL calculations, that's what this was developed for. The basic idea is that you can write a custom filter that returns whether the document should be included in the results set that's only called _after_ all other clauses (search and FQs) have been satisfied. Here's the issue. Normally, fqs are calculated across the entire document set. That's what allows them to be cached and re-used. But, as you've found, doing ACL calculations for the entire document set is expensive. So this is an attempt to make a lower-cost alternative. The downside is that it is NOT cached, so it must be calculated anew each time. But it's only calculated for a subset of documents. Best Erick On Fri, Apr 6, 2012 at 9:00 AM, Shawn Heisey s...@elyograg.org wrote: On 4/5/2012 3:49 PM, Erick Erickson wrote: Of course putting more clauses in an OR query will have a performance cost, there's more work to do OK, being a smart-alec aside you will probably be fine with a few hundred clauses. The question is simply whether the performance hit is acceptable. I'm afraid that question can't be answered in the abstract, you'll have to test... Since you're putting them in an fq, there's also some chance that they'll be re-used from the cache, at least if there are common patterns. Roz, I have a similar situation going on in my index. Because employees have access to far more than real users, they get filter queries constructed that have HUGE number of clauses in them. We have implemented a new field for a feature that we call search groups but it has not penetrated all aspects of the application yet. Also, until we can make those groups use a hierarchy, which is not a trivial undertaking, we may be stuck with large filter queries. These complex filters have led to a problem that you have probably not considered - really long filterCache autowarm times. I have reduced the autoWarm value on my filterCache to FOUR, and there are still times that the autowarm takes up to 60 seconds. Most of the time it is only a few seconds, with up to 30 seconds being relatively common. I just thought of a new localparam feature for this situation and filed SOLR-. I will talk to our developers about using the existing localparam that skips filterCache entirely. Thanks, Shawn
RE: upgrade solr from 1.4 to 3.5 not working
Note that I am trying to upgrade from the Lucid Imagination distribution of Solr 1.4, dunno if that makes a difference. We have an existing index of 11 million documents which I am trying to preserve in the upgrade process. -Original Message- From: Robert Petersen [mailto:rober...@buy.com] Sent: Thursday, April 05, 2012 2:21 PM To: solr-user@lucene.apache.org Subject: upgrade solr from 1.4 to 3.5 not working Hi folks, I'm a little stumped here. I have an existing solr 1.4 setup which is well configured. I want to upgrade to the latest solr release, and after reading release notes, the wiki, etc, I concluded the correct path would be to not change any config items and just replace the solr.war file in tomcats webapps folder with the new one and then start tomcat back up. This worked fine, solr came up. The problem is that on the solr info page it still says that I am running solr 1.4 even after several restarts and even a server reboot. Am I missing something? Info says this though there is no solr 1.4 war file anywhere under tomcat root: Solr Specification Version: 1.4.0.2009.12.10.10.34.34 Solr Implementation Version: 1.4 exported - sam - 2009-12-10 10:34:34 Lucene Specification Version: 2.9.1 Lucene Implementation Version: 2.9.1 exported - 2009-12-10 10:32:14 Current Time: Thu Apr 05 12:56:12 PDT 2012 Server Start Time:Thu Apr 05 12:52:25 PDT 2012 Any help would be appreciated. Thanks Robi
RE: upgrade solr from 1.4 to 3.5 not working
OK I found in the tomcat documentation that I not only have to drop the war file into webapps but also have to delete the expanded version of the war that tomcat makes. Now tomcat doesn't find the velocity response writer which I seem to recall seeing some note about. I'll try to find that again. Thanks for the help? Oh well... -Original Message- From: Robert Petersen [mailto:rober...@buy.com] Sent: Friday, April 06, 2012 8:27 AM To: solr-user@lucene.apache.org Subject: RE: upgrade solr from 1.4 to 3.5 not working Note that I am trying to upgrade from the Lucid Imagination distribution of Solr 1.4, dunno if that makes a difference. We have an existing index of 11 million documents which I am trying to preserve in the upgrade process. -Original Message- From: Robert Petersen [mailto:rober...@buy.com] Sent: Thursday, April 05, 2012 2:21 PM To: solr-user@lucene.apache.org Subject: upgrade solr from 1.4 to 3.5 not working Hi folks, I'm a little stumped here. I have an existing solr 1.4 setup which is well configured. I want to upgrade to the latest solr release, and after reading release notes, the wiki, etc, I concluded the correct path would be to not change any config items and just replace the solr.war file in tomcats webapps folder with the new one and then start tomcat back up. This worked fine, solr came up. The problem is that on the solr info page it still says that I am running solr 1.4 even after several restarts and even a server reboot. Am I missing something? Info says this though there is no solr 1.4 war file anywhere under tomcat root: Solr Specification Version: 1.4.0.2009.12.10.10.34.34 Solr Implementation Version: 1.4 exported - sam - 2009-12-10 10:34:34 Lucene Specification Version: 2.9.1 Lucene Implementation Version: 2.9.1 exported - 2009-12-10 10:32:14 Current Time: Thu Apr 05 12:56:12 PDT 2012 Server Start Time:Thu Apr 05 12:52:25 PDT 2012 Any help would be appreciated. Thanks Robi
Re: schema design question
I'd consider a field like associated_with_album, and a field that identifies the kind of record this is track or album. Then you can form a query like -associated_with_album:true (where '-' is the Lucene or NOT). And then group by kind to get separate groups of albums and tracks. Hope this helps Erick On Thu, Apr 5, 2012 at 9:00 PM, N. Tucker ntucker-ml-solr-us...@august20th.com wrote: Apologies if this is a very straightforward schema design problem that should be fairly obvious, but I'm not seeing a good way to do it. Let's say I have an index that wants to model Albums and Tracks, and they all have arbitrary tags attached to them (represented by multivalue string type fields). Tracks also have an album id field which can be used to associate them with an album. I'd like to perform a query which shows both Track and Album results, but suppresses Tracks that are associated with Albums in the result set. I am tempted to use a join here, but I have reservations because it is my understanding that joins cannot work across shards, and I'm not sure it's a good idea to limit myself in that way if possible. Any suggestions? Is there a standard solution to this type of problem where you've got hierarchical items and you don't want children shown in the same result as the parent?
Re: A little onfusion with maxPosAsterisk
great! thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/A-little-onfusion-with-maxPosAsterisk-tp3889226p3890776.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: schema design question
Thanks, but I don't want to exclude all tracks that are associated with albums, I want to exclude tracks that are associated with albums *which match the query* (tracks and their associated albums may have different tags). I don't think your suggestion covers that. On Fri, Apr 6, 2012 at 9:35 AM, Erick Erickson erickerick...@gmail.com wrote: I'd consider a field like associated_with_album, and a field that identifies the kind of record this is track or album. Then you can form a query like -associated_with_album:true (where '-' is the Lucene or NOT). And then group by kind to get separate groups of albums and tracks. Hope this helps Erick On Thu, Apr 5, 2012 at 9:00 PM, N. Tucker ntucker-ml-solr-us...@august20th.com wrote: Apologies if this is a very straightforward schema design problem that should be fairly obvious, but I'm not seeing a good way to do it. Let's say I have an index that wants to model Albums and Tracks, and they all have arbitrary tags attached to them (represented by multivalue string type fields). Tracks also have an album id field which can be used to associate them with an album. I'd like to perform a query which shows both Track and Album results, but suppresses Tracks that are associated with Albums in the result set. I am tempted to use a join here, but I have reservations because it is my understanding that joins cannot work across shards, and I'm not sure it's a good idea to limit myself in that way if possible. Any suggestions? Is there a standard solution to this type of problem where you've got hierarchical items and you don't want children shown in the same result as the parent?
SolrEntityProcessor Configuration Problem
Dear all, I'm facing a problem with SolrEntityProcessor, when having it configured under a JDBC Datasource. My configuration looks like this: entity name=V_MARKET_STUDIES datasource=jdbc-2 query=select * from V_MARKET_STUDIES transformer=ClobTransformer field column=ID name=id / field column=TYPE name=type / field column=LOCALE name=locale / field column=TITLE name=title / field column=KEYWORDS name=keywords clob=true/ field column=TOPICS name=topics/ field column=EXTENDED_KEYWORDS name=extended_keywordsclob=true/ field column=PUBLICATION_DATE name=publication_date/ field column=OWNER name=owner / field column=DL_FILE_ENTRY_ID name=dl_file_entry_id / field column=DL_FILE_VERSION_ID name=dl_file_version_id / field column=DL_FOLDER_ID name=dl_folder_id / field column=FILE_NAME name=file_name / field column=EXTENSION name=extension / field column=URL_LINK name=urllink / entity name=sep processor=SolrEntityProcessor fl=content url=http://vmcenter120:8983/solr/; query=folderId:${V_MARKET_STUDIES.DL_FOLDER_ID} fq=entryClassPK:${V_MARKET_STUDIES.DL_FILE_ENTRY_ID} field column=content name=content / /entity /entity I have 6 rows in the Oracle Database, but only the first row is processed right, means that the 2nd Solr is queried and the results went to the document, the remaining 5 rows where processed without quering the 2nd Solr and therfore didn't have the content field filled. Any suggestions? Did I configured something wrong, or misunderstand something wrong? Thanks for your help Best regards Michael
solr analysis-extras configuration
Hello, I'm running into an odd problem trying to use ICUTokenizer under a solr installation running under tomcat on ubuntu. It seems that all the appropriate JAR files are loaded: INFO: Adding 'file:/usr/share/solr/lib/lucene-stempel-3.5.0.jar' to classloader INFO: Adding 'file:/usr/share/solr/lib/lucene-smartcn-3.5.0.jar' to classloader INFO: Adding 'file:/usr/share/solr/lib/icu4j-4_8_1_1.jar' to classloader INFO: Adding 'file:/usr/share/solr/lib/lucene-icu-3.5.0.jar' to classloader INFO: Adding 'file:/usr/share/solr/lib/apache-solr-analysis-extras-3.5.0.jar' to classloader ... but later: ... SEVERE: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/icu/segmentation/ICUTokenizer I'm not too clear on the correct way to add the contrib bits other than copying them into the 'lib' directory under solrhome. They are obviously found there (and I have verified that ICUTokenizer is in lucene-icu-3.5.0.jar), but there's still a problem loading the ICUTokenizer class. Any tips on troubleshooting this? Are there more depenencies that I'm unaware of?
Re: SolrCloud Zookeeper view does not work on latest snapshot
There have been a bunch of changes getting the zookeeper info and UI looking good. The info moved from being on the core to using a servlet at the root level. Note, it is not a request handler anymore, so the wt=XXX has no effect. It is always JSON ryan On Fri, Apr 6, 2012 at 7:01 AM, Jamie Johnson jej2...@gmail.com wrote: I looked at our old system and indeed it used to make a call to /solr/zookeeper not /solr/corename/zookeeper. I am making a change locally so I can run with this but is this a bug or did I much something up with my configuration? On Fri, Apr 6, 2012 at 9:33 AM, Jamie Johnson jej2...@gmail.com wrote: I just downloaded the latest snapshot and fired it up to take a look around and I'm getting the following error when looking at the Cloud view. Loading of undefined failed with HTTP-Status 404 The request I see going out is as follows http://localhost:8501/solr/slice1_shard1/zookeeper?wt=json this doesn't work but this does http://localhost:8501/solr/zookeeper?wt=json Any thoughts why this would happen?
Re: Solr dismax not returning expected results
Adding autoGeneratePhraseQueries=true to my field definitions has solved the problem -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-dismax-not-returning-expected-results-tp3891346p3891594.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr analysis-extras configuration
Further info: I can make this work if I stay out of tomcat -- I download a fresh solr binary distro, copy those five JARs from 'dist' and 'contrib' into example/solr/lib/, copy my solrconfig.xml and schema.xml, and run 'java -jar start.jar', and it works fine. But trying to add those same JARs to my tomcat instance's solrhome/lib doesn't work. Any ideas how to troubleshoot? On Fri, Apr 6, 2012 at 12:15 PM, N. Tucker ntucker-ml-solr-us...@august20th.com wrote: Hello, I'm running into an odd problem trying to use ICUTokenizer under a solr installation running under tomcat on ubuntu. It seems that all the appropriate JAR files are loaded: INFO: Adding 'file:/usr/share/solr/lib/lucene-stempel-3.5.0.jar' to classloader INFO: Adding 'file:/usr/share/solr/lib/lucene-smartcn-3.5.0.jar' to classloader INFO: Adding 'file:/usr/share/solr/lib/icu4j-4_8_1_1.jar' to classloader INFO: Adding 'file:/usr/share/solr/lib/lucene-icu-3.5.0.jar' to classloader INFO: Adding 'file:/usr/share/solr/lib/apache-solr-analysis-extras-3.5.0.jar' to classloader ... but later: ... SEVERE: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/icu/segmentation/ICUTokenizer I'm not too clear on the correct way to add the contrib bits other than copying them into the 'lib' directory under solrhome. They are obviously found there (and I have verified that ICUTokenizer is in lucene-icu-3.5.0.jar), but there's still a problem loading the ICUTokenizer class. Any tips on troubleshooting this? Are there more depenencies that I'm unaware of?
Re: SolrCloud Zookeeper view does not work on latest snapshot
Thanks Ryan. So to be clear this is a bug then? I went into the cloud.js and changed the url used to access this information so that it would work, wasn't sure if it was kosher or not. On 4/6/12, Ryan McKinley ryan...@gmail.com wrote: There have been a bunch of changes getting the zookeeper info and UI looking good. The info moved from being on the core to using a servlet at the root level. Note, it is not a request handler anymore, so the wt=XXX has no effect. It is always JSON ryan On Fri, Apr 6, 2012 at 7:01 AM, Jamie Johnson jej2...@gmail.com wrote: I looked at our old system and indeed it used to make a call to /solr/zookeeper not /solr/corename/zookeeper. I am making a change locally so I can run with this but is this a bug or did I much something up with my configuration? On Fri, Apr 6, 2012 at 9:33 AM, Jamie Johnson jej2...@gmail.com wrote: I just downloaded the latest snapshot and fired it up to take a look around and I'm getting the following error when looking at the Cloud view. Loading of undefined failed with HTTP-Status 404 The request I see going out is as follows http://localhost:8501/solr/slice1_shard1/zookeeper?wt=json this doesn't work but this does http://localhost:8501/solr/zookeeper?wt=json Any thoughts why this would happen?
Re: SolrEntityProcessor Configuration Problem
The SolrEntityProcessor resolves all of its parameters at start time, not for each query. This technique cannot work. I filed it: https://issues.apache.org/jira/browse/SOLR-3336 On Fri, Apr 6, 2012 at 11:13 AM, michael.k...@basf.com wrote: Dear all, I'm facing a problem with SolrEntityProcessor, when having it configured under a JDBC Datasource. My configuration looks like this: entity name=V_MARKET_STUDIES datasource=jdbc-2 query=select * from V_MARKET_STUDIES transformer=ClobTransformer field column=ID name=id / field column=TYPE name=type / field column=LOCALE name=locale / field column=TITLE name=title / field column=KEYWORDS name=keywords clob=true/ field column=TOPICS name=topics/ field column=EXTENDED_KEYWORDS name=extended_keywords clob=true/ field column=PUBLICATION_DATE name=publication_date/ field column=OWNER name=owner / field column=DL_FILE_ENTRY_ID name=dl_file_entry_id / field column=DL_FILE_VERSION_ID name=dl_file_version_id / field column=DL_FOLDER_ID name=dl_folder_id / field column=FILE_NAME name=file_name / field column=EXTENSION name=extension / field column=URL_LINK name=urllink / entity name=sep processor=SolrEntityProcessor fl=content url=http://vmcenter120:8983/solr/; query=folderId:${V_MARKET_STUDIES.DL_FOLDER_ID} fq=entryClassPK:${V_MARKET_STUDIES.DL_FILE_ENTRY_ID} field column=content name=content / /entity /entity I have 6 rows in the Oracle Database, but only the first row is processed right, means that the 2nd Solr is queried and the results went to the document, the remaining 5 rows where processed without quering the 2nd Solr and therfore didn't have the content field filled. Any suggestions? Did I configured something wrong, or misunderstand something wrong? Thanks for your help Best regards Michael -- Lance Norskog goks...@gmail.com
Re: schema design question
(albums:query OR tracks:query) AND NOT(tracks:query - albums:query) Is this it? That last clause does sound like a join. How do you shard? Is it possible to put all associated albums and tracks in one shard? You can then do a join query against each shard and merge the output yourself. On Fri, Apr 6, 2012 at 9:59 AM, Neal Tucker ntuc...@august20th.com wrote: Thanks, but I don't want to exclude all tracks that are associated with albums, I want to exclude tracks that are associated with albums *which match the query* (tracks and their associated albums may have different tags). I don't think your suggestion covers that. On Fri, Apr 6, 2012 at 9:35 AM, Erick Erickson erickerick...@gmail.com wrote: I'd consider a field like associated_with_album, and a field that identifies the kind of record this is track or album. Then you can form a query like -associated_with_album:true (where '-' is the Lucene or NOT). And then group by kind to get separate groups of albums and tracks. Hope this helps Erick On Thu, Apr 5, 2012 at 9:00 PM, N. Tucker ntucker-ml-solr-us...@august20th.com wrote: Apologies if this is a very straightforward schema design problem that should be fairly obvious, but I'm not seeing a good way to do it. Let's say I have an index that wants to model Albums and Tracks, and they all have arbitrary tags attached to them (represented by multivalue string type fields). Tracks also have an album id field which can be used to associate them with an album. I'd like to perform a query which shows both Track and Album results, but suppresses Tracks that are associated with Albums in the result set. I am tempted to use a join here, but I have reservations because it is my understanding that joins cannot work across shards, and I'm not sure it's a good idea to limit myself in that way if possible. Any suggestions? Is there a standard solution to this type of problem where you've got hierarchical items and you don't want children shown in the same result as the parent? -- Lance Norskog goks...@gmail.com
Re: solr analysis-extras configuration
Tomcat needs an explicit parameter somewhere to use UTF-8 text. It's on the wiki how to do this. On Fri, Apr 6, 2012 at 4:41 PM, N. Tucker ntucker-ml-solr-us...@august20th.com wrote: Further info: I can make this work if I stay out of tomcat -- I download a fresh solr binary distro, copy those five JARs from 'dist' and 'contrib' into example/solr/lib/, copy my solrconfig.xml and schema.xml, and run 'java -jar start.jar', and it works fine. But trying to add those same JARs to my tomcat instance's solrhome/lib doesn't work. Any ideas how to troubleshoot? On Fri, Apr 6, 2012 at 12:15 PM, N. Tucker ntucker-ml-solr-us...@august20th.com wrote: Hello, I'm running into an odd problem trying to use ICUTokenizer under a solr installation running under tomcat on ubuntu. It seems that all the appropriate JAR files are loaded: INFO: Adding 'file:/usr/share/solr/lib/lucene-stempel-3.5.0.jar' to classloader INFO: Adding 'file:/usr/share/solr/lib/lucene-smartcn-3.5.0.jar' to classloader INFO: Adding 'file:/usr/share/solr/lib/icu4j-4_8_1_1.jar' to classloader INFO: Adding 'file:/usr/share/solr/lib/lucene-icu-3.5.0.jar' to classloader INFO: Adding 'file:/usr/share/solr/lib/apache-solr-analysis-extras-3.5.0.jar' to classloader ... but later: ... SEVERE: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/icu/segmentation/ICUTokenizer I'm not too clear on the correct way to add the contrib bits other than copying them into the 'lib' directory under solrhome. They are obviously found there (and I have verified that ICUTokenizer is in lucene-icu-3.5.0.jar), but there's still a problem loading the ICUTokenizer class. Any tips on troubleshooting this? Are there more depenencies that I'm unaware of? -- Lance Norskog goks...@gmail.com