Re: solr 1.4 highlighting issue
Koji, This looks strange to me, because I would assume, that highlighter also applies boolean logic same way as a query parser. In this way of thinking drilling should be highlighted if ships occurred together in the same document. Which wasn't the case in the example. Dmitry On Wed, Sep 14, 2011 at 2:20 PM, Koji Sekiguchi k...@r.email.ne.jp wrote: (11/09/14 15:54), Dmitry Kan wrote: Hello list, Not sure how many of you are still using solr 1.4 in production, but here is an issue with highlighting, that we've noticed: The query is: (drill AND ships) OR rigs Excerpt from the highlighting list: arr name=Contents str Within the fleet of 27 floatinglt;emrigslt;/em (semisubmersibles and drillships) are 21 deepwaterlt;emdrillinglt;/**em /str /arr /lst Why did solr highlight drilling even though there is no ships in the text? Dmitry, This is expected, even if you use the latest version of Solr. You got the document because rigs was hit in the document, but then Highlighter tries to search individual terms of the query in the document again. koji -- Check out Query Log Visualizer for Apache Solr http://www.rondhuit-demo.com/**loganalyzer/loganalyzer.htmlhttp://www.rondhuit-demo.com/loganalyzer/loganalyzer.html http://www.rondhuit.com/en/ -- Regards, Dmitry Kan
Re: solr 1.4 highlighting issue
Hi Mike, Actually, the example I gave is the document in this case. So there was no ships, only drilling. Dmitry On Wed, Sep 14, 2011 at 1:59 PM, Michael Sokolov soko...@ifactory.comwrote: The highlighter gives you snippets of text surrounding words (terms) drawn from the query. The whole document should satisfy the query (ie it probably has ships/s somewhere else in it), but each snippet won't generally have all the terms. -Mike On 9/14/2011 2:54 AM, Dmitry Kan wrote: Hello list, Not sure how many of you are still using solr 1.4 in production, but here is an issue with highlighting, that we've noticed: The query is: (drill AND ships) OR rigs Excerpt from the highlighting list: arr name=Contents str Within the fleet of 27 floatinglt;emrigslt;/em (semisubmersibles and drillships) are 21 deepwaterlt;emdrillinglt;/**em /str /arr /lst Why did solr highlight drilling even though there is no ships in the text? * *-- Regards, Dmitry Kan -- Regards, Dmitry Kan
Re: math with date and modulo
okay, thanks a lot. I thought, that isnt possible to get the month in my case =( i will try out another way. - --- System One Server, 12 GB RAM, 2 Solr Instances, 8 Cores, 1 Core with 45 Million Documents other Cores 200.000 - Solr1 for Search-Requests - commit every Minute - 5GB Xmx - Solr2 for Update-Request - delta every Minute - 4GB Xmx -- View this message in context: http://lucene.472066.n3.nabble.com/math-with-date-and-modulo-tp3335800p3338207.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Norms - scoring issue
It seems that fieldNorm difference is coming from the field named 'text'. And you didn't include the definition of text field. Did you omit norms for that field too? By the way I see that you have store=true in some places but it should be store*d*=true. --- On Wed, 9/14/11, Adolfo Castro Menna adolfo.castrome...@gmail.com wrote: From: Adolfo Castro Menna adolfo.castrome...@gmail.com Subject: Norms - scoring issue To: solr-user@lucene.apache.org Date: Wednesday, September 14, 2011, 11:13 PM Hi All, I hope someone could shed some light on the issue I'm facing with solr 3.1.0. It looks like it's computing diferrent fieldNorm values despite my configuration that aims to ignore it. field name=item_name type=textgen indexed=true store=true omitNorms=true omitTermFrequencyAndPositions=true / field name=item_description type=textTight indexed=true store=true omitNorms=true omitTermFrequencyAndPositions=true / field name=item_tags type=text indexed=true stored=true multiValued=true omitNorms=true omitTermFrequencyAndPositions=true / I also have a custom class that extends DefaultSimilarity to override the idf method. Query: str name=qitem_name:octopus seafood OR item_description:octopus seafood OR item_tags:octopus seafood/str str name=sortscore desc,item_ranking desc/str The first 2 results are: doc float name=score0.5217492/float str name=item_nameGrilled Octopus/str arr name=item_tagsstrSeafood, tapas/str/arr /doc doc float name=score0.49379835/float str name=item_nameoctopus marisco/str arr name=item_tagsstrAppetizer, Mexican, Seafood, food/str/arr /doc Does anyone know why they get a different score? I'm expecting them to have the same scoring because both matched the two search terms. I checked the debug information and it seems that the difference involves the fieldNorm values. 1) Grilled Octopus 0.52174926 = (MATCH) product of: 0.7826238 = (MATCH) sum of: 0.4472136 = (MATCH) weight(item_name:octopus in 69), product of: 0.4472136 = queryWeight(item_name:octopus), product of: 1.0 = idf(docFreq=2, maxDocs=449) 0.4472136 = queryNorm 1.0 = (MATCH) fieldWeight(item_name:octopus in 69), product of: 1.0 = tf(termFreq(item_name:octopus)=1) 1.0 = idf(docFreq=2, maxDocs=449) 1.0 = fieldNorm(field=item_name, doc=69) 0.1118034 = (MATCH) weight(text:seafood in 69), product of: 0.4472136 = queryWeight(text:seafood), product of: 1.0 = idf(docFreq=8, maxDocs=449) 0.4472136 = queryNorm 0.25 = (MATCH) fieldWeight(text:seafood in 69), product of: 1.0 = tf(termFreq(text:seafood)=1) 1.0 = idf(docFreq=8, maxDocs=449) 0.25 = fieldNorm(field=text, doc=69) 0.1118034 = (MATCH) weight(text:seafood in 69), product of: 0.4472136 = queryWeight(text:seafood), product of: 1.0 = idf(docFreq=8, maxDocs=449) 0.4472136 = queryNorm 0.25 = (MATCH) fieldWeight(text:seafood in 69), product of: 1.0 = tf(termFreq(text:seafood)=1) 1.0 = idf(docFreq=8, maxDocs=449) 0.25 = fieldNorm(field=text, doc=69) 0.1118034 = (MATCH) weight(text:seafood in 69), product of: 0.4472136 = queryWeight(text:seafood), product of: 1.0 = idf(docFreq=8, maxDocs=449) 0.4472136 = queryNorm 0.25 = (MATCH) fieldWeight(text:seafood in 69), product of: 1.0 = tf(termFreq(text:seafood)=1) 1.0 = idf(docFreq=8, maxDocs=449) 0.25 = fieldNorm(field=text, doc=69) 0.667 = coord(4/6) 2) octopus marisco 0.49379835 = (MATCH) product of: 0.7406975 = (MATCH) sum of: 0.4472136 = (MATCH) weight(item_name:octopus in 81), product of: 0.4472136 = queryWeight(item_name:octopus), product of: 1.0 = idf(docFreq=2, maxDocs=449) 0.4472136 = queryNorm 1.0 = (MATCH) fieldWeight(item_name:octopus in 81), product of: 1.0 = tf(termFreq(item_name:octopus)=1) 1.0 = idf(docFreq=2, maxDocs=449) 1.0 = fieldNorm(field=item_name, doc=81) 0.09782797 = (MATCH) weight(text:seafood in 81), product of: 0.4472136 = queryWeight(text:seafood), product of: 1.0 = idf(docFreq=8, maxDocs=449) 0.4472136 = queryNorm 0.21875 = (MATCH) fieldWeight(text:seafood in 81), product of: 1.0 = tf(termFreq(text:seafood)=1) 1.0 = idf(docFreq=8, maxDocs=449) 0.21875 = fieldNorm(field=text, doc=81) 0.09782797 = (MATCH) weight(text:seafood in 81), product of: 0.4472136 = queryWeight(text:seafood), product of: 1.0 = idf(docFreq=8, maxDocs=449) 0.4472136 = queryNorm 0.21875 = (MATCH) fieldWeight(text:seafood in 81), product of: 1.0 = tf(termFreq(text:seafood)=1) 1.0 = idf(docFreq=8, maxDocs=449) 0.21875 = fieldNorm(field=text, doc=81)
Re: Out of memory
Hello, Since you use caching, you can monitor the eviction parameter on the solr admin page (http://localhost:port/solr/admin/stats.jsp#cache). If it is non zero, the cache can be made e.g. bigger. queryResultWindowSize=50 in my case. Not sure, if solr 3.1 supports, but in 1.4 I have: HashDocSet maxSize=1000 loadFactor=0.75/ Does the OOM happen on update/commit or search? Dmitry On Wed, Sep 14, 2011 at 2:47 PM, Rohit ro...@in-rev.com wrote: Thanks Dmirty for the offer to help, I am using some caching in one of the cores not. Earlier I was using on other cores too, but now I have commented them out because of frequent OOM, also some warming up in one of the core. I have share the links for my config files for all the 4 cores, http://haklus.com/crssConfig.xml http://haklus.com/rssConfig.xml http://haklus.com/twitterConfig.xml http://haklus.com/facebookConfig.xml Thanks again Rohit -Original Message- From: Dmitry Kan [mailto:dmitry@gmail.com] Sent: 14 September 2011 10:23 To: solr-user@lucene.apache.org Subject: Re: Out of memory Hi, OK 64GB fits into one shard quite nicely in our setup. But I have never used multicore setup. In total you have 79,9 GB. We try to have 70-100GB per shard with caching on. Do you do warming up of your index on starting? Also, there was a setting of pre-populating the cache. It could also help, if you can show some parts of your solrconfig file. What is the solr version you use? Regards, Dmitry On Wed, Sep 14, 2011 at 11:38 AM, Rohit ro...@in-rev.com wrote: Hi Dimtry, To answer your questions, -Do you use caching? I do user caching, but will disable it and give it a go. -How big is your index in size on the disk? These are the size of the data folder for each of the cores. Core1 : 64GB Core2 : 6.1GB Core3 : 7.9GB Core4 : 1.9GB Will try attaching a jconsole to my solr as suggested to get a better picture. Regards, Rohit -Original Message- From: Dmitry Kan [mailto:dmitry@gmail.com] Sent: 14 September 2011 08:15 To: solr-user@lucene.apache.org Subject: Re: Out of memory Hi Rohit, Do you use caching? How big is your index in size on the disk? What is the stack trace contents? The OOM problems that we have seen so far were related to the index physical size and usage of caching. I don't think we have ever found the exact cause of these problems, but sharding has helped to keep each index relatively small and OOM have gone away. You can also attach jconsole onto your SOLR via the jmx and monitor the memory / cpu usage in a graphical interface. I have also run garbage collector manually through jconsole sometimes and it was of a help. Regards, Dmitry On Wed, Sep 14, 2011 at 9:10 AM, Rohit ro...@in-rev.com wrote: Thanks Jaeger. Actually I am storing twitter streaming data into the core, so the rate of index is about 12tweets(docs)/second. The same solr contains 3 other cores but these cores are not very heavy. Now the twitter core has become very large (77516851) and its taking a long time to query (Mostly facet queries based on date, string fields). After sometime about 18-20hr solr goes out of memory, the thread dump doesn't show anything. How can I improve this besides adding more ram into the system. Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg -Original Message- From: Jaeger, Jay - DOT [mailto:jay.jae...@dot.wi.gov] Sent: 13 September 2011 21:06 To: solr-user@lucene.apache.org Subject: RE: Out of memory numDocs is not the number of documents in memory. It is the number of documents currently in the index (which is kept on disk). Same goes for maxDocs, except that it is a count of all of the documents that have ever been in the index since it was created or optimized (including deleted documents). Your subject indicates that something is giving you some kind of Out of memory error. We might better be able to help you if you provide more information about your exact problem. JRJ -Original Message- From: Rohit [mailto:ro...@in-rev.com] Sent: Tuesday, September 13, 2011 2:29 PM To: solr-user@lucene.apache.org Subject: Out of memory I have solr running on a machine with 18Gb Ram , with 4 cores. One of the core is very big containing 77516851 docs, the stats for searcher given below searcherName : Searcher@5a578998 main caching : true numDocs : 77516851 maxDoc : 77518729 lockFactory=org.apache.lucene.store.NativeFSLockFactory@5a9c5842 indexVersion : 1308817281798 openedAt : Tue Sep 13 18:59:52 GMT 2011 registeredAt : Tue Sep 13 19:00:55 GMT 2011 warmupTime : 63139 . Is there a way to reduce the number of docs loaded into memory for this core? . At any given
Re: Terms.regex performance issue
Hi, I do have the same problem, i am looking for infix autocomplete, could you elaborate a bit on your QueryConverter - Suggester solution ? Thank You! -- View this message in context: http://lucene.472066.n3.nabble.com/Terms-regex-performance-issue-tp3268994p3338273.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Out of memory
It's happening more in search and search has become very slow particularly on the core with 69GB index data. Regards, Rohit -Original Message- From: Dmitry Kan [mailto:dmitry@gmail.com] Sent: 15 September 2011 07:51 To: solr-user@lucene.apache.org Subject: Re: Out of memory Hello, Since you use caching, you can monitor the eviction parameter on the solr admin page (http://localhost:port/solr/admin/stats.jsp#cache). If it is non zero, the cache can be made e.g. bigger. queryResultWindowSize=50 in my case. Not sure, if solr 3.1 supports, but in 1.4 I have: HashDocSet maxSize=1000 loadFactor=0.75/ Does the OOM happen on update/commit or search? Dmitry On Wed, Sep 14, 2011 at 2:47 PM, Rohit ro...@in-rev.com wrote: Thanks Dmirty for the offer to help, I am using some caching in one of the cores not. Earlier I was using on other cores too, but now I have commented them out because of frequent OOM, also some warming up in one of the core. I have share the links for my config files for all the 4 cores, http://haklus.com/crssConfig.xml http://haklus.com/rssConfig.xml http://haklus.com/twitterConfig.xml http://haklus.com/facebookConfig.xml Thanks again Rohit -Original Message- From: Dmitry Kan [mailto:dmitry@gmail.com] Sent: 14 September 2011 10:23 To: solr-user@lucene.apache.org Subject: Re: Out of memory Hi, OK 64GB fits into one shard quite nicely in our setup. But I have never used multicore setup. In total you have 79,9 GB. We try to have 70-100GB per shard with caching on. Do you do warming up of your index on starting? Also, there was a setting of pre-populating the cache. It could also help, if you can show some parts of your solrconfig file. What is the solr version you use? Regards, Dmitry On Wed, Sep 14, 2011 at 11:38 AM, Rohit ro...@in-rev.com wrote: Hi Dimtry, To answer your questions, -Do you use caching? I do user caching, but will disable it and give it a go. -How big is your index in size on the disk? These are the size of the data folder for each of the cores. Core1 : 64GB Core2 : 6.1GB Core3 : 7.9GB Core4 : 1.9GB Will try attaching a jconsole to my solr as suggested to get a better picture. Regards, Rohit -Original Message- From: Dmitry Kan [mailto:dmitry@gmail.com] Sent: 14 September 2011 08:15 To: solr-user@lucene.apache.org Subject: Re: Out of memory Hi Rohit, Do you use caching? How big is your index in size on the disk? What is the stack trace contents? The OOM problems that we have seen so far were related to the index physical size and usage of caching. I don't think we have ever found the exact cause of these problems, but sharding has helped to keep each index relatively small and OOM have gone away. You can also attach jconsole onto your SOLR via the jmx and monitor the memory / cpu usage in a graphical interface. I have also run garbage collector manually through jconsole sometimes and it was of a help. Regards, Dmitry On Wed, Sep 14, 2011 at 9:10 AM, Rohit ro...@in-rev.com wrote: Thanks Jaeger. Actually I am storing twitter streaming data into the core, so the rate of index is about 12tweets(docs)/second. The same solr contains 3 other cores but these cores are not very heavy. Now the twitter core has become very large (77516851) and its taking a long time to query (Mostly facet queries based on date, string fields). After sometime about 18-20hr solr goes out of memory, the thread dump doesn't show anything. How can I improve this besides adding more ram into the system. Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg -Original Message- From: Jaeger, Jay - DOT [mailto:jay.jae...@dot.wi.gov] Sent: 13 September 2011 21:06 To: solr-user@lucene.apache.org Subject: RE: Out of memory numDocs is not the number of documents in memory. It is the number of documents currently in the index (which is kept on disk). Same goes for maxDocs, except that it is a count of all of the documents that have ever been in the index since it was created or optimized (including deleted documents). Your subject indicates that something is giving you some kind of Out of memory error. We might better be able to help you if you provide more information about your exact problem. JRJ -Original Message- From: Rohit [mailto:ro...@in-rev.com] Sent: Tuesday, September 13, 2011 2:29 PM To: solr-user@lucene.apache.org Subject: Out of memory I have solr running on a machine with 18Gb Ram , with 4 cores. One of the core is very big containing 77516851 docs, the stats for searcher given below searcherName : Searcher@5a578998 main caching : true numDocs : 77516851 maxDoc : 77518729
Re: why we need the index information in a database ?
On Thu, Sep 15, 2011 at 2:53 PM, kiran.bodigam kiran.bodi...@gmail.com wrote: why we need the index information in a database is because it is clusterable. In other words, we may have/need more than one instance of the SOLR engine running. [...] Not sure if you are after multiple instances that replicate between each other, or a solution that scales on demand. Both are possible: Please see, e.g., http://wiki.apache.org/solr/SolrReplication http://wiki.apache.org/solr/SolrCloud If you could explain details of what you want, people might be better able to advise you. As people have pointed out, putting Solr's index into a database makes no sense, and will almost certainly never be officially supported. Regards, Gora
Re: Out of memory
If you have many users you could scale vertically, i.e. do replication. Buf before that you could do sharding, for example by indexing entries based on a hash function. Let's say split 69GB to two shards first and experiment with it. Regards, Dmitry On Thu, Sep 15, 2011 at 12:22 PM, Rohit ro...@in-rev.com wrote: It's happening more in search and search has become very slow particularly on the core with 69GB index data. Regards, Rohit -Original Message- From: Dmitry Kan [mailto:dmitry@gmail.com] Sent: 15 September 2011 07:51 To: solr-user@lucene.apache.org Subject: Re: Out of memory Hello, Since you use caching, you can monitor the eviction parameter on the solr admin page (http://localhost:port/solr/admin/stats.jsp#cache). If it is non zero, the cache can be made e.g. bigger. queryResultWindowSize=50 in my case. Not sure, if solr 3.1 supports, but in 1.4 I have: HashDocSet maxSize=1000 loadFactor=0.75/ Does the OOM happen on update/commit or search? Dmitry On Wed, Sep 14, 2011 at 2:47 PM, Rohit ro...@in-rev.com wrote: Thanks Dmirty for the offer to help, I am using some caching in one of the cores not. Earlier I was using on other cores too, but now I have commented them out because of frequent OOM, also some warming up in one of the core. I have share the links for my config files for all the 4 cores, http://haklus.com/crssConfig.xml http://haklus.com/rssConfig.xml http://haklus.com/twitterConfig.xml http://haklus.com/facebookConfig.xml Thanks again Rohit -Original Message- From: Dmitry Kan [mailto:dmitry@gmail.com] Sent: 14 September 2011 10:23 To: solr-user@lucene.apache.org Subject: Re: Out of memory Hi, OK 64GB fits into one shard quite nicely in our setup. But I have never used multicore setup. In total you have 79,9 GB. We try to have 70-100GB per shard with caching on. Do you do warming up of your index on starting? Also, there was a setting of pre-populating the cache. It could also help, if you can show some parts of your solrconfig file. What is the solr version you use? Regards, Dmitry On Wed, Sep 14, 2011 at 11:38 AM, Rohit ro...@in-rev.com wrote: Hi Dimtry, To answer your questions, -Do you use caching? I do user caching, but will disable it and give it a go. -How big is your index in size on the disk? These are the size of the data folder for each of the cores. Core1 : 64GB Core2 : 6.1GB Core3 : 7.9GB Core4 : 1.9GB Will try attaching a jconsole to my solr as suggested to get a better picture. Regards, Rohit -Original Message- From: Dmitry Kan [mailto:dmitry@gmail.com] Sent: 14 September 2011 08:15 To: solr-user@lucene.apache.org Subject: Re: Out of memory Hi Rohit, Do you use caching? How big is your index in size on the disk? What is the stack trace contents? The OOM problems that we have seen so far were related to the index physical size and usage of caching. I don't think we have ever found the exact cause of these problems, but sharding has helped to keep each index relatively small and OOM have gone away. You can also attach jconsole onto your SOLR via the jmx and monitor the memory / cpu usage in a graphical interface. I have also run garbage collector manually through jconsole sometimes and it was of a help. Regards, Dmitry On Wed, Sep 14, 2011 at 9:10 AM, Rohit ro...@in-rev.com wrote: Thanks Jaeger. Actually I am storing twitter streaming data into the core, so the rate of index is about 12tweets(docs)/second. The same solr contains 3 other cores but these cores are not very heavy. Now the twitter core has become very large (77516851) and its taking a long time to query (Mostly facet queries based on date, string fields). After sometime about 18-20hr solr goes out of memory, the thread dump doesn't show anything. How can I improve this besides adding more ram into the system. Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg -Original Message- From: Jaeger, Jay - DOT [mailto:jay.jae...@dot.wi.gov] Sent: 13 September 2011 21:06 To: solr-user@lucene.apache.org Subject: RE: Out of memory numDocs is not the number of documents in memory. It is the number of documents currently in the index (which is kept on disk). Same goes for maxDocs, except that it is a count of all of the documents that have ever been in the index since it was created or optimized (including deleted documents). Your subject indicates that something is giving you some kind of Out of memory error. We might better be able to help you if you provide more information about your exact problem. JRJ -Original Message-
Re: Count rows with tokens
Facet Indexing is good solution for me :) Thanks for your help! -- View this message in context: http://lucene.472066.n3.nabble.com/Count-rows-with-tokens-tp3274643p3338556.html Sent from the Solr - User mailing list archive at Nabble.com.
can we share the same index directory for multiple cores?
If we implement the multi core functionality in solr is there any possibility that the same index information shared by two different cores (redundancy),can we share the same index directory for multiple cores?If i query it on admin which core will respond because they suggesting to query on different core http://localhost:8983/solr/core0/select?q=*:* i don't want to do this? I would like to know how multi core functionality will work? -- View this message in context: http://lucene.472066.n3.nabble.com/can-we-share-the-same-index-directory-for-multiple-cores-tp3338571p3338571.html Sent from the Solr - User mailing list archive at Nabble.com.
Distinct elements in a field
Simple question: I want to know how many distinct elements I have in a field and these verify a query. Do you know if there's a way to do it today in 3.4. I saw SOLR-1814 and SOLR-2242. SOLR-1814 seems fairly easy to use. What do you think ? Thank you
Delete documents with empty fields
I want to delete all documents with empty title field. If i run the query -title:[* TO *] I obtain the correct list of documents but when I submit to solr the delete command: curl http://localhost:8080/solr/web/update\?commit=true -H 'Content-Type: text/xml' --data-binary \ 'deletequery-title:[* TO *]/query/delete' none of the documents were deleted. After a bit of debugging I have noted that the query was internally rewritten by org.apache.lucene.search.Searcher.createNormalizedWeight to an empty query. It is a bug or there is another way to do this operation? (or there is no way?) Regards Massimo
Re: Solandra - select query error
Hi Jake, I was reproduce example of my error (commit release 3408a30): 1. I have used schema.xml from reuters-demo, with my fields definition: . fields field name=id type=long indexed=true stored=true required=true / field name=text type=text indexed=true stored=true termPositions=true/ field name=doma_type type=long indexed=true stored=true required=true / field name=sentiment_type type=long indexed=true stored=true required=true / field name=date_check type=long indexed=true stored=true required=true / /fields uniqueKeyid/uniqueKey defaultSearchFieldtext/defaultSearchField . 2. I have populated 1 items with two iterations (5000) to index sampleIndex.sub 3. Then I execute many selects to above index sampleIndex.sub with many combinations queries: QueryResponse r = client.query(combination query); - doma_type:(2) AND sentiment_type:(1) AND text:(piwo nie może) -- combination query - ERROR - doma_type:(2 1) AND sentiment_type:(1) AND text:(piwo nie może) - doma_type:(3 2 1) AND sentiment_type:(1) AND text:(piwo nie może) - doma_type:(3 2 1) AND sentiment_type:(1) AND text:(może) - doma_type:(3 2 1) AND sentiment_type:(1) AND text:(piwo nie) - etc. (all combinations of numbers 1 2 3 and words piwo nie może) 4. In results, I have received error for a combination query (one from above). The error combination query is not repeatable. This error does not always occur. If error does not occur, then try above steps again (selects should be performed immediately after index/write data). I may have a bad configuration for this situation (I have standard configuration). MY CONSOLE ERROR: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118) at com.its.bt.solandra.dao.PostDao.countPosts(PostDao.java:383) at com.its.bt.solandra.dao.PostDao.countPostsByTokens(PostDao.java:338) at com.its.bt.solandra.dao.ProjectDao.main(ProjectDao.java:42) Caused by: org.apache.solr.common.SolrException: 4 java.lang.ArrayIndexOutOfBoundsException: 4 at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:310) at org.apache.lucene.search.ConjunctionScorer.score(ConjunctionScorer.java:135) at org.apache.lucene.search.BooleanScorer2$2.score(BooleanScorer2.java:182) at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:309) at org.apache.lucene.search.TopScoreDocCollector$InOrderTopScoreDocCollector.collect(TopScoreDocCollector.java:47) at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:281) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:526) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:320) at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1178) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1066) at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:358) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:258) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at solandra.SolandraDispatchFilter.execute(SolandraDispatchFilter.java:171) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at solandra.SolandraDispatchFilter.doFilter(SolandraDispatchFilter.java:137) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.ja 4 java.lang.ArrayIndexOutOfBoundsException: 4 at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:310) at org.apache.lucene.search.ConjunctionScorer.score(ConjunctionScorer.java:135) at org.apache.lucene.search.BooleanScorer2$2.score(BooleanScorer2.java:182) at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:309) at org.apache.lucene.search.TopScoreDocCollector$InOrderTopScoreDocCollector.collect(TopScoreDocCollector.java:47) at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:281) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:526) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:320) at
Re: Delete documents with empty fields
I want to delete all documents with empty title field. If i run the query -title:[* TO *] I obtain the correct list of documents but when I submit to solr the delete command: curl http://localhost:8080/solr/web/update\?commit=true -H 'Content-Type: text/xml' --data-binary \ 'deletequery-title:[* TO *]/query/delete' none of the documents were deleted. After a bit of debugging I have noted that the query was internally rewritten by org.apache.lucene.search.Searcher.createNormalizedWeight to an empty query. It is a bug or there is another way to do this operation? (or there is no way?) Not sure but 'deletequery+*:* -title:[* TO *]/query/delete' may do the trick.
Re: indexing data from rich documents - Tika with solr3.1
Maybe this quick script will get you running? http://www.lucidimagination.com/blog/2011/08/31/indexing-rich-files-into-solr-quickly-and-easily/ On Sep 15, 2011, at 00:44 , scorpking wrote: Hi Erick Erickson, Now, we have many files format(doc, ppt, pdf, ...), File's purpose serve to search details content of education in that files. Because i am new solr, so maybe i understand not enough depth about Apache Tika. At the moment i can't index pdf files from http, with one file is ok. Thank for your attention. -- View this message in context: http://lucene.472066.n3.nabble.com/indexing-data-from-rich-documents-Tika-with-solr3-1-tp3322555p3337963.html Sent from the Solr - User mailing list archive at Nabble.com.
How to write core's name in log
Hi, I have multiple core in Solr and I want to write core name in log through to lo4j. I've found in SolrException a method called log(Logger log, Throwable e) but when It try to build a Exception it haven't core's name. The Exception is built in toStr() method in SolrException class, so I want to write core's name in the message of Exception. I'm thinking to add MDC variable, this will be name of core. Finally I'll use it in log4j configuration like this in ConversionPattern %X{core} The idea is that when Solr received a request I'll add this new variable name of core. But I don't know if it's a good idea or not. or Do you already exists any solution for add name of core in log? Thanks Joan
RE: Out of memory
Thanks Dmitry, let me look into sharading concepts. Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg -Original Message- From: Dmitry Kan [mailto:dmitry@gmail.com] Sent: 15 September 2011 10:15 To: solr-user@lucene.apache.org Subject: Re: Out of memory If you have many users you could scale vertically, i.e. do replication. Buf before that you could do sharding, for example by indexing entries based on a hash function. Let's say split 69GB to two shards first and experiment with it. Regards, Dmitry On Thu, Sep 15, 2011 at 12:22 PM, Rohit ro...@in-rev.com wrote: It's happening more in search and search has become very slow particularly on the core with 69GB index data. Regards, Rohit -Original Message- From: Dmitry Kan [mailto:dmitry@gmail.com] Sent: 15 September 2011 07:51 To: solr-user@lucene.apache.org Subject: Re: Out of memory Hello, Since you use caching, you can monitor the eviction parameter on the solr admin page (http://localhost:port/solr/admin/stats.jsp#cache). If it is non zero, the cache can be made e.g. bigger. queryResultWindowSize=50 in my case. Not sure, if solr 3.1 supports, but in 1.4 I have: HashDocSet maxSize=1000 loadFactor=0.75/ Does the OOM happen on update/commit or search? Dmitry On Wed, Sep 14, 2011 at 2:47 PM, Rohit ro...@in-rev.com wrote: Thanks Dmirty for the offer to help, I am using some caching in one of the cores not. Earlier I was using on other cores too, but now I have commented them out because of frequent OOM, also some warming up in one of the core. I have share the links for my config files for all the 4 cores, http://haklus.com/crssConfig.xml http://haklus.com/rssConfig.xml http://haklus.com/twitterConfig.xml http://haklus.com/facebookConfig.xml Thanks again Rohit -Original Message- From: Dmitry Kan [mailto:dmitry@gmail.com] Sent: 14 September 2011 10:23 To: solr-user@lucene.apache.org Subject: Re: Out of memory Hi, OK 64GB fits into one shard quite nicely in our setup. But I have never used multicore setup. In total you have 79,9 GB. We try to have 70-100GB per shard with caching on. Do you do warming up of your index on starting? Also, there was a setting of pre-populating the cache. It could also help, if you can show some parts of your solrconfig file. What is the solr version you use? Regards, Dmitry On Wed, Sep 14, 2011 at 11:38 AM, Rohit ro...@in-rev.com wrote: Hi Dimtry, To answer your questions, -Do you use caching? I do user caching, but will disable it and give it a go. -How big is your index in size on the disk? These are the size of the data folder for each of the cores. Core1 : 64GB Core2 : 6.1GB Core3 : 7.9GB Core4 : 1.9GB Will try attaching a jconsole to my solr as suggested to get a better picture. Regards, Rohit -Original Message- From: Dmitry Kan [mailto:dmitry@gmail.com] Sent: 14 September 2011 08:15 To: solr-user@lucene.apache.org Subject: Re: Out of memory Hi Rohit, Do you use caching? How big is your index in size on the disk? What is the stack trace contents? The OOM problems that we have seen so far were related to the index physical size and usage of caching. I don't think we have ever found the exact cause of these problems, but sharding has helped to keep each index relatively small and OOM have gone away. You can also attach jconsole onto your SOLR via the jmx and monitor the memory / cpu usage in a graphical interface. I have also run garbage collector manually through jconsole sometimes and it was of a help. Regards, Dmitry On Wed, Sep 14, 2011 at 9:10 AM, Rohit ro...@in-rev.com wrote: Thanks Jaeger. Actually I am storing twitter streaming data into the core, so the rate of index is about 12tweets(docs)/second. The same solr contains 3 other cores but these cores are not very heavy. Now the twitter core has become very large (77516851) and its taking a long time to query (Mostly facet queries based on date, string fields). After sometime about 18-20hr solr goes out of memory, the thread dump doesn't show anything. How can I improve this besides adding more ram into the system. Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg -Original Message- From: Jaeger, Jay - DOT [mailto:jay.jae...@dot.wi.gov] Sent: 13 September 2011 21:06 To: solr-user@lucene.apache.org Subject: RE: Out of memory numDocs is not the number of documents in memory. It is the number of documents currently in the index (which is kept on disk). Same goes for maxDocs, except that it is a count of all of the documents that have ever been in the index since it was created or optimized
Replication and ExternalFileField
Hi all, I'm trying to find some good information regarding replication, especially for the ExternalFileField. As I understand it; - the external files must be in data dir. - replication only replicates data/indexes and possibly confFiles from the conf dir. Does anyone have suggestions or ideas on how this should would work? Best regards, Per Osbeck
Re: Replication and ExternalFileField
Perhaps a symlink will do the trick. On Thursday 15 September 2011 14:04:47 Per Osbeck wrote: Hi all, I'm trying to find some good information regarding replication, especially for the ExternalFileField. As I understand it; - the external files must be in data dir. - replication only replicates data/indexes and possibly confFiles from the conf dir. Does anyone have suggestions or ideas on how this should would work? Best regards, Per Osbeck -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
RE: Replication and ExternalFileField
Probably would have worked on *nix but unfortunately running Windows. Best regards, Per -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: den 15 september 2011 14:07 To: solr-user@lucene.apache.org Subject: Re: Replication and ExternalFileField Perhaps a symlink will do the trick. On Thursday 15 September 2011 14:04:47 Per Osbeck wrote: Hi all, I'm trying to find some good information regarding replication, especially for the ExternalFileField. As I understand it; - the external files must be in data dir. - replication only replicates data/indexes and possibly confFiles from the conf dir. Does anyone have suggestions or ideas on how this should would work? Best regards, Per Osbeck -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Delete documents with empty fields
On 15/09/2011 13:01, Ahmet Arslan wrote: +*:* -title:[* TO *] Worked fine. Thanks a lot! Massimo
Re: Index not getting refreshed
Is it possible you have two solr instances running off the same index folder? This was a mistake I stumbled into early on - I was writing with one, and reading with the other, so I didn't see updates. -Mike On 09/15/2011 12:37 AM, Pawan Darira wrote: I am commiting but not doing replication now. Mine sort order also includes last login timestamp. the new profiles are being reflected in my SOLR admin db. but its not listed on my website. On Thu, Sep 15, 2011 at 4:25 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : I am using Solr 3.2 on a live website. i get live user's data of about 2000 : per day. I do an incremental index every 8 hours. but my search results : always show the same result with same sorting order. when i check the same Are you commiting? Are you using replication? Are you using a sort order that might not make it obvious that the new docs are actaully there? (ie: sort=timestamp asc) -Hoss
Re: Terms.regex performance issue
Read http://lucene.472066.n3.nabble.com/suggester-issues-td3262718.html http://lucene.472066.n3.nabble.com/suggester-issues-td3262718.html for more info about the QueryConverter. IMO Suggester should make it easier to choose between QueryConverters. As for the infix, WIKI says its planned feature, but the Suggester hasnt't been worked on for couple of months. So guess we will have to wait :) -- View this message in context: http://lucene.472066.n3.nabble.com/Terms-regex-performance-issue-tp3268994p3338899.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Norms - scoring issue
Hi Ashmet, You're right. It was related to the text field which is the defaultSearch field. I also added omitNorms=true in the fieldtype definition and it's now working as expected Thanks, Adolfo.
Re: OOM issue
Hi Eric, Thanks for the reply. It is very useful for me. For point 1. : I do need 10 core and it will go on increasing in future. I have document that belongs to different workspaces , so the 1 workspace = 1 core ; I cant go with one core. Currrently having 10 core but in future the count may go 40+. For point 2.: Currently I have not given any thought on it , but yes I think in future I may have to go for the master/slave setup For point 3: the current cache size for document cache , filter cache and query cache is 512 for each the ramBufferSizeMB size is 512M. Shall I reduce the same to 128M? For point 4: I didnot get you why should I use SolrJ with Tika? Do you mean sending the new/updated documents to Tika for reindexing? Then I am already doing it using data-config. I have written the query in data-config in way that it takes the path of updated/new documents. Thanks in advance! Regards, Abhijit Multiple webapps will not help you, they're still on the underlying memory. In fact, it'll make matters worse since they won't share resources. So questions become: 1 Why do you have 10 cores? Putting 10 cores on the same machine doesn't really do much. It can make lots of sense to put 10 cores on the same machine for *indexing*, then replicate them out. But putting 10 cores on one machine in hopes of making better use of memory isn't useful. It may be useful to just go to one core. 2 Indexing, reindexing and searching on a single machine is requiring a lot from that machine. Really you should consider having a master/slave setup. 3 But assuming more hardware of any sort isn't in the cards, sure. reduce your cache sizes. Look at ramBufferSizeMB and make it small. 4 Consider indexing with Tika via SolrJ and only sending the finished document to Solr. Best Erick On Mon, Sep 12, 2011 at 5:42 AM, Manish Bafna manish.bafna...@gmail.com wrote: Number of cache is definitely going to reduce heap usage. Can you run those xlsx file separately with Tika and see if you are getting OOM issue. On Mon, Sep 12, 2011 at 3:09 PM, abhijit bashetti abhijitbashe...@gmail.com wrote: I am facing the OOM issue. OTHER than increasing the RAM , Can we chnage some other parameters to avoid the OOM issue. such as minimizing the filter cache size , document cache size etc. Can you suggest me some other option to avoid the OOM issue? Thanks in advance! Regards, Abhijit
Re: Performance troubles with solr
Thank you all for your fast replies, Changing photo_id:* to boolean has_photo field via transformer, when importing data, *fixed my problems*; reducing query times to *30~ ms*. I'll try to optimize furthermore by your advices on filter query usage and int=tint (will search it first) transform. On Thu, Sep 15, 2011 at 1:31 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : q=photo_id:* AND gender:true AND country:MALAWI AND online:false photo_id:* does not mean what you probably think it means. you most likely want photo_id:[* TO *] given your current schema, but i would recommend adding a new has_photo boolean field and using that instead. thta alone should explain a big part of what those queries would be slow. you didn't describe how your q param varies in your test queries (just your fq). I'm assuming gender and online can vary, and that you sometimes don't use the photo_id clauses, and that the country clause can vary, but that these clauses are always all mandatory. in which case i would suggest using fq for all of them individually, and leaving your q param as *:* (unless you sometimes sort on the actual solr score, in which case leave it as whatever part of hte queyr you actually want to contribute to hte score) Lastly: I don't remember off the top of my head how int and tinit are defined in the example solrconfig files, but you should consider your usage of them carefully -- particularly with the precisionStep and which fields you do range queries on. -Hoss
Re: Schema fieldType y-m-d ?!?!
thx =) i think i will save this as an string if ranges really works =) - --- System One Server, 12 GB RAM, 2 Solr Instances, 8 Cores, 1 Core with 45 Million Documents other Cores 200.000 - Solr1 for Search-Requests - commit every Minute - 5GB Xmx - Solr2 for Update-Request - delta every Minute - 4GB Xmx -- View this message in context: http://lucene.472066.n3.nabble.com/Schema-fieldType-y-m-d-tp3335359p3339160.html Sent from the Solr - User mailing list archive at Nabble.com.
Multiple shards on same machine find matches but return 0 results.
Hi, Recently we have been trying to scale up our Solandra setup to make use of a more powerful server. To improve query speeds we tried reducing the index size, and thus increasing the number of shards on a single machine. While we had no trouble searching and return results when we had a single shard, with multiple shards we are getting an interesting bug where the search is able to find the matching documents and returns an accurate count, but returns 0 documents. What could be causing this problem? Are we missing an obvious parameter? While using multiple shards, if I set isShard=true, we do get back results with total number found from only one shard. We tried hitting each of the cores directly by using the setParam from SolrJ, but are not getting any results back. We found the names of the core from the logs (ie.10.1.10.200:8983/solandra/df~1). To debug we have set up a test environment that has the latest release of Solandra and uses the default settings. In this setting with just two shards we are seeing the same issue. We have also tested with different size shards ranging from 256 to 4194304. In all cases as soon as we have more than 1 shard Solandra stops returning results even though it is clear the results were found. Below is some of the log information. Server specs: 8 cores 32 GB of memory (though we are only allocating 16GB for Solandra) Using the default settings in Solandra Properties, we added in 2,000,000 documents to ensure there were two shards on the same machine. For a query that has not been cached: DEBUG 10:28:55,004 core: df DEBUG 10:28:55,005 Adding shard(df): 10.1.10.200:8983/solandra/df~0 DEBUG 10:28:55,005 Adding shard(df): 10.1.10.200:8983/solandra/df~1 DEBUG 10:28:55,014 Fetching 0 Docs INFO 10:28:55,015 [df] webapp=/solandra path=/select params={fl=key,scorestart=0q=province:azisShard=truewt=javabinfsv=truerows=10version=2} hits=0 status=0 QTime=3 INFO 10:28:55,821 GC for ParNew: 258 ms, 586012000 reclaimed leaving 2122387984 used; max is 16955473920 DEBUG 10:28:58,034 Fetching 10 Docs DEBUG 10:28:58,035 Going to bulk load 10 documents DEBUG 10:28:58,099 Document read took: 63ms INFO 10:28:58,099 [df] webapp=/solandra path=/select params={fl=key,scorestart=0q=province:azisShard=truewt=javabinfsv=truerows=10version=2} hits=99470 status=0 QTime=3087 DEBUG 10:28:58,101 Document read took: 1ms DEBUG 10:28:58,102 Document read took: 1ms DEBUG 10:28:58,104 Document read took: 1ms DEBUG 10:28:58,105 Document read took: 1ms DEBUG 10:28:58,107 Document read took: 2ms DEBUG 10:28:58,108 Document read took: 1ms DEBUG 10:28:58,109 Document read took: 1ms DEBUG 10:28:58,110 Document read took: 1ms DEBUG 10:28:58,112 Document read took: 1ms DEBUG 10:28:58,113 Document read took: 1ms DEBUG 10:28:58,118 Fetching 0 Docs INFO 10:28:58,118 [df] webapp=/solandra path=/select params={isShard=truewt=javabinq=province:azids=[us/az/yuma/1152s4thave],[us/az/tempe/208sriverdr],[us/az/mundspark/475pinewoodblvd],[us/az/phoenix/2338wstellaln],[us/az/tucson/3341wwildwooddr],[us/az/surprise/15128wbellrd],[us/az/phoenix/3222egeorgiaave],[us/az/lakehavasucity/2250catamarandr],[us/az/huachucacity/264shuachucablvd],[us/az/tucson/6161sparkave]version=2} status=0 QTime=1 INFO 10:28:58,119 [df] webapp=/solandra path=/select params={wt=javabinq=province:azversion=2} status=0 QTime=3115 For a query that has been cached: DEBUG 10:27:36,350 core: df INFO 10:27:36,351 ShardInfo for df has expired INFO 10:27:36,353 Found reserved shard1(106758077800188110322537822484278066430):178410 TO 180224 DEBUG 10:27:36,353 Adding shard(df): 10.1.10.200:8983/solandra/df~0 DEBUG 10:27:36,353 Adding shard(df): 10.1.10.200:8983/solandra/df~1 DEBUG 10:27:36,359 Fetching 0 Docs INFO 10:27:36,360 [df] webapp=/solandra path=/select params={fl=key,scorestart=0q=province:akisShard=truewt=javabinfsv=truerows=10version=2} hits=0 status=0 QTime=2 DEBUG 10:27:36,362 Fetching 10 Docs DEBUG 10:27:36,363 Found doc in cache INFO 10:27:36,363 [df] webapp=/solandra path=/select params={fl=key,scorestart=0q=province:akisShard=truewt=javabinfsv=truerows=10version=2} hits=14707 status=0 QTime=5 DEBUG 10:27:36,363 Found doc in cache DEBUG 10:27:36,363 Found doc in cache DEBUG 10:27:36,363 Found doc in cache DEBUG 10:27:36,364 Found doc in cache DEBUG 10:27:36,364 Found doc in cache DEBUG 10:27:36,364 Found doc in cache DEBUG 10:27:36,364 Found doc in cache DEBUG 10:27:36,364 Found doc in cache DEBUG 10:27:36,365 Found doc in cache DEBUG 10:27:36,365 Found doc in cache DEBUG 10:27:36,369 Fetching 0 Docs INFO 10:27:36,369 [df] webapp=/solandra path=/select params={isShard=truewt=javabinq=province:akids=[us/ak/fairbanks/1483ballainerd],[us/ak/anchorage/4451etudorrd],[us/ak/anchorage/600cordovast],[us/ak/anchorage/6048e6thave],[us/ak/anchorage/940tyonekdr],[us/ak/fairbanks/3800universityaves],[us/ak/kenai/47189sherwoodcir],[us/ak/anchorage/12801oldsewardhwy],[us/ak/anchorage/8400raintreecir],[us/ak/juneau/9150skywoodln]version=2} status=0
Re: glassfish, solrconfig.xml and SolrException: Error loading DataImportHandler
Thanks for telling me this issue. However, I would think this is a bug. ^=^ From: Chris Hostetter hossman_luc...@fucit.org To: solr-user@lucene.apache.org solr-user@lucene.apache.org; Xue-Feng Yang just4l...@yahoo.com Sent: Wednesday, September 14, 2011 6:19:24 PM Subject: Re: glassfish, solrconfig.xml and SolrException: Error loading DataImportHandler : References: 41dfe0136ddf091e98d45dea9f0da1ab@localhost : cab_8yd9obtkvkdktqpfnuzmey-afbzajyvgahh58+mccgiq...@mail.gmail.com : Message-ID: 1316011545.626.yahoomail...@web110411.mail.gq1.yahoo.com : Subject: glassfish, solrconfig.xml and SolrException: Error loading : DataImportHandler : In-Reply-To: : cab_8yd9obtkvkdktqpfnuzmey-afbzajyvgahh58+mccgiq...@mail.gmail.com https://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. -Hoss
location of solr folder when deploy to servlet container
hi, how do i configure the solr folder to specific directory when deploy to servlet container. regards, kiwi
Re: location of solr folder when deploy to servlet container
In Tomcat you can set an environment var in Solr's context and set your home directory: Environment name=solr/home type=java.lang.String value=/opt/solr/ hi, how do i configure the solr folder to specific directory when deploy to servlet container. regards, kiwi
Re: how would I use the new join feature given my schema.
Anyone know the query I would do to get the join to work? I'm unable to get it to work. On Wed, Sep 14, 2011 at 10:49 AM, Jason Toy jason...@gmail.com wrote: I've been reading the information on the new join feature and am not quite sure how I would use it given my schema structure. I have User docs and BlogPost docs and I want to return all BlogPosts that match the fulltext title cool that belong to Users that match the description solr. Here are the 2 docs I have: ?xml version=1.0 encoding=UTF-8?add docfield name=class_nameUser/fieldfield name=login_sjtoy/fieldfield name=user_id_i192123/fieldfield name=description_texta solr user/field/field/doc docfield name=class_nameBlogPost/fieldfield name=user_id_i192123/fieldfield name=body_textthis is the description/fieldfield name=title_textthis is a cool title/field/field/doc /add?xml version=1.0 encoding=UTF-8?commit/ Is it possible to do this with the join functionality? If not, how would I do this? I'd appreciate any pointers or help on this. Jason -- - sent from my mobile 6176064373
query for point in time
Hi I have a scenario that I am not sure how to write the query for. Here is the scenario - have an employee record with multi value for project, started date, end date. looks something like John Smith web site bug fix 2010-01-01 2010-01-03 unit testing 2010-01-04 2010-01-06 QA support 2010-01-07 2010-01-12 implementation 2010-01-13 2010-01-22 I want to find what project John Smith was working on 2010-01-05 Is this possible or I have to back to my database ? Thanks
Re: query for point in time
You didn't tell us what your schema looks like, what fields with what types are involved. But similar to how you'd do it in your database, you need to find 'documents' that have a start date before your date in question, and an end date after your date in question, to find the ones whose range includes your date in question. Something like this: q=start_date:[* TO '2010-01-05'] AND end_date:['2010-01-05' TO *] Of course, you need to add on your restriction to just documents about 'John Smith', through another AND clause or an 'fq'. But in general, if you've got a db with this info already, and this is all you need, why not just use the db? Multi-hieararchy data like this is going to give you trouble in Solr eventually, you've got to arrange the solr indexes/schema to answer your questions, and eventually you're going to have two questions which require mutually incompatible schema to answer. An rdbms is a great general purpose question answering tool for structured data. lucene/Solr is a great indexing tool for text matching. On 9/15/2011 2:55 PM, gary tam wrote: Hi I have a scenario that I am not sure how to write the query for. Here is the scenario - have an employee record with multi value for project, started date, end date. looks something like John Smith web site bug fix 2010-01-01 2010-01-03 unit testing 2010-01-04 2010-01-06 QA support 2010-01-07 2010-01-12 implementation 2010-01-13 2010-01-22 I want to find what project John Smith was working on 2010-01-05 Is this possible or I have to back to my database ? Thanks
Re: Sorting on multiValued fields via function query
Was there a solution here? Is there a ticket related to the sort=max(FIELD) solution? -brian -- View this message in context: http://lucene.472066.n3.nabble.com/Sorting-on-multiValued-fields-via-function-query-tp2681833p3340145.html Sent from the Solr - User mailing list archive at Nabble.com.
[DIH] How to use combine Regex and HTML transformers
Hello, I need to pull out the price and imageURL for products in an Amazon RSS feed. PROBLEM STATEMENT: The following: field column=description xpath=/rss/channel/item/description / field column=price regex=.*?\$(\d*.\d*) sourceColName=description / field column=imageUrl regex=.*?img src=quot;(.*?)quot;.* sourceColName=description / works but I am left with html junk inside the description! USELESS WORKAROUND: If I try to strip the html from the data being fed into description while letting the price and imageURL know of the direct path of the RSS feed field like so: field column=description xpath=/rss/channel/item/description stripHTML=true / field column=price regex=.*?\$(\d*.\d*) xpath=/rss/channel/item/description / field column=imageUrl regex=.*?img src=quot;(.*?)quot;.* xpath=/rss/channel/item/description / then this fails and only the last configured field in this list (imageURL) ends up having any data imported. Is this a bug? CRUX OF THE PROBLEM: Also I tried to then create a field just to store the raw html data like so but this configuration yields no results for the description field so I'm back to where I started: field column=rawDescription xpath=/rss/channel/item/description / field column=description regex=.* sourceColName=rawDescription stripHTML=true / field column=price regex=.*?\$(\d*.\d*) sourceColName=rawDescription / field column=imageUrl regex=.*?img src=quot;(.*?)quot;.* sourceColName=rawDescription / I was suspicious of trying to combine sourceColName with stripHTML to begin with ... I suppose that I was hoping that the regex transformer will run first and copy all the html data as-is which will then be stripped out later by the HTML transformer but this didn't work. Why? what else can I do? Thanks! - Pulkit
Re: Can index size increase when no updates/optimizes are happening?
On 9/14/2011 2:36 PM, Erick Erickson wrote: What is the machine used for? Was your user looking at a master? Slave? Something used for both? Stand-alone machine with multiple Solr cores. No replication. Measuring the size of all the files in the index? Or looking at memory? Disk space. The index files shouldn't be getting bigger unless there were indexing operations going on. That's what I thought. Is it at all possible that DIH was configured to run automatically (or any other indexing job for that matter) and your user didn't realize it? There's no DIH, but there is a custom app that submit docs for indexing via SolrJ. Supposedly, Solr logs were not showing any updates over night, so the assumption was that no new docs were added. I'd write it off as a user error, but wanted to double check with the community that no other internal Solr/Lucene task can change the index file size in the absence of submits.
Generating large datasets for Solr proof-of-concept
Hello Everyone, I have a goal of populating Solr with a million unique products in order to create a test environment for a proof of concept. I started out by using DIH with Amazon RSS feeds but I've quickly realized that there's no way I can glean a million products from one RSS feed. And I'd go mad if I just sat at my computer all day looking for feeds and punching them into DIH config for Solr. Has anyone ever had to create large mock/dummy datasets for test environments or for POCs/Demos to convince folks that Solr was the wave of the future? Any tips would be greatly appreciated. I suppose it sounds a lot like crawling even though it started out as innocent DIH usage. - Pulkit
Re: Generating large datasets for Solr proof-of-concept
I've done it using SolrJ and a *lot *of of parallel processes feeding dummy data into the server. On Thu, Sep 15, 2011 at 4:54 PM, Pulkit Singhal pulkitsing...@gmail.comwrote: Hello Everyone, I have a goal of populating Solr with a million unique products in order to create a test environment for a proof of concept. I started out by using DIH with Amazon RSS feeds but I've quickly realized that there's no way I can glean a million products from one RSS feed. And I'd go mad if I just sat at my computer all day looking for feeds and punching them into DIH config for Solr. Has anyone ever had to create large mock/dummy datasets for test environments or for POCs/Demos to convince folks that Solr was the wave of the future? Any tips would be greatly appreciated. I suppose it sounds a lot like crawling even though it started out as innocent DIH usage. - Pulkit
RE: Replication and ExternalFileField
Actually, Windoze also has symbolic links. You have to manipulate them from the command line, but they do exist. http://en.wikipedia.org/wiki/NTFS_symbolic_link -Original Message- From: Per Osbeck [mailto:per.osb...@lbi.com] Sent: Thursday, September 15, 2011 7:15 AM To: solr-user@lucene.apache.org Subject: RE: Replication and ExternalFileField Probably would have worked on *nix but unfortunately running Windows. Best regards, Per -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: den 15 september 2011 14:07 To: solr-user@lucene.apache.org Subject: Re: Replication and ExternalFileField Perhaps a symlink will do the trick. On Thursday 15 September 2011 14:04:47 Per Osbeck wrote: Hi all, I'm trying to find some good information regarding replication, especially for the ExternalFileField. As I understand it; - the external files must be in data dir. - replication only replicates data/indexes and possibly confFiles from the conf dir. Does anyone have suggestions or ideas on how this should would work? Best regards, Per Osbeck -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Generating large datasets for Solr proof-of-concept
If we want to test with huge amounts of data we feed portions of the internet. The problem is it takes a lot of bandwith and lots of computing power to get to a `reasonable` size. On the positive side, you deal with real text so it's easier to tune for relevance. I think it's easier to create a simple XML generator with mock data, prices, popularity rates etc. It's fast to generate millions of mock products and once you have a large quantity of XML files, you can easily index, test, change config or schema and reindex. On the other hand, the sample data that comes with the Solr example is a good set as well as it proves the concepts well, especially with the stock Velocity templates. We know Solr will handle enormous sets but quantity is not always a part of a PoC. Hello Everyone, I have a goal of populating Solr with a million unique products in order to create a test environment for a proof of concept. I started out by using DIH with Amazon RSS feeds but I've quickly realized that there's no way I can glean a million products from one RSS feed. And I'd go mad if I just sat at my computer all day looking for feeds and punching them into DIH config for Solr. Has anyone ever had to create large mock/dummy datasets for test environments or for POCs/Demos to convince folks that Solr was the wave of the future? Any tips would be greatly appreciated. I suppose it sounds a lot like crawling even though it started out as innocent DIH usage. - Pulkit
Re: query for point in time
Thanks for the reply. We had the search within the database initially, but it proven to be too slow. With solr we have much better performance. One more question, how could I find the most current job for each employee My data looks like John Smith department A web site bug fix 2010-01-01 2010-01-03 unit testing 2010-01-04 2010-01-06 QA support 2010-01-07 2010-01-12 implementation 2010-01-13 2010-01-22 Jane Doe department A QA support 2010-01-01 2010-05-01 implementation 2010-05-02 2010-09-28 Joe Doe department APHP development 2011-01-01 2011-08-31 Java Development 2011-09-01 2011-09-15 I would like to return this as my search result John Smith department Aimplementation 2010-01-13 2010-01-22 Jane Doe department Aimplementation 2010-05-02 2010-09-28 Joe Doedepartment AJava Development 2011-09-01 2011-09-15 Thanks in advance Gary On Thu, Sep 15, 2011 at 3:33 PM, Jonathan Rochkind rochk...@jhu.edu wrote: You didn't tell us what your schema looks like, what fields with what types are involved. But similar to how you'd do it in your database, you need to find 'documents' that have a start date before your date in question, and an end date after your date in question, to find the ones whose range includes your date in question. Something like this: q=start_date:[* TO '2010-01-05'] AND end_date:['2010-01-05' TO *] Of course, you need to add on your restriction to just documents about 'John Smith', through another AND clause or an 'fq'. But in general, if you've got a db with this info already, and this is all you need, why not just use the db? Multi-hieararchy data like this is going to give you trouble in Solr eventually, you've got to arrange the solr indexes/schema to answer your questions, and eventually you're going to have two questions which require mutually incompatible schema to answer. An rdbms is a great general purpose question answering tool for structured data. lucene/Solr is a great indexing tool for text matching. On 9/15/2011 2:55 PM, gary tam wrote: Hi I have a scenario that I am not sure how to write the query for. Here is the scenario - have an employee record with multi value for project, started date, end date. looks something like John Smith web site bug fix 2010-01-01 2010-01-03 unit testing 2010-01-04 2010-01-06 QA support 2010-01-07 2010-01-12 implementation 2010-01-13 2010-01-22 I want to find what project John Smith was working on 2010-01-05 Is this possible or I have to back to my database ? Thanks
Re: query for point in time
I think there's something wrong with your database then, but okay. You still haven't said what your Solr schema looks like -- that list of values doesn't say what the solr field names or types are. I think this is maybe because you don't actually have a Solr database and have no idea how Solr works, you're just asking in theory? On the other hand, you just said you have better performance with solr -- I'm not sure how you were able to tell the performance of solr in answering these queries if you don't even know how to make them! But, again, assuming your data is set up like i'm guessing it is, it's quite similar to what you'd do with an rdbms. What does 'most current' mean? Can jobs be overlapping? To find the project with the latest start date for a given person, just limit to documents with that current person in a 'q' or 'fq', and then sort by start_date desc. Perhaps limit to 1 if you really only want one hit. Same principle as you would in an rdbms. Again, this requires setting up your solr index in such a way to answer these sorts of questions. Each document in Solr will represent a person-project pair. It'll have fields for person (or multiple fields, personID, personFirst, personLast, etc), project name, project start date, project end date. This will make it easy/possible to answer questions like your examples with Solr, but will make it hard to answer many other sorts of questions -- unlike an rdbms, it is difficult to set up a Solr index that can flexibly answer just about any question you through at it, particularly when you have hieararchical or otherwise multi-entity data. If you are interested, the standard Solr tutorial is pretty good: http://lucene.apache.org/solr/tutorial.html On 9/15/2011 6:39 PM, gary tam wrote: Thanks for the reply. We had the search within the database initially, but it proven to be too slow. With solr we have much better performance. One more question, how could I find the most current job for each employee My data looks like John Smith department A web site bug fix 2010-01-01 2010-01-03 unit testing 2010-01-04 2010-01-06 QA support 2010-01-07 2010-01-12 implementation 2010-01-13 2010-01-22 Jane Doe department A QA support 2010-01-01 2010-05-01 implementation 2010-05-02 2010-09-28 Joe Doe department APHP development 2011-01-01 2011-08-31 Java Development 2011-09-01 2011-09-15 I would like to return this as my search result John Smith department Aimplementation 2010-01-13 2010-01-22 Jane Doe department Aimplementation 2010-05-02 2010-09-28 Joe Doedepartment AJava Development 2011-09-01 2011-09-15 Thanks in advance Gary On Thu, Sep 15, 2011 at 3:33 PM, Jonathan Rochkindrochk...@jhu.edu wrote: You didn't tell us what your schema looks like, what fields with what types are involved. But similar to how you'd do it in your database, you need to find 'documents' that have a start date before your date in question, and an end date after your date in question, to find the ones whose range includes your date in question. Something like this: q=start_date:[* TO '2010-01-05'] AND end_date:['2010-01-05' TO *] Of course, you need to add on your restriction to just documents about 'John Smith', through another AND clause or an 'fq'. But in general, if you've got a db with this info already, and this is all you need, why not just use the db? Multi-hieararchy data like this is going to give you trouble in Solr eventually, you've got to arrange the solr indexes/schema to answer your questions, and eventually you're going to have two questions which require mutually incompatible schema to answer. An rdbms is a great general purpose question answering tool for structured data. lucene/Solr is a great indexing tool for text matching. On 9/15/2011 2:55 PM, gary tam wrote: Hi I have a scenario that I am not sure how to write the query for. Here is the scenario - have an employee record with multi value for project, started date, end date. looks something like John Smith web site bug fix 2010-01-01 2010-01-03 unit testing 2010-01-04 2010-01-06 QA support 2010-01-07 2010-01-12 implementation 2010-01-13 2010-01-22 I want to find what project John Smith was working on 2010-01-05 Is this possible or I have to back to my database ? Thanks
Lucene-SOLR transition
I've been using lucene for a number of years. We've now decided to move to SOLR. I have a couple of questions. 1. I'm used to creating Boolean queries, filter queries, term queries, etc. for lucene. Am I right in thinking that for SOLR my only option is creating string queries (with q and fq components) for solrj? 2. Assuming that the answer to 1 is correct, then is there an easy way to take a lucene query (with nested Boolean queries, filter queries, etc.) and generate a SOLR query string with q and fq components? Thanks Scott
Re: Generating large datasets for Solr proof-of-concept
Ah missing } doh! BTW I still welcome any ideas on how to build an e-commerce test base. It doesn't have to be amazon that was jsut my approach, any one? - Pulkit On Thu, Sep 15, 2011 at 8:52 PM, Pulkit Singhal pulkitsing...@gmail.com wrote: Thanks for all the feedback thus far. Now to get little technical about it :) I was thinking of feeding a file with all the tags of amazon that yield close to roughly 5 results each into a file and then running my rss DIH off of that, I came up with the following config but something is amiss, can someone please point out what is off about this? document entity name=amazonFeeds processor=LineEntityProcessor url=file:///xxx/yyy/zzz/amazonfeeds.txt rootEntity=false dataSource=myURIreader1 transformer=RegexTransformer,DateFormatTransformer entity name=feed pk=link url=${amazonFeeds.rawLine processor=XPathEntityProcessor forEach=/rss/channel | /rss/channel/item transformer=RegexTransformer,HTMLStripTransformer,DateFormatTransformer,script:skipRow ... The rawline should feed into the url key but instead i get: Caused by: java.net.MalformedURLException: no protocol: null${amazonFeeds.rawLine at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90) Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: start rollback Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter rollback SEVERE: Exception while solr rollback. Thanks in advance! On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma markus.jel...@openindex.io wrote: If we want to test with huge amounts of data we feed portions of the internet. The problem is it takes a lot of bandwith and lots of computing power to get to a `reasonable` size. On the positive side, you deal with real text so it's easier to tune for relevance. I think it's easier to create a simple XML generator with mock data, prices, popularity rates etc. It's fast to generate millions of mock products and once you have a large quantity of XML files, you can easily index, test, change config or schema and reindex. On the other hand, the sample data that comes with the Solr example is a good set as well as it proves the concepts well, especially with the stock Velocity templates. We know Solr will handle enormous sets but quantity is not always a part of a PoC. Hello Everyone, I have a goal of populating Solr with a million unique products in order to create a test environment for a proof of concept. I started out by using DIH with Amazon RSS feeds but I've quickly realized that there's no way I can glean a million products from one RSS feed. And I'd go mad if I just sat at my computer all day looking for feeds and punching them into DIH config for Solr. Has anyone ever had to create large mock/dummy datasets for test environments or for POCs/Demos to convince folks that Solr was the wave of the future? Any tips would be greatly appreciated. I suppose it sounds a lot like crawling even though it started out as innocent DIH usage. - Pulkit
Re: Generating large datasets for Solr proof-of-concept
Thanks for all the feedback thus far. Now to get little technical about it :) I was thinking of feeding a file with all the tags of amazon that yield close to roughly 5 results each into a file and then running my rss DIH off of that, I came up with the following config but something is amiss, can someone please point out what is off about this? document entity name=amazonFeeds processor=LineEntityProcessor url=file:///xxx/yyy/zzz/amazonfeeds.txt rootEntity=false dataSource=myURIreader1 transformer=RegexTransformer,DateFormatTransformer entity name=feed pk=link url=${amazonFeeds.rawLine processor=XPathEntityProcessor forEach=/rss/channel | /rss/channel/item transformer=RegexTransformer,HTMLStripTransformer,DateFormatTransformer,script:skipRow ... The rawline should feed into the url key but instead i get: Caused by: java.net.MalformedURLException: no protocol: null${amazonFeeds.rawLine at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90) Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: start rollback Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter rollback SEVERE: Exception while solr rollback. Thanks in advance! On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma markus.jel...@openindex.io wrote: If we want to test with huge amounts of data we feed portions of the internet. The problem is it takes a lot of bandwith and lots of computing power to get to a `reasonable` size. On the positive side, you deal with real text so it's easier to tune for relevance. I think it's easier to create a simple XML generator with mock data, prices, popularity rates etc. It's fast to generate millions of mock products and once you have a large quantity of XML files, you can easily index, test, change config or schema and reindex. On the other hand, the sample data that comes with the Solr example is a good set as well as it proves the concepts well, especially with the stock Velocity templates. We know Solr will handle enormous sets but quantity is not always a part of a PoC. Hello Everyone, I have a goal of populating Solr with a million unique products in order to create a test environment for a proof of concept. I started out by using DIH with Amazon RSS feeds but I've quickly realized that there's no way I can glean a million products from one RSS feed. And I'd go mad if I just sat at my computer all day looking for feeds and punching them into DIH config for Solr. Has anyone ever had to create large mock/dummy datasets for test environments or for POCs/Demos to convince folks that Solr was the wave of the future? Any tips would be greatly appreciated. I suppose it sounds a lot like crawling even though it started out as innocent DIH usage. - Pulkit
ClassCastException: SmartChineseWordTokenFilterFactory to TokenizerFactory
Hi all, I am trying to use SmartChineseWordTokenFilterFactory in solr 3.4.0, but come to the error SEVERE: java.lang.ClassCastException: org.apache.solr.analysis.SmartChineseWordTokenFilterFactory cannot be cast to org.apache.solr.analysis.TokenizerFactory Any thought?
Re: hi. allowLeadingWildcard is it possible or not yet?
i wonder the same thing... so wanna re-animate the topic is it possible? - Zeki ama calismiyor... Calissa yapar... -- View this message in context: http://lucene.472066.n3.nabble.com/hi-allowLeadingWildcard-is-it-possible-or-not-yet-tp495457p3340838.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Generating large datasets for Solr proof-of-concept
http://aws.amazon.com/datasets DBPedia might be the easiest to work with: http://aws.amazon.com/datasets/2319 Amazon has a lot of these things. Infochimps.com is a marketplace for free pay versions. Lance On Thu, Sep 15, 2011 at 6:55 PM, Pulkit Singhal pulkitsing...@gmail.comwrote: Ah missing } doh! BTW I still welcome any ideas on how to build an e-commerce test base. It doesn't have to be amazon that was jsut my approach, any one? - Pulkit On Thu, Sep 15, 2011 at 8:52 PM, Pulkit Singhal pulkitsing...@gmail.com wrote: Thanks for all the feedback thus far. Now to get little technical about it :) I was thinking of feeding a file with all the tags of amazon that yield close to roughly 5 results each into a file and then running my rss DIH off of that, I came up with the following config but something is amiss, can someone please point out what is off about this? document entity name=amazonFeeds processor=LineEntityProcessor url=file:///xxx/yyy/zzz/amazonfeeds.txt rootEntity=false dataSource=myURIreader1 transformer=RegexTransformer,DateFormatTransformer entity name=feed pk=link url=${amazonFeeds.rawLine processor=XPathEntityProcessor forEach=/rss/channel | /rss/channel/item transformer=RegexTransformer,HTMLStripTransformer,DateFormatTransformer,script:skipRow ... The rawline should feed into the url key but instead i get: Caused by: java.net.MalformedURLException: no protocol: null${amazonFeeds.rawLine at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90) Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: start rollback Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter rollback SEVERE: Exception while solr rollback. Thanks in advance! On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma markus.jel...@openindex.io wrote: If we want to test with huge amounts of data we feed portions of the internet. The problem is it takes a lot of bandwith and lots of computing power to get to a `reasonable` size. On the positive side, you deal with real text so it's easier to tune for relevance. I think it's easier to create a simple XML generator with mock data, prices, popularity rates etc. It's fast to generate millions of mock products and once you have a large quantity of XML files, you can easily index, test, change config or schema and reindex. On the other hand, the sample data that comes with the Solr example is a good set as well as it proves the concepts well, especially with the stock Velocity templates. We know Solr will handle enormous sets but quantity is not always a part of a PoC. Hello Everyone, I have a goal of populating Solr with a million unique products in order to create a test environment for a proof of concept. I started out by using DIH with Amazon RSS feeds but I've quickly realized that there's no way I can glean a million products from one RSS feed. And I'd go mad if I just sat at my computer all day looking for feeds and punching them into DIH config for Solr. Has anyone ever had to create large mock/dummy datasets for test environments or for POCs/Demos to convince folks that Solr was the wave of the future? Any tips would be greatly appreciated. I suppose it sounds a lot like crawling even though it started out as innocent DIH usage. - Pulkit -- Lance Norskog goks...@gmail.com
Re: ClassCastException: SmartChineseWordTokenFilterFactory to TokenizerFactory
Tokenizers and TokenFilters are different. Look in the schema for how other TokenFilterFactory classes are used. On Thu, Sep 15, 2011 at 8:05 PM, Xue-Feng Yang just4l...@yahoo.com wrote: Hi all, I am trying to use SmartChineseWordTokenFilterFactory in solr 3.4.0, but come to the error SEVERE: java.lang.ClassCastException: org.apache.solr.analysis.SmartChineseWordTokenFilterFactory cannot be cast to org.apache.solr.analysis.TokenizerFactory Any thought? -- Lance Norskog goks...@gmail.com