Re: reader/searcher refresh after replication (commit)
Yes, I consciously let my slaves run away from the master in order to reduce update latency, but every now and then they sync up with master that is doing heavy lifting. The price you pay is that slaves do not see the same documents as the master, but this is the case anyhow with replication, in my setup slave may go ahead of master with updates, this delta gets zeroed after replication and the game starts again. What you have to take into account with this is very small time window where you may go back in time on slaves (not seeing documents that were already there), but we are talking about seconds and a couple out of 200Mio documents (only those documents that were softComited on slave during replication, since commit ond master and postCommit on slave). Why do you think something is strange here? What are you expecting a BeforeCommitListener could do for you, if one would exist? Why should I be expecting something? I just need to read userCommit Data as soon as replication is done, and I am looking for proper/easy way to do it. (postCommitListener is what I use now). What makes me slightly nervous are those life cycle questions, e.g. when I issue update command before and after postCommit event, which index gets updated, the one just replicated or the one that was there just before replication. There are definitely ways to optimize this, for example to force replication handler to copy only delta files if index gets updated on slave and master (there is already todo somewhere on solr replication Wiki I think). Now replicationHandler copies complete index if this gets detected ... I am all ears if there are better proposals to have low latency updates in multi server setup... On Tue, Feb 21, 2012 at 11:53 PM, Em mailformailingli...@yahoo.de wrote: Eks, that sounds strange! Am I getting you right? You have a master which indexes batch-updates from time to time. Furthermore you got some slaves, pulling data from that master to keep them up-to-date with the newest batch-updates. Additionally your slaves index own content in soft-commit mode that needs to be available as soon as possible. In consequence the slavesare not in sync with the master. I am not 100% certain, but chances are good that Solr's replication-mechanism only changes those segments that are not in sync with the master. What are you expecting a BeforeCommitListener could do for you, if one would exist? Kind regards, Em Am 21.02.2012 21:10, schrieb eks dev: Thanks Mark, Hmm, I would like to have this information asap, not to wait until the first search gets executed (depends on user) . Is solr going to create new searcher as a part of replication transaction... Just to make it clear why I need it... I have simple master, many slaves config where master does batch updates in big chunks (things user can wait longer to see on search side) but slaves work in soft commit mode internally where I permit them to run away slightly from master in order to know where incremental update should start, I read it from UserData Basically, ideally, before commit (after successful replication is finished) ends, I would like to read in these counters to let incremental update run from the right point... I need to prevent updating replicated index before I read this information (duplicates can appear) are there any IndexWriter listeners around? Thanks again, eks. On Tue, Feb 21, 2012 at 8:03 PM, Mark Miller markrmil...@gmail.com wrote: Post commit calls are made before a new searcher is opened. Might be easier to try to hook in with a new searcher listener? On Feb 21, 2012, at 8:23 AM, eks dev wrote: Hi all, I am a bit confused with IndexSearcher refresh lifecycles... In a master slave setup, I override postCommit listener on slave (solr trunk version) to read some user information stored in userCommitData on master -- @Override public final void postCommit() { // This returnes stale information that was present before replication finished RefCountedSolrIndexSearcher refC = core.getNewestSearcher(true); MapString, String userData = refC.get().getIndexReader().getIndexCommit().getUserData(); } I expected core.getNewestSearcher(true); to return refreshed SolrIndexSearcher, but it didn't When is this information going to be refreshed to the status from the replicated index, I repeat this is postCommit listener? What is the way to get the information from the last commit point? Maybe like this? core.getDeletionPolicy().getLatestCommit().getUserData(); Or I need to explicitly open new searcher (isn't solr does this behind the scenes?) core.openNewSearcher(false, false) Not critical, reopening new searcher works, but I would like to understand these lifecycles, when solr loads latest commit point... Thanks, eks - Mark Miller lucidimagination.com
Re: Unique key constraint and optimistic locking (versioning)
Thanks a lot. We will use the UniqueKey feature and build versioning ourselves. Do you think it would be a good idea if we built a versioning feature into Solr/Lucene instead of doing it outside, so that others can benefit from the feature as well? Guess contributions will be made according to http://wiki.apache.org/solr/HowToContribute. It is possible for outsiders (like us) to get a SVN branch at svn.apache.org to prepare contributions, or do we have to use our own SVN? Are there any plans migrating lucene/solr codebase to Git, which will make it easier getting a separate area to work on the code (making a Git fork), and suggest the contribution back to core lucene/solr (doing a Git pull request)? Thanks! Per Steffensen Em skrev: Hi Per, Solr provides the so called UniqueKey-field. Refer to the Wiki to learn more: http://wiki.apache.org/solr/UniqueKey Optimistic locking (versioning) ... is not provided by Solr out of the box. If you add a new document with the same UniqueKey it replaces the old one. You have to do the versioning on your own (and keep in mind concurrent updates). Kind regards, Em Am 21.02.2012 13:50, schrieb Per Steffensen: Hi Does solr/lucene provide any mechanism for unique key constraint and optimistic locking (versioning)? Unique key constraint: That a client will not succeed creating a new document in solr/lucene if a document already exists having the same value in some field (e.g. an id field). Of course implemented right, so that even though two or more threads are concurrently trying to create a new document with the same value in this field, only one of them will succeed. Optimistic locking (versioning): That a client will only succeed updating a document if this updated document is based on the version of the document currently stored in solr/lucene. Implemented in the optimistic way that clients during an update have to tell which version of the document they fetched from Solr and that they therefore have used as a starting-point for their updated document. So basically having a version field on the document that clients increase by one before sending to solr for update, and some code in Solr that only makes the update succeed if the version number of the updated document is exactly one higher than the version number of the document already stored. Of course again implemented right, so that even though two or more thrads are concurrently trying to update a document, and they all have their updated document based on the current version in solr/lucene, only one of them will succeed. Or do I have to do stuff like this myself outside solr/lucene - e.g. in the client using solr. Regards, Per Steffensen
Re: Unique key constraint and optimistic locking (versioning)
Per Steffensen skrev: Thanks a lot. We will use the UniqueKey feature and build versioning ourselves. Do you think it would be a good idea if we built a versioning feature into Solr/Lucene instead of doing it outside, so that others can benefit from the feature as well? Guess contributions will be made according to http://wiki.apache.org/solr/HowToContribute. It is possible for outsiders (like us) to get a SVN branch at svn.apache.org to prepare contributions, or do we have to use our own SVN? Are there any plans migrating lucene/solr codebase to Git, which will make it easier getting a separate area to work on the code (making a Git fork), and suggest the contribution back to core lucene/solr (doing a Git pull request)? Sorry - didnt see the Eclipse (using Git) chapter on http://wiki.apache.org/solr/HowToContribute. We might contribute in that area. Thanks! Per Steffensen
Re: reader/searcher refresh after replication (commit)
Sounds much clearer to me than before. :) Ad-hoc I have two ideas: First: Let Replication run asynchronously. If shard1 is pulling the new index from the master and therefore very recent documents aren't available anymore, shard2 will find them in the mean-time. As soon as shard1 is up-to-date (including the most recent documents) shard2 can pull its update from the master. However beeing out of sync between two shards that should serve the same data has its own problems, I think. Second: You can have another SolrCore for the most recent documents. This one could be based on a RAMDirectory for reduced latency (or even use NRT-features, if available in your Solr-version). Your Master-Slave setup becomes more easier, since you do not have to worry about out-of-sync-scenarios anymore. The challange here is to handle duplicate documents (i.e. newer versions in the RAMDirectory) and proper relevancy due to unbalanced shards by design. Kind regards, Em Am 22.02.2012 09:25, schrieb eks dev: Yes, I consciously let my slaves run away from the master in order to reduce update latency, but every now and then they sync up with master that is doing heavy lifting. The price you pay is that slaves do not see the same documents as the master, but this is the case anyhow with replication, in my setup slave may go ahead of master with updates, this delta gets zeroed after replication and the game starts again. What you have to take into account with this is very small time window where you may go back in time on slaves (not seeing documents that were already there), but we are talking about seconds and a couple out of 200Mio documents (only those documents that were softComited on slave during replication, since commit ond master and postCommit on slave). Why do you think something is strange here? What are you expecting a BeforeCommitListener could do for you, if one would exist? Why should I be expecting something? I just need to read userCommit Data as soon as replication is done, and I am looking for proper/easy way to do it. (postCommitListener is what I use now). What makes me slightly nervous are those life cycle questions, e.g. when I issue update command before and after postCommit event, which index gets updated, the one just replicated or the one that was there just before replication. There are definitely ways to optimize this, for example to force replication handler to copy only delta files if index gets updated on slave and master (there is already todo somewhere on solr replication Wiki I think). Now replicationHandler copies complete index if this gets detected ... I am all ears if there are better proposals to have low latency updates in multi server setup... On Tue, Feb 21, 2012 at 11:53 PM, Em mailformailingli...@yahoo.de wrote: Eks, that sounds strange! Am I getting you right? You have a master which indexes batch-updates from time to time. Furthermore you got some slaves, pulling data from that master to keep them up-to-date with the newest batch-updates. Additionally your slaves index own content in soft-commit mode that needs to be available as soon as possible. In consequence the slavesare not in sync with the master. I am not 100% certain, but chances are good that Solr's replication-mechanism only changes those segments that are not in sync with the master. What are you expecting a BeforeCommitListener could do for you, if one would exist? Kind regards, Em Am 21.02.2012 21:10, schrieb eks dev: Thanks Mark, Hmm, I would like to have this information asap, not to wait until the first search gets executed (depends on user) . Is solr going to create new searcher as a part of replication transaction... Just to make it clear why I need it... I have simple master, many slaves config where master does batch updates in big chunks (things user can wait longer to see on search side) but slaves work in soft commit mode internally where I permit them to run away slightly from master in order to know where incremental update should start, I read it from UserData Basically, ideally, before commit (after successful replication is finished) ends, I would like to read in these counters to let incremental update run from the right point... I need to prevent updating replicated index before I read this information (duplicates can appear) are there any IndexWriter listeners around? Thanks again, eks. On Tue, Feb 21, 2012 at 8:03 PM, Mark Miller markrmil...@gmail.com wrote: Post commit calls are made before a new searcher is opened. Might be easier to try to hook in with a new searcher listener? On Feb 21, 2012, at 8:23 AM, eks dev wrote: Hi all, I am a bit confused with IndexSearcher refresh lifecycles... In a master slave setup, I override postCommit listener on slave (solr trunk version) to read some user information stored in userCommitData on master -- @Override public
Fields, Facets, and Search Results
I'm new to SOLR and trying to get a proper understanding of what's going on with fields, facets, and search results. I've modified the example schema.xml and solrconfig.xml that comes with SOLR to reflect some fields I want to experiment with. I've also modified the velocity templates in Solaritas accordingly. I've created some sample docs to post to the index that have the fields/data I want to experiment with. Everything compiles and works, but my search results are not what I expect and I'm trying to understand why. Every field I have is defined the same (here are two examples): field name=title type=text_general indexed=true stored=true omitNorms=true/ field name=section_text_content type=text_general indexed=true stored=true omitNorms=true/ The puzzle is that I'm getting search results on every term that's in the title field, but nothing on terms in the section_text_content field. I have no idea why. I thought at first it was because I'd specified the title field to also be a facet, but I removed that and things remain as described (except now, of course, the facet for title is gone). Can anyone provide some insight? Don -- View this message in context: http://lucene.472066.n3.nabble.com/Fields-Facets-and-Search-Results-tp3765946p3765946.html Sent from the Solr - User mailing list archive at Nabble.com.
how to mock solr server solr_sruby
Hi, Am using solr_ruby in ruby code for that am starting solr server by using start.jsr. Now i want to write mockobjects for solr connection and code written in my ruby file to search data from solr. Can anybody suggest how to do testing without stating solr server -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-mock-solr-server-solr-sruby-tp3766080p3766080.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Fields, Facets, and Search Results
Check you schema config file first. It looks like you have missed copy of section_text_content field's content to your default search field : defaultSearchFieldtext/defaultSearchField copyField source=section_text_content dest=text/ -- View this message in context: http://lucene.472066.n3.nabble.com/Fields-Facets-and-Search-Results-tp3765946p3766084.html Sent from the Solr - User mailing list archive at Nabble.com.
'location' fieldType indexation impossible
Hi, When i try to index my location field i get this error for each documents : *ATTENTION: Error creating document Error adding field 'emploi_city_geoloc'='48.85,2.5525' * (so i have 0 files indexed) Here is my schema.xml : *field name=emploi_city_geoloc type=location indexed=true stored=false/* I really don't understand why it isnt working because, it was working on my local server with the same configuration (Solr 3.5.0) and the same database !!! If i try to use geohash instead of location it is working for indexation, but my geodist query in front isnt working anymore ... Any ideas ? Best regards, Xavier -- View this message in context: http://lucene.472066.n3.nabble.com/location-fieldType-indexation-impossible-tp3766136p3766136.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Date filter query
Hi, all Thanks for your responses. I'd tried [NOW/DAY-30DAY+TO+NOW/DAY-1DAY-1SECOND] and seems it works fine for me. Thanks a lot! -- View this message in context: http://lucene.472066.n3.nabble.com/Date-filter-query-tp3764349p3766139.html Sent from the Solr - User mailing list archive at Nabble.com.
How is Data Indexed in HBase?
Dear all, I wonder how data in HBase is indexed? Now Solr is used in my system because data is managed in inverted index. Such an index is suitable to retrieve unstructured and huge amount of data. How does HBase deal with the issue? May I replaced Solr with HBase? Thanks so much! Best regards, Bing
Re: Fast Vector Highlighter Working for some records only
Koji Sekiguchi wrote (12/02/22 11:58), dhaivat wrote: Thanks for reply, But can you please tell me why it's working for some documents and not for other. As Solr 1.4.1 cannot recognize hl.useFastVectorHighlighter flag, Solr just ignore it, but due to hl=true is there, Solr tries to create highlight snippets by using (existing; traditional; I mean not FVH) Highlighter. Highlighter (including FVH) cannot produce snippets sometime for some reasons, you can use hl.alternateField parameter. http://wiki.apache.org/solr/HighlightingParameters#hl.alternateField koji -- Query Log Visualizer for Apache Solr http://soleami.com/ Thank you so much explanation, I have updated my solr version and using 3.5, Could you please tell me when i am using custom Tokenizer on the field,so do i need to make any changes related Solr highlighter. here is my custom analyser fieldType name=custom_text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=ns.solr.analyser.CustomIndexTokeniserFactory/ /analyzer analyzer type=query tokenizer class=ns.solr.analyser.CustomSearcherTokeniserFactory/ /analyzer /fieldType here is the field info: field name=contents type=custom_text indexed=true stored=true multiValued=true termPositions=true termVectors=true termOffsets=true/ i am creating tokens using my custom analyser and when i am trying to use highlighter it's not working properly for contents field.. but when i tried to use Solr inbuilt tokeniser i am finding the word highlighted for particular query.. Please can you help me out with this ? Thanks in advance Dhaivat -- View this message in context: http://lucene.472066.n3.nabble.com/Fast-Vector-Highlighter-Working-for-some-records-only-tp3763286p3766335.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: reader/searcher refresh after replication (commit)
You'll *really like* the SolrCloud stuff going into trunk when it's baked for a while Best Erick On Wed, Feb 22, 2012 at 3:25 AM, eks dev eks...@googlemail.com wrote: Yes, I consciously let my slaves run away from the master in order to reduce update latency, but every now and then they sync up with master that is doing heavy lifting. The price you pay is that slaves do not see the same documents as the master, but this is the case anyhow with replication, in my setup slave may go ahead of master with updates, this delta gets zeroed after replication and the game starts again. What you have to take into account with this is very small time window where you may go back in time on slaves (not seeing documents that were already there), but we are talking about seconds and a couple out of 200Mio documents (only those documents that were softComited on slave during replication, since commit ond master and postCommit on slave). Why do you think something is strange here? What are you expecting a BeforeCommitListener could do for you, if one would exist? Why should I be expecting something? I just need to read userCommit Data as soon as replication is done, and I am looking for proper/easy way to do it. (postCommitListener is what I use now). What makes me slightly nervous are those life cycle questions, e.g. when I issue update command before and after postCommit event, which index gets updated, the one just replicated or the one that was there just before replication. There are definitely ways to optimize this, for example to force replication handler to copy only delta files if index gets updated on slave and master (there is already todo somewhere on solr replication Wiki I think). Now replicationHandler copies complete index if this gets detected ... I am all ears if there are better proposals to have low latency updates in multi server setup... On Tue, Feb 21, 2012 at 11:53 PM, Em mailformailingli...@yahoo.de wrote: Eks, that sounds strange! Am I getting you right? You have a master which indexes batch-updates from time to time. Furthermore you got some slaves, pulling data from that master to keep them up-to-date with the newest batch-updates. Additionally your slaves index own content in soft-commit mode that needs to be available as soon as possible. In consequence the slavesare not in sync with the master. I am not 100% certain, but chances are good that Solr's replication-mechanism only changes those segments that are not in sync with the master. What are you expecting a BeforeCommitListener could do for you, if one would exist? Kind regards, Em Am 21.02.2012 21:10, schrieb eks dev: Thanks Mark, Hmm, I would like to have this information asap, not to wait until the first search gets executed (depends on user) . Is solr going to create new searcher as a part of replication transaction... Just to make it clear why I need it... I have simple master, many slaves config where master does batch updates in big chunks (things user can wait longer to see on search side) but slaves work in soft commit mode internally where I permit them to run away slightly from master in order to know where incremental update should start, I read it from UserData Basically, ideally, before commit (after successful replication is finished) ends, I would like to read in these counters to let incremental update run from the right point... I need to prevent updating replicated index before I read this information (duplicates can appear) are there any IndexWriter listeners around? Thanks again, eks. On Tue, Feb 21, 2012 at 8:03 PM, Mark Miller markrmil...@gmail.com wrote: Post commit calls are made before a new searcher is opened. Might be easier to try to hook in with a new searcher listener? On Feb 21, 2012, at 8:23 AM, eks dev wrote: Hi all, I am a bit confused with IndexSearcher refresh lifecycles... In a master slave setup, I override postCommit listener on slave (solr trunk version) to read some user information stored in userCommitData on master -- @Override public final void postCommit() { // This returnes stale information that was present before replication finished RefCountedSolrIndexSearcher refC = core.getNewestSearcher(true); MapString, String userData = refC.get().getIndexReader().getIndexCommit().getUserData(); } I expected core.getNewestSearcher(true); to return refreshed SolrIndexSearcher, but it didn't When is this information going to be refreshed to the status from the replicated index, I repeat this is postCommit listener? What is the way to get the information from the last commit point? Maybe like this? core.getDeletionPolicy().getLatestCommit().getUserData(); Or I need to explicitly open new searcher (isn't solr does this behind the scenes?) core.openNewSearcher(false, false) Not critical, reopening new searcher works, but I would like to
Re: reader/searcher refresh after replication (commit)
Erick, You'll *really like* the SolrCloud stuff going into trunk when it's baked for a while How stable is SolrCloud at the moment? I can not wait to try it out. Kind regards, Em Am 22.02.2012 14:45, schrieb Erick Erickson: You'll *really like* the SolrCloud stuff going into trunk when it's baked for a while Best Erick On Wed, Feb 22, 2012 at 3:25 AM, eks dev eks...@googlemail.com wrote: Yes, I consciously let my slaves run away from the master in order to reduce update latency, but every now and then they sync up with master that is doing heavy lifting. The price you pay is that slaves do not see the same documents as the master, but this is the case anyhow with replication, in my setup slave may go ahead of master with updates, this delta gets zeroed after replication and the game starts again. What you have to take into account with this is very small time window where you may go back in time on slaves (not seeing documents that were already there), but we are talking about seconds and a couple out of 200Mio documents (only those documents that were softComited on slave during replication, since commit ond master and postCommit on slave). Why do you think something is strange here? What are you expecting a BeforeCommitListener could do for you, if one would exist? Why should I be expecting something? I just need to read userCommit Data as soon as replication is done, and I am looking for proper/easy way to do it. (postCommitListener is what I use now). What makes me slightly nervous are those life cycle questions, e.g. when I issue update command before and after postCommit event, which index gets updated, the one just replicated or the one that was there just before replication. There are definitely ways to optimize this, for example to force replication handler to copy only delta files if index gets updated on slave and master (there is already todo somewhere on solr replication Wiki I think). Now replicationHandler copies complete index if this gets detected ... I am all ears if there are better proposals to have low latency updates in multi server setup... On Tue, Feb 21, 2012 at 11:53 PM, Em mailformailingli...@yahoo.de wrote: Eks, that sounds strange! Am I getting you right? You have a master which indexes batch-updates from time to time. Furthermore you got some slaves, pulling data from that master to keep them up-to-date with the newest batch-updates. Additionally your slaves index own content in soft-commit mode that needs to be available as soon as possible. In consequence the slavesare not in sync with the master. I am not 100% certain, but chances are good that Solr's replication-mechanism only changes those segments that are not in sync with the master. What are you expecting a BeforeCommitListener could do for you, if one would exist? Kind regards, Em Am 21.02.2012 21:10, schrieb eks dev: Thanks Mark, Hmm, I would like to have this information asap, not to wait until the first search gets executed (depends on user) . Is solr going to create new searcher as a part of replication transaction... Just to make it clear why I need it... I have simple master, many slaves config where master does batch updates in big chunks (things user can wait longer to see on search side) but slaves work in soft commit mode internally where I permit them to run away slightly from master in order to know where incremental update should start, I read it from UserData Basically, ideally, before commit (after successful replication is finished) ends, I would like to read in these counters to let incremental update run from the right point... I need to prevent updating replicated index before I read this information (duplicates can appear) are there any IndexWriter listeners around? Thanks again, eks. On Tue, Feb 21, 2012 at 8:03 PM, Mark Miller markrmil...@gmail.com wrote: Post commit calls are made before a new searcher is opened. Might be easier to try to hook in with a new searcher listener? On Feb 21, 2012, at 8:23 AM, eks dev wrote: Hi all, I am a bit confused with IndexSearcher refresh lifecycles... In a master slave setup, I override postCommit listener on slave (solr trunk version) to read some user information stored in userCommitData on master -- @Override public final void postCommit() { // This returnes stale information that was present before replication finished RefCountedSolrIndexSearcher refC = core.getNewestSearcher(true); MapString, String userData = refC.get().getIndexReader().getIndexCommit().getUserData(); } I expected core.getNewestSearcher(true); to return refreshed SolrIndexSearcher, but it didn't When is this information going to be refreshed to the status from the replicated index, I repeat this is postCommit listener? What is the way to get the information from the last commit point? Maybe like this?
Re: How to handle to run testcases in ruby code for solr
Hi Erik, I have tried links which you given. while runnign rake am getting error == Errno::ECONNREFUSED: No connection could be made because the target machine acti vely refused it. - connect(2) === -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-handle-to-run-testcases-in-ruby-code-for-solr-tp3753479p3766559.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr on netty
Is anybody aware of any effort regarding porting solr to a netty ( or any other async-io based framework ) based framework. Even on medium load ( 10 parallel clients ) with 16 shards performance seems to deteriorate quite sharply compared another alternative ( async-io based ) solution as load increases. -Prasenjit -- Sent from my mobile device
Re: How to handle to run testcases in ruby code for solr
I'm not sure what to suggest at this point... obviously your test setup is trying to hit a Solr server that isn't running. Check the host and port that it is trying and ensure that Solr is running as your tests expect or use the mock way that I just replied about. Note, again, that solr-ruby is deprecated and unsupported at this point. I recommend you give the RSolr project a try if you want support with it in the future. Erik On Feb 22, 2012, at 09:10 , solr wrote: Hi Erik, I have tried links which you given. while runnign rake am getting error == Errno::ECONNREFUSED: No connection could be made because the target machine acti vely refused it. - connect(2) === -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-handle-to-run-testcases-in-ruby-code-for-solr-tp3753479p3766559.html Sent from the Solr - User mailing list archive at Nabble.com.
solr 3.5 and indexing performance
Hello, I wanted to switch to new version of solr, exactelly to 3.5 but im getting big drop of indexing speed. I'm using 3.1 and after few tests i discower that 3.4 do it a lot of better then 3.5 My schema is really simple few field using text type field /fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.HunspellStemFilterFactory dictionary=pl_PL.dic affix=pl_PL.aff/ filter class=solr.ASCIIFoldingFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.HunspellStemFilterFactory dictionary=pl_PL.dic affix=pl_PL.aff/ filter class=solr.ASCIIFoldingFilterFactory/ /analyzer /fieldType / All data and configuration are the same, same schema, solrconfig, same jetty. *SOLR 3.5* /Feb 22, 2012 3:40:33 PM org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=1 commit{dir=/vol/home/mciurla/proj/solr/accordion3.5/example/solr/data/index,segFN=segments_bl,version=1329831219365,generation=417,filenames=[_a5.fdx, _52.fdx, _aq.frq, _a5.fdt, _cr.nrm, _52.fnm, _a5.prx, segments_bl, _52.fdt, _7k.tii, _cr.frq, _a5.tis, _cr.fdt, _a5.nrm, _cr.prx, _cp.prx, _cr.fdx, _cn.nrm, _52.tvf, _cp.fnm, _co.tii, _52.tvd, _8 o.tvx, _co.tis, _8o.tii, _a5.fnm, _8o.tvd, _7k.tis, _8o.tvf, _bb.tis, _7k.fdx, _7k.fdt, _7k.frq, _bb.tii, _cn.frq, _co.prx, _aq.tii, _cq.fdx, _52.tii, _cm.tis, _cq.fdt, _aq.tis, _52.tis, _aq.tvx, _co.nrm, _bb.prx, _cm.tii, _cr.fnm, _aq.tvf, _bb_3.del, _aq.tvd, _cm.frq, _cp.nrm, _cq.tis, _52.prx, _cn.tis, _8o.fnm, _cl.nrm, _cl.fnm, _a5.tii, _cn.tii, _cq .tii, _cp.tis, _cp.fdt, _cl.fdt, _cl.prx, _aq.fdt, _cl.fdx, _cr.tis, _co.frq, _7k.fnm, _cq.frq, _bb.fnm, _cr.tii, _cp.fdx, _cp.tii, _aq.fdx, _cq.tvd, _8o.fdt, _cq.tvf, _52.nrm, _8o.nrm, _aq.fnm, _8o.prx, _co.tvd, _cq.tvx, _52.frq, _bb.nrm, _bb.fdt, _cp.tvf, _a5.tvx, _cp.tvd, _cn.tvx, _7k.nrm, _bb.fdx, _cm.tvx, _cm.fdx, _cl.tvf, _cp.tvx, _co.fdx, _cl.tv d, _cn.tvf, _a5.frq, _cm.fdt, _a5.tvf, _co.fdt, _a5.tvd, _cp.frq, _cn.fdt, _cm.nrm, _7k_d.del, _cn.fdx, _52_1e.del, _7k.prx, _8o.fdx, _cn.prx, _cl.tis, _cq.nrm, _7k.tvx, _cq.prx , _cn.tvd, _cl.tii, _cm.fnm, _7k.tvd, _cm.prx, _8o.tis, _cm.tvf, _52.tvx, _7k.tvf, _cl.tvx, _cm.tvd, _a5_9.del, _bb.tvf, _bb.tvd, _cr.tvd, _co.tvf, _bb.tvx, _cr.tvf, _co.fnm, _a q.prx, _cl.frq, _cq.fnm, _aq_9.del, _bb.frq, _8o.frq, _aq.nrm, _co.tvx, _8o_t.del, _cr.tvx, _cn.fnm, _cl_6.del] Feb 22, 2012 3:40:33 PM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 1329831219365 Feb 22, 2012 3:40:47 PM org.apache.solr.update.processor.LogUpdateProcessor finish *INFO: {add=[2271874, 2271875, 2271876, 2271877, 2271878, 2271879, 2271880, 2271881, ... (100 adds)]} 0 14213* Feb 22, 2012 3:40:47 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={} status=0 QTime=14213 / when on solr 3.4 /Feb 22, 2012 3:42:56 PM org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=1 commit{dir=/vol/home/mciurla/proj/solr/accordion3.4/example/solr/data/index,segFN=segments_29,version=1329918470592,generation=81,filenames=[_2b.tvf, _2c.tvx, _2d.tvf, _2f.tvx, _2d.tvd, _15.prx, _15.frq, _2b.tvd, _2c.nrm, _20.fnm, _2b.tvx, _2c.fdx, _2c.prx, _2f.tii, _2f.tvf, _20.tvx, _2b.fnm, _2c.fdt, _2d.tis, _15.fdt, _20.frq, _2d.tvx, _2f.tvd, _15.fdx, _15.fnm, _2c.tvf, _2e.frq, _2e.prx, _2c.tvd, _2b.frq, _20.tvd, _2c.fnm, _20.tvf, _2e.tvf, _2e.nrm, _20.tis, _2b.prx, _20.tii, _2e.tvd, _15.tis, _2f.frq, _15.tii, _2e.tvx, _2e.tii, _2c.tis, _2c.frq, _2e.fdx, _2f.prx, _2f.fnm, _15.tvx, _2e.fdt, _15.tvf, _2b.tis, _2c.tii, _2d.prx, _2d.fnm, _20.fdx, _2b.tii, _2e.tis, _20.fdt, _2d.frq, _2b.nrm, _15.tvd, _15_b.del, _2b.fdt, _2f.nrm, _2d.fdx,
Re: reader/searcher refresh after replication (commit)
It's certainly stable enough to start experimenting with, and I know that it's under pretty active development now. I've seen a lot of back-and-forth between Mark Miller and Jamie Johnson, Jamie trying things and Mark responding. It's part of the trunk, so be prepared for occasional re-indexing being required. This isn't related to SolrCloud, just the fact that it's only available on trunk. And I'm certain that the more eyes look at it, the better it'll be, so I'd say go for it. I tried out the example here: http://wiki.apache.org/solr/SolrCloud and it went quite well, but I didn't stress it much yet (that's next). Personally, I'd put it through some pretty heavy testing before deploying to production at this point, just because of all the new features on trunk. But having people work with it is the best way to move the effort forward. So feel free! Erick On Wed, Feb 22, 2012 at 9:07 AM, Em mailformailingli...@yahoo.de wrote: Erick, You'll *really like* the SolrCloud stuff going into trunk when it's baked for a while How stable is SolrCloud at the moment? I can not wait to try it out. Kind regards, Em Am 22.02.2012 14:45, schrieb Erick Erickson: You'll *really like* the SolrCloud stuff going into trunk when it's baked for a while Best Erick On Wed, Feb 22, 2012 at 3:25 AM, eks dev eks...@googlemail.com wrote: Yes, I consciously let my slaves run away from the master in order to reduce update latency, but every now and then they sync up with master that is doing heavy lifting. The price you pay is that slaves do not see the same documents as the master, but this is the case anyhow with replication, in my setup slave may go ahead of master with updates, this delta gets zeroed after replication and the game starts again. What you have to take into account with this is very small time window where you may go back in time on slaves (not seeing documents that were already there), but we are talking about seconds and a couple out of 200Mio documents (only those documents that were softComited on slave during replication, since commit ond master and postCommit on slave). Why do you think something is strange here? What are you expecting a BeforeCommitListener could do for you, if one would exist? Why should I be expecting something? I just need to read userCommit Data as soon as replication is done, and I am looking for proper/easy way to do it. (postCommitListener is what I use now). What makes me slightly nervous are those life cycle questions, e.g. when I issue update command before and after postCommit event, which index gets updated, the one just replicated or the one that was there just before replication. There are definitely ways to optimize this, for example to force replication handler to copy only delta files if index gets updated on slave and master (there is already todo somewhere on solr replication Wiki I think). Now replicationHandler copies complete index if this gets detected ... I am all ears if there are better proposals to have low latency updates in multi server setup... On Tue, Feb 21, 2012 at 11:53 PM, Em mailformailingli...@yahoo.de wrote: Eks, that sounds strange! Am I getting you right? You have a master which indexes batch-updates from time to time. Furthermore you got some slaves, pulling data from that master to keep them up-to-date with the newest batch-updates. Additionally your slaves index own content in soft-commit mode that needs to be available as soon as possible. In consequence the slavesare not in sync with the master. I am not 100% certain, but chances are good that Solr's replication-mechanism only changes those segments that are not in sync with the master. What are you expecting a BeforeCommitListener could do for you, if one would exist? Kind regards, Em Am 21.02.2012 21:10, schrieb eks dev: Thanks Mark, Hmm, I would like to have this information asap, not to wait until the first search gets executed (depends on user) . Is solr going to create new searcher as a part of replication transaction... Just to make it clear why I need it... I have simple master, many slaves config where master does batch updates in big chunks (things user can wait longer to see on search side) but slaves work in soft commit mode internally where I permit them to run away slightly from master in order to know where incremental update should start, I read it from UserData Basically, ideally, before commit (after successful replication is finished) ends, I would like to read in these counters to let incremental update run from the right point... I need to prevent updating replicated index before I read this information (duplicates can appear) are there any IndexWriter listeners around? Thanks again, eks. On Tue, Feb 21, 2012 at 8:03 PM, Mark Miller markrmil...@gmail.com wrote: Post commit calls are made before a new searcher is opened. Might be easier to
Re: Solr on netty
On Wed, Feb 22, 2012 at 9:27 AM, prasenjit mukherjee prasen@gmail.com wrote: Is anybody aware of any effort regarding porting solr to a netty ( or any other async-io based framework ) based framework. Even on medium load ( 10 parallel clients ) with 16 shards performance seems to deteriorate quite sharply compared another alternative ( async-io based ) solution as load increases. By 16 shards do you mean you have 16 nodes and each single client request causes a distributed search across all of them them? How many concurrent requests are your 10 clients making to each node? NIO works well when there are many clients, but when servicing those client requests only needs intermittent CPU. That's not the pattern we see for search. You *can* easily configure Solr's Jetty to use NIO when accepting client connections, but it won't do you any good, just as switching to Netty wouldn't do anything here. Where NIO could help a little is with the requests that Solr makes to other Solr instances. Solr is already architected for async request-response to other nodes, but the current underlying implementation uses HttpClient 3 (which doesn't have NIO). Anyway, it's unlikely that NIO vs BIO will make much of a difference with the numbers you're talking about (16 shards). Someone else reported that we have the number of connections per host set too low, and they saw big gains by increasing this. There's an issue open to make this configurable in 3x: https://issues.apache.org/jira/browse/SOLR-3079 We should probably up the max connections per host by default. -Yonik lucidimagination.com
Unusually long data import time?
Hello, Would it be unusual for an import of 160 million documents to take 18 hours? Each document is less than 1kb and I have the DataImportHandler using the jdbc driver to connect to SQL Server 2008. The full-import query calls a stored procedure that contains only a select from my target table. Is there any way I can speed this up? I saw recently someone on this list suggested a new user could get all their Solr data imported in under an hour. I sure hope that's true! Devon Baumgarten
Re: Unusually long data import time?
Import times will depend on: - hardware (speed of disks, cpu, # of cpus, amount of memory, etc) - Java configuration (heap size, etc) - Lucene/Solr configuration (many ...) - Index configuration - how many fields, indexed how; faceting, etc - OS configuration (this usually to a lesser degree; _usually_) - Network issues if non-local - DB configuration (driver, etc) If you can give more information about the above, people on this list should be able to better indicate whether 18 hours sounds right for your situation. -Glen Newton On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten dbaumgar...@nationalcorp.com wrote: Hello, Would it be unusual for an import of 160 million documents to take 18 hours? Each document is less than 1kb and I have the DataImportHandler using the jdbc driver to connect to SQL Server 2008. The full-import query calls a stored procedure that contains only a select from my target table. Is there any way I can speed this up? I saw recently someone on this list suggested a new user could get all their Solr data imported in under an hour. I sure hope that's true! Devon Baumgarten -- - http://zzzoot.blogspot.com/ -
Re: solr 3.5 and indexing performance
I wanted to switch to new version of solr, exactelly to 3.5 but im getting big drop of indexing speed. Could it be autoCommit configuration in solrconfig.xml?
SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher
We started observing strange failures from ReplicationHandler when we commit on master trunk version 4-5 days old. It works sometimes, and sometimes not didn't dig deeper yet. Looks like the real culprit hides behind: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed Looks familiar to somebody? 120222 154959 SEVERE SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1138) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1251) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1043) at org.apache.solr.update.DirectUpdateHandler2.commit(Unknown Source) at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:503) at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:348) at org.apache.solr.handler.ReplicationHandler.doFetch(Unknown Source) at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:163) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:810) at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:815) at org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:3984) at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:254) at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:233) at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:223) at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170) at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1095) ... 15 more
Re: Solr on netty
Thanks for the response. Yes we have 16 shards/partitions each on 16 different nodes and a separate master Solr receiving continuous parallel requests from 10 client threads running on a single separate machine. Our observation was that the perf degraded non linearly as the load ( no of concurrent clients ) increased. Have some followup questions : 1. What is the default maxnumber of threads configured when a Solr instance make calls to other 16 partitions ? 2. How do I increase the max no of connections for solr--solr interactions as u mentioned in your mail ? On 2/22/12, Yonik Seeley yo...@lucidimagination.com wrote: On Wed, Feb 22, 2012 at 9:27 AM, prasenjit mukherjee prasen@gmail.com wrote: Is anybody aware of any effort regarding porting solr to a netty ( or any other async-io based framework ) based framework. Even on medium load ( 10 parallel clients ) with 16 shards performance seems to deteriorate quite sharply compared another alternative ( async-io based ) solution as load increases. By 16 shards do you mean you have 16 nodes and each single client request causes a distributed search across all of them them? How many concurrent requests are your 10 clients making to each node? NIO works well when there are many clients, but when servicing those client requests only needs intermittent CPU. That's not the pattern we see for search. You *can* easily configure Solr's Jetty to use NIO when accepting client connections, but it won't do you any good, just as switching to Netty wouldn't do anything here. Where NIO could help a little is with the requests that Solr makes to other Solr instances. Solr is already architected for async request-response to other nodes, but the current underlying implementation uses HttpClient 3 (which doesn't have NIO). Anyway, it's unlikely that NIO vs BIO will make much of a difference with the numbers you're talking about (16 shards). Someone else reported that we have the number of connections per host set too low, and they saw big gains by increasing this. There's an issue open to make this configurable in 3x: https://issues.apache.org/jira/browse/SOLR-3079 We should probably up the max connections per host by default. -Yonik lucidimagination.com -- Sent from my mobile device
RE: Unusually long data import time?
Oh sure! As best as I can, anyway. I have not set the Java heap size, or really configured it at all. The server running both the SQL Server and Solr has: * 2 Intel Xeon X5660 (each one is 2.8 GHz, 6 cores, 12 logical processors) * 64 GB RAM * One Solr instance (no shards) I'm not using faceting. My schema has these fields: field name=Id type=string indexed=true stored=true / field name=RecordId type=int indexed=true stored=true / field name=RecordType type=string indexed=true stored=true / field name=Name type=LikeText indexed=true stored=true termVectors=true / field name=NameFuzzy type=FuzzyText indexed=true stored=true termVectors=true / copyField source=Name dest=NameFuzzy / field name=NameType type=string indexed=true stored=true / Custom types: *LikeText PatternReplaceCharFilterFactory (\W+ = ) KeywordTokenizerFactory StopFilterFactory (~40 words in stoplist) ASCIIFoldingFilterFactory LowerCaseFilterFactory EdgeNGramFilterFactory LengthFilterFactory (min:3, max:512) *FuzzyText PatternReplaceCharFilterFactory (\W+ = ) KeywordTokenizerFactory StopFilterFactory (~40 words in stoplist) ASCIIFoldingFilterFactory LowerCaseFilterFactory NGramFilterFactory LengthFilterFactory (min:3, max:512) Devon Baumgarten -Original Message- From: Glen Newton [mailto:glen.new...@gmail.com] Sent: Wednesday, February 22, 2012 9:24 AM To: solr-user@lucene.apache.org Subject: Re: Unusually long data import time? Import times will depend on: - hardware (speed of disks, cpu, # of cpus, amount of memory, etc) - Java configuration (heap size, etc) - Lucene/Solr configuration (many ...) - Index configuration - how many fields, indexed how; faceting, etc - OS configuration (this usually to a lesser degree; _usually_) - Network issues if non-local - DB configuration (driver, etc) If you can give more information about the above, people on this list should be able to better indicate whether 18 hours sounds right for your situation. -Glen Newton On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten dbaumgar...@nationalcorp.com wrote: Hello, Would it be unusual for an import of 160 million documents to take 18 hours? Each document is less than 1kb and I have the DataImportHandler using the jdbc driver to connect to SQL Server 2008. The full-import query calls a stored procedure that contains only a select from my target table. Is there any way I can speed this up? I saw recently someone on this list suggested a new user could get all their Solr data imported in under an hour. I sure hope that's true! Devon Baumgarten -- - http://zzzoot.blogspot.com/ -
Re: Fields, Facets, and Search Results
Hi darul, You're right, I was not using defaultSearchField. So, following your suggestions, I added defaultSearchFieldtext/defaultSearchField and copyField source=section_text_content dest=text/ This required that I add a field text, which is fine. I did that. Now, when I commit the doc for indexing, I get this error: SOLR returned a #400 Error: Error adding field section_text_content. . . -- View this message in context: http://lucene.472066.n3.nabble.com/Fields-Facets-and-Search-Results-tp3765946p3767006.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr Performance Improvement and degradation Help
As I've mentioned before, I'm very new to Solr. I'm not a Java guy or an Apache guy. I'm a .Net guy. We have a rather large schema - some 100 + fields plus a large number of dynamic fields. We've been trying to improve performance and finally got around to implementing fastvectorhighlighting which gave us an immediate improvement on the qtime (nearly 70%) which also improved the overall response time by over 20%. With that, we also bring back an extraordinarly large amount of data in the XML. Some results (20 records) come back with a payload between 3MB and even 17MB. We have a lot of report text that is used for searching and highlighting. We recently implemented field list wildcards on two versions of Solr to test it out. This allowed us to leave the report text off the return and decreased the payload significantly - by nearly 85% in the large cases... SO, we'd expect a performance boost there, however we are seeing greatly increased response times on these builds of Solr even though the qtime is incredibly fast. To put it in perspective - our original Solr core is 4.0, I believe the 4.0.0.2010.12.10.08.54.56 version. On our test boxes, we have one running 4.0.0.2011.11.17 and one running 4.0.0.2012.02.16 version. with the older version (not having the wildcard field list), it returns a payload of approximately 13MB in an average of 1.5 seconds. with the new version (2012.02.16) which is on the same machines as the older version (so network traffic/latency/hardware/etc are all the same), it's returning the reduced payload (approximately 1.5MB in an average of 3.5-4 seconds). I will say that we reloaded the core once and briefly saw the 1.5MB payload come back in 150-200 milliseconds, but within minutes we were back to the 3.5-4 seconds. We also noticed the CPU was being pegged for seconds when running the queries on the new build with the wildcard field list. We have a lower scale box running the 2011.11.17 version and had more success for a while. We were getting the 150-200 ms response time on the reduced payload for probably 30 minutes or so, and then it did the same thing - bumped up to 3-4 seconds in response time. Anyone have any experience with this type of random yet consistent performance degradation or have insight as to what might be causing the issues and how to fix them? We'd love to not only have the performance boost from fast vector highlighting, but also the decreased payload size. Thanks in advance! -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Performance-Improvement-and-degradation-Help-tp3767015p3767015.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to merge an autofacet with a predefined facet
I'm not sure to understand your solution ? When (and how) will be the 'word' detection in the fulltext ? before (by my own) or during (with) solr indexation ? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-merge-an-autofacet-with-a-predefined-facet-tp3763988p3767059.html Sent from the Solr - User mailing list archive at Nabble.com.
Problem parsing queries with forward slashes and multiple fields
I'm running into a problem with queries that contain forward slashes and more than one field. For example, these queries work fine: fieldName:/a fieldName:/* But if I have two fields with similar syntax in the same query, it fails. For simplicity, I'm using the same field twice: fieldName:/a fieldName:/a results in: no field name specified in query and no defaultSearchField defined in schema.xml SEVERE: org.apache.solr.common.SolrException: no field name specified in query and no defaultSearchField defined in schema.xml at org.apache.solr.search.SolrQueryParser.checkNullField(SolrQueryParser.java:106) at org.apache.solr.search.SolrQueryParser.getFieldQuery(SolrQueryParser.java:124) at org.apache.lucene.queryparser.classic.QueryParserBase.handleBareTokenQuery(QueryParserBase.java:1058) at org.apache.lucene.queryparser.classic.QueryParser.Term(QueryParser.java:358) at org.apache.lucene.queryparser.classic.QueryParser.Clause(QueryParser.java:257) at org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:212) at org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:170) at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:118) at org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:74) at org.apache.solr.search.QParser.getQuery(QParser.java:143) fieldName:/* fieldName:/* results in: null java.lang.NullPointerException at org.apache.solr.schema.IndexSchema$DynamicReplacement.matches(IndexSchema.java:747) at org.apache.solr.schema.IndexSchema.getDynamicFieldType(IndexSchema.java:1026) at org.apache.solr.schema.IndexSchema.getFieldType(IndexSchema.java:980) at org.apache.solr.search.SolrQueryParser.getWildcardQuery(SolrQueryParser.java:172) at org.apache.lucene.queryparser.classic.QueryParserBase.handleBareTokenQuery(QueryParserBase.java:1039) at org.apache.lucene.queryparser.classic.QueryParser.Term(QueryParser.java:358) at org.apache.lucene.queryparser.classic.QueryParser.Clause(QueryParser.java:257) at org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:212) at org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:170) at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:118) at org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:74) at org.apache.solr.search.QParser.getQuery(QParser.java:143) Any ideas as to what may be wrong and how can I make these work? I'm on a 4.0 snapshot from Nov 29, 2011.
RE: Unusually long data import time?
I changed the heap size (Xmx1582m was as high as I could go). The import is at about 5% now, and from that I now estimate about 13 hours. It's hard to say though.. it keeps going up little by little. If I get approval to use Solr for this project, I'll have them install a 64bit jvm instead, but is there anything else I can do? Devon Baumgarten Application Developer -Original Message- From: Devon Baumgarten [mailto:dbaumgar...@nationalcorp.com] Sent: Wednesday, February 22, 2012 10:32 AM To: 'solr-user@lucene.apache.org' Subject: RE: Unusually long data import time? Oh sure! As best as I can, anyway. I have not set the Java heap size, or really configured it at all. The server running both the SQL Server and Solr has: * 2 Intel Xeon X5660 (each one is 2.8 GHz, 6 cores, 12 logical processors) * 64 GB RAM * One Solr instance (no shards) I'm not using faceting. My schema has these fields: field name=Id type=string indexed=true stored=true / field name=RecordId type=int indexed=true stored=true / field name=RecordType type=string indexed=true stored=true / field name=Name type=LikeText indexed=true stored=true termVectors=true / field name=NameFuzzy type=FuzzyText indexed=true stored=true termVectors=true / copyField source=Name dest=NameFuzzy / field name=NameType type=string indexed=true stored=true / Custom types: *LikeText PatternReplaceCharFilterFactory (\W+ = ) KeywordTokenizerFactory StopFilterFactory (~40 words in stoplist) ASCIIFoldingFilterFactory LowerCaseFilterFactory EdgeNGramFilterFactory LengthFilterFactory (min:3, max:512) *FuzzyText PatternReplaceCharFilterFactory (\W+ = ) KeywordTokenizerFactory StopFilterFactory (~40 words in stoplist) ASCIIFoldingFilterFactory LowerCaseFilterFactory NGramFilterFactory LengthFilterFactory (min:3, max:512) Devon Baumgarten -Original Message- From: Glen Newton [mailto:glen.new...@gmail.com] Sent: Wednesday, February 22, 2012 9:24 AM To: solr-user@lucene.apache.org Subject: Re: Unusually long data import time? Import times will depend on: - hardware (speed of disks, cpu, # of cpus, amount of memory, etc) - Java configuration (heap size, etc) - Lucene/Solr configuration (many ...) - Index configuration - how many fields, indexed how; faceting, etc - OS configuration (this usually to a lesser degree; _usually_) - Network issues if non-local - DB configuration (driver, etc) If you can give more information about the above, people on this list should be able to better indicate whether 18 hours sounds right for your situation. -Glen Newton On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten dbaumgar...@nationalcorp.com wrote: Hello, Would it be unusual for an import of 160 million documents to take 18 hours? Each document is less than 1kb and I have the DataImportHandler using the jdbc driver to connect to SQL Server 2008. The full-import query calls a stored procedure that contains only a select from my target table. Is there any way I can speed this up? I saw recently someone on this list suggested a new user could get all their Solr data imported in under an hour. I sure hope that's true! Devon Baumgarten -- - http://zzzoot.blogspot.com/ -
How to check if a field is a multivalue field with java
Hello, is there any way to check, if a field of a SolrDocument ist a multivalue field with java (solrj)? Greets Thomas -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-check-if-a-field-is-a-multivalue-field-with-java-tp3767200p3767200.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr HBase - Re: How is Data Indexed in HBase?
Mr Gupta, Thanks so much for your reply! In my use cases, retrieving data by keyword is one of them. I think Solr is a proper choice. However, Solr does not provide a complex enough support to rank. And, frequent updating is also not suitable in Solr. So it is difficult to retrieve data randomly based on the values other than keyword frequency in text. In this case, I attempt to use HBase. But I don't know how HBase support high performance when it needs to keep consistency in a large scale distributed system. Now both of them are used in my system. I will check out ElasticSearch. Best regards, Bing On Thu, Feb 23, 2012 at 1:35 AM, T Vinod Gupta tvi...@readypulse.comwrote: Bing, Its a classic battle on whether to use solr or hbase or a combination of both. both systems are very different but there is some overlap in the utility. they also differ vastly when it compares to computation power, storage needs, etc. so in the end, it all boils down to your use case. you need to pick the technology that it best suited to your needs. im still not clear on your use case though. btw, if you haven't started using solr yet - then you might want to checkout ElasticSearch. I spent over a week researching between solr and ES and eventually chose ES due to its cool merits. thanks On Wed, Feb 22, 2012 at 9:31 AM, Ted Yu yuzhih...@gmail.com wrote: There is no secondary index support in HBase at the moment. It's on our road map. FYI On Wed, Feb 22, 2012 at 9:28 AM, Bing Li lbl...@gmail.com wrote: Jacques, Yes. But I still have questions about that. In my system, when users search with a keyword arbitrarily, the query is forwarded to Solr. No any updating operations but appending new indexes exist in Solr managed data. When I need to retrieve data based on ranking values, HBase is used. And, the ranking values need to be updated all the time. Is that correct? My question is that the performance must be low if keeping consistency in a large scale distributed environment. How does HBase handle this issue? Thanks so much! Bing On Thu, Feb 23, 2012 at 1:17 AM, Jacques whs...@gmail.com wrote: It is highly unlikely that you could replace Solr with HBase. They're really apples and oranges. On Wed, Feb 22, 2012 at 1:09 AM, Bing Li lbl...@gmail.com wrote: Dear all, I wonder how data in HBase is indexed? Now Solr is used in my system because data is managed in inverted index. Such an index is suitable to retrieve unstructured and huge amount of data. How does HBase deal with the issue? May I replaced Solr with HBase? Thanks so much! Best regards, Bing
Re: Unusually long data import time?
In my first try with the DIH, I had several sub-entities and it was making six queries per document. My 20M doc load was going to take many hours, most of a day. I re-wrote it to eliminate those, and now it makes a single query for the whole load and takes 70 minutes. These are small documents, just the metadata for each book. wunder Search Guy Chegg On Feb 22, 2012, at 9:41 AM, Devon Baumgarten wrote: I changed the heap size (Xmx1582m was as high as I could go). The import is at about 5% now, and from that I now estimate about 13 hours. It's hard to say though.. it keeps going up little by little. If I get approval to use Solr for this project, I'll have them install a 64bit jvm instead, but is there anything else I can do? Devon Baumgarten Application Developer -Original Message- From: Devon Baumgarten [mailto:dbaumgar...@nationalcorp.com] Sent: Wednesday, February 22, 2012 10:32 AM To: 'solr-user@lucene.apache.org' Subject: RE: Unusually long data import time? Oh sure! As best as I can, anyway. I have not set the Java heap size, or really configured it at all. The server running both the SQL Server and Solr has: * 2 Intel Xeon X5660 (each one is 2.8 GHz, 6 cores, 12 logical processors) * 64 GB RAM * One Solr instance (no shards) I'm not using faceting. My schema has these fields: field name=Id type=string indexed=true stored=true / field name=RecordId type=int indexed=true stored=true / field name=RecordType type=string indexed=true stored=true / field name=Name type=LikeText indexed=true stored=true termVectors=true / field name=NameFuzzy type=FuzzyText indexed=true stored=true termVectors=true / copyField source=Name dest=NameFuzzy / field name=NameType type=string indexed=true stored=true / Custom types: *LikeText PatternReplaceCharFilterFactory (\W+ = ) KeywordTokenizerFactory StopFilterFactory (~40 words in stoplist) ASCIIFoldingFilterFactory LowerCaseFilterFactory EdgeNGramFilterFactory LengthFilterFactory (min:3, max:512) *FuzzyText PatternReplaceCharFilterFactory (\W+ = ) KeywordTokenizerFactory StopFilterFactory (~40 words in stoplist) ASCIIFoldingFilterFactory LowerCaseFilterFactory NGramFilterFactory LengthFilterFactory (min:3, max:512) Devon Baumgarten -Original Message- From: Glen Newton [mailto:glen.new...@gmail.com] Sent: Wednesday, February 22, 2012 9:24 AM To: solr-user@lucene.apache.org Subject: Re: Unusually long data import time? Import times will depend on: - hardware (speed of disks, cpu, # of cpus, amount of memory, etc) - Java configuration (heap size, etc) - Lucene/Solr configuration (many ...) - Index configuration - how many fields, indexed how; faceting, etc - OS configuration (this usually to a lesser degree; _usually_) - Network issues if non-local - DB configuration (driver, etc) If you can give more information about the above, people on this list should be able to better indicate whether 18 hours sounds right for your situation. -Glen Newton On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten dbaumgar...@nationalcorp.com wrote: Hello, Would it be unusual for an import of 160 million documents to take 18 hours? Each document is less than 1kb and I have the DataImportHandler using the jdbc driver to connect to SQL Server 2008. The full-import query calls a stored procedure that contains only a select from my target table. Is there any way I can speed this up? I saw recently someone on this list suggested a new user could get all their Solr data imported in under an hour. I sure hope that's true! Devon Baumgarten -- - http://zzzoot.blogspot.com/ -
Re: How to merge an autofacet with a predefined facet
If you use the suggested solution, it will detect the words at indexing time. However, Solr's FilterFactory's lifecycle keeps no track on whether a file for synonyms, keywords etc. has been changed since Solr's last startup. Therefore a change within these files is not visible until you reload your core. Furthermore keywords for old documents aren't added automatically if you change your keywords (and reload the core) - you have to write a routine that finds documents matching the new keywords and reindex those documents. Example: Your keywordslist at time t1 contains two words: keyword codeword You are indexing two documents: doc1: {content:I am about a secret codeword.} doc1: {content:Happy keyword and the gang.} Your filter will mark codeword in doc1 and keyword in doc2 as words to keep and remove everything else. Therefore their content for your keepWordField contains only doc1: {indexedContent:codeword} doc2: {indexedContent:keyword} However, if you add the word gang to your keywordlist AND reload your SolrCore, doc2 will still only contain the term keyword until it gets reindexed again. Kind regards, Em Am 22.02.2012 17:56, schrieb Xavier: I'm not sure to understand your solution ? When (and how) will be the 'word' detection in the fulltext ? before (by my own) or during (with) solr indexation ? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-merge-an-autofacet-with-a-predefined-facet-tp3763988p3767059.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to check if a field is a multivalue field with java
Hi Thomas, With Java (from within a custom handler in Solr) you can get a handle to the IndexSchema from the request, like so: IndexSchema schema = req.getSchema(); SchemaField sf = schema.getField(fielaname); boolean isMultiValued = sf.multiValued(); From within SolrJ code, you can use SolrDocument.getFieldValue() which returns an Object, so you could do an instanceof check - if its a Collection its multivalued, else not. Object value = sdoc.getFieldValue(fieldname); boolean isMultiValued = value instanceof Collection; At least this is what I do, I don't think there is a way to get a handle to the IndexSchema object over solrj... -sujit On Feb 22, 2012, at 9:41 AM, tschiela wrote: Hello, is there any way to check, if a field of a SolrDocument ist a multivalue field with java (solrj)? Greets Thomas -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-check-if-a-field-is-a-multivalue-field-with-java-tp3767200p3767200.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to merge an autofacet with a predefined facet
Btw.: Solr has no downtime while reloading the core. It loads the new core and while loading the new one it still serves requests with the old one. When the new one is ready (and warmed up) it finally replaces the old core. Best, Em Am 22.02.2012 17:56, schrieb Xavier: I'm not sure to understand your solution ? When (and how) will be the 'word' detection in the fulltext ? before (by my own) or during (with) solr indexation ? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-merge-an-autofacet-with-a-predefined-facet-tp3763988p3767059.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem parsing queries with forward slashes and multiple fields
Yury, are you sure your request has a proper url-encoding? Kind regards, Em Am 22.02.2012 18:25, schrieb Yury Kats: I'm running into a problem with queries that contain forward slashes and more than one field. For example, these queries work fine: fieldName:/a fieldName:/* But if I have two fields with similar syntax in the same query, it fails. For simplicity, I'm using the same field twice: fieldName:/a fieldName:/a results in: no field name specified in query and no defaultSearchField defined in schema.xml SEVERE: org.apache.solr.common.SolrException: no field name specified in query and no defaultSearchField defined in schema.xml at org.apache.solr.search.SolrQueryParser.checkNullField(SolrQueryParser.java:106) at org.apache.solr.search.SolrQueryParser.getFieldQuery(SolrQueryParser.java:124) at org.apache.lucene.queryparser.classic.QueryParserBase.handleBareTokenQuery(QueryParserBase.java:1058) at org.apache.lucene.queryparser.classic.QueryParser.Term(QueryParser.java:358) at org.apache.lucene.queryparser.classic.QueryParser.Clause(QueryParser.java:257) at org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:212) at org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:170) at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:118) at org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:74) at org.apache.solr.search.QParser.getQuery(QParser.java:143) fieldName:/* fieldName:/* results in: null java.lang.NullPointerException at org.apache.solr.schema.IndexSchema$DynamicReplacement.matches(IndexSchema.java:747) at org.apache.solr.schema.IndexSchema.getDynamicFieldType(IndexSchema.java:1026) at org.apache.solr.schema.IndexSchema.getFieldType(IndexSchema.java:980) at org.apache.solr.search.SolrQueryParser.getWildcardQuery(SolrQueryParser.java:172) at org.apache.lucene.queryparser.classic.QueryParserBase.handleBareTokenQuery(QueryParserBase.java:1039) at org.apache.lucene.queryparser.classic.QueryParser.Term(QueryParser.java:358) at org.apache.lucene.queryparser.classic.QueryParser.Clause(QueryParser.java:257) at org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:212) at org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:170) at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:118) at org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:74) at org.apache.solr.search.QParser.getQuery(QParser.java:143) Any ideas as to what may be wrong and how can I make these work? I'm on a 4.0 snapshot from Nov 29, 2011.
Re: Problem parsing queries with forward slashes and multiple fields
On 2/22/2012 12:25 PM, Yury Kats wrote: I'm running into a problem with queries that contain forward slashes and more than one field. For example, these queries work fine: fieldName:/a fieldName:/* But if I have two fields with similar syntax in the same query, it fails. For simplicity, I'm using the same field twice: fieldName:/a fieldName:/a Looks like escaping forward slashes makes the query work, eg fieldName:\/a fieldName:\/a This is a bit puzzling as the forward slash is not part of the query language, is it?
Re: Problem parsing queries with forward slashes and multiple fields
On 2/22/2012 1:05 PM, Em wrote: Yury, are you sure your request has a proper url-encoding? Yes
Re: Solr HBase - Re: How is Data Indexed in HBase?
Solr does not provide a complex enough support to rank. I believe Solr has a bunch of plug-ability to write your own custom ranking approach. If you think you can't do your desired ranking with Solr, you're probably wrong and need to ask for help from the Solr community. retrieving data by keyword is one of them. I think Solr is a proper choice The key to keyword retrieval is the construction of the data. Among other things, this is one of the key things that Solr is very good at: creating a very efficient organization of the data so that you can retrieve quickly. At their core, Solr, ElasticSearch, Lily and Katta all use Lucene to construct this data. HBase is bad at this. how HBase support high performance when it needs to keep consistency in a large scale distributed system HBase is primarily built for retrieving a single row at a time based on a predetermined and known location (the key). It is also very efficient at splitting massive datasets across multiple machines and allowing sequential batch analyses of these datasets. HBase can maintain high performance in this way because consistency only ever exists at the row level. This is what HBase is good at. You need to focus what you're doing and then write it out. Figure out how you think the pieces should work together. Read the documentation. Then, ask specific questions where you feel like the documentation is unclear or you feel confused. Your general questions are very difficult to answer in any kind of really helpful way. thanks, Jacques On Wed, Feb 22, 2012 at 9:51 AM, Bing Li lbl...@gmail.com wrote: Mr Gupta, Thanks so much for your reply! In my use cases, retrieving data by keyword is one of them. I think Solr is a proper choice. However, Solr does not provide a complex enough support to rank. And, frequent updating is also not suitable in Solr. So it is difficult to retrieve data randomly based on the values other than keyword frequency in text. In this case, I attempt to use HBase. But I don't know how HBase support high performance when it needs to keep consistency in a large scale distributed system. Now both of them are used in my system. I will check out ElasticSearch. Best regards, Bing On Thu, Feb 23, 2012 at 1:35 AM, T Vinod Gupta tvi...@readypulse.comwrote: Bing, Its a classic battle on whether to use solr or hbase or a combination of both. both systems are very different but there is some overlap in the utility. they also differ vastly when it compares to computation power, storage needs, etc. so in the end, it all boils down to your use case. you need to pick the technology that it best suited to your needs. im still not clear on your use case though. btw, if you haven't started using solr yet - then you might want to checkout ElasticSearch. I spent over a week researching between solr and ES and eventually chose ES due to its cool merits. thanks On Wed, Feb 22, 2012 at 9:31 AM, Ted Yu yuzhih...@gmail.com wrote: There is no secondary index support in HBase at the moment. It's on our road map. FYI On Wed, Feb 22, 2012 at 9:28 AM, Bing Li lbl...@gmail.com wrote: Jacques, Yes. But I still have questions about that. In my system, when users search with a keyword arbitrarily, the query is forwarded to Solr. No any updating operations but appending new indexes exist in Solr managed data. When I need to retrieve data based on ranking values, HBase is used. And, the ranking values need to be updated all the time. Is that correct? My question is that the performance must be low if keeping consistency in a large scale distributed environment. How does HBase handle this issue? Thanks so much! Bing On Thu, Feb 23, 2012 at 1:17 AM, Jacques whs...@gmail.com wrote: It is highly unlikely that you could replace Solr with HBase. They're really apples and oranges. On Wed, Feb 22, 2012 at 1:09 AM, Bing Li lbl...@gmail.com wrote: Dear all, I wonder how data in HBase is indexed? Now Solr is used in my system because data is managed in inverted index. Such an index is suitable to retrieve unstructured and huge amount of data. How does HBase deal with the issue? May I replaced Solr with HBase? Thanks so much! Best regards, Bing
Re: Problem parsing queries with forward slashes and multiple fields
2012/2/22 Yury Kats yuryk...@yahoo.com: On 2/22/2012 12:25 PM, Yury Kats wrote: I'm running into a problem with queries that contain forward slashes and more than one field. For example, these queries work fine: fieldName:/a fieldName:/* But if I have two fields with similar syntax in the same query, it fails. For simplicity, I'm using the same field twice: fieldName:/a fieldName:/a Looks like escaping forward slashes makes the query work, eg fieldName:\/a fieldName:\/a This is a bit puzzling as the forward slash is not part of the query language, is it? Regex queries were added that use forward slashes: https://issues.apache.org/jira/browse/LUCENE-2604 -Yonik lucidimagination.com
Re: Unusually long data import time?
Would it be unusual for an import of 160 million documents to take 18 hours? Each document is less than 1kb and I have the DataImportHandler using the jdbc driver to connect to SQL Server 2008. The full-import query calls a stored procedure that contains only a select from my target table. Is there any way I can speed this up? I saw recently someone on this list suggested a new user could get all their Solr data imported in under an hour. I sure hope that's true! Do have autoCommit or autoSoftCommit configured in solrconfig.xml?
Re: Problem parsing queries with forward slashes and multiple fields
That's strange. Could you provide a sample dataset? I'd like to try it out. Kind regards, Em Am 22.02.2012 19:17, schrieb Yury Kats: On 2/22/2012 1:05 PM, Em wrote: Yury, are you sure your request has a proper url-encoding? Yes
Re: Problem parsing queries with forward slashes and multiple fields
On 2/22/2012 1:25 PM, Em wrote: That's strange. Could you provide a sample dataset? Data set does not matter. The query fails to parse, long before it gets to the data.
Re: Problem parsing queries with forward slashes and multiple fields
On 2/22/2012 1:24 PM, Yonik Seeley wrote: This is a bit puzzling as the forward slash is not part of the query language, is it? Regex queries were added that use forward slashes: https://issues.apache.org/jira/browse/LUCENE-2604 Oh, so / is a special character now? I don't think it is mentioned as such on any of the wiki pages, or in org.apache.solr.client.solrj.util.ClientUtils
RE: Unusually long data import time?
Ahmet, I do not. I commented autoCommit out. Devon Baumgarten -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Wednesday, February 22, 2012 12:25 PM To: solr-user@lucene.apache.org Subject: Re: Unusually long data import time? Would it be unusual for an import of 160 million documents to take 18 hours? Each document is less than 1kb and I have the DataImportHandler using the jdbc driver to connect to SQL Server 2008. The full-import query calls a stored procedure that contains only a select from my target table. Is there any way I can speed this up? I saw recently someone on this list suggested a new user could get all their Solr data imported in under an hour. I sure hope that's true! Do have autoCommit or autoSoftCommit configured in solrconfig.xml?
maxClauseCount error
Hi, I am suddenly getting a maxclause count error and don't know why. I am using Solr 3.5
maxClauseCount Exception
Hi, I am suddenly getting a maxClauseCount exception for no reason. I am using Solr 3.5. I have only 206 documents in my index. Any ideas? This is wierd. QUERY PARAMS: [hl, hl.snippets, hl.simple.pre, hl.simple.post, fl, hl.mergeContiguous, hl.usePhraseHighlighter, hl.requireFieldMatch, echoParams, hl.fl, q, rows, start]|#] [#|2012-02-22T13:40:13.129-0500|INFO|glassfish3.1.1| org.apache.solr.core.SolrCore|_ThreadID=22;_ThreadName=Thread-2;|[] webapp=/solr3 path=/select params={hl=truehl.snippets=4hl.simple.pre=b/bfl=*,scorehl.mergeContiguous=truehl.usePhraseHighlighter=truehl.requireFieldMatch=trueechoParams=allhl.fl=text_tq={!lucene+q.op%3DOR+df%3Dtext_t}+(+kind_s:doc+OR+kind_s:xml)+AND+(type_s:[*+TO+*])+AND+(usergroup_sm:admin)rows=20start=0wt=javabinversion=2} hits=204 status=500 QTime=166 |#] [#|2012-02-22T13:40:13.131-0500|SEVERE|glassfish3.1.1| org.apache.solr.servlet.SolrDispatchFilter| _ThreadID=22;_ThreadName=Thread-2;|org.apache.lucene.search.BooleanQuery $TooManyClauses: maxClauseCount is set to 1024 at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:136) at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:127) at org.apache.lucene.search.ScoringRewrite $1.addClause(ScoringRewrite.java:51) at org.apache.lucene.search.ScoringRewrite $1.addClause(ScoringRewrite.java:41) at org.apache.lucene.search.ScoringRewrite $3.collect(ScoringRewrite.java:95) at org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:38) at org.apache.lucene.search.ScoringRewrite.rewrite(ScoringRewrite.java:93) at org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:304) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:158) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:98) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:385) at org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:217) at org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:185) at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:205) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:490) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:401) at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:131) at org.apache.so
Re: Problem parsing queries with forward slashes and multiple fields
On 2/22/2012 1:24 PM, Yonik Seeley wrote: Looks like escaping forward slashes makes the query work, eg fieldName:\/a fieldName:\/a This is a bit puzzling as the forward slash is not part of the query language, is it? Regex queries were added that use forward slashes: https://issues.apache.org/jira/browse/LUCENE-2604 Looks like regex matching happens across multiple fields though. Feels like a bug to me?
Re: Unusually long data import time?
Davon, you ought to try to update from many threads, (I do not know if DIH can do it, check it), but lucene does great job if fed from many update threads... depends where your time gets lost, but it is usually a) analysis chain or b) database if it os a) and your server has spare cpu-cores, you can scale at X NooCores rate On Wed, Feb 22, 2012 at 7:41 PM, Devon Baumgarten dbaumgar...@nationalcorp.com wrote: Ahmet, I do not. I commented autoCommit out. Devon Baumgarten -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Wednesday, February 22, 2012 12:25 PM To: solr-user@lucene.apache.org Subject: Re: Unusually long data import time? Would it be unusual for an import of 160 million documents to take 18 hours? Each document is less than 1kb and I have the DataImportHandler using the jdbc driver to connect to SQL Server 2008. The full-import query calls a stored procedure that contains only a select from my target table. Is there any way I can speed this up? I saw recently someone on this list suggested a new user could get all their Solr data imported in under an hour. I sure hope that's true! Do have autoCommit or autoSoftCommit configured in solrconfig.xml?
dih and solr cloud
out of curiosity, trying to see if new cloud features can replace what I use now... how is this (batch) update forwarding solved at cloud level? imagine simple one shard and one replica case, if I fire up DIH update, is this going to be replicated to replica shard? If yes, - is it going to be sent document by document (network, imagine 100Mio+ update commands going to replica from slave for big batches) - somehow batch into packages to reduce load - distributed at index level somehow This is important case, today with master/slave solr replication, but is not mentioned at http://wiki.apache.org/solr/SolrCloud
Re: Fields, Facets, and Search Results
Well, you probably need to clear you index first..remove index director, restart your server and try again. Let me know if it works or not. -- View this message in context: http://lucene.472066.n3.nabble.com/Fields-Facets-and-Search-Results-tp3765946p3767537.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Fields, Facets, and Search Results
And check your log file, you may have some errors at start of your server. Due to some mistake, bad syntax in your schema file for example... -- View this message in context: http://lucene.472066.n3.nabble.com/Fields-Facets-and-Search-Results-tp3765946p3767569.html Sent from the Solr - User mailing list archive at Nabble.com.
org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out
Hi, I am getting below error while running delta import and my index is not updated. Could you please let me know what might be causing this issue? I am using Solr 3.5 version and around 60+ documents suppose to be updated using delta import. [org.apache.solr.handler.dataimport.SolrWriter] - Error creating document : SolrInputDocument[...] org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/var/solr/data/5159200/index/write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:84) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1108) at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:83) at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:101) at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:171) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:219) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:115) at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:73) at org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:293) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:636) at org.apache.solr.handler.dataimport.DocBuilder.doDelta(DocBuilder.java:303) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:179) at org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:390) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:429) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408) -- View this message in context: http://lucene.472066.n3.nabble.com/org-apache-lucene-store-LockObtainFailedException-Lock-obtain-timed-out-tp3767605p3767605.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out
Hi Uomesh, I was facing similar issues few days ago and was able to resolve it by deleting the lock file created in the index directory and restarting my solr server. I have documented the same in one of the posts at http://www.params.me/2011/12/solr-index-lock-issue.html Hope it helps! -param On 2/22/12 2:36 PM, Uomesh uom...@gmail.com wrote: Hi, I am getting below error while running delta import and my index is not updated. Could you please let me know what might be causing this issue? I am using Solr 3.5 version and around 60+ documents suppose to be updated using delta import. [org.apache.solr.handler.dataimport.SolrWriter] - Error creating document : SolrInputDocument[...] org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/var/solr/data/5159200/index/write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:84) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1108) at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:83) at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.j ava:101) at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler 2.java:171) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.ja va:219) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdatePr ocessorFactory.java:61) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdatePr ocessorFactory.java:115) at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:73) at org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHa ndler.java:293) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav a:636) at org.apache.solr.handler.dataimport.DocBuilder.doDelta(DocBuilder.java:303) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:179) at org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter .java:390) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4 29) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40 8) -- View this message in context: http://lucene.472066.n3.nabble.com/org-apache-lucene-store-LockObtainFaile dException-Lock-obtain-timed-out-tp3767605p3767605.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Performance Improvement and degradation Help
As an update to this... I tried running a query again the 4.0.0.2010.12.10.08.54.56 version and the newer 4.0.0.2012.02.16 (both on the same box). So the query params were the same, returned results were the same, but the 4.0.0.2010.12.10.08.54.56 returned the results in about 1.6 seconds and the newer (4.0.0.2012.02.16) version returned the results in about 4 seconds. If I add the wildcard field list to the newer version, the time increases anywhere from .5-1 second. These are all averages after running the queries several times over a 30 minute period. (allowing for warming and cache). Anybody have any insight into why the newer versions are performing a bit slower? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Performance-Improvement-and-degradation-Help-tp3767015p3767725.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr 3.5 and indexing performance
i got it all commnented in updateHandler, im prety sure there is no default autocommit updateHandler class=solr.DirectUpdateHandler2 iorixxx wrote I wanted to switch to new version of solr, exactelly to 3.5 but im getting big drop of indexing speed. Could it be autoCommit configuration in solrconfig.xml? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-3-5-and-indexing-performance-tp3766653p3767843.html Sent from the Solr - User mailing list archive at Nabble.com.
result present in Solr 1.4, but missing in Solr 3.5, dismax only
I am working on upgrading Solr from 1.4 to 3.5, and I have hit a problem. I have a test checking for a search result in Solr, and the test passes in Solr 1.4, but fails in Solr 3.5. Dismax is the desired QueryParser -- I just included output from lucene QueryParser to prove the document exists and is found I am completely stumped. Here are the debugQuery details: ***Solr 3.5*** lucene QueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology final query: all_search:the beatl as musician revolv through the antholog 6.0562754 = (MATCH) weight(all_search:the beatl as musician revolv through the antholog in 1064395), product of: 1.0 = queryWeight(all_search:the beatl as musician revolv through the antholog), product of: 48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611) 0.02063975 = queryNorm 6.0562754 = fieldWeight(all_search:the beatl as musician revolv through the antholog in 1064395), product of: 1.0 = tf(phraseFreq=1.0) 48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611) 0.125 = fieldNorm(field=all_search, doc=1064395) dismax QueryParser: URL: qf=all_searchpf=all_searchq=The Beatles as musicians : Revolver through the Anthology final query: +(all_search:the beatl as musician revolv through the antholog~1)~0.01 (all_search:the beatl as musician revolv through the antholog~3)~0.01 (no matches) ***Solr 1.4*** lucene QueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology final query: all_search:the beatl as musician revolv through the antholog 5.2676983 = fieldWeight(all_search:the beatl as musician revolv through the antholog in 3469163), product of: 1.0 = tf(phraseFreq=1.0) 48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 revolv=820 through=88238 the=3542123 antholog=11205) 0.109375 = fieldNorm(field=all_search, doc=3469163) dismax QueryParser: URL: qf=all_searchpf=all_searchq=The Beatles as musicians : Revolver through the Anthology final query: +(all_search:the beatl as musician revolv through the antholog~1)~0.01 (all_search:the beatl as musician revolv through the antholog~3)~0.01 score: 7.449651 = (MATCH) sum of: 3.7248254 = weight(all_search:the beatl as musician revolv through the antholog~1 in 3469163), product of: 0.7071068 = queryWeight(all_search:the beatl as musician revolv through the antholog~1), product of: 48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 revolv=820 through=88238 the=3542123 antholog=11205) 0.014681898 = queryNorm 5.2676983 = fieldWeight(all_search:the beatl as musician revolv through the antholog in 3469163), product of: 1.0 = tf(phraseFreq=1.0) 48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 revolv=820 through=88238 the=3542123 antholog=11205) 0.109375 = fieldNorm(field=all_search, doc=3469163) 3.7248254 = weight(all_search:the beatl as musician revolv through the antholog~3 in 3469163), product of: 0.7071068 = queryWeight(all_search:the beatl as musician revolv through the antholog~3), product of: 48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 revolv=820 through=88238 the=3542123 antholog=11205) 0.014681898 = queryNorm 5.2676983 = fieldWeight(all_search:the beatl as musician revolv through the antholog in 3469163), product of: 1.0 = tf(phraseFreq=1.0) 48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 revolv=820 through=88238 the=3542123 antholog=11205) 0.109375 = fieldNorm(field=all_search, doc=3469163)
RE: Unusually long data import time?
Thank you everyone for your patience and suggestions. It turns out I was doing something really unreasonable in my schema. I mistakenly edited the max EdgeNgram size to 512, when I meant to set the lengthFilter max to 512. I brought this to a more reasonable number, and my estimated time to import is now down to 4 hours. Based on the size of my record set, this time is more consistent with Walter's observations in his own project. Thanks again for your help, Devon Baumgarten -Original Message- From: Devon Baumgarten [mailto:dbaumgar...@nationalcorp.com] Sent: Wednesday, February 22, 2012 12:42 PM To: 'solr-user@lucene.apache.org' Subject: RE: Unusually long data import time? Ahmet, I do not. I commented autoCommit out. Devon Baumgarten -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Wednesday, February 22, 2012 12:25 PM To: solr-user@lucene.apache.org Subject: Re: Unusually long data import time? Would it be unusual for an import of 160 million documents to take 18 hours? Each document is less than 1kb and I have the DataImportHandler using the jdbc driver to connect to SQL Server 2008. The full-import query calls a stored procedure that contains only a select from my target table. Is there any way I can speed this up? I saw recently someone on this list suggested a new user could get all their Solr data imported in under an hour. I sure hope that's true! Do have autoCommit or autoSoftCommit configured in solrconfig.xml?
Re: nutch and solr
thanks for your reply, but don't work. the same message: can't convert empty path and more: impossible find class org.apache.nutch.crawl.injector .. Il giorno 22 febbraio 2012 06:14, tamanjit.bin...@yahoo.co.in tamanjit.bin...@yahoo.co.in ha scritto: Try this command. bin/nutch crawl urls/folder name/url file.txt -dir crawl/folders name -threads 10 -depth 2 -topN 1000 Your folder structure will look like this: nutch folder-- urls -- folder name-- url file.txt | | -- crawl -- folder name The folder name will be for different domains. So for each domain folder in urls folder there has to be a corresponding folder (with the same name) in the crawl folder. -- View this message in context: http://lucene.472066.n3.nabble.com/nutch-and-solr-tp3765166p3765607.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: 'location' fieldType indexation impossible
Make sure that your schema file is exactly the same on both your local server and the remote server. Especially there should be a dynamic field definition like: dynamicField name=*_coordinate type=tdouble indexed=true stored=false/ and you should see a couple of fields appear like emploi_city_geoloc_0_coordinate and emploi_city_geoloc_1_coordinate when you index a location type in the field you indicated. This has tripped me up in the past. If that doesn't apply, then you need to provide more information, more of the stack trace, what you've tried etc. Because saying: I really don't understand why it isnt working because, it was working on my local server with the same configuration (Solr 3.5.0) and the same database !!! Is another way of saying Something's different between the two versions, I just don't know what yet G... So I'd start (make a backup first) by just copying my entire configuration from my local machine to the remote one, restarting Solr and trying again. Best Erick On Wed, Feb 22, 2012 at 5:53 AM, Xavier xav...@audivox.fr wrote: Hi, When i try to index my location field i get this error for each documents : *ATTENTION: Error creating document Error adding field 'emploi_city_geoloc'='48.85,2.5525' * (so i have 0 files indexed) Here is my schema.xml : *field name=emploi_city_geoloc type=location indexed=true stored=false/* I really don't understand why it isnt working because, it was working on my local server with the same configuration (Solr 3.5.0) and the same database !!! If i try to use geohash instead of location it is working for indexation, but my geodist query in front isnt working anymore ... Any ideas ? Best regards, Xavier -- View this message in context: http://lucene.472066.n3.nabble.com/location-fieldType-indexation-impossible-tp3766136p3766136.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Same id on two shards
Hi, I stumbled across this thread after running into the same question. The answers presented here seem a little vague and I was hoping to renew the discussion. I am using using a branch of Solr 4, distributed searching over 12 shards. I want the documents in the first shard to always be selected over documents that appear in the other 11 shards. The queries to these shards looks something like this: http://solrserver/shard_1_app/select?shards=solr_server:/shard_1_app/,solr_server:/shard_2_app, ... ,solr_server:/shard_12_appq=id: When I execute a query for an ID that I know exists in shard_1 and another shard, I do always get the result from shard 1. Here's some questions that I have: 1. Has anyone rigorously tested the comment in the wiki If docs with duplicate unique keys are encountered, Solr will make an attempt to return valid results, but the behavior may be non-deterministic. 2. Who is relying on this behavior (the document of the first shard is returned) today? When do you notice the wrong document is selected? Do you have a feeling for how frequently your distributed search returns the document from a shard other than the first? 3. Is there a good web source other than the Solr wiki for information about Solr distributed queries? Thanks, Jerry M. On Mon, Aug 8, 2011 at 7:41 PM, simon mtnes...@gmail.com wrote: I think the first one to respond is indeed the way it works, but that's only deterministic up to a point (if your small index is in the throes of a commit and everything required for a response happens to be cached on the larger shard ... who knows ?) On Mon, Aug 8, 2011 at 7:10 PM, Shawn Heisey s...@elyograg.org wrote: On 8/8/2011 4:07 PM, simon wrote: Only one should be returned, but it's non-deterministic. See http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations I had heard it was based on which one responded first. This is part of why we have a small index that contains the newest content and only distribute content to the other shards once a day. The hope is that the small index (less than 1GB, fits into RAM on that virtual machine) will always respond faster than the other larger shards (over 18GB each). Is this an incorrect assumption on our part? The build system does do everything it can to ensure that periods of overlap are limited to the time it takes to commit a change across all of the shards, which should amount to just a few seconds once a day. There might be situations when the index gets out of whack and we have duplicate id values for a longer time period, but in practice it hasn't happened yet. Thanks, Shawn
need to support bi-directional synonyms
hello all, i need to support the following: if the user enters sprayer in the desc field - then they get results for BOTH sprayer and washer. and in the other direction if the user enters washer in the desc field - then they get results for BOTH washer and sprayer. would i set up my synonym file like this? assuming expand = true.. sprayer = washer washer = sprayer thank you, mark -- View this message in context: http://lucene.472066.n3.nabble.com/need-to-support-bi-directional-synonyms-tp3767990p3767990.html Sent from the Solr - User mailing list archive at Nabble.com.
Trunk build errors
Hi, I am getting numerous errors preventing a build of solrcloud trunk. [licenses] MISSING LICENSE for the following file: Any tips to get a clean build working? thanks
Re: Fast Vector Highlighter Working for some records only
Hi dhaivat, I think you may want to use analysis.jsp: http://localhost:8983/solr/admin/analysis.jsp Go to the URL and look into how your custom tokenizer produces tokens, and compare with the output of Solr's inbuilt tokenizer. koji -- Query Log Visualizer for Apache Solr http://soleami.com/ (12/02/22 21:35), dhaivat wrote: Koji Sekiguchi wrote (12/02/22 11:58), dhaivat wrote: Thanks for reply, But can you please tell me why it's working for some documents and not for other. As Solr 1.4.1 cannot recognize hl.useFastVectorHighlighter flag, Solr just ignore it, but due to hl=true is there, Solr tries to create highlight snippets by using (existing; traditional; I mean not FVH) Highlighter. Highlighter (including FVH) cannot produce snippets sometime for some reasons, you can use hl.alternateField parameter. http://wiki.apache.org/solr/HighlightingParameters#hl.alternateField koji -- Query Log Visualizer for Apache Solr http://soleami.com/ Thank you so much explanation, I have updated my solr version and using 3.5, Could you please tell me when i am using custom Tokenizer on the field,so do i need to make any changes related Solr highlighter. here is my custom analyser fieldType name=custom_text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=ns.solr.analyser.CustomIndexTokeniserFactory/ /analyzer analyzer type=query tokenizer class=ns.solr.analyser.CustomSearcherTokeniserFactory/ /analyzer /fieldType here is the field info: field name=contents type=custom_text indexed=true stored=true multiValued=true termPositions=true termVectors=true termOffsets=true/ i am creating tokens using my custom analyser and when i am trying to use highlighter it's not working properly for contents field.. but when i tried to use Solr inbuilt tokeniser i am finding the word highlighted for particular query.. Please can you help me out with this ? Thanks in advance Dhaivat -- View this message in context: http://lucene.472066.n3.nabble.com/Fast-Vector-Highlighter-Working-for-some-records-only-tp3763286p3766335.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only
So I don't really know what I'm talking about, and I'm not really sure if it's related or not, but your particular query: The Beatles as musicians : Revolver through the Anthology With the lone word that's a ':', reminds me of a dismax stopwords-type problem I ran into. Now, I ran into it on 1.4. I don't know why it would be different on 1.4 and 3.x. And I see you aren't even using a multi-field dismax in your sample query, so it couldn't possibly be what I ran into... I don't think. But I'll write this anyway in case it gives someone some ideas. The problem I ran into is caused by different analysis in two fields both used in a dismax, one that ends up keeping : as a token, and one that doesn't. Which ends up having the same effect as the famous 'dismax stopwords problem'. Maybe somehow your schema changed such to produce this problem in 3.x but not in 1.4? Although again I realize the fact that you are only using a single field in your demo dismax query kind of suggests it's not this problem. Wonder if you try the query without the :, if the problem goes away, that might be a hint. Or, maybe someone more skilled at understanding what's in those Solr debug statements than I am (it's kind of all greek to me) will be able to take this hint and rule out or confirm that it may have something to do with your problem. Here I write up the issue I ran into (which may or may not have anything to do with what you ran into) http://bibwild.wordpress.com/2011/06/15/more-dismax-gotchas-varying-field-analysis-and-mm/ Also, you don't say what your 'mm' is in your dismax queries, that could be relevant if it's got anything to do with anything similar to the issue I'm talking about. Hmm, I wonder if Solr 3.x changes the way dismax calculates number of tokens for 'mm' in such a way that the 'varying field analysis dismax gotcha' can manifest with only one field, if the way dismax counts tokens for 'mm' differs from number of tokens the single field's analysis produces? Jonathan On 2/22/2012 2:55 PM, Naomi Dushay wrote: I am working on upgrading Solr from 1.4 to 3.5, and I have hit a problem. I have a test checking for a search result in Solr, and the test passes in Solr 1.4, but fails in Solr 3.5. Dismax is the desired QueryParser -- I just included output from lucene QueryParser to prove the document exists and is found I am completely stumped. Here are the debugQuery details: ***Solr 3.5*** lucene QueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology final query: all_search:the beatl as musician revolv through the antholog 6.0562754 = (MATCH) weight(all_search:the beatl as musician revolv through the antholog in 1064395), product of: 1.0 = queryWeight(all_search:the beatl as musician revolv through the antholog), product of: 48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611) 0.02063975 = queryNorm 6.0562754 = fieldWeight(all_search:the beatl as musician revolv through the antholog in 1064395), product of: 1.0 = tf(phraseFreq=1.0) 48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611) 0.125 = fieldNorm(field=all_search, doc=1064395) dismax QueryParser: URL: qf=all_searchpf=all_searchq=The Beatles as musicians : Revolver through the Anthology final query: +(all_search:the beatl as musician revolv through the antholog~1)~0.01 (all_search:the beatl as musician revolv through the antholog~3)~0.01 (no matches) ***Solr 1.4*** lucene QueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology final query: all_search:the beatl as musician revolv through the antholog 5.2676983 = fieldWeight(all_search:the beatl as musician revolv through the antholog in 3469163), product of: 1.0 = tf(phraseFreq=1.0) 48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 revolv=820 through=88238 the=3542123 antholog=11205) 0.109375 = fieldNorm(field=all_search, doc=3469163) dismax QueryParser: URL: qf=all_searchpf=all_searchq=The Beatles as musicians : Revolver through the Anthology final query: +(all_search:the beatl as musician revolv through the antholog~1)~0.01 (all_search:the beatl as musician revolv through the antholog~3)~0.01 score: 7.449651 = (MATCH) sum of: 3.7248254 = weight(all_search:the beatl as musician revolv through the antholog~1 in 3469163), product of: 0.7071068 = queryWeight(all_search:the beatl as musician revolv through the antholog~1), product of: 48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 revolv=820 through=88238 the=3542123 antholog=11205) 0.014681898 = queryNorm 5.2676983 = fieldWeight(all_search:the beatl as musician revolv through the antholog in 3469163), product of: 1.0 = tf(phraseFreq=1.0)
Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only
I forgot to include the field definition information: schema.xml: field name=all_search type=text indexed=true stored=false / solr 3.5: fieldtype name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.ICUFoldingFilterFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 generateWordParts=1 catenateWords=1 splitOnNumerics=0 generateNumberParts=1 catenateNumbers=1 catenateAll=0 preserveOriginal=0 stemEnglishPossessive=1 / filter class=solr.EnglishPorterFilterFactory protected=protwords.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory / /analyzer /fieldtype solr1.4: fieldtype name=text class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory / filter class=schema.UnicodeNormalizationFilterFactory version=icu4j composed=false remove_diacritics=true remove_modifiers=true fold=true / filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 generateWordParts=1 catenateWords=1 splitOnNumerics=0 generateNumberParts=1 catenateNumbers=1 catenateAll=0 preserveOriginal=0 stemEnglishPossessive=1 / filter class=solr.LowerCaseFilterFactory / filter class=solr.EnglishPorterFilterFactory protected=protwords.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory / /analyzer /fieldtype And the analysis page shows the same results for Solr 3.5 and 1.4 Solr 3.5: position1 2 3 4 5 6 7 8 term text the beatl as musicianrevolv through the antholog keyword false false false false false false false false startOffset 0 4 12 15 27 36 44 48 endOffset 3 11 14 24 35 43 47 57 typewordwordwordwordwordwordwordword Solr 1.4: term position 1 2 3 4 5 6 7 8 term text the beatl as musicianrevolv through the antholog term type wordwordwordwordwordwordwordword source start,end0,3 4,1112,14 15,24 27,35 36,43 44,47 48,57 - Naomi -- View this message in context: http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768007.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: String search in Dismax handler
Two things: 1 what version of Solr are you using? qt=dismax isn't going to any request handler I don't think. 2 what do you get when you add debugQuery=on? Try that with both results and perhaps that will shed some light. If not, can you post the results? Best Erick On Wed, Feb 22, 2012 at 7:47 AM, mechravi25 mechrav...@yahoo.co.in wrote: Hi, The string I am searching is Pass By Value. I am using the qt=dismax (in the request query) as well. When I search the above string with the double quotes, the data is getting fetched but the same query string without any double quotes gives no results. Following is the dismax request handler in the solrconfig.xml requestHandler name=dismax class=solr.DisMaxRequestHandler lst name=defaults str name=echoParamsexplicit/str str name=fl id,score /str str name=q.alt*:*/str str name=f.name.hl.fragsize0/str str name=f.name.hl.alternateFieldname/str str name=f.text.hl.fragmenterregex/str /lst /requestHandler The same query string works fine with and without double quotes when I use default request handler Following is the default request handler in the solrconfig.xml requestHandler name=standard class=solr.StandardRequestHandler default=true lst name=defaults str name=echoParamsexplicit/str /lst /requestHandler Please provide some suggestions as to why the string search without quotes is returning no records when dismax handler is used. Am I missing out on something? Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/String-search-in-Dismax-handler-tp3766360p3766360.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only
Jonathan, I have the same problem without the colon - I tested that, but didn't mention it. mm can't be the issue either: in Solr 3.5, if I remove one of the occurrences of the (doesn't matter which), I get results. Removing any other word does NOT get results. And if the query isn't a phrase query, it gets results. And no, it can't be related to what you refer to as the dismax stopwords problem, since i can demonstrate the problem with a single field. mm can't be the issue I have run into problems in the past with a non-alpha character surrounded by spaces tanking my search results for dismax … but I fixed that with this fieldType: !-- single token with punctuation terms removed so dismax doesn't look for punctuation terms in these fields -- !-- On client side, Lucene query parser breaks things up by whitespace *before* field analysis for dismax -- !-- so punctuation terms ( : ;) are stopwords to allow results from other fields when these chars are surrounded by spaces in query -- !-- do not lowercase -- fieldType name=string_punct_stop class=solr.TextField omitNorms=true analyzer type=index tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.ICUNormalizer2FilterFactory name=nfkc mode=compose / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.ICUNormalizer2FilterFactory name=nfkc mode=compose / !-- removing punctuation for Lucene query parser issues -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_punctuation.txt enablePositionIncrements=true / /analyzer /fieldType My stopwords_punctuation.txt file is #Punctuation characters we want to ignore in queries : ; / and used this type instead of string for fields in my dismax qf.Thus, the punctuation terms in the query are not present for the fields that were formerly string fields. - Naomi On Feb 22, 2012, at 3:41 PM, Jonathan Rochkind wrote: So I don't really know what I'm talking about, and I'm not really sure if it's related or not, but your particular query: The Beatles as musicians : Revolver through the Anthology With the lone word that's a ':', reminds me of a dismax stopwords-type problem I ran into. Now, I ran into it on 1.4. I don't know why it would be different on 1.4 and 3.x. And I see you aren't even using a multi-field dismax in your sample query, so it couldn't possibly be what I ran into... I don't think. But I'll write this anyway in case it gives someone some ideas. The problem I ran into is caused by different analysis in two fields both used in a dismax, one that ends up keeping : as a token, and one that doesn't. Which ends up having the same effect as the famous 'dismax stopwords problem'. Maybe somehow your schema changed such to produce this problem in 3.x but not in 1.4? Although again I realize the fact that you are only using a single field in your demo dismax query kind of suggests it's not this problem. Wonder if you try the query without the :, if the problem goes away, that might be a hint. Or, maybe someone more skilled at understanding what's in those Solr debug statements than I am (it's kind of all greek to me) will be able to take this hint and rule out or confirm that it may have something to do with your problem. Here I write up the issue I ran into (which may or may not have anything to do with what you ran into) http://bibwild.wordpress.com/2011/06/15/more-dismax-gotchas-varying-field-analysis-and-mm/ Also, you don't say what your 'mm' is in your dismax queries, that could be relevant if it's got anything to do with anything similar to the issue I'm talking about. Hmm, I wonder if Solr 3.x changes the way dismax calculates number of tokens for 'mm' in such a way that the 'varying field analysis dismax gotcha' can manifest with only one field, if the way dismax counts tokens for 'mm' differs from number of tokens the single field's analysis produces? Jonathan On 2/22/2012 2:55 PM, Naomi Dushay wrote: I am working on upgrading Solr from 1.4 to 3.5, and I have hit a problem. I have a test checking for a search result in Solr, and the test passes in Solr 1.4, but fails in Solr 3.5. Dismax is the desired QueryParser -- I just included output from lucene QueryParser to prove the document exists and is found I am completely stumped. Here are the debugQuery details: ***Solr 3.5*** lucene QueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology final query: all_search:the beatl as musician revolv through the antholog 6.0562754 = (MATCH) weight(all_search:the beatl as musician revolv through the antholog in 1064395), product of: 1.0 = queryWeight(all_search:the beatl as musician revolv through the antholog), product of: 48.450203 = idf(all_search:
Re: SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher
Looks like an issue around replication IndexWriter reboot, soft commits and hard commits. I think I've got a workaround for it: Index: solr/core/src/java/org/apache/solr/handler/SnapPuller.java === --- solr/core/src/java/org/apache/solr/handler/SnapPuller.java (revision 1292344) +++ solr/core/src/java/org/apache/solr/handler/SnapPuller.java (working copy) @@ -499,6 +499,17 @@ // reboot the writer on the new index and get a new searcher solrCore.getUpdateHandler().newIndexWriter(); + Future[] waitSearcher = new Future[1]; + solrCore.getSearcher(true, false, waitSearcher, true); + if (waitSearcher[0] != null) { +try { + waitSearcher[0].get(); + } catch (InterruptedException e) { + SolrException.log(LOG,e); + } catch (ExecutionException e) { + SolrException.log(LOG,e); + } + } // update our commit point to the right dir solrCore.getUpdateHandler().commit(new CommitUpdateCommand(req, false)); That should allow the searcher that the following commit command prompts to see the *new* IndexWriter. On Feb 22, 2012, at 10:56 AM, eks dev wrote: We started observing strange failures from ReplicationHandler when we commit on master trunk version 4-5 days old. It works sometimes, and sometimes not didn't dig deeper yet. Looks like the real culprit hides behind: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed Looks familiar to somebody? 120222 154959 SEVERE SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1138) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1251) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1043) at org.apache.solr.update.DirectUpdateHandler2.commit(Unknown Source) at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:503) at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:348) at org.apache.solr.handler.ReplicationHandler.doFetch(Unknown Source) at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:163) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:810) at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:815) at org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:3984) at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:254) at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:233) at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:223) at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170) at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1095) ... 15 more - Mark Miller lucidimagination.com
Re: Solr Highlighting not working with PayloadTermQueries
(12/02/22 7:53), Nitin Arora wrote: Hi, I'm using SOLR and Lucene in my application for search. I'm facing an issue of highlighting using FastVectorHighlighter not working when I use PayloadTermQueries as clauses of a BooleanQuery. After Debugging I found that In DefaultSolrHighlighter.Java, fvh.getFieldQuery does not return any term in the termMap. FastVectorHighlighter fvh = new FastVectorHighlighter( // FVH cannot process hl.usePhraseHighlighter parameter per-field basis params.getBool( HighlightParams.USE_PHRASE_HIGHLIGHTER, true ), // FVH cannot process hl.requireFieldMatch parameter per-field basis params.getBool( HighlightParams.FIELD_MATCH, false ) ); FieldQuery fieldQuery = fvh.getFieldQuery( query ); The reason of empty termmap is, PayloadTermQuery is discarded while constructing the FieldQuery. void flatten( Query sourceQuery, CollectionQuery flatQueries ){ if( sourceQuery instanceof BooleanQuery ){ BooleanQuery bq = (BooleanQuery)sourceQuery; for( BooleanClause clause : bq.getClauses() ){ if( !clause.isProhibited() ) flatten( clause.getQuery(), flatQueries ); } } else if( sourceQuery instanceof DisjunctionMaxQuery ){ DisjunctionMaxQuery dmq = (DisjunctionMaxQuery)sourceQuery; for( Query query : dmq ){ flatten( query, flatQueries ); } } else if( sourceQuery instanceof TermQuery ){ if( !flatQueries.contains( sourceQuery ) ) flatQueries.add( sourceQuery ); } else if( sourceQuery instanceof PhraseQuery ){ if( !flatQueries.contains( sourceQuery ) ){ PhraseQuery pq = (PhraseQuery)sourceQuery; if( pq.getTerms().length 1 ) flatQueries.add( pq ); else if( pq.getTerms().length == 1 ){ flatQueries.add( new TermQuery( pq.getTerms()[0] ) ); } } } // else discard queries } What is the best way to get highlighting working with Payload Term Queries? Hi Nitin, Thank you for reporting this problem! Your assumption is correct. FVH discards PayloadTermQueries in flatten() method. Though I'm not familiar with SpanQueries so much, but looks like SpanTermQuery which is the super class of PayloadTermQuery, has getTerm() method. Do you think if flatten() can recognize SpanTermQuery and then add the term to flatQueries, it solves your problem? If so, please open a jira ticket. And if you can, attach a patch would help a lot! koji -- Query Log Visualizer for Apache Solr http://soleami.com/
Do nested entities have a representation in Solr indexes?
The data-config.xml file that I have for indexing database contents has nested entity nodes within a document node, and each of the entities contains field nodes. Lucene indexes consist of documents that contain fields. What about entities? If you change the way entities are structured in a data-config.xml file, in what way (if any) does it change how the contents are stored in the index. When I created the entities I am using, and defined the fields in one of the inner entities to be multivalued, I thought that the fields of that entity type would be grouped logically somehow in the index, but then I remembered that Lucene doesn't have a concept of sub-documents (that I know of), so each of the field values will be added to a list, and the extent of the logical grouping would be that the field values that were indexed together would be at the same position in their respective lists. Am I understanding this right, or do entities as defined in data-config.xml have some kind of representation in the index like document and field do? Thanks, Mike
RE: Recovering from database connection resets in DataimportHandler
Could you point me to the most non-intimidating introduction to SolrJ that you know of? I have a passing familiarity with Javascript and, with few exceptions, I haven't developing software that has a graphical user interface of any kind in about 25 years. I like the idea of having finer control over data imported from a database though. Thanks, Mike -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Monday, February 13, 2012 6:19 AM To: solr-user@lucene.apache.org Subject: Re: Recovering from database connection resets in DataimportHandler I'd seriously consider using SolrJ and your favorite JDBC driver instead. It's actually quite easy to create one, although as always it may be a bit intimidating to get started. This allows you much finer control over error conditions than DIH does, so may be more suited to your needs. Best Erick On Sat, Feb 11, 2012 at 2:40 AM, Mike O'Leary tmole...@uw.edu wrote: I am trying to use Solr's DataImportHandler to index a large number of database records in a SQL Server database that is owned and managed by a group we are collaborating with. The indexing jobs I have run so far, except for the initial very small test runs, have failed due to database connection resets. I have gotten indexing jobs to go further by using CachedSqlEntityProcessor and specifying responseBuffering=adaptive in the connection url, but I think in order to index that data I'm going to have to work out how to catch database connection reset exceptions and resubmit the queries that failed. Can anyone can suggest a good way to approach this? Or have any of you encountered this problem and worked out a solution to it already? Thanks, Mike
Re: distributed deletes working?
I know everyone is busy, but I was wondering if anyone had found anything with this? Any suggestions on what I could be doing wrong would be greatly appreciated. On Fri, Feb 17, 2012 at 4:08 PM, Mark Miller markrmil...@gmail.com wrote: On Feb 17, 2012, at 3:56 PM, Jamie Johnson wrote: id field is a UUID. Strange - was using UUID's myself in same test this morning... I'll try again soon. - Mark Miller lucidimagination.com
Is there a way to write a DataImportHandler deltaQuery that compares contents still to be imported to contents in the index?
I am working on indexing the contents of a database that I don't have permission to alter. In particular, the DataImportHandler examples that show how to specify a deltaQuery attribute value show database tables that have a last_modified column, and it compares these values with last_index_time values stored in the dataimport.properties file. The tables in the database I am working with don't have anything like a last_modified column. An indexing job I was running yesterday failed, and I would like to restart it so that it only imports the data that it hasn't already indexed. As a one-off, I could create a list of the keys of the database records that have been indexed and hack in something that reads that list as part of how it figures out what to index, but I was wondering if there is something built in that would allow me to do the same kind of comparison in a likely far more elegant way. What kinds of information do the deltaQuery attributes have access to, apart from the database tables, columns, etc., and do they have access to any information that would help me with what I want to do? Thanks, Mike P.S. While we're on the subject of delta... attributes, can someone explain to me what the difference is between the deltaQuery and the deltaImportQuery attributes?
Re: distributed deletes working?
Yonik did fix an issue around peer sync and deletes a few days ago - long chance that was involved? Otherwise, neither Sami nor I have replicated these results so far. On Feb 22, 2012, at 8:56 PM, Jamie Johnson wrote: I know everyone is busy, but I was wondering if anyone had found anything with this? Any suggestions on what I could be doing wrong would be greatly appreciated. On Fri, Feb 17, 2012 at 4:08 PM, Mark Miller markrmil...@gmail.com wrote: On Feb 17, 2012, at 3:56 PM, Jamie Johnson wrote: id field is a UUID. Strange - was using UUID's myself in same test this morning... I'll try again soon. - Mark Miller lucidimagination.com - Mark Miller lucidimagination.com
Re: distributed deletes working?
Perhaps if you could give me the steps you're using to test I can find an error in what I'm doing. On Wed, Feb 22, 2012 at 9:24 PM, Mark Miller markrmil...@gmail.com wrote: Yonik did fix an issue around peer sync and deletes a few days ago - long chance that was involved? Otherwise, neither Sami nor I have replicated these results so far. On Feb 22, 2012, at 8:56 PM, Jamie Johnson wrote: I know everyone is busy, but I was wondering if anyone had found anything with this? Any suggestions on what I could be doing wrong would be greatly appreciated. On Fri, Feb 17, 2012 at 4:08 PM, Mark Miller markrmil...@gmail.com wrote: On Feb 17, 2012, at 3:56 PM, Jamie Johnson wrote: id field is a UUID. Strange - was using UUID's myself in same test this morning... I'll try again soon. - Mark Miller lucidimagination.com - Mark Miller lucidimagination.com
Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only
Jonathan has brought it to my attention that BOTH of my failing searches happen to have 8 terms, and one of the terms is repeated: The Beatles as musicians : Revolver through the Anthology Color-blindness [print/digital]; its dangers and its detection but this is a PHRASE search. In case it's relevant, both Solr 1.4 and Solr 3.5: do NOT use stopwords in the fieldtype; mm is 6-1 690% for dismax qs is 1 ps is 3 And both use this filter last filter class=solr.RemoveDuplicatesTokenFilterFactory / … but I believe that filter is only used for consecutive tokens. Lastly, Color-blindness [print/digital]; its and its detection works (danger is removed, rather than one of the repeated its) - Naomi On Feb 22, 2012, at 3:41 PM, Jonathan Rochkind wrote: So I don't really know what I'm talking about, and I'm not really sure if it's related or not, but your particular query: The Beatles as musicians : Revolver through the Anthology With the lone word that's a ':', reminds me of a dismax stopwords-type problem I ran into. Now, I ran into it on 1.4. I don't know why it would be different on 1.4 and 3.x. And I see you aren't even using a multi-field dismax in your sample query, so it couldn't possibly be what I ran into... I don't think. But I'll write this anyway in case it gives someone some ideas. The problem I ran into is caused by different analysis in two fields both used in a dismax, one that ends up keeping : as a token, and one that doesn't. Which ends up having the same effect as the famous 'dismax stopwords problem'. Maybe somehow your schema changed such to produce this problem in 3.x but not in 1.4? Although again I realize the fact that you are only using a single field in your demo dismax query kind of suggests it's not this problem. Wonder if you try the query without the :, if the problem goes away, that might be a hint. Or, maybe someone more skilled at understanding what's in those Solr debug statements than I am (it's kind of all greek to me) will be able to take this hint and rule out or confirm that it may have something to do with your problem. Here I write up the issue I ran into (which may or may not have anything to do with what you ran into) http://bibwild.wordpress.com/2011/06/15/more-dismax-gotchas-varying-field-analysis-and-mm/ Also, you don't say what your 'mm' is in your dismax queries, that could be relevant if it's got anything to do with anything similar to the issue I'm talking about. Hmm, I wonder if Solr 3.x changes the way dismax calculates number of tokens for 'mm' in such a way that the 'varying field analysis dismax gotcha' can manifest with only one field, if the way dismax counts tokens for 'mm' differs from number of tokens the single field's analysis produces? Jonathan On 2/22/2012 2:55 PM, Naomi Dushay wrote: I am working on upgrading Solr from 1.4 to 3.5, and I have hit a problem. I have a test checking for a search result in Solr, and the test passes in Solr 1.4, but fails in Solr 3.5. Dismax is the desired QueryParser -- I just included output from lucene QueryParser to prove the document exists and is found I am completely stumped. Here are the debugQuery details: ***Solr 3.5*** lucene QueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology final query: all_search:the beatl as musician revolv through the antholog 6.0562754 = (MATCH) weight(all_search:the beatl as musician revolv through the antholog in 1064395), product of: 1.0 = queryWeight(all_search:the beatl as musician revolv through the antholog), product of: 48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611) 0.02063975 = queryNorm 6.0562754 = fieldWeight(all_search:the beatl as musician revolv through the antholog in 1064395), product of: 1.0 = tf(phraseFreq=1.0) 48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611) 0.125 = fieldNorm(field=all_search, doc=1064395) dismax QueryParser: URL: qf=all_searchpf=all_searchq=The Beatles as musicians : Revolver through the Anthology final query: +(all_search:the beatl as musician revolv through the antholog~1)~0.01 (all_search:the beatl as musician revolv through the antholog~3)~0.01 (no matches) ***Solr 1.4*** lucene QueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology final query: all_search:the beatl as musician revolv through the antholog 5.2676983 = fieldWeight(all_search:the beatl as musician revolv through the antholog in 3469163), product of: 1.0 = tf(phraseFreq=1.0) 48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 revolv=820 through=88238 the=3542123 antholog=11205) 0.109375 =
Re: Recovering from database connection resets in DataimportHandler
It *just happens* that I wrote a blog on this very topic, see: http://www.lucidimagination.com/blog/2012/02/14/indexing-with-solrj/ That code contains two rather different methods, one that indexes based on a SQL database and one based on indexing random files with client-side Tika. Best Erick On Wed, Feb 22, 2012 at 8:51 PM, Mike O'Leary tmole...@uw.edu wrote: Could you point me to the most non-intimidating introduction to SolrJ that you know of? I have a passing familiarity with Javascript and, with few exceptions, I haven't developing software that has a graphical user interface of any kind in about 25 years. I like the idea of having finer control over data imported from a database though. Thanks, Mike -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Monday, February 13, 2012 6:19 AM To: solr-user@lucene.apache.org Subject: Re: Recovering from database connection resets in DataimportHandler I'd seriously consider using SolrJ and your favorite JDBC driver instead. It's actually quite easy to create one, although as always it may be a bit intimidating to get started. This allows you much finer control over error conditions than DIH does, so may be more suited to your needs. Best Erick On Sat, Feb 11, 2012 at 2:40 AM, Mike O'Leary tmole...@uw.edu wrote: I am trying to use Solr's DataImportHandler to index a large number of database records in a SQL Server database that is owned and managed by a group we are collaborating with. The indexing jobs I have run so far, except for the initial very small test runs, have failed due to database connection resets. I have gotten indexing jobs to go further by using CachedSqlEntityProcessor and specifying responseBuffering=adaptive in the connection url, but I think in order to index that data I'm going to have to work out how to catch database connection reset exceptions and resubmit the queries that failed. Can anyone can suggest a good way to approach this? Or have any of you encountered this problem and worked out a solution to it already? Thanks, Mike
Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only
On Wed, Feb 22, 2012 at 7:35 PM, Naomi Dushay ndus...@stanford.edu wrote: Jonathan has brought it to my attention that BOTH of my failing searches happen to have 8 terms, and one of the terms is repeated: The Beatles as musicians : Revolver through the Anthology Color-blindness [print/digital]; its dangers and its detection but this is a PHRASE search. Can you take your same phrase queries, and simply add some slop to them (e.g. ~3) and ensure they still match with the lucene queryparser? SloppyPhraseQuery has a bit of a history with repeats since Lucene 2.9 that you were using. https://issues.apache.org/jira/browse/LUCENE-3068 https://issues.apache.org/jira/browse/LUCENE-3215 https://issues.apache.org/jira/browse/LUCENE-3412 -- lucidimagination.com
default fq in dismax request handler being overridden
I have a dismax request handler with a default fq parameter. requestHandler name=dismax class=solr.DisMaxRequestHandler lst name=defaults str name=echoParamsexplicit/str float name=tie0.01/float str name=qf sku^9.0 upc^9.1 searchKeyword^1.9 series^2.8 productTitle^1.2 productID^9.0 manufacturer^4.0 masterFinish^1.5 theme^1.1 categoryName^2.0 finish^1.4 /str str name=pf searchKeyword^2.1 text^0.2 productTitle^1.5 manufacturer^4.0 finish^1.9 /str str name=bqisTopSeller:true^1.30/str str name=bflinear(popularity,1,2)^3.0/str str name=flproductID,manufacturer/str str name=mm3-1 5-2 690%/str int name=ps100/int int name=qs3/int str name=fqdiscontinued:false/str /lst /requestHandler I understand that when I send a search post ex. /select?qt=dismaxq=f-0sort=score%20descfq=type_string:faucet What I would like to know is if there is a way to always include the fq that is defined in the query handler and not have it be overridden but appended automatically to any solr searches that use the query handler. -- View this message in context: http://lucene.472066.n3.nabble.com/default-fq-in-dismax-request-handler-being-overridden-tp3768735p3768735.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: need to support bi-directional synonyms
Same question here... On Wednesday, February 22, 2012, geeky2 gee...@hotmail.com wrote: hello all, i need to support the following: if the user enters sprayer in the desc field - then they get results for BOTH sprayer and washer. and in the other direction if the user enters washer in the desc field - then they get results for BOTH washer and sprayer. would i set up my synonym file like this? assuming expand = true.. sprayer = washer washer = sprayer thank you, mark -- View this message in context: http://lucene.472066.n3.nabble.com/need-to-support-bi-directional-synonyms-tp3767990p3767990.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: default fq in dismax request handler being overridden
Think I answered my own question... I need to use an appends list -- View this message in context: http://lucene.472066.n3.nabble.com/default-fq-in-dismax-request-handler-being-overridden-tp3768735p3768817.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Development inside or outside of Solr?
Hi, François Schiettecatte Thank you for the reply all the same, but I choose to stick on Solr (wrapped with Tika language API) and do changes outside Solr. Best Regards, Bing -- View this message in context: http://lucene.472066.n3.nabble.com/Development-inside-or-outside-of-Solr-tp3759680p3768903.html Sent from the Solr - User mailing list archive at Nabble.com.
problem with parsering (using Tika) on remote glassfish
Hi all! I'm using Tika parser to index my files into Solr. I created my own parser (which extends XMLParser). It uses my own mimetype. I created a jar file which inside looks like this: src |-main |-some_packages |-MyParser.java |resources |-META-INF |-services |-org.apache.tika.parser.Parser (which contains some_packages.MyParser.java) |_org |-apache |-tika |-mime |-custom-mimetypes.xml In custom-mimetypes I put the definition of new mimetype becouse my xml files have some special tags. Now where is the problem: I've been testing parsing and indexing with Solr on glassfish installed on my local machine. It worked just fine. Then I wanted to install it on some remote server. There is the same version of glassfish installed (3.1.1). I copied-pasted Solr application, it's home directory with all libraries (including tika jars and the jar with my custom parser). Unfortunately it doesn't work. After posting files to Solr I can see in content-type field that it detected my custom mime type. But there are no fields that suppose to be there like if MyParser class was never runned. The only fields I get are the ones from Dublin Core. I checked (by simply adding some printlines) that Tika is only using XMLParser. Have anyone had similar problem? How to handle this? Regards, Ola
Re: need to support bi-directional synonyms
Use sprayer, washer http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory Regards Bernd Am 23.02.2012 07:03, schrieb remi tassing: Same question here... On Wednesday, February 22, 2012, geeky2gee...@hotmail.com wrote: hello all, i need to support the following: if the user enters sprayer in the desc field - then they get results for BOTH sprayer and washer. and in the other direction if the user enters washer in the desc field - then they get results for BOTH washer and sprayer. would i set up my synonym file like this? assuming expand = true.. sprayer = washer washer = sprayer thank you, mark -- View this message in context: http://lucene.472066.n3.nabble.com/need-to-support-bi-directional-synonyms-tp3767990p3767990.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher
thanks Mark, I will give it a go and report back... On Thu, Feb 23, 2012 at 1:31 AM, Mark Miller markrmil...@gmail.com wrote: Looks like an issue around replication IndexWriter reboot, soft commits and hard commits. I think I've got a workaround for it: Index: solr/core/src/java/org/apache/solr/handler/SnapPuller.java === --- solr/core/src/java/org/apache/solr/handler/SnapPuller.java (revision 1292344) +++ solr/core/src/java/org/apache/solr/handler/SnapPuller.java (working copy) @@ -499,6 +499,17 @@ // reboot the writer on the new index and get a new searcher solrCore.getUpdateHandler().newIndexWriter(); + Future[] waitSearcher = new Future[1]; + solrCore.getSearcher(true, false, waitSearcher, true); + if (waitSearcher[0] != null) { + try { + waitSearcher[0].get(); + } catch (InterruptedException e) { + SolrException.log(LOG,e); + } catch (ExecutionException e) { + SolrException.log(LOG,e); + } + } // update our commit point to the right dir solrCore.getUpdateHandler().commit(new CommitUpdateCommand(req, false)); That should allow the searcher that the following commit command prompts to see the *new* IndexWriter. On Feb 22, 2012, at 10:56 AM, eks dev wrote: We started observing strange failures from ReplicationHandler when we commit on master trunk version 4-5 days old. It works sometimes, and sometimes not didn't dig deeper yet. Looks like the real culprit hides behind: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed Looks familiar to somebody? 120222 154959 SEVERE SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1138) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1251) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1043) at org.apache.solr.update.DirectUpdateHandler2.commit(Unknown Source) at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:503) at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:348) at org.apache.solr.handler.ReplicationHandler.doFetch(Unknown Source) at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:163) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:810) at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:815) at org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:3984) at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:254) at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:233) at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:223) at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170) at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1095) ... 15 more - Mark Miller lucidimagination.com
Re: Development inside or outside of Solr?
Hi, Erick, The example is impressive. Thank you. For the first, we decide not to do that, as Tika extraction is time-consuming part in indexing large files, and the dual call make the situation worse. For the second, for now, we choose Dspace to connect to DB, and discovery(solr) as the index/query. Thus, we might do revisions in dspace. Best Regards, Bing -- View this message in context: http://lucene.472066.n3.nabble.com/Development-inside-or-outside-of-Solr-tp3759680p3768977.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Do nested entities have a representation in Solr indexes?
Hello Mike, Solr is too flat yet. Work is in progress https://issues.apache.org/jira/browse/SOLR-3076 Good introduction is in Michael's blog http://blog.mikemccandless.com/2012/01/searching-relational-content-with.htmlbut it's only about Lucene Queries. Colleague of my blogged about the same problem but solved it by an alternative approach http://blog.griddynamics.com/search/label/Solr Finally we give up with termspositions/spans and considering BJQ as a solution. Regards On Thu, Feb 23, 2012 at 5:37 AM, Mike O'Leary tmole...@uw.edu wrote: The data-config.xml file that I have for indexing database contents has nested entity nodes within a document node, and each of the entities contains field nodes. Lucene indexes consist of documents that contain fields. What about entities? If you change the way entities are structured in a data-config.xml file, in what way (if any) does it change how the contents are stored in the index. When I created the entities I am using, and defined the fields in one of the inner entities to be multivalued, I thought that the fields of that entity type would be grouped logically somehow in the index, but then I remembered that Lucene doesn't have a concept of sub-documents (that I know of), so each of the field values will be added to a list, and the extent of the logical grouping would be that the field values that were indexed together would be at the same position in their respective lists. Am I understanding this right, or do entities as defined in data-config.xml have some kind of representation in the index like document and field do? Thanks, Mike -- Sincerely yours Mikhail Khludnev Lucid Certified Apache Lucene/Solr Developer Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com