Re: Doc Transformer to remove document from the response
Thanks Hoss, I probably did not formulate the question properly, but you gave me an answer. I do it already in SearchComponent, just wanted to centralise this control of the depth and width of the response to the single place in code [style={minimal, verbose, full...}]. It just sounds logical to me to have this possibility in DocTransformer, as null document is kind of "extremely" modified document. Even better, it might actually work… (did not try it yet) @Override setContext( TransformContext context ) { context.iterator = new FilteringIterator(context.iterator) } Simply by providing my own FilteringIterator that would skip document I do not need in the response? Does this sound right from the "legitimate api usage" perspective. I did not look where pagination happens, but it looks like DocTransform gets applied at the very end (response writer), which in turn means pagination is not an issue , just soma pages might get shorter due to this additional filtering, but that is quite ok for me. On Mon, Oct 29, 2012 at 7:59 PM, Chris Hostetter wrote: > > : Transformer is great to augment Documents before shipping to response, > : but what would be a way to prevent document from being delivered? > > DocTransformers can only modify the documents -- not hte Document List. > > what you are describing would have to be done as a SearchComponent (or in > the QParser) -- take a look at QueryElevation component for an example of > how to do something like this that plays nicely with pagination. > > > -Hoss
Doc Transformer to remove document from the response
Transformer is great to augment Documents before shipping to response, but what would be a way to prevent document from being delivered? I have some search components that make some conclusions after search , duplicates removal, clustering and one Augmenter(solr Transformer) to shape the response up, but I need to stop some documents from being delivered, what is the way to do it? thanks, e.
Re: Solr 4.0 and production environments
I am here on lucene as a user since the project started, even before solr came to life, many many years. And I was always using trunk version for pretty big customers, and *never* experienced some serious problems. The worst thing that can happen is to notice bug somewhere, and if you have some reasonable testing for your product, you will see it quickly. But, with this community, *you will definitely not have wait long top get it fixed*. Not only they will fix it, they will thank you for bringing it up! I can, as an old user, 100 % vouch what Robert said below. Simply, just go for it, test you application a bit and make your users happy. On Wed, Mar 7, 2012 at 5:55 PM, Robert Muir wrote: > On Wed, Mar 7, 2012 at 11:47 AM, Dirceu Vieira wrote: >> Hi All, >> >> Has anybody started using Solr 4.0 in production environments? Is it stable >> enough? >> I'm planning to create a proof of concept using solr 4.0, we have some >> projects that will gain a lot with features such as near real time search, >> joins and others, that are available only on version 4. >> >> Is it too risky to think of using it right now? >> What are your thoughts and experiences with that? >> > > In general, we try to keep our 'trunk' (slated to be 4.0) in very > stable condition. > > Really, it should be 'ready-to-release' at any time, of course 4.0 has > had many drastic changes: both at the Lucene and Solr level. > > Before deciding what is stable, you should define stability: is it: > * api stability: will i be able to upgrade to a more recent snapshot > of 4.0 without drastic changes to my app? > * index format stability: will i be able to upgrade to a more recent > snapshot of 4.0 without re-indexing? > * correctness: is 4.0 dangerous in some way that it has many bugs > since much of the code is new? > > I think you should limit your concerns to only the first 2 items, as > far as correctness, just look at the tests. For any open source > project, you can easily judge its quality by its tests: this is a > fact. > > For lucene/solr the testing strategy, in my opinion, goes above and > beyond many other projects: for example random testing: > http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011_presentations#dawid_weiss > > and the new solr cloud functionality also adds the similar chaosmonkey > concept on top of this already. > > If you are worried about bugs, is a lucene/solr trunk snapshot less > reliable than even a released version of alternative software? its an > interesting question. look at their tests. > > -- > lucidimagination.com
Re: [SoldCloud] Slow indexing
hmm, loks like you are facing exactly the phenomena I asked about. See my question here: http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/61326 On Sun, Mar 4, 2012 at 9:24 PM, Markus Jelsma wrote: > Hi, > > With auto-committing disabled we can now index many millions of documents in > our test environment on a 5-node cluster with 5 shards and a replication > factor of 2. The documents are uploaded from map/reduce. No significant > changes were made to solrconfig and there are no update processors enabled. > We are using a trunk revision from this weekend. > > The indexing speed is well below what we are used to see, we can easily > index 5 millions documents on a non-cloud enabled Solr 3.x instance within > an hour. What could be going on? There aren't many open TCP connections and > the number of file descriptors is stable and I/O is low but CPU-time is > high! Each node has two Solr cores both writing to their dedicated disk. > > The indexing speed is stable, it was slow at start and still is. It's now > running for well over 6 hours and only 3.5 millions documents are indexed. > Another strange detail is that the node receiving all incoming documents > (we're not yet using a client side Solr server pool) has a much larger disk > usage than all other nodes. This is peculiar as we expected all replica's to > be a about the same size. > > The receiving node has slightly higher CPU than the other nodes but the > thread dump shows a very large amount of threads of type > cmdDistribExecutor-8-thread-292260 (295090) with 0-100ms CPU-time. At the > top of the list these threads all have < 20ms time but near the bottom it > rises to just over 100ms. All nodes have a couple of http-80-30 (121994) > threads with very high CPU-time each. > > Is this a known issue? Did i miss something? Any ideas? > > Thanks
Re: Solr Cloud, Commits and Master/Slave configuration
Thanks Mark, Good, this is probably good enough to give it a try. My analyzers are normally fast, doing duplicate analysis (at each replica) is probably not going to cost a lot, if there is some decent "batching" Can this be somehow controlled (depth of this buffer / time till flush or some such). Which "events" trigger this flushing to replicas (softCommit, commit, something new?) What I found useful is to always think in terms of incremental (low latency) and batch (high throughput) updates. I just then need some knobs to tweak behavior of this update process. I wold really like to move away from Master/Slave, Cloud makes a lot of things way simpler for us users ... Will give it a try in a couple of weeks Later we can even think about putting replication at segment level for "extremely expensive analysis, batch cases", or "initial cluster seeding" as a replication option. But this is then just an optimization. Cheers, eks On Thu, Mar 1, 2012 at 5:24 AM, Mark Miller wrote: > We actually do currently batch updates - we are being somewhat loose when we > say a document at a time. There is a buffer of updates per replica that gets > flushed depending on the requests coming through and the buffer size. > > - Mark Miller > lucidimagination.com > > On Feb 28, 2012, at 3:38 AM, eks dev wrote: > >> SolrCluod is going to be great, NRT feature is really huge step >> forward, as well as central configuration, elasticity ... >> >> The only thing I do not yet understand is treatment of cases that were >> traditionally covered by Master/Slave setup. Batch update >> >> If I get it right (?), updates to replicas are sent one by one, >> meaning when one server receives update, it gets forwarded to all >> replicas. This is great for reduced update latency case, but I do not >> know how is it implemented if you hit it with "batch" update. This >> would cause huge amount of update commands going to replicas. Not so >> good for throughput. >> >> - Master slave does distribution at segment level, (no need to >> replicate analysis, far less network traffic). Good for batch updates >> - SolrCloud does par update command (low latency, but chatty and >> Analysis step is done N_Servers times). Good for incremental updates >> >> Ideally, some sort of "batching" is going to be available in >> SolrCloud, and some cont roll over it, e.g. forward batches of 1000 >> documents (basically keep update log slightly longer and forward it as >> a batch update command). This would still cause duplicate analysis, >> but would reduce network traffic. >> >> Please bare in mind, this is more of a question than a statement, I >> didn't look at the cloud code. It might be I am completely wrong here! >> >> >> >> >> >> On Tue, Feb 28, 2012 at 4:01 AM, Erick Erickson >> wrote: >>> As I understand it (and I'm just getting into SolrCloud myself), you can >>> essentially forget about master/slave stuff. If you're using NRT, >>> the soft commit will make the docs visible, you don't ned to do a hard >>> commit (unlike the master/slave days). Essentially, the update is sent >>> to each shard leader and then fanned out into the replicas for that >>> leader. All automatically. Leaders are elected automatically. ZooKeeper >>> is used to keep the cluster information. >>> >>> Additionally, SolrCloud keeps a transaction log of the updates, and replays >>> them if the indexing is interrupted, so you don't risk data loss the way >>> you used to. >>> >>> There aren't really masters/slaves in the old sense any more, so >>> you have to get out of that thought-mode (it's hard, I know). >>> >>> The code is under pretty active development, so any feedback is >>> valuable >>> >>> Best >>> Erick >>> >>> On Mon, Feb 27, 2012 at 3:26 AM, roz dev wrote: >>>> Hi All, >>>> >>>> I am trying to understand features of Solr Cloud, regarding commits and >>>> scaling. >>>> >>>> >>>> - If I am using Solr Cloud then do I need to explicitly call commit >>>> (hard-commit)? Or, a soft commit is okay and Solr Cloud will do the job >>>> of >>>> writing to disk? >>>> >>>> >>>> - Do We still need to use Master/Slave setup to scale searching? If we >>>> have to use Master/Slave setup then do i need to issue hard-commit to >>>> make >>>> my changes visible to slaves? >>>> - If I were to use NRT with Master/Slave setup with soft commit then >>>> will the slave be able to see changes made on master with soft commit? >>>> >>>> Any inputs are welcome. >>>> >>>> Thanks >>>> >>>> -Saroj > > > > > > > > > > > >
Re: Solr Cloud, Commits and Master/Slave configuration
SolrCluod is going to be great, NRT feature is really huge step forward, as well as central configuration, elasticity ... The only thing I do not yet understand is treatment of cases that were traditionally covered by Master/Slave setup. Batch update If I get it right (?), updates to replicas are sent one by one, meaning when one server receives update, it gets forwarded to all replicas. This is great for reduced update latency case, but I do not know how is it implemented if you hit it with "batch" update. This would cause huge amount of update commands going to replicas. Not so good for throughput. - Master slave does distribution at segment level, (no need to replicate analysis, far less network traffic). Good for batch updates - SolrCloud does par update command (low latency, but chatty and Analysis step is done N_Servers times). Good for incremental updates Ideally, some sort of "batching" is going to be available in SolrCloud, and some cont roll over it, e.g. forward batches of 1000 documents (basically keep update log slightly longer and forward it as a batch update command). This would still cause duplicate analysis, but would reduce network traffic. Please bare in mind, this is more of a question than a statement, I didn't look at the cloud code. It might be I am completely wrong here! On Tue, Feb 28, 2012 at 4:01 AM, Erick Erickson wrote: > As I understand it (and I'm just getting into SolrCloud myself), you can > essentially forget about master/slave stuff. If you're using NRT, > the soft commit will make the docs visible, you don't ned to do a hard > commit (unlike the master/slave days). Essentially, the update is sent > to each shard leader and then fanned out into the replicas for that > leader. All automatically. Leaders are elected automatically. ZooKeeper > is used to keep the cluster information. > > Additionally, SolrCloud keeps a transaction log of the updates, and replays > them if the indexing is interrupted, so you don't risk data loss the way > you used to. > > There aren't really masters/slaves in the old sense any more, so > you have to get out of that thought-mode (it's hard, I know). > > The code is under pretty active development, so any feedback is > valuable > > Best > Erick > > On Mon, Feb 27, 2012 at 3:26 AM, roz dev wrote: >> Hi All, >> >> I am trying to understand features of Solr Cloud, regarding commits and >> scaling. >> >> >> - If I am using Solr Cloud then do I need to explicitly call commit >> (hard-commit)? Or, a soft commit is okay and Solr Cloud will do the job of >> writing to disk? >> >> >> - Do We still need to use Master/Slave setup to scale searching? If we >> have to use Master/Slave setup then do i need to issue hard-commit to make >> my changes visible to slaves? >> - If I were to use NRT with Master/Slave setup with soft commit then >> will the slave be able to see changes made on master with soft commit? >> >> Any inputs are welcome. >> >> Thanks >> >> -Saroj
Re: SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher
it loos like it works, with patch, after a couple of hours of testing under same conditions didn't see it happen (without it, approx. every 15 minutes). I do not think it will happen again with this patch. Thanks again and my respect to your debugging capacity, my bug report was really thin. On Thu, Feb 23, 2012 at 8:47 AM, eks dev wrote: > thanks Mark, I will give it a go and report back... > > On Thu, Feb 23, 2012 at 1:31 AM, Mark Miller wrote: >> Looks like an issue around replication IndexWriter reboot, soft commits and >> hard commits. >> >> I think I've got a workaround for it: >> >> Index: solr/core/src/java/org/apache/solr/handler/SnapPuller.java >> === >> --- solr/core/src/java/org/apache/solr/handler/SnapPuller.java (revision >> 1292344) >> +++ solr/core/src/java/org/apache/solr/handler/SnapPuller.java (working >> copy) >> @@ -499,6 +499,17 @@ >> >> // reboot the writer on the new index and get a new searcher >> solrCore.getUpdateHandler().newIndexWriter(); >> + Future[] waitSearcher = new Future[1]; >> + solrCore.getSearcher(true, false, waitSearcher, true); >> + if (waitSearcher[0] != null) { >> + try { >> + waitSearcher[0].get(); >> + } catch (InterruptedException e) { >> + SolrException.log(LOG,e); >> + } catch (ExecutionException e) { >> + SolrException.log(LOG,e); >> + } >> + } >> // update our commit point to the right dir >> solrCore.getUpdateHandler().commit(new CommitUpdateCommand(req, >> false)); >> >> That should allow the searcher that the following commit command prompts to >> see the *new* IndexWriter. >> >> On Feb 22, 2012, at 10:56 AM, eks dev wrote: >> >>> We started observing strange failures from ReplicationHandler when we >>> commit on master trunk version 4-5 days old. >>> It works sometimes, and sometimes not didn't dig deeper yet. >>> >>> Looks like the real culprit hides behind: >>> org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed >>> >>> Looks familiar to somebody? >>> >>> >>> 120222 154959 SEVERE SnapPull failed >>> :org.apache.solr.common.SolrException: Error opening new searcher >>> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1138) >>> at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1251) >>> at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1043) >>> at org.apache.solr.update.DirectUpdateHandler2.commit(Unknown Source) >>> at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:503) >>> at >>> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:348) >>> at org.apache.solr.handler.ReplicationHandler.doFetch(Unknown Source) >>> at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:163) >>> at >>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >>> at >>> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) >>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) >>> at >>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) >>> at >>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) >>> at java.lang.Thread.run(Thread.java:722) >>> Caused by: org.apache.lucene.store.AlreadyClosedException: this >>> IndexWriter is closed >>> at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:810) >>> at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:815) >>> at >>> org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:3984) >>> at >>> org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:254) >>> at >>> org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:233) >>> at >>> org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:223) >>> at >>> org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170) >>> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1095) >>> ... 15 more >> >> - Mark Miller >> lucidimagination.com >> >> >> >> >> >> >> >> >> >> >>
Re: SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher
thanks Mark, I will give it a go and report back... On Thu, Feb 23, 2012 at 1:31 AM, Mark Miller wrote: > Looks like an issue around replication IndexWriter reboot, soft commits and > hard commits. > > I think I've got a workaround for it: > > Index: solr/core/src/java/org/apache/solr/handler/SnapPuller.java > === > --- solr/core/src/java/org/apache/solr/handler/SnapPuller.java (revision > 1292344) > +++ solr/core/src/java/org/apache/solr/handler/SnapPuller.java (working copy) > @@ -499,6 +499,17 @@ > > // reboot the writer on the new index and get a new searcher > solrCore.getUpdateHandler().newIndexWriter(); > + Future[] waitSearcher = new Future[1]; > + solrCore.getSearcher(true, false, waitSearcher, true); > + if (waitSearcher[0] != null) { > + try { > + waitSearcher[0].get(); > + } catch (InterruptedException e) { > + SolrException.log(LOG,e); > + } catch (ExecutionException e) { > + SolrException.log(LOG,e); > + } > + } > // update our commit point to the right dir > solrCore.getUpdateHandler().commit(new CommitUpdateCommand(req, false)); > > That should allow the searcher that the following commit command prompts to > see the *new* IndexWriter. > > On Feb 22, 2012, at 10:56 AM, eks dev wrote: > >> We started observing strange failures from ReplicationHandler when we >> commit on master trunk version 4-5 days old. >> It works sometimes, and sometimes not didn't dig deeper yet. >> >> Looks like the real culprit hides behind: >> org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed >> >> Looks familiar to somebody? >> >> >> 120222 154959 SEVERE SnapPull failed >> :org.apache.solr.common.SolrException: Error opening new searcher >> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1138) >> at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1251) >> at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1043) >> at org.apache.solr.update.DirectUpdateHandler2.commit(Unknown Source) >> at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:503) >> at >> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:348) >> at org.apache.solr.handler.ReplicationHandler.doFetch(Unknown Source) >> at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:163) >> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >> at >> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) >> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) >> at >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) >> at >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) >> at java.lang.Thread.run(Thread.java:722) >> Caused by: org.apache.lucene.store.AlreadyClosedException: this >> IndexWriter is closed >> at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:810) >> at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:815) >> at org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:3984) >> at >> org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:254) >> at >> org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:233) >> at >> org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:223) >> at >> org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170) >> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1095) >> ... 15 more > > - Mark Miller > lucidimagination.com > > > > > > > > > > >
dih and solr cloud
out of curiosity, trying to see if new cloud features can replace what I use now... how is this (batch) update forwarding solved at cloud level? imagine simple one shard and one replica case, if I fire up DIH update, is this going to be replicated to replica shard? If yes, - is it going to be sent document by document (network, imagine 100Mio+ update commands going to replica from slave for big batches) - somehow batch into "packages" to reduce load - distributed at index level somehow This is important case, today with master/slave solr replication, but is not mentioned at http://wiki.apache.org/solr/SolrCloud
Re: Unusually long data import time?
Davon, you ought to try to update from many threads, (I do not know if DIH can do it, check it), but lucene does great job if fed from many update threads... depends where your time gets lost, but it is usually a) analysis chain or b) database if it os a) and your server has spare cpu-cores, you can scale at X NooCores rate On Wed, Feb 22, 2012 at 7:41 PM, Devon Baumgarten wrote: > Ahmet, > > I do not. I commented autoCommit out. > > Devon Baumgarten > > > > -Original Message- > From: Ahmet Arslan [mailto:iori...@yahoo.com] > Sent: Wednesday, February 22, 2012 12:25 PM > To: solr-user@lucene.apache.org > Subject: Re: Unusually long data import time? > >> Would it be unusual for an import of 160 million documents >> to take 18 hours? Each document is less than 1kb and I >> have the DataImportHandler using the jdbc driver to connect >> to SQL Server 2008. The full-import query calls a stored >> procedure that contains only a select from my target table. >> >> Is there any way I can speed this up? I saw recently someone >> on this list suggested a new user could get all their Solr >> data imported in under an hour. I sure hope that's true! > > Do have autoCommit or autoSoftCommit configured in solrconfig.xml?
SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher
We started observing strange failures from ReplicationHandler when we commit on master trunk version 4-5 days old. It works sometimes, and sometimes not didn't dig deeper yet. Looks like the real culprit hides behind: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed Looks familiar to somebody? 120222 154959 SEVERE SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1138) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1251) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1043) at org.apache.solr.update.DirectUpdateHandler2.commit(Unknown Source) at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:503) at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:348) at org.apache.solr.handler.ReplicationHandler.doFetch(Unknown Source) at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:163) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:810) at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:815) at org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:3984) at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:254) at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:233) at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:223) at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170) at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1095) ... 15 more
Re: reader/searcher refresh after replication (commit)
Yes, I consciously let my slaves run away from the master in order to reduce update latency, but every now and then they sync up with master that is doing heavy lifting. The price you pay is that slaves do not see the same documents as the master, but this is the case anyhow with replication, in my setup slave may go ahead of master with updates, this delta gets zeroed after replication and the game starts again. What you have to take into account with this is very small time window where you may "go back in time" on slaves (not seeing documents that were already there), but we are talking about seconds and a couple out of 200Mio documents (only those documents that were softComited on slave during replication, since commit ond master and postCommit on slave). Why do you think something is strange here? > What are you expecting a BeforeCommitListener could do for you, if one > would exist? Why should I be expecting something? I just need to read userCommit Data as soon as replication is done, and I am looking for proper/easy way to do it. (postCommitListener is what I use now). What makes me slightly nervous are those life cycle questions, e.g. when I issue update command before and after postCommit event, which index gets updated, the one just replicated or the one that was there just before replication. There are definitely ways to optimize this, for example to force replication handler to copy only delta files if index gets updated on slave and master (there is already todo somewhere on solr replication Wiki I think). Now replicationHandler copies complete index if this gets detected ... I am all ears if there are better proposals to have low latency updates in multi server setup... On Tue, Feb 21, 2012 at 11:53 PM, Em wrote: > Eks, > > that sounds strange! > > Am I getting you right? > You have a master which indexes batch-updates from time to time. > Furthermore you got some slaves, pulling data from that master to keep > them up-to-date with the newest batch-updates. > Additionally your slaves index own content in soft-commit mode that > needs to be available as soon as possible. > In consequence the slavesare not in sync with the master. > > I am not 100% certain, but chances are good that Solr's > replication-mechanism only changes those segments that are not in sync > with the master. > > What are you expecting a BeforeCommitListener could do for you, if one > would exist? > > Kind regards, > Em > > Am 21.02.2012 21:10, schrieb eks dev: >> Thanks Mark, >> Hmm, I would like to have this information asap, not to wait until the >> first search gets executed (depends on user) . Is solr going to create >> new searcher as a part of "replication transaction"... >> >> Just to make it clear why I need it... >> I have simple master, many slaves config where master does "batch" >> updates in big chunks (things user can wait longer to see on search >> side) but slaves work in soft commit mode internally where I permit >> them to run away slightly from master in order to know where >> "incremental update" should start, I read it from UserData >> >> Basically, ideally, before commit (after successful replication is >> finished) ends, I would like to read in these counters to let >> "incremental update" run from the right point... >> >> I need to prevent updating "replicated index" before I read this >> information (duplicates can appear) are there any "IndexWriter" >> listeners around? >> >> >> Thanks again, >> eks. >> >> >> >> On Tue, Feb 21, 2012 at 8:03 PM, Mark Miller wrote: >>> Post commit calls are made before a new searcher is opened. >>> >>> Might be easier to try to hook in with a new searcher listener? >>> >>> On Feb 21, 2012, at 8:23 AM, eks dev wrote: >>> >>>> Hi all, >>>> I am a bit confused with IndexSearcher refresh lifecycles... >>>> In a master slave setup, I override postCommit listener on slave >>>> (solr trunk version) to read some user information stored in >>>> userCommitData on master >>>> >>>> -- >>>> @Override >>>> public final void postCommit() { >>>> // This returnes "stale" information that was present before >>>> replication finished >>>> RefCounted refC = core.getNewestSearcher(true); >>>> Map userData = >>>> refC.get().getIndexReader().getIndexCommit().getUserData(); >>>> } >>>> >>>> I expected core.getNewestSearcher(true); to return refreshed >>>>
Re: reader/searcher refresh after replication (commit)
And drinks on me to those who decoupled implicit commit from close... this was tricky trap On Tue, Feb 21, 2012 at 9:10 PM, eks dev wrote: > Thanks Mark, > Hmm, I would like to have this information asap, not to wait until the > first search gets executed (depends on user) . Is solr going to create > new searcher as a part of "replication transaction"... > > Just to make it clear why I need it... > I have simple master, many slaves config where master does "batch" > updates in big chunks (things user can wait longer to see on search > side) but slaves work in soft commit mode internally where I permit > them to run away slightly from master in order to know where > "incremental update" should start, I read it from UserData > > Basically, ideally, before commit (after successful replication is > finished) ends, I would like to read in these counters to let > "incremental update" run from the right point... > > I need to prevent updating "replicated index" before I read this > information (duplicates can appear) are there any "IndexWriter" > listeners around? > > > Thanks again, > eks. > > > > On Tue, Feb 21, 2012 at 8:03 PM, Mark Miller wrote: >> Post commit calls are made before a new searcher is opened. >> >> Might be easier to try to hook in with a new searcher listener? >> >> On Feb 21, 2012, at 8:23 AM, eks dev wrote: >> >>> Hi all, >>> I am a bit confused with IndexSearcher refresh lifecycles... >>> In a master slave setup, I override postCommit listener on slave >>> (solr trunk version) to read some user information stored in >>> userCommitData on master >>> >>> -- >>> @Override >>> public final void postCommit() { >>> // This returnes "stale" information that was present before >>> replication finished >>> RefCounted refC = core.getNewestSearcher(true); >>> Map userData = >>> refC.get().getIndexReader().getIndexCommit().getUserData(); >>> } >>> >>> I expected core.getNewestSearcher(true); to return refreshed >>> SolrIndexSearcher, but it didn't >>> >>> When is this information going to be refreshed to the status from the >>> replicated index, I repeat this is postCommit listener? >>> >>> What is the way to get the information from the last commit point? >>> >>> Maybe like this? >>> core.getDeletionPolicy().getLatestCommit().getUserData(); >>> >>> Or I need to explicitly open new searcher (isn't solr does this behind >>> the scenes?) >>> core.openNewSearcher(false, false) >>> >>> Not critical, reopening new searcher works, but I would like to >>> understand these lifecycles, when solr loads latest commit point... >>> >>> Thanks, eks >> >> - Mark Miller >> lucidimagination.com >> >> >> >> >> >> >> >> >> >> >>
Re: reader/searcher refresh after replication (commit)
Thanks Mark, Hmm, I would like to have this information asap, not to wait until the first search gets executed (depends on user) . Is solr going to create new searcher as a part of "replication transaction"... Just to make it clear why I need it... I have simple master, many slaves config where master does "batch" updates in big chunks (things user can wait longer to see on search side) but slaves work in soft commit mode internally where I permit them to run away slightly from master in order to know where "incremental update" should start, I read it from UserData Basically, ideally, before commit (after successful replication is finished) ends, I would like to read in these counters to let "incremental update" run from the right point... I need to prevent updating "replicated index" before I read this information (duplicates can appear) are there any "IndexWriter" listeners around? Thanks again, eks. On Tue, Feb 21, 2012 at 8:03 PM, Mark Miller wrote: > Post commit calls are made before a new searcher is opened. > > Might be easier to try to hook in with a new searcher listener? > > On Feb 21, 2012, at 8:23 AM, eks dev wrote: > >> Hi all, >> I am a bit confused with IndexSearcher refresh lifecycles... >> In a master slave setup, I override postCommit listener on slave >> (solr trunk version) to read some user information stored in >> userCommitData on master >> >> -- >> @Override >> public final void postCommit() { >> // This returnes "stale" information that was present before >> replication finished >> RefCounted refC = core.getNewestSearcher(true); >> Map userData = >> refC.get().getIndexReader().getIndexCommit().getUserData(); >> } >> >> I expected core.getNewestSearcher(true); to return refreshed >> SolrIndexSearcher, but it didn't >> >> When is this information going to be refreshed to the status from the >> replicated index, I repeat this is postCommit listener? >> >> What is the way to get the information from the last commit point? >> >> Maybe like this? >> core.getDeletionPolicy().getLatestCommit().getUserData(); >> >> Or I need to explicitly open new searcher (isn't solr does this behind >> the scenes?) >> core.openNewSearcher(false, false) >> >> Not critical, reopening new searcher works, but I would like to >> understand these lifecycles, when solr loads latest commit point... >> >> Thanks, eks > > - Mark Miller > lucidimagination.com > > > > > > > > > > >
reader/searcher refresh after replication (commit)
Hi all, I am a bit confused with IndexSearcher refresh lifecycles... In a master slave setup, I override postCommit listener on slave (solr trunk version) to read some user information stored in userCommitData on master -- @Override public final void postCommit() { // This returnes "stale" information that was present before replication finished RefCounted refC = core.getNewestSearcher(true); Map userData = refC.get().getIndexReader().getIndexCommit().getUserData(); } I expected core.getNewestSearcher(true); to return refreshed SolrIndexSearcher, but it didn't When is this information going to be refreshed to the status from the replicated index, I repeat this is postCommit listener? What is the way to get the information from the last commit point? Maybe like this? core.getDeletionPolicy().getLatestCommit().getUserData(); Or I need to explicitly open new searcher (isn't solr does this behind the scenes?) core.openNewSearcher(false, false) Not critical, reopening new searcher works, but I would like to understand these lifecycles, when solr loads latest commit point... Thanks, eks
Re: codec="Pulsing" per field broken?
Thanks Robert, I've missed LUCENE-3490... Awesome! On Sun, Dec 11, 2011 at 6:37 PM, Robert Muir wrote: > On Sun, Dec 11, 2011 at 11:34 AM, eks dev wrote: >> on the latest trunk, my schema.xml with field type declaration >> containing //codec="Pulsing"// does not work any more (throws >> exception from FieldType). It used to work wit approx. a month old >> trunk version. >> >> I didn't dig deeper, can be that the old schema.xml was broken and >> worked by accident. >> > > Hi, > > The short answer is, you should change this to //postingsFormat="Pulsing40"// > See > http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/conf/schema_codec.xml > > The longer answer is that the Codec API in lucene trunk was extended recently: > https://issues.apache.org/jira/browse/LUCENE-3490 > > Previously "Codec" only allowed you to customize the format of the > postings lists. > We are working to have it cover the entire index segment (at the > moment nearly everything except deletes and encoding of compound files > can be customized). > > For example, look at SimpleText now: > http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/codecs/simpletext/ > As you see, it now implements plain-text stored fields, term vectors, > norms, segments file, fieldinfos, etc. > See Codec.java > (http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/codecs/Codec.java) > or LUCENE-3490 for more details. > > Because of this, what you had before is now just "PostingsFormat", as > Pulsing is just a wrapper around a postings implementation that > inlines low frequency terms. > Lucene's default Codec uses a per-field postings setup, so you can > still configure the postings per-field, just differently. > > -- > lucidimagination.com
codec="Pulsing" per field broken?
on the latest trunk, my schema.xml with field type declaration containing //codec="Pulsing"// does not work any more (throws exception from FieldType). It used to work wit approx. a month old trunk version. I didn't dig deeper, can be that the old schema.xml was broken and worked by accident. org.apache.solr.common.SolrException: Plugin Initializing failure for [schema.xml] fieldType at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:183) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:368) at org.apache.solr.schema.IndexSchema.(IndexSchema.java:107) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:651) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:409) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:243) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:93) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713) at org.mortbay.jetty.servlet.Context.startContext(Context.java:140) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130) at org.mortbay.jetty.Server.doStart(Server.java:224) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at runjettyrun.Bootstrap.main(Bootstrap.java:86) Caused by: java.lang.RuntimeException: schema fieldtype storableCity(X.StorableField) invalid arguments:{codec=Pulsing} at org.apache.solr.schema.FieldType.setArgs(FieldType.java:177) at org.apache.solr.schema.FieldTypePluginLoader.init(FieldTypePluginLoader.java:127) at org.apache.solr.schema.FieldTypePluginLoader.init(FieldTypePluginLoader.java:43) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:180) ... 18 more
Re: capacity planning
Re. "I have little experience with VM servers for search." We had huge performance penalty on VMs, CPU was bottleneck. We couldn't freely run measurements to figure out what the problem really was (hosting was contracted by customer...), but it was something pretty scary, kind of 8-10 times slower than advertised dedicated equivalent. Whatever its worth, if you can afford it, keep lucene away from it. Lucene is highly optimized machine, and someone twiddling with context switches is not welcome there. Of course, if you get IO bound, it makes no big diff anyhow. This is just my singular experience, might be the hosting team did not configure it right, or something changed in meantime (~ 4 Years old experience), but we burnt our fingers that hard I still remember it On Tue, Oct 11, 2011 at 7:49 PM, Toke Eskildsen wrote: > Travis Low [t...@4centurion.com] wrote: > > Toke, thanks. Comments embedded (hope that's okay): > > Inline or top-posting? Long discussion, but for mailing lists I clearly > prefer the former. > > [Toke: Estimate characters] > > > Yes. We estimate each of the 23K DB records has 600 pages of text for > the > > combined documents, 300 words per page, 5 characters per word. Which > > coincidentally works out to about 21GB, so good guessing there. :) > > Heh. Lucky Guess indeed, although the factors were off. Anyway, 21GB does > not sound scary at all. > > > The way it works is we have researchers modifying the DB records during > the > > day, and they may upload documents at that time. We estimate 50-60 > uploads > > throughout the day. If possible, we'd like to index them as they are > > uploaded, but if that would negatively affect the search, then we can > > rebuild the index nightly. > > > > Which is better? > > The analyzing part is only CPU and you're running multi-core so as long as > you only analyze using one thread you're safe there. That leaves us with > I/O: Even for spinning drives, a daily load of just 60 updates of 1MB of > extracted text each shouldn't have any real effect - with the usual caveat > that large merges should be avoided by either optimizing at night or > tweaking merge policy to avoid large segments. With such a relatively small > index, (re)opening and warm up should be painless too. > > Summary: 300GB is a fair amount of data and takes some power to crunch. > However, in the Solr/Lucene end your index size and your update rates are > nothing to worry about. Usual caveat for advanced use and all that applies. > > [Toke: i7, 8GB, 1TB spinning, 256GB SSD] > > > We have a very beefy VM server that we will use for benchmarking, but > your > > specs provide a starting point. Thanks very much for that. > > I have little experience with VM servers for search. Although we use a lot > of virtual machines, we use dedicated machines for our searchers, primarily > to ensure low latency for I/O. They might be fine for that too, but we > haven't tried it yet. > > Glad to be of help, > Toke
Re: Update ingest rate drops suddenly
Just to bring closure on this one, we were slurping data from the wrong DB (hardly desktop class machine)... Solr did not cough on 41Mio records @34k updates / sec., single threaded. Great! On Sat, Sep 24, 2011 at 9:18 PM, eks dev wrote: > just looking for hints where to look for... > > We were testing single threaded ingest rate on solr, trunk version on > atypical collection (a lot of small documents), and we noticed > something we are not able to explain. > > Setup: > We use defaults for index settings, windows 64 bit, jdk 7 U2. on SSD, > machine with enough memory and 8 cores. Schema has 5 stored fields, > 4 of them indexed no positions no norms. > Average net document size (optimized index size / number of documents) > is around 100 bytes. > > On a test with 40 Mio document: > - we had update ingest rate on first 4,4Mio documents @ incredible > 34k records / second... > - then it dropped, suddenly to 20k records per second and this rate > remained stable (variance 1k) until... > - we hit 13Mio, where ingest rate dropped again really hard, from one > instant in time to another to 10k records per second. > > it stayed there until we reached the end @40Mio (slightly reducing, to > ca 9k, but this is not long enough to see trend). > > Nothing unusual happening with jvm memory ( tooth-saw 200- 450M fully > regular). CPU in turn was following the ingest rate trend, inicating > that we were waiting on something. No searches , no commits, nothing. > > autoCommit was turned off. Updates were streaming directly from the database. > > - > I did not expect something like this, knowing lucene merges in > background. Also, having such sudden drops in ingest rate is > indicative that we are not leaking something. (drop would have been > much more gradual). It is some caches, but why two really significant > drops? 33k/sec to 20k and than to 10k... We would love to keep it @34 > k/second :) > > I am not really acquainted with the new MergePolicy and flushing > settings, but I suspect this is something there we could tweak. > > Could it be windows is somehow, hmm, quirky with solr default > directory on win64/jvm (I think it is MMAP by default)... We did not > saturate IO with such a small documents I guess, It is a just couple > of Gig over 1-2 hours. > > All in all, it works good, but is having such hard update ingest rate > drops normal? > > Thanks, > eks. >
Re: Update ingest rate drops suddenly
Thanks Otis, we will look into these issues again, slightly deeper. Network problems are not likely, but DB, I do not know, this is huge select ... we will try to scan db, without indexing, just to see if it can sustain... But gut feeling says, nope, this is not the one. IO saturation would surprise me, but you never know. Might be very well that SSD is somehow having problems with this sustained throughput. 8 Core... no, this was single update thread. we left default index settings (do not tweak if it works :) 32 32MB sounds like a lot of our documents (100b average on disk size). Assuming ram efficiency of 50% (?), we lend at 100k buffered documents. Yes, this is kind of smallish as every ~3 seconds we fill-up ramBuffer. (our Analyzers surprised me with 30k+ records per second). 256 will do the job, ~24 seconds should be plenty of "idle" time for IO-OS-JVM to sort out MMAP issues, if any (windows was newer MMAP performance champion when using it from java, but once you dance around it, it works ok)... Max jvm heap on this test was 768m, memory never went above 500m, Using -XX:-UseParallelGC ... this is definitely not a gc problem. cheers, eks On Sun, Sep 25, 2011 at 6:20 AM, Otis Gospodnetic wrote: > eks, > > This is clear as day - you're using Winblows! Kidding. > > I'd: > * watch IO with something like vmstat 2 and see if the rate drops correlate > to increased disk IO or IO wait time > * monitor the DB from which you were pulling the data - maybe the DB or the > server that runs it had issues > * monitor the network over which you pull data from DB > > If none of the above reveals the problem I'd still: > * grab all data you need to index and copy it locally > * index everything locally > > Out of curiosity, how big is your ramBufferSizeMB and your -Xmx? > And on that 8-core box you have ~8 indexing threads going? > > Otis > > Sematext is Hiring -- http://sematext.com/about/jobs.html > > > > >> >>From: eks dev >>To: solr-user >>Sent: Saturday, September 24, 2011 3:18 PM >>Subject: Update ingest rate drops suddenly >> >>just looking for hints where to look for... >> >>We were testing single threaded ingest rate on solr, trunk version on >>atypical collection (a lot of small documents), and we noticed >>something we are not able to explain. >> >>Setup: >>We use defaults for index settings, windows 64 bit, jdk 7 U2. on SSD, >>machine with enough memory and 8 cores. Schema has 5 stored fields, >>4 of them indexed no positions no norms. >>Average net document size (optimized index size / number of documents) >>is around 100 bytes. >> >>On a test with 40 Mio document: >>- we had update ingest rate on first 4,4Mio documents @ incredible >>34k records / second... >>- then it dropped, suddenly to 20k records per second and this rate >>remained stable (variance 1k) until... >>- we hit 13Mio, where ingest rate dropped again really hard, from one >>instant in time to another to 10k records per second. >> >>it stayed there until we reached the end @40Mio (slightly reducing, to >>ca 9k, but this is not long enough to see trend). >> >>Nothing unusual happening with jvm memory ( tooth-saw 200- 450M fully >>regular). CPU in turn was following the ingest rate trend, inicating >>that we were waiting on something. No searches , no commits, nothing. >> >>autoCommit was turned off. Updates were streaming directly from the database. >> >>- >>I did not expect something like this, knowing lucene merges in >>background. Also, having such sudden drops in ingest rate is >>indicative that we are not leaking something. (drop would have been >>much more gradual). It is some caches, but why two really significant >>drops? 33k/sec to 20k and than to 10k... We would love to keep it @34 >>k/second :) >> >>I am not really acquainted with the new MergePolicy and flushing >>settings, but I suspect this is something there we could tweak. >> >>Could it be windows is somehow, hmm, quirky with solr default >>directory on win64/jvm (I think it is MMAP by default)... We did not >>saturate IO with such a small documents I guess, It is a just couple >>of Gig over 1-2 hours. >> >>All in all, it works good, but is having such hard update ingest rate >>drops normal? >> >>Thanks, >>eks. >> >> >>
Update ingest rate drops suddenly
just looking for hints where to look for... We were testing single threaded ingest rate on solr, trunk version on atypical collection (a lot of small documents), and we noticed something we are not able to explain. Setup: We use defaults for index settings, windows 64 bit, jdk 7 U2. on SSD, machine with enough memory and 8 cores. Schema has 5 stored fields, 4 of them indexed no positions no norms. Average net document size (optimized index size / number of documents) is around 100 bytes. On a test with 40 Mio document: - we had update ingest rate on first 4,4Mio documents @ incredible 34k records / second... - then it dropped, suddenly to 20k records per second and this rate remained stable (variance 1k) until... - we hit 13Mio, where ingest rate dropped again really hard, from one instant in time to another to 10k records per second. it stayed there until we reached the end @40Mio (slightly reducing, to ca 9k, but this is not long enough to see trend). Nothing unusual happening with jvm memory ( tooth-saw 200- 450M fully regular). CPU in turn was following the ingest rate trend, inicating that we were waiting on something. No searches , no commits, nothing. autoCommit was turned off. Updates were streaming directly from the database. - I did not expect something like this, knowing lucene merges in background. Also, having such sudden drops in ingest rate is indicative that we are not leaking something. (drop would have been much more gradual). It is some caches, but why two really significant drops? 33k/sec to 20k and than to 10k... We would love to keep it @34 k/second :) I am not really acquainted with the new MergePolicy and flushing settings, but I suspect this is something there we could tweak. Could it be windows is somehow, hmm, quirky with solr default directory on win64/jvm (I think it is MMAP by default)... We did not saturate IO with such a small documents I guess, It is a just couple of Gig over 1-2 hours. All in all, it works good, but is having such hard update ingest rate drops normal? Thanks, eks.
solr-user@lucene.apache.org
probably stupid question, Which Directory implementation should be the best suited for index mounted on ramfs/tmpfs? I guess plain old FSDirectory, (or mmap/nio?)
Which Solr / Lucene direcotory for ramfs?
probably stupid question, Which Directory implementation should be the best suited for index mounted on ramfs/tmpfs? I guess plain old FSDirectory, (or mmap/nio?)
Re: DataImportHandler using new connection on each query
watch out, "running 10 hours" != "idling 10 seconds" and trying again. Those are different cases. It is not dropping *used* connections (good to know it works that good, thanks for reporting!), just not reusing connections more than 10 seconds idle On Fri, Sep 2, 2011 at 10:26 PM, Gora Mohanty wrote: > On Sat, Sep 3, 2011 at 1:38 AM, Shawn Heisey wrote: > [...] >> I use DIH with MySQL. When things are going well, a full rebuild will leave >> connections open and active for over two hours. This is the case with >> 1.4.0, 1.4.1, 3.1.0, and 3.2.0. Due to some kind of problem on the database >> server, last night I had a rebuild going for more than 11 hours with no >> problems, verified from the processlist on the server. > > Will second that. Have had DIH connections open to both > mysql, and MS-SQL for 8-10h. Dropped connections could > be traced to network issues, or some other exception. > > Regards, > Gora >
Re: DataImportHandler using new connection on each query
take care, "running 10 hours" != "idling 10 seconds" and trying again. Those are different cases. It is not dropping *used* connections (good to know it works that good, thanks for reporting!), just not reusing connections more than 10 seconds idle On Fri, Sep 2, 2011 at 10:26 PM, Gora Mohanty wrote: > On Sat, Sep 3, 2011 at 1:38 AM, Shawn Heisey wrote: > [...] >> I use DIH with MySQL. When things are going well, a full rebuild will leave >> connections open and active for over two hours. This is the case with >> 1.4.0, 1.4.1, 3.1.0, and 3.2.0. Due to some kind of problem on the database >> server, last night I had a rebuild going for more than 11 hours with no >> problems, verified from the processlist on the server. > > Will second that. Have had DIH connections open to both > mysql, and MS-SQL for 8-10h. Dropped connections could > be traced to network issues, or some other exception. > > Regards, > Gora >
Re: DataImportHandler using new connection on each query
I am not sure if current version has this, but DIH used to reload connections after some idle time if (currTime - connLastUsed > CONN_TIME_OUT) { synchronized (this) { Connection tmpConn = factory.call(); closeConnection(); connLastUsed = System.currentTimeMillis(); return conn = tmpConn; } Where CONN_TIME_OUT = 10 seconds On Fri, Sep 2, 2011 at 12:36 AM, Chris Hostetter wrote: > > : However, I tested this against a slower SQL Server and I saw > : dramatically worse results. Instead of re-using their database, each of > : the sub-entities is recreating a connection each time the query runs. > > are you seeing any specific errors logged before these new connections are > created? > > I don't *think* there's anything in the DIH JDBC/SQL code that causes it > to timeout existing connections -- is it possible this is sometihng > specific to the JDBC Driver you are using? > > Or maybe you are using the DIH "threads" option along with a JNDI/JDBC > based pool of connections that is configured to create new Connections on > demand, and with the fast DB it can reuse them but on the slow DB it does > enough stuff in parallel to keep asking for new connections to be created? > > > If it's DIH creating new connections over and over then i'm pretty sure > you should see an INFO level log message like this for each connection... > > LOG.info("Creating a connection for entity " > + context.getEntityAttribute(DataImporter.NAME) + " with URL: " > + url); > > ...are those messages different against you fast DB and your slow DB? > > -Hoss >
NRT in Master- Slave setup, crazy?
Thinking aloud and grateful for sparing .. I need to support high commit rate (low update latency) in a master slave setup and I have a bad feelings about it, even with disabling warmup and stripping everything down that slows down refresh. I will try it anyway, but I started thinking about "backup plan", like NRT on slaves. An idea is to have Master working on disk, doing commits in throughput friendly manner (e.g. every 5-10 minutes), but let slaves do the same updates with softCommit I am basically going to let slaves "possibly run out of sync" with master, by issuing the same updates on all slaves with softCommit ... every now and than syncing with Master. Could this work? the trick is, index is big (can fit in Ca. 16-20G Ram), but update rate is small and ugly distributed in time (every couple of seconds a few documents), one hard commit on master + slave update would probably cost much more than add(document) with softCommit on every slave (2-5 of them) So all in all, master remains real master and is there to ensure: a ) seeding if slave restarts b) authoritative index master, if slaves run out of sync (small diff is ok if they get corrected once a day) In general, do you find such idea wrong for some reason, should I be doing something else/better to achieve low update latency in master slave (for low update throughput)? Anything I can do to make standard master slave latency better apart from disabling warmup? Would loading os ramdisk (tmpfs forced in ram) on slaves bring much. I am talking about Ca. 1 second (plus/minus) update latency target from update to search on slave... But not more than 0.5 - 2 updates every second. And what I so far understood how solr works, this is going to be possible only with NRT on slaves (Analysis in my case is fast, so not an issue)...
Re: DIH on sequence (or any type that supports ordering) possible?
Thanks Shawn, nice! I didn't notice you can pass more params all the way to sql. So you really do not care about DIH incremental facility, you use it just as vehicle to provide - SQL import - transactional commit to solr on updates... But keeping DB/solr n sync is externalized (I am trying to find simple/robust solution for this part as well...). I am researching possibilities to get this information from lucene index itself, "what was the last document added?" , and than read stored ID field from it to feed DIH query like yours Should be easy question for solr/lucene to do, but I really do not know simple and fast way... cheers, eks On Sat, Aug 6, 2011 at 8:32 PM, Shawn Heisey wrote: > On 8/6/2011 8:49 AM, eks dev wrote: >> >> I would appreciate some clarifications about DIH >> >> I do not have reliable timestamp, but I do have atomic sequence that >> only grows on inserts/changes. > > I use DIH, but I don't use the built-in timestamp facility at all. I have > an autoincrement field in a MySQL database that tells me what's new. Here > are the three queries I have defined in dih-config.xml: > > query=" > SELECT * FROM ${dataimporter.request.dataView} > WHERE ( > ( > did > ${dataimporter.request.minDid} > AND did <= ${dataimporter.request.maxDid} > ) > ${dataimporter.request.extraWhere} > ) AND (crc32(did) % ${dataimporter.request.numShards}) > IN (${dataimporter.request.modVal}) > " > deltaImportQuery=" > SELECT * FROM ${dataimporter.request.dataView} > WHERE ( > ( > did > ${dataimporter.request.minDid} > AND did <= ${dataimporter.request.maxDid} > ) > ${dataimporter.request.extraWhere} > ) AND (crc32(did) % ${dataimporter.request.numShards}) > IN (${dataimporter.request.modVal}) > " > deltaQuery="SELECT 1 AS did" > > If you look carefully, you'll notice that query and deltaImportQuery are > identical, and deltaQuery is just something that always returns a value. I > keep track of did (the primary key for both dih-config and the database) in > my build system, passing in minDid and maxDid parameters on the DIH URL to > tell it what to index. I include more parameters to handle sharding and > special situations. I actually use a different field (with it's own unique > MySQL index) as Solr's uniqueKey. > > Currently Solr does not support keeping track of arbitrary data, just the > current timestamp ... but if you can track it outside of Solr and pass the > appropriate parameters in with the full-import or delta-import request, you > can do almost anything. > > This is on Solr 3.2, but I used a similar setup when I was running 1.4.1 as > well. > > Shawn > >
DIH on sequence (or any type that supports ordering) possible?
I would appreciate some clarifications about DIH I do not have reliable timestamp, but I do have atomic sequence that only grows on inserts/changes. You can understand it as a timestamp on some funky timezone not related to wall clock time, it is integer type. Is DIH keeping track of the MAX(committed timestamp) or it expects timestamp in DB to be wall clock time? If it expects wall clock timestamp, casting integer sequence value to timestamp (like number of seconds since constant point in time) at reading time would not work... Ideally for my case, DIH should keep MAX(whatever_field_specified)... Maybe in an Idea would be to modify DIH to support passing max(of the specified field in dih config) to the Lucene IndexWriter.commit(Map commitUserData) Later, just read IndexReader.getCommitUserData() and pass it to SQL as ${last.committed.sequence} This would have charming property, in a master slave setup, to continue working after master fail over without touching anything every slave could overtake at any time Second question is related to the delta queries as well. I know I have no deletes/modifications in my DataSource, only additions. Can I prevent DIH from trying to resolve deletes... my delta is fully qualified by: select * from source_table where my_sequence > ${last.committed.sequence} I imagine this step takes a lot of time to lookup every document ID in index? Thanks in advance, eks
Re: Matching queries on a per-element basis against a multivalued field
Sure, I know..., the point I was trying to make, "if someone serious like Lucid is using solr 4.x as a core technology for own customers, the trunk could not be all that bad" => release date not as far as 2012 :) On Tue, Aug 2, 2011 at 11:33 PM, Smiley, David W. wrote: > "LucidWorks Enterprise" (which is more than Solr, and a modified Solr at > that) isn't free; so you can't extract the Solr part of that package and use > it unless you are willing to pay them. > > Lucid's "Certified Solr", on the other hand, is free. But they have yet to > bump that to trunk/4.x; it was only recently updated to 3.2. > > On Aug 2, 2011, at 5:26 PM, eks dev wrote: > >> Well, Lucid released "LucidWorks Enterprise" >> with " Complete Apache Solr 4.x Release Integrated and tested with >> powerful enhancements" >> >> Whatever it means for solr 4.0 >> >> >> >> On Tue, Aug 2, 2011 at 11:10 PM, David Smiley (@MITRE.org) >> wrote: >>> My best guess (and it is just a guess) is between December and March. >>> >>> The roots of Solr 4 which triggered the major version change is known as >>> "flexible indexing" (or just "flex" for short amongst developers). The >>> genesis of it was posted to JIRA as a patch on 18 November 2008 -- >>> LUCENE-1458 (almost 3 years ago!). About a year later it was committed into >>> a special flex branch that is probably gone now, and then around >>> April/early-May 2010, it went into trunk whereas the pre-flex code on trunk >>> went to a newly formed 3x branch. That is ancient history now, and there are >>> some amazing performance improvements tied to flex that haven't seen the >>> light of day in an official release. It's a shame, really. So it's been so >>> long that, well, after it dawns on everyone that it that the code is 3 >>> friggin years old without a release -- it's time to get on with the show. >>> >>> ~ David Smiley >>> >>> - >>> Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book >>> -- >>> View this message in context: >>> http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3220242.html >>> Sent from the Solr - User mailing list archive at Nabble.com. >>> > >
Re: Matching queries on a per-element basis against a multivalued field
Well, Lucid released "LucidWorks Enterprise" with " Complete Apache Solr 4.x Release Integrated and tested with powerful enhancements" Whatever it means for solr 4.0 On Tue, Aug 2, 2011 at 11:10 PM, David Smiley (@MITRE.org) wrote: > My best guess (and it is just a guess) is between December and March. > > The roots of Solr 4 which triggered the major version change is known as > "flexible indexing" (or just "flex" for short amongst developers). The > genesis of it was posted to JIRA as a patch on 18 November 2008 -- > LUCENE-1458 (almost 3 years ago!). About a year later it was committed into > a special flex branch that is probably gone now, and then around > April/early-May 2010, it went into trunk whereas the pre-flex code on trunk > went to a newly formed 3x branch. That is ancient history now, and there are > some amazing performance improvements tied to flex that haven't seen the > light of day in an official release. It's a shame, really. So it's been so > long that, well, after it dawns on everyone that it that the code is 3 > friggin years old without a release -- it's time to get on with the show. > > ~ David Smiley > > - > Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3220242.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: conditionally update document on unique id
Hi Yonik, as this recommendation comes from you, I am not going to test it, you are well known as a speed junkie ;) When we are there (in SignatureUpdateProcessor), why is this code not moved to the constructor, but remains in processAdd ... Signature sig = (Signature) req.getCore().getResourceLoader().newInstance(signatureClass); sig.init(params); ... Should we be expecting on the fly signatureClass changes / params? I am still not all that familiar with solr life cycles... might be stupid question. Thanks, eks On Wed, Jun 29, 2011 at 10:36 PM, Yonik Seeley wrote: > On Wed, Jun 29, 2011 at 4:32 PM, eks dev wrote: >> req.getSearcher().getFirstMatch(t) != -1; > > Yep, this is currently the fastest option we have. > > -Yonik > http://www.lucidimagination.com >
Re: conditionally update document on unique id
Thanks Shalin! would you not expect req.getSearcher().docFreq(t); to be slightly faster? Or maybe even req.getSearcher().getFirstMatch(t) != -1; which one should be faster, any known side effects? On Wed, Jun 29, 2011 at 1:45 PM, Shalin Shekhar Mangar wrote: > On Wed, Jun 29, 2011 at 2:01 AM, eks dev wrote: > >> Quick question, >> Is there a way with solr to conditionally update document on unique >> id? Meaning, default, add behavior if id is not already in index and >> *not to touch index" if already there. >> >> Deletes are not important (no sync issues). >> >> I am asking because I noticed with deduplication turned on, >> index-files get modified even if I update the same documents again >> (same signatures). >> I am facing very high dupes rate (40-50%), and setup is going to be >> master-slave with high commit rate (requirement is to reduce >> propagation latency for updates). Having unnecessary index >> modifications is going to waste "effort" to ship the same information >> again and again. >> >> if there is no standard way, what would be the fastest way to check if >> Term exists in index from UpdateRequestProcessor? >> >> > I'd suggest that you use the searcher's getDocSet with a TermQuery. > > Use the SolrQueryRequest#getSearcher so you don't need to worry about ref > counting. > > e.g. req.getSearcher().getDocSet(new TermQuery(new Term(signatureField, > sigString))).size(); > > > >> I intend to extend SignatureUpdateProcessor to prevent a document from >> propagating down the chain if this happens? >> Would that be a way to deal with it? I repeat, there are no deletes to >> make headaches with synchronization >> >> > Yes, that should be fine. > > -- > Regards, > Shalin Shekhar Mangar. >
Re: Using RAMDirectoryFactory in Master/Slave setup
sure, SSD or RAM disks fix these problems with IO. Anyhow, I can really see no alternative for some in memory index for slaves, especially for low latency master-slave apps (high commit rate is a problem). having possibility to run slaves in memory that are slurping updates from Master seams to me like a preffered method (you need no twiddling with OS, just CPU and RAM is what you need for your slaves, run slave and point it to master ). I assume that update propagation times could be better by having some sexy ReadOnlySlaveRAMDirectorySlurpingUpdatesFromTheMaster that does reload() directly from the Master (maybe even uncommitted, somehow NRT-likish). Point being, lower latency update than current 1-5 Minutes (wiki recommended values) is not going to be possible with current master-slave solution, due to the nature of it (commit to disk on master, copy delta to slave disk, reload...) This is a lot of ping pong... ES and solandra are by nature better suited if you need update propagation in seconds range. It is just thinking aloud, and slightly off-topic... solr/lucene as it is today, rocks anyhow. On Wed, Jun 29, 2011 at 10:55 AM, Toke Eskildsen wrote: > On Wed, 2011-06-29 at 09:35 +0200, eks dev wrote: >> In MMAP, you need to have really smart warm up (MMAP) to beat IO >> quirks, for RAMDir you need to tune gc(), choose your poison :) > > Other alternatives are operating system RAM disks (avoids the GC > problem) and using SSDs (nearly the same performance as RAM). > >
Re: Using RAMDirectoryFactory in Master/Slave setup
sure, SSD or RAM disks fix these problems with IO. Anyhow, I can really see no alternative for some in memory index for slaves, especially for low latency master-slave apps (high commit rate is a problem). having possibility to run slaves in memory that are slurping updates from Master seams to me like a preffered method (you need no twiddling with OS, just CPU and RAM is what you need for your slaves, run slave and point it to master ). I assume that update propagation times could be better by having some sexy ReadOnlySlaveRAMDirectorySlurpingUpdatesFromTheMaster that does reload() directly from the Master (maybe even uncommitted, somehow NRT-likish). Point being, lower latency update than current 1-5 Minutes (wiki recommended values) is not going to be possible with current master-slave solution, due to the nature of it (commit to disk on master, copy delta to slave disk, reload...) This is a lot of ping pong... ES and solandra are by nature better suited if you need update propagation in seconds range. It is just thinking aloud, and slightly off-topic... solr/lucene as it is today, rocks anyhow. On Wed, Jun 29, 2011 at 10:55 AM, Toke Eskildsen wrote: > On Wed, 2011-06-29 at 09:35 +0200, eks dev wrote: >> In MMAP, you need to have really smart warm up (MMAP) to beat IO >> quirks, for RAMDir you need to tune gc(), choose your poison :) > > Other alternatives are operating system RAM disks (avoids the GC > problem) and using SSDs (nearly the same performance as RAM). > >
Re: Using RAMDirectoryFactory in Master/Slave setup
...Using RAMDirectory really does not help performance... I kind of agree, but in my experience with lucene, there are cases where RAMDirectory helps a lot, with all its drawbacks (huge heap and gc() tuning). We had very good experience with MMAP on average, but moving to RAMDirectory with properly tuned gc() reduced 95% of "slow performers" in upper range of response times (e.g. slowest 5% queries). On average it made practically no difference. Maybe is this mitigated by better warm up on solr than our hand-tuned warmup, maybe not, I do not really know. In MMAP, you need to have really smart warm up (MMAP) to beat IO quirks, for RAMDir you need to tune gc(), choose your poison :) I argue, in some cases it is very hard to tame IO quirks (e.g. this is shared resource, you never know what going really on in shared app setup!). Then, see only what is happening on major merge and all these efforts with native linux directory to somehow get a grip on that... If you have spare ram, you are probably safer with RAMDirectory. >From the theoretical perspective, in ideal case, RAM ought to be faster than disk (and more expensive). If this is not the case, we did something wrong. I have a feeling that this work Mike is doing with in memory Codecs (fst TermDictionary, pulsing codec & co) in Lucene 4, native directory features ... will make RAMDirectory really obsolete for production setup. Cheers, eks On Wed, Jun 29, 2011 at 6:00 AM, Lance Norskog wrote: > Using RAMDirectory really does not help performance. Java garbage > collection has to work around all of the memory taken by the segments. > It works out that Solr works better (for most indexes) without using > the RAMDirectory. > > > > On Sun, Jun 26, 2011 at 2:07 PM, nipunb wrote: >> PS: Sorry if this is a repost, I was unable to see my message in the mailing >> list - this may have been due to my outgoing email different from the one I >> used to subscribe to the list with. >> >> Overview – Trying to evaluate if keeping the index in memory using >> RAMDirectoryFactory can help in query performance.I am trying to perform the >> indexing on the master using solr.StandardDirectoryFactory and make those >> indexes accesible to the slave using solr.RAMDirectoryFactory >> >> Details: >> We have set-up Solr in a master/slave enviornment. The index is built on the >> master and then replicated to slaves which are used to serve the query. >> The replication is done using the in-built Java replication in Solr. >> On the master, in the of solrconfig.xml we have >> > class="solr.StandardDirectoryFactory"/> >> >> On the slave, I tried to use the following in the >> >> > class="solr.RAMDirectoryFactory"/> >> >> My slave shows no data for any queries. In solrconfig.xml it is mentioned >> that replication doesn’t work when using RAMDirectoryFactory, however this ( >> https://issues.apache.org/jira/browse/SOLR-1379) mentions that you can use >> it to have the index on disk and then load into memory. >> >> To test the sanity of my set-up, I changed solrconfig.xml in the slave to >> and replicated: >> > class="solr.StandardDirectoryFactory"/> >> I was able to see the results. >> >> Shouldn’t RAMDirectoryFactory be used for reading index from disk into >> memory? >> >> Any help/pointers in the right direction would be appreciated. >> >> Thanks! >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Using-RAMDirectoryFactory-in-Master-Slave-setup-tp3111792p3111792.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> > > > > -- > Lance Norskog > goks...@gmail.com >
conditionally update document on unique id
Quick question, Is there a way with solr to conditionally update document on unique id? Meaning, default, add behavior if id is not already in index and *not to touch index" if already there. Deletes are not important (no sync issues). I am asking because I noticed with deduplication turned on, index-files get modified even if I update the same documents again (same signatures). I am facing very high dupes rate (40-50%), and setup is going to be master-slave with high commit rate (requirement is to reduce propagation latency for updates). Having unnecessary index modifications is going to waste "effort" to ship the same information again and again. if there is no standard way, what would be the fastest way to check if Term exists in index from UpdateRequestProcessor? I intend to extend SignatureUpdateProcessor to prevent a document from propagating down the chain if this happens? Would that be a way to deal with it? I repeat, there are no deletes to make headaches with synchronization Thanks, eks
overwirite if not already in index?
Quick question, Is there a way with solr to conditionally update document on unique id? Meaning, default, add behavior if id is not already in index and *not to touch index" if already there. Deletes are not important (no sync issues). I am asking because I noticed with deduplication turned on, index-files get modified even if I update the same documents again (same signatures). I am facing very high dupes rate (40-50%), and setup is going to be master-slave with high commit rate (requirement is to reduce propagation latency for updates). Having unnecessary index modifications is going to waste "effort" to ship the same information again and again. if there is no standard way, what would be the fastest way to check if Term exists in index from UpdateRequestProcessor? I intend to extend SignatureUpdateProcessor to prevent a document from propagating down the chain if this happens? Would that be a way to deal with it? I repeat, there are no deletes to make headaches with synchronization Thanks, eks
Re: Using RAMDirectoryFactory in Master/Slave setup
Your best bet is MMapDirectoryFactory, you can come very close to the performance of the RAMDirectory. Unfortunatelly this setup with Master_on_disk->Slaves_in_ram type of setup is not possible using solr. We are moving our architecture to solr at the moment, and this is one of "missings" we have to somehow figure out. The problem is that MMap works fine on average, but it also has quirks regarding upper quntiles of the responses. If you are using RAMDirectory, you do not need to be afraid that occasionally slow IO will kill performance for some of your requests. This happens with MMAP, and not all that rare, depending on your usage pattern (high update/commit rate for example). I repeat, RAMDirectory is not to beat when it comes to reduction of the IO-caused "outliers". We removed some 90% of the slowest response times by using RAMDirectory instead of MMap... Depending on what you want to optimize, MMap can work just fine for you, and has some nice properties, eg. you do not need to tune gc() as much as if you manage bigger heap (RAMDirectory...) But, imo, it would make sense to have some possibility to do it in solr. On Mon, Jun 27, 2011 at 10:50 AM, Shalin Shekhar Mangar wrote: > On Mon, Jun 27, 2011 at 12:49 PM, nipunb wrote: >> I found a similar post - >> http://lucene.472066.n3.nabble.com/Problems-with-RAMDirectory-in-Solr-td1575223.html >> It mentions that Java based replication might work (This is what I have >> used, but didn't work for me) > > Solr Replication does not work with non-file directory implementations. > > -- > Regards, > Shalin Shekhar Mangar. >
Re: Question about http://wiki.apache.org/solr/Deduplication
Thanks Hoss, Externanlizing this part is exactly the path we are exploring now, not only for this reason. We already started testing Hadoop SequenceFile for write ahead log for updates/deletes. SequenceFile supports append now (simply great!). It was a a pain to have to add hadoop into mix for "mortal" collection sizes 200 Mio, but on the other side, having hadoop around offers huge flexibility. Write ahead log catches update commands (all solr slaves, fronting clients accept updates but only to forward them to WAL). Solr master is trying to catch up with update stream indexing in async fashion, and finally solr slaves are chasing master index with standard solr replication. Overnight we run simple map reduce jobs to consolidate, normalize and sort update stream and reindex at the end. Deduplication and collection sorting is for us only an optimization, if done reasonably offten, like once per day/week, but if we do not do it, it doubles HW resorces. Imo, native WAL support on solr would be definitly one nice "nice to have" (for HA, update scalability...). Charming with WAL is that updates never wait/disapear, if too much traffic, we only have slightly higher update latency, but updates get definitley processed. Some basic primitives on WAL (consolidation, replaying update stream on solr etc...) should be supported in this case, sort of "smallish hadoop features subset for solr clusters", but nothing oversized. Cheers, eks On Sun, Apr 3, 2011 at 1:05 AM, Chris Hostetter wrote: > > : Is it possible in solr to have multivalued "id"? Or I need to make my > : own "mv_ID" for this? Any ideas how to achieve this efficiently? > > This isn't something the SignatureUpdateProcessor is going to be able to > hel pyou with -- it does the deduplication be changing hte low level > "update" (implemented as a delete then add) so that the key used to delete > the older documents is based on the signature field instead of the id > field. > > in order to do what you are describing, you would need to query the index > for matching signatures, then add the resulting ids to your document > before doing that "update" > > You could posibly do this in a custom UpdateProcessor, but you'd have to > do something tricky to ensure you didn't overlook docs that had been addd > but not yet committed when checking for dups. > > I don't have a good suggestion for how to do this internally in Slr -- it > seems like the type of bulk processing logic that would be better suited > for an external process before you ever start indexing (much like link > analysis for back refrences) > > -Hoss >
Deduplication questions
Q1. Is is possible to pass *analyzed* content to the public abstract class Signature { public void init(SolrParams nl) { } public abstract String calculate(String content); } Q2. Method calculate() is using concatenated fields from name,features,cat Is there any mechanism I could build "field dependant signatures"? Use case for this: I have two fields: OWNER , TEXT I need to disable *fuzzy* duplicates for one owner, one clean way would be to make prefixed signature "OWNER/FUZZY_SIGNATURE" Is idea to make two UpdadeProcessors and chain them OK? (Is ugly, but would work) true false exact_signature OWNER ExactSignature hard_signature should not be stored and not indexed field true true mixed_signature exact_signature, TEXT MixedSignature Assuming I know how long my exact_signature is, I could calculate fuzzy part and mix it properly. Possible, better ideas? Thanks, eks
Question about http://wiki.apache.org/solr/Deduplication
Hi, Use case I am trying to figure out is about preserving IDs without re-indexing on duplicate, rather adding this new ID under list of document id "aliases". Example: Input collection: "id":1, "text":"dummy text 1", "signature":"A" "id":2, "text":"dummy text 1", "signature":"A" I add the first document in empty index, text is going to be indexed, ID is going to be "1", so far so good Now the question, if I add second document with id == "2", instead of deleting/indexing this new document, I would like to store id == 2 in multivalued Field "id" At the end, I would have one document less indexed and both ID are going to be "searchable" (and stored as well)... Is it possible in solr to have multivalued "id"? Or I need to make my own "mv_ID" for this? Any ideas how to achieve this efficiently? My target is not to add new documents if signature matches, but to have IDs indexed and stored? Thanks, eks
Re: filter query from external list of Solr unique IDs
if your index is read-only in production, can you add mapping unique_id-Lucene docId in your kv store and and build filters externally? That would make unique Key obsolete in your production index, as you would work at lucene doc id level. That way, you offline the problem to update/optimize phase. Ugly part is a lot of updates on your kv-store... I am not really familiar with solr, but working directly with lucene this is doable, even having parallel index that has unique ID as a stored field, and another index with indexed fields on update master, and than having only this index with indexed fields in production. On Fri, Oct 15, 2010 at 8:59 PM, Burton-West, Tom wrote: > Hi Jonathan, > > The advantages of the obvious approach you outline are that it is simple, > it fits in to the existing Solr model, it doesn't require any customization > or modification to Solr/Lucene java code. Unfortunately, it does not scale > well. We originally tried just what you suggest for our implementation of > Collection Builder. For a user's personal collection we had a table that > maps the collection id to the unique Solr ids. > Then when they wanted to search their collection, we just took their search > and added a filter query with the fq=(id:1 OR id:2 OR). I seem to > remember running in to a limit on the number of OR clauses allowed. Even if > you can set that limit larger, there are a number of efficiency issues. > > We ended up constructing a separate Solr index where we have a multi-valued > collection number field. Unfortunately, until incremental field updating > gets implemented, this means that every time someone adds a document to a > collection, the entire document (including 700KB of OCR) needs to be > re-indexed just to update the collection number field. This approach has > allowed us to scale up to a total of something under 100,000 documents, but > we don't think we can scale it much beyond that for various reasons. > > I was actually thinking of some kind of custom Lucene/Solr component that > would for example take a query parameter such as &lookitUp=123 and the > component might do a JDBC query against a database or kv store and return > results in some form that would be efficient for Solr/Lucene to process. (Of > course this assumes that a JDBC query would be more efficient than just > sending a long list of ids to Solr). The other part of the equation is > mapping the unique Solr ids to internal Lucene ids in order to implement a > filter query. I was wondering if something like the unique id to Lucene id > mapper in zoie might be useful or if that is too specific to zoie. SoThis > may be totally off-base, since I haven't looked at the zoie code at all yet. > > In our particular use case, we might be able to build some kind of > in-memory map after we optimize an index and before we mount it in > production. In our workflow, we update the index and optimize it before we > release it and once it is released to production there is no > indexing/merging taking place on the production index (so the internal > Lucene ids don't change.) > > Tom > > > > -Original Message- > From: Jonathan Rochkind [mailto:rochk...@jhu.edu] > Sent: Friday, October 15, 2010 1:07 PM > To: solr-user@lucene.apache.org > Subject: RE: filter query from external list of Solr unique IDs > > Definitely interested in this. > > The naive obvious approach would be just putting all the ID's in the query. > Like fq=(id:1 OR id:2 OR). Or making it another clause in the 'q'. > > Can you outline what's wrong with this approach, to make it more clear > what's needed in a solution? > >