Re: Doc Transformer to remove document from the response

2012-10-29 Thread eks dev
Thanks Hoss,
I probably did not formulate the question properly, but you gave me an answer.

I do it already in SearchComponent, just wanted to centralise this
control of the depth and width of the  response to the single place in
code  [style={minimal, verbose, full...}].

It just sounds logical to me to have this possibility in
DocTransformer, as null document is kind of "extremely" modified
document.

Even better, it might actually work… (did not try it yet)

@Override
setContext( TransformContext context ) {
 context.iterator = new FilteringIterator(context.iterator)
}

Simply by providing my own FilteringIterator that would skip document
I do not need in the response? Does this sound right from the
"legitimate api usage" perspective.
I did not look where pagination happens, but it looks like
DocTransform gets applied at the very end (response writer), which in
turn means pagination is not an issue , just soma pages might get
shorter due to this additional filtering, but that is quite ok for me.



On Mon, Oct 29, 2012 at 7:59 PM, Chris Hostetter
 wrote:
>
> : Transformer is great to augment Documents before shipping to response,
> : but what would be a way to prevent document from being delivered?
>
> DocTransformers can only modify the documents -- not hte Document List.
>
> what you are describing would have to be done as a SearchComponent (or in
> the QParser) -- take a look at QueryElevation component for an example of
> how to do something like this that plays nicely with pagination.
>
>
> -Hoss


Doc Transformer to remove document from the response

2012-10-27 Thread eks dev
Transformer is great to augment Documents before shipping to response,
but what would be a way to prevent document from being delivered?

I have some search components that make some conclusions after search
, duplicates removal, clustering and one Augmenter(solr Transformer)
to shape the response up, but I need to stop some documents from being
delivered, what is the way to do it?


thanks, e.


Re: Solr 4.0 and production environments

2012-03-07 Thread eks dev
I am here on lucene as a user since the project started, even before
solr came to life, many many years. And I was always using trunk
version for pretty big customers, and *never* experienced some serious
problems. The worst thing that can happen is to notice bug somewhere,
and if you have some reasonable testing for your product, you will see
it quickly.
But, with this community, *you will definitely not have wait long top
get it fixed*. Not only they will fix it, they will thank you for
bringing it up!

I can, as an old user, 100 % vouch what Robert said below.

Simply, just go for it, test you application a bit and make your users happy.




On Wed, Mar 7, 2012 at 5:55 PM, Robert Muir  wrote:
> On Wed, Mar 7, 2012 at 11:47 AM, Dirceu Vieira  wrote:
>> Hi All,
>>
>> Has anybody started using Solr 4.0 in production environments? Is it stable
>> enough?
>> I'm planning to create a proof of concept using solr 4.0, we have some
>> projects that will gain a lot with features such as near real time search,
>> joins and others, that are available only on version 4.
>>
>> Is it too risky to think of using it right now?
>> What are your thoughts and experiences with that?
>>
>
> In general, we try to keep our 'trunk' (slated to be 4.0) in very
> stable condition.
>
> Really, it should be 'ready-to-release' at any time, of course 4.0 has
> had many drastic changes: both at the Lucene and Solr level.
>
> Before deciding what is stable, you should define stability: is it:
> * api stability: will i be able to upgrade to a more recent snapshot
> of 4.0 without drastic changes to my app?
> * index format stability: will i be able to upgrade to a more recent
> snapshot of 4.0 without re-indexing?
> * correctness: is 4.0 dangerous in some way that it has many bugs
> since much of the code is new?
>
> I think you should limit your concerns to only the first 2 items, as
> far as correctness, just look at the tests. For any open source
> project, you can easily judge its quality by its tests: this is a
> fact.
>
> For lucene/solr the testing strategy, in my opinion, goes above and
> beyond many other projects: for example random testing:
> http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011_presentations#dawid_weiss
>
> and the new solr cloud functionality also adds the similar chaosmonkey
> concept on top of this already.
>
> If you are worried about bugs, is a lucene/solr trunk snapshot less
> reliable than even a released version of alternative software? its an
> interesting question. look at their tests.
>
> --
> lucidimagination.com


Re: [SoldCloud] Slow indexing

2012-03-04 Thread eks dev
hmm, loks like you are facing exactly the phenomena I asked about.
See my question here:
http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/61326

On Sun, Mar 4, 2012 at 9:24 PM, Markus Jelsma
 wrote:
> Hi,
>
> With auto-committing disabled we can now index many millions of documents in
> our test environment on a 5-node cluster with 5 shards and a replication
> factor of 2. The documents are uploaded from map/reduce. No significant
> changes were made to solrconfig and there are no update processors enabled.
> We are using a trunk revision from this weekend.
>
> The indexing speed is well below what we are used to see, we can easily
> index 5 millions documents on a non-cloud enabled Solr 3.x instance within
> an hour. What could be going on? There aren't many open TCP connections and
> the number of file descriptors is stable and I/O is low but CPU-time is
> high! Each node has two Solr cores both writing to their dedicated disk.
>
> The indexing speed is stable, it was slow at start and still is. It's now
> running for well over 6 hours and only 3.5 millions documents are indexed.
> Another strange detail is that the node receiving all incoming documents
> (we're not yet using a client side Solr server pool) has a much larger disk
> usage than all other nodes. This is peculiar as we expected all replica's to
> be a about the same size.
>
> The receiving node has slightly higher CPU than the other nodes but the
> thread dump shows a very large amount of threads of type
> cmdDistribExecutor-8-thread-292260 (295090) with 0-100ms CPU-time. At the
> top of the list these threads all have < 20ms time but near the bottom it
> rises to just over 100ms. All nodes have a couple of http-80-30 (121994)
> threads with very high CPU-time each.
>
> Is this a known issue? Did i miss something? Any ideas?
>
> Thanks


Re: Solr Cloud, Commits and Master/Slave configuration

2012-03-01 Thread eks dev
Thanks Mark,
Good, this is probably good enough to give it a try. My analyzers are
normally fast,  doing duplicate analysis  (at each replica) is
probably not going to cost a lot, if there is some decent "batching"

Can this be somehow controlled (depth of this buffer / time till flush
or some such). Which "events" trigger this flushing to replicas
(softCommit, commit, something new?)

What I found useful is to always think in terms of incremental (low
latency) and batch (high throughput) updates. I just then need some
knobs to tweak behavior of this update process.

I wold really like to move away from Master/Slave, Cloud makes a lot
of things way simpler for us users ... Will give it a try in a couple
of weeks

Later we can even think about putting replication at segment level for
"extremely expensive analysis, batch cases", or "initial cluster
seeding" as a replication option. But this is then just an
optimization.

Cheers,
eks


On Thu, Mar 1, 2012 at 5:24 AM, Mark Miller  wrote:
> We actually do currently batch updates - we are being somewhat loose when we 
> say a document at a time. There is a buffer of updates per replica that gets 
> flushed depending on the requests coming through and the buffer size.
>
> - Mark Miller
> lucidimagination.com
>
> On Feb 28, 2012, at 3:38 AM, eks dev wrote:
>
>> SolrCluod is going to be great, NRT feature is really huge step
>> forward, as well as central configuration, elasticity ...
>>
>> The only thing I do not yet understand is treatment of cases that were
>> traditionally covered by Master/Slave setup. Batch update
>>
>> If I get it right (?), updates to replicas are sent one by one,
>> meaning when one server receives update, it gets forwarded to all
>> replicas. This is great for reduced update latency case, but I do not
>> know how is it implemented if you hit it with "batch" update. This
>> would cause huge amount of update commands going to replicas. Not so
>> good for throughput.
>>
>> - Master slave does distribution at segment level, (no need to
>> replicate analysis, far less network traffic). Good for batch updates
>> - SolrCloud does par update command (low latency, but chatty and
>> Analysis step is done N_Servers times). Good for incremental updates
>>
>> Ideally, some sort of "batching" is going to be available in
>> SolrCloud, and some cont roll over it, e.g. forward batches of 1000
>> documents (basically keep update log slightly longer and forward it as
>> a batch update command). This would still cause duplicate analysis,
>> but would reduce network traffic.
>>
>> Please bare in mind, this is more of a question than a statement,  I
>> didn't look at the cloud code. It might be I am completely wrong here!
>>
>>
>>
>>
>>
>> On Tue, Feb 28, 2012 at 4:01 AM, Erick Erickson  
>> wrote:
>>> As I understand it (and I'm just getting into SolrCloud myself), you can
>>> essentially forget about master/slave stuff. If you're using NRT,
>>> the soft commit will make the docs visible, you don't ned to do a hard
>>> commit (unlike the master/slave days). Essentially, the update is sent
>>> to each shard leader and then fanned out into the replicas for that
>>> leader. All automatically. Leaders are elected automatically. ZooKeeper
>>> is used to keep the cluster information.
>>>
>>> Additionally, SolrCloud keeps a transaction log of the updates, and replays
>>> them if the indexing is interrupted, so you don't risk data loss the way
>>> you used to.
>>>
>>> There aren't really masters/slaves in the old sense any more, so
>>> you have to get out of that thought-mode (it's hard, I know).
>>>
>>> The code is under pretty active development, so any feedback is
>>> valuable
>>>
>>> Best
>>> Erick
>>>
>>> On Mon, Feb 27, 2012 at 3:26 AM, roz dev  wrote:
>>>> Hi All,
>>>>
>>>> I am trying to understand features of Solr Cloud, regarding commits and
>>>> scaling.
>>>>
>>>>
>>>>   - If I am using Solr Cloud then do I need to explicitly call commit
>>>>   (hard-commit)? Or, a soft commit is okay and Solr Cloud will do the job 
>>>> of
>>>>   writing to disk?
>>>>
>>>>
>>>>   - Do We still need to use  Master/Slave setup to scale searching? If we
>>>>   have to use Master/Slave setup then do i need to issue hard-commit to 
>>>> make
>>>>   my changes visible to slaves?
>>>>   - If I were to use NRT with Master/Slave setup with soft commit then
>>>>   will the slave be able to see changes made on master with soft commit?
>>>>
>>>> Any inputs are welcome.
>>>>
>>>> Thanks
>>>>
>>>> -Saroj
>
>
>
>
>
>
>
>
>
>
>
>


Re: Solr Cloud, Commits and Master/Slave configuration

2012-02-28 Thread eks dev
SolrCluod is going to be great, NRT feature is really huge step
forward, as well as central configuration, elasticity ...

The only thing I do not yet understand is treatment of cases that were
traditionally covered by Master/Slave setup. Batch update

If I get it right (?), updates to replicas are sent one by one,
meaning when one server receives update, it gets forwarded to all
replicas. This is great for reduced update latency case, but I do not
know how is it implemented if you hit it with "batch" update. This
would cause huge amount of update commands going to replicas. Not so
good for throughput.

- Master slave does distribution at segment level, (no need to
replicate analysis, far less network traffic). Good for batch updates
- SolrCloud does par update command (low latency, but chatty and
Analysis step is done N_Servers times). Good for incremental updates

Ideally, some sort of "batching" is going to be available in
SolrCloud, and some cont roll over it, e.g. forward batches of 1000
documents (basically keep update log slightly longer and forward it as
a batch update command). This would still cause duplicate analysis,
but would reduce network traffic.

Please bare in mind, this is more of a question than a statement,  I
didn't look at the cloud code. It might be I am completely wrong here!





On Tue, Feb 28, 2012 at 4:01 AM, Erick Erickson  wrote:
> As I understand it (and I'm just getting into SolrCloud myself), you can
> essentially forget about master/slave stuff. If you're using NRT,
> the soft commit will make the docs visible, you don't ned to do a hard
> commit (unlike the master/slave days). Essentially, the update is sent
> to each shard leader and then fanned out into the replicas for that
> leader. All automatically. Leaders are elected automatically. ZooKeeper
> is used to keep the cluster information.
>
> Additionally, SolrCloud keeps a transaction log of the updates, and replays
> them if the indexing is interrupted, so you don't risk data loss the way
> you used to.
>
> There aren't really masters/slaves in the old sense any more, so
> you have to get out of that thought-mode (it's hard, I know).
>
> The code is under pretty active development, so any feedback is
> valuable
>
> Best
> Erick
>
> On Mon, Feb 27, 2012 at 3:26 AM, roz dev  wrote:
>> Hi All,
>>
>> I am trying to understand features of Solr Cloud, regarding commits and
>> scaling.
>>
>>
>>   - If I am using Solr Cloud then do I need to explicitly call commit
>>   (hard-commit)? Or, a soft commit is okay and Solr Cloud will do the job of
>>   writing to disk?
>>
>>
>>   - Do We still need to use  Master/Slave setup to scale searching? If we
>>   have to use Master/Slave setup then do i need to issue hard-commit to make
>>   my changes visible to slaves?
>>   - If I were to use NRT with Master/Slave setup with soft commit then
>>   will the slave be able to see changes made on master with soft commit?
>>
>> Any inputs are welcome.
>>
>> Thanks
>>
>> -Saroj


Re: SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher

2012-02-23 Thread eks dev
it loos like it works, with patch, after a couple of hours of testing
under same conditions didn't see it happen (without it, approx. every
15 minutes).

I do not think it will happen again with this patch.

Thanks again and my respect to your debugging capacity, my bug report
was really thin.


On Thu, Feb 23, 2012 at 8:47 AM, eks dev  wrote:
> thanks Mark, I will give it a go and report back...
>
> On Thu, Feb 23, 2012 at 1:31 AM, Mark Miller  wrote:
>> Looks like an issue around replication IndexWriter reboot, soft commits and 
>> hard commits.
>>
>> I think I've got a workaround for it:
>>
>> Index: solr/core/src/java/org/apache/solr/handler/SnapPuller.java
>> ===
>> --- solr/core/src/java/org/apache/solr/handler/SnapPuller.java  (revision 
>> 1292344)
>> +++ solr/core/src/java/org/apache/solr/handler/SnapPuller.java  (working 
>> copy)
>> @@ -499,6 +499,17 @@
>>
>>       // reboot the writer on the new index and get a new searcher
>>       solrCore.getUpdateHandler().newIndexWriter();
>> +      Future[] waitSearcher = new Future[1];
>> +      solrCore.getSearcher(true, false, waitSearcher, true);
>> +      if (waitSearcher[0] != null) {
>> +        try {
>> +         waitSearcher[0].get();
>> +       } catch (InterruptedException e) {
>> +         SolrException.log(LOG,e);
>> +       } catch (ExecutionException e) {
>> +         SolrException.log(LOG,e);
>> +       }
>> +     }
>>       // update our commit point to the right dir
>>       solrCore.getUpdateHandler().commit(new CommitUpdateCommand(req, 
>> false));
>>
>> That should allow the searcher that the following commit command prompts to 
>> see the *new* IndexWriter.
>>
>> On Feb 22, 2012, at 10:56 AM, eks dev wrote:
>>
>>> We started observing strange failures from ReplicationHandler when we
>>> commit on master trunk version 4-5 days old.
>>> It works sometimes, and sometimes not didn't dig deeper yet.
>>>
>>> Looks like the real culprit hides behind:
>>> org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
>>>
>>> Looks familiar to somebody?
>>>
>>>
>>> 120222 154959 SEVERE SnapPull failed
>>> :org.apache.solr.common.SolrException: Error opening new searcher
>>>    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1138)
>>>    at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1251)
>>>    at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1043)
>>>    at org.apache.solr.update.DirectUpdateHandler2.commit(Unknown Source)
>>>    at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:503)
>>>    at 
>>> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:348)
>>>    at org.apache.solr.handler.ReplicationHandler.doFetch(Unknown Source)
>>>    at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:163)
>>>    at 
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>>    at 
>>> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
>>>    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
>>>    at 
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>>>    at 
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>>>    at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>>    at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>>    at java.lang.Thread.run(Thread.java:722)
>>> Caused by: org.apache.lucene.store.AlreadyClosedException: this
>>> IndexWriter is closed
>>>    at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:810)
>>>    at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:815)
>>>    at 
>>> org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:3984)
>>>    at 
>>> org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:254)
>>>    at 
>>> org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:233)
>>>    at 
>>> org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:223)
>>>    at 
>>> org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170)
>>>    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1095)
>>>    ... 15 more
>>
>> - Mark Miller
>> lucidimagination.com
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>


Re: SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher

2012-02-22 Thread eks dev
thanks Mark, I will give it a go and report back...

On Thu, Feb 23, 2012 at 1:31 AM, Mark Miller  wrote:
> Looks like an issue around replication IndexWriter reboot, soft commits and 
> hard commits.
>
> I think I've got a workaround for it:
>
> Index: solr/core/src/java/org/apache/solr/handler/SnapPuller.java
> ===
> --- solr/core/src/java/org/apache/solr/handler/SnapPuller.java  (revision 
> 1292344)
> +++ solr/core/src/java/org/apache/solr/handler/SnapPuller.java  (working copy)
> @@ -499,6 +499,17 @@
>
>       // reboot the writer on the new index and get a new searcher
>       solrCore.getUpdateHandler().newIndexWriter();
> +      Future[] waitSearcher = new Future[1];
> +      solrCore.getSearcher(true, false, waitSearcher, true);
> +      if (waitSearcher[0] != null) {
> +        try {
> +         waitSearcher[0].get();
> +       } catch (InterruptedException e) {
> +         SolrException.log(LOG,e);
> +       } catch (ExecutionException e) {
> +         SolrException.log(LOG,e);
> +       }
> +     }
>       // update our commit point to the right dir
>       solrCore.getUpdateHandler().commit(new CommitUpdateCommand(req, false));
>
> That should allow the searcher that the following commit command prompts to 
> see the *new* IndexWriter.
>
> On Feb 22, 2012, at 10:56 AM, eks dev wrote:
>
>> We started observing strange failures from ReplicationHandler when we
>> commit on master trunk version 4-5 days old.
>> It works sometimes, and sometimes not didn't dig deeper yet.
>>
>> Looks like the real culprit hides behind:
>> org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
>>
>> Looks familiar to somebody?
>>
>>
>> 120222 154959 SEVERE SnapPull failed
>> :org.apache.solr.common.SolrException: Error opening new searcher
>>    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1138)
>>    at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1251)
>>    at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1043)
>>    at org.apache.solr.update.DirectUpdateHandler2.commit(Unknown Source)
>>    at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:503)
>>    at 
>> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:348)
>>    at org.apache.solr.handler.ReplicationHandler.doFetch(Unknown Source)
>>    at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:163)
>>    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>    at 
>> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
>>    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
>>    at 
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>>    at 
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>>    at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>    at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>    at java.lang.Thread.run(Thread.java:722)
>> Caused by: org.apache.lucene.store.AlreadyClosedException: this
>> IndexWriter is closed
>>    at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:810)
>>    at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:815)
>>    at org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:3984)
>>    at 
>> org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:254)
>>    at 
>> org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:233)
>>    at 
>> org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:223)
>>    at 
>> org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170)
>>    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1095)
>>    ... 15 more
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>


dih and solr cloud

2012-02-22 Thread eks dev
out of curiosity, trying to see if new cloud features can replace what
I use now...

how is this (batch) update forwarding solved at cloud level?

imagine simple one shard and one replica case, if I fire up DIH
update, is this going to be replicated to replica shard?
If yes,
- is it going to be sent document by document (network, imagine
100Mio+ update commands going to replica from slave for big batches)
- somehow batch into "packages" to reduce load
- distributed at index level somehow



This is important case, today with master/slave solr replication,  but
is not mentioned at  http://wiki.apache.org/solr/SolrCloud


Re: Unusually long data import time?

2012-02-22 Thread eks dev
Davon, you ought to try to update from many threads, (I do not know if
DIH can do it, check it), but lucene does great job if fed from many
update threads...

depends where your time gets lost, but it is usually a) analysis chain
or b) database

if it os a) and your server has spare cpu-cores, you can scale at X
NooCores rate

On Wed, Feb 22, 2012 at 7:41 PM, Devon Baumgarten
 wrote:
> Ahmet,
>
> I do not. I commented autoCommit out.
>
> Devon Baumgarten
>
>
>
> -Original Message-
> From: Ahmet Arslan [mailto:iori...@yahoo.com]
> Sent: Wednesday, February 22, 2012 12:25 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Unusually long data import time?
>
>> Would it be unusual for an import of 160 million documents
>> to take 18 hours?  Each document is less than 1kb and I
>> have the DataImportHandler using the jdbc driver to connect
>> to SQL Server 2008. The full-import query calls a stored
>> procedure that contains only a select from my target table.
>>
>> Is there any way I can speed this up? I saw recently someone
>> on this list suggested a new user could get all their Solr
>> data imported in under an hour. I sure hope that's true!
>
> Do have autoCommit or autoSoftCommit configured in solrconfig.xml?


SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher

2012-02-22 Thread eks dev
We started observing strange failures from ReplicationHandler when we
commit on master trunk version 4-5 days old.
It works sometimes, and sometimes not didn't dig deeper yet.

Looks like the real culprit hides behind:
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed

Looks familiar to somebody?


120222 154959 SEVERE SnapPull failed
:org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1138)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1251)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1043)
at org.apache.solr.update.DirectUpdateHandler2.commit(Unknown Source)
at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:503)
at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:348)
at org.apache.solr.handler.ReplicationHandler.doFetch(Unknown Source)
at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:163)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at 
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.lucene.store.AlreadyClosedException: this
IndexWriter is closed
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:810)
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:815)
at org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:3984)
at 
org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:254)
at 
org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:233)
at 
org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:223)
at 
org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1095)
... 15 more


Re: reader/searcher refresh after replication (commit)

2012-02-22 Thread eks dev
Yes, I consciously let my slaves run away from the master in order to
reduce update latency, but every now and then they sync up with master
that is doing heavy lifting.

The price you pay is that slaves do not see the same documents as the
master, but this is the case anyhow with replication, in my setup
slave may go ahead of master with updates, this delta gets zeroed
after replication and the game starts again.

What you have to take into account with this is very small time window
where you may "go back in time" on slaves (not seeing documents that
were already there), but we are talking about seconds and a couple out
of 200Mio documents (only those documents that were softComited on
slave during replication, since commit ond master and postCommit on
slave).

Why do you think something is strange here?

> What are you expecting a BeforeCommitListener could do for you, if one
> would exist?
Why should I be expecting something?

I just need to read userCommit Data as soon as replication is done,
and I am looking for proper/easy way to do it.  (postCommitListener is
what I use now).

What makes me slightly nervous are those life cycle questions, e.g.
when I issue update command before and after postCommit event, which
index gets updated, the one just replicated or the one that was there
just before replication.

There are definitely ways to optimize this, for example to force
replication handler to copy only delta files if index gets updated on
slave and master  (there is already todo somewhere on solr replication
Wiki I think). Now replicationHandler copies complete index if this
gets detected ...

I am all ears if there are better proposals to have low latency
updates in multi server setup...


On Tue, Feb 21, 2012 at 11:53 PM, Em  wrote:
> Eks,
>
> that sounds strange!
>
> Am I getting you right?
> You have a master which indexes batch-updates from time to time.
> Furthermore you got some slaves, pulling data from that master to keep
> them up-to-date with the newest batch-updates.
> Additionally your slaves index own content in soft-commit mode that
> needs to be available as soon as possible.
> In consequence the slavesare not in sync with the master.
>
> I am not 100% certain, but chances are good that Solr's
> replication-mechanism only changes those segments that are not in sync
> with the master.
>
> What are you expecting a BeforeCommitListener could do for you, if one
> would exist?
>
> Kind regards,
> Em
>
> Am 21.02.2012 21:10, schrieb eks dev:
>> Thanks Mark,
>> Hmm, I would like to have this information asap, not to wait until the
>> first search gets executed (depends on user) . Is solr going to create
>> new searcher as a part of "replication transaction"...
>>
>> Just to make it clear why I need it...
>> I have simple master, many slaves config where master does "batch"
>> updates in big chunks (things user can wait longer to see on search
>> side) but slaves work in soft commit mode internally where I permit
>> them to run away slightly from master in order to know where
>> "incremental update" should start, I read it from UserData 
>>
>> Basically, ideally, before commit (after successful replication is
>> finished) ends, I would like to read in these counters to let
>> "incremental update" run from the right point...
>>
>> I need to prevent updating "replicated index" before I read this
>> information (duplicates can appear) are there any "IndexWriter"
>> listeners around?
>>
>>
>> Thanks again,
>> eks.
>>
>>
>>
>> On Tue, Feb 21, 2012 at 8:03 PM, Mark Miller  wrote:
>>> Post commit calls are made before a new searcher is opened.
>>>
>>> Might be easier to try to hook in with a new searcher listener?
>>>
>>> On Feb 21, 2012, at 8:23 AM, eks dev wrote:
>>>
>>>> Hi all,
>>>> I am a bit confused with IndexSearcher refresh lifecycles...
>>>> In a master slave setup, I override postCommit listener on slave
>>>> (solr trunk version) to read some user information stored in
>>>> userCommitData on master
>>>>
>>>> --
>>>> @Override
>>>> public final void postCommit() {
>>>> // This returnes "stale" information that was present before
>>>> replication finished
>>>> RefCounted refC = core.getNewestSearcher(true);
>>>> Map userData =
>>>> refC.get().getIndexReader().getIndexCommit().getUserData();
>>>> }
>>>> 
>>>> I expected core.getNewestSearcher(true); to return refreshed
>>>> 

Re: reader/searcher refresh after replication (commit)

2012-02-21 Thread eks dev
And drinks on me to those who decoupled implicit commit from close...
this was tricky trap

On Tue, Feb 21, 2012 at 9:10 PM, eks dev  wrote:
> Thanks Mark,
> Hmm, I would like to have this information asap, not to wait until the
> first search gets executed (depends on user) . Is solr going to create
> new searcher as a part of "replication transaction"...
>
> Just to make it clear why I need it...
> I have simple master, many slaves config where master does "batch"
> updates in big chunks (things user can wait longer to see on search
> side) but slaves work in soft commit mode internally where I permit
> them to run away slightly from master in order to know where
> "incremental update" should start, I read it from UserData 
>
> Basically, ideally, before commit (after successful replication is
> finished) ends, I would like to read in these counters to let
> "incremental update" run from the right point...
>
> I need to prevent updating "replicated index" before I read this
> information (duplicates can appear) are there any "IndexWriter"
> listeners around?
>
>
> Thanks again,
> eks.
>
>
>
> On Tue, Feb 21, 2012 at 8:03 PM, Mark Miller  wrote:
>> Post commit calls are made before a new searcher is opened.
>>
>> Might be easier to try to hook in with a new searcher listener?
>>
>> On Feb 21, 2012, at 8:23 AM, eks dev wrote:
>>
>>> Hi all,
>>> I am a bit confused with IndexSearcher refresh lifecycles...
>>> In a master slave setup, I override postCommit listener on slave
>>> (solr trunk version) to read some user information stored in
>>> userCommitData on master
>>>
>>> --
>>> @Override
>>> public final void postCommit() {
>>> // This returnes "stale" information that was present before
>>> replication finished
>>> RefCounted refC = core.getNewestSearcher(true);
>>> Map userData =
>>> refC.get().getIndexReader().getIndexCommit().getUserData();
>>> }
>>> 
>>> I expected core.getNewestSearcher(true); to return refreshed
>>> SolrIndexSearcher, but it didn't
>>>
>>> When is this information going to be refreshed to the status from the
>>> replicated index, I repeat this is postCommit listener?
>>>
>>> What is the way to get the information from the last commit point?
>>>
>>> Maybe like this?
>>> core.getDeletionPolicy().getLatestCommit().getUserData();
>>>
>>> Or I need to explicitly open new searcher (isn't solr does this behind
>>> the scenes?)
>>> core.openNewSearcher(false, false)
>>>
>>> Not critical, reopening new searcher works, but I would like to
>>> understand these lifecycles, when solr loads latest commit point...
>>>
>>> Thanks, eks
>>
>> - Mark Miller
>> lucidimagination.com
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>


Re: reader/searcher refresh after replication (commit)

2012-02-21 Thread eks dev
Thanks Mark,
Hmm, I would like to have this information asap, not to wait until the
first search gets executed (depends on user) . Is solr going to create
new searcher as a part of "replication transaction"...

Just to make it clear why I need it...
I have simple master, many slaves config where master does "batch"
updates in big chunks (things user can wait longer to see on search
side) but slaves work in soft commit mode internally where I permit
them to run away slightly from master in order to know where
"incremental update" should start, I read it from UserData 

Basically, ideally, before commit (after successful replication is
finished) ends, I would like to read in these counters to let
"incremental update" run from the right point...

I need to prevent updating "replicated index" before I read this
information (duplicates can appear) are there any "IndexWriter"
listeners around?


Thanks again,
eks.



On Tue, Feb 21, 2012 at 8:03 PM, Mark Miller  wrote:
> Post commit calls are made before a new searcher is opened.
>
> Might be easier to try to hook in with a new searcher listener?
>
> On Feb 21, 2012, at 8:23 AM, eks dev wrote:
>
>> Hi all,
>> I am a bit confused with IndexSearcher refresh lifecycles...
>> In a master slave setup, I override postCommit listener on slave
>> (solr trunk version) to read some user information stored in
>> userCommitData on master
>>
>> --
>> @Override
>> public final void postCommit() {
>> // This returnes "stale" information that was present before
>> replication finished
>> RefCounted refC = core.getNewestSearcher(true);
>> Map userData =
>> refC.get().getIndexReader().getIndexCommit().getUserData();
>> }
>> 
>> I expected core.getNewestSearcher(true); to return refreshed
>> SolrIndexSearcher, but it didn't
>>
>> When is this information going to be refreshed to the status from the
>> replicated index, I repeat this is postCommit listener?
>>
>> What is the way to get the information from the last commit point?
>>
>> Maybe like this?
>> core.getDeletionPolicy().getLatestCommit().getUserData();
>>
>> Or I need to explicitly open new searcher (isn't solr does this behind
>> the scenes?)
>> core.openNewSearcher(false, false)
>>
>> Not critical, reopening new searcher works, but I would like to
>> understand these lifecycles, when solr loads latest commit point...
>>
>> Thanks, eks
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>


reader/searcher refresh after replication (commit)

2012-02-21 Thread eks dev
Hi all,
I am a bit confused with IndexSearcher refresh lifecycles...
In a master slave setup, I override postCommit listener on slave
(solr trunk version) to read some user information stored in
userCommitData on master

--
@Override
public final void postCommit() {
// This returnes "stale" information that was present before
replication finished
RefCounted refC = core.getNewestSearcher(true);
Map userData =
refC.get().getIndexReader().getIndexCommit().getUserData();
}

I expected core.getNewestSearcher(true); to return refreshed
SolrIndexSearcher, but it didn't

When is this information going to be refreshed to the status from the
replicated index, I repeat this is postCommit listener?

What is the way to get the information from the last commit point?

Maybe like this?
core.getDeletionPolicy().getLatestCommit().getUserData();

Or I need to explicitly open new searcher (isn't solr does this behind
the scenes?)
core.openNewSearcher(false, false)

Not critical, reopening new searcher works, but I would like to
understand these lifecycles, when solr loads latest commit point...

Thanks, eks


Re: codec="Pulsing" per field broken?

2011-12-11 Thread eks dev
Thanks Robert,

I've missed LUCENE-3490... Awesome!

On Sun, Dec 11, 2011 at 6:37 PM, Robert Muir  wrote:
> On Sun, Dec 11, 2011 at 11:34 AM, eks dev  wrote:
>> on the latest trunk, my schema.xml with field type declaration
>> containing //codec="Pulsing"// does not work any more (throws
>> exception from FieldType). It used to work wit approx. a month old
>> trunk version.
>>
>> I didn't dig deeper, can be that the old schema.xml  was broken and
>> worked by accident.
>>
>
> Hi,
>
> The short answer is, you should change this to //postingsFormat="Pulsing40"//
> See 
> http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/conf/schema_codec.xml
>
> The longer answer is that the Codec API in lucene trunk was extended recently:
> https://issues.apache.org/jira/browse/LUCENE-3490
>
> Previously "Codec" only allowed you to customize the format of the
> postings lists.
> We are working to have it cover the entire index segment (at the
> moment nearly everything except deletes and encoding of compound files
> can be customized).
>
> For example, look at SimpleText now:
> http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/codecs/simpletext/
> As you see, it now implements plain-text stored fields, term vectors,
> norms, segments file, fieldinfos, etc.
> See Codec.java 
> (http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/codecs/Codec.java)
> or LUCENE-3490 for more details.
>
> Because of this, what you had before is now just "PostingsFormat", as
> Pulsing is just a wrapper around a postings implementation that
> inlines low frequency terms.
> Lucene's default Codec uses a per-field postings setup, so you can
> still configure the postings per-field, just differently.
>
> --
> lucidimagination.com


codec="Pulsing" per field broken?

2011-12-11 Thread eks dev
on the latest trunk, my schema.xml with field type declaration
containing //codec="Pulsing"// does not work any more (throws
exception from FieldType). It used to work wit approx. a month old
trunk version.

I didn't dig deeper, can be that the old schema.xml  was broken and
worked by accident.



org.apache.solr.common.SolrException: Plugin Initializing failure for
[schema.xml] fieldType
at 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:183)
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:368)
at org.apache.solr.schema.IndexSchema.(IndexSchema.java:107)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:651)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:409)
at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:243)
at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:93)
at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713)
at org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
at 
org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282)
at 
org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
at 
org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
at org.mortbay.jetty.Server.doStart(Server.java:224)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at runjettyrun.Bootstrap.main(Bootstrap.java:86)
Caused by: java.lang.RuntimeException: schema fieldtype
storableCity(X.StorableField) invalid
arguments:{codec=Pulsing}
at org.apache.solr.schema.FieldType.setArgs(FieldType.java:177)
at 
org.apache.solr.schema.FieldTypePluginLoader.init(FieldTypePluginLoader.java:127)
at 
org.apache.solr.schema.FieldTypePluginLoader.init(FieldTypePluginLoader.java:43)
at 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:180)
... 18 more


Re: capacity planning

2011-10-11 Thread eks dev
Re. "I have little experience with VM servers for search."

We had huge performance penalty on VMs,  CPU was bottleneck.
We couldn't freely run measurements to figure out what the problem really
was (hosting was contracted by customer...), but it was something pretty
scary, kind of 8-10 times slower than advertised dedicated equivalent.
Whatever its worth, if you can afford it, keep lucene away from it. Lucene
is highly optimized machine, and someone twiddling with context switches is
not welcome there.

Of course, if you get IO bound, it makes no big diff anyhow.

This is just my singular experience, might be the hosting team did not
configure it right, or something changed in meantime (~ 4 Years old
experience),  but we burnt our fingers that hard I still remember it




On Tue, Oct 11, 2011 at 7:49 PM, Toke Eskildsen wrote:

> Travis Low [t...@4centurion.com] wrote:
> > Toke, thanks.  Comments embedded (hope that's okay):
>
> Inline or top-posting? Long discussion, but for mailing lists I clearly
> prefer the former.
>
> [Toke: Estimate characters]
>
> > Yes.  We estimate each of the 23K DB records has 600 pages of text for
> the
> > combined documents, 300 words per page, 5 characters per word.  Which
> > coincidentally works out to about 21GB, so good guessing there. :)
>
> Heh. Lucky Guess indeed, although the factors were off. Anyway, 21GB does
> not sound scary at all.
>
> > The way it works is we have researchers modifying the DB records during
> the
> > day, and they may upload documents at that time.  We estimate 50-60
> uploads
> > throughout the day.  If possible, we'd like to index them as they are
> > uploaded, but if that would negatively affect the search, then we can
> > rebuild the index nightly.
> >
> > Which is better?
>
> The analyzing part is only CPU and you're running multi-core so as long as
> you only analyze using one thread you're safe there. That leaves us with
> I/O: Even for spinning drives, a daily load of just 60 updates of 1MB of
> extracted text each shouldn't have any real effect - with the usual caveat
> that large merges should be avoided by either optimizing at night or
> tweaking merge policy to avoid large segments. With such a relatively small
> index, (re)opening and warm up should be painless too.
>
> Summary: 300GB is a fair amount of data and takes some power to crunch.
> However, in the Solr/Lucene end your index size and your update rates are
> nothing to worry about. Usual caveat for advanced use and all that applies.
>
> [Toke: i7, 8GB, 1TB spinning, 256GB SSD]
>
> > We have a very beefy VM server that we will use for benchmarking, but
> your
> > specs provide a starting point.  Thanks very much for that.
>
> I have little experience with VM servers for search. Although we use a lot
> of virtual machines, we use dedicated machines for our searchers, primarily
> to ensure low latency for I/O. They might be fine for that too, but we
> haven't tried it yet.
>
> Glad to be of help,
> Toke


Re: Update ingest rate drops suddenly

2011-09-26 Thread eks dev
Just to bring closure on this one, we were slurping data from the
wrong DB (hardly desktop class machine)...

Solr did not cough on 41Mio records @34k updates / sec.,  single threaded.
Great!



On Sat, Sep 24, 2011 at 9:18 PM, eks dev  wrote:
> just looking for hints where to look for...
>
> We were testing single threaded ingest rate on solr, trunk version on
> atypical collection (a lot of small documents), and we noticed
> something we are not able to explain.
>
> Setup:
> We use defaults for index settings, windows 64 bit, jdk 7 U2. on SSD,
> machine with enough memory and 8 cores.   Schema has 5 stored fields,
> 4 of them indexed no positions no norms.
> Average net document size (optimized index size / number of documents)
> is around 100 bytes.
>
> On a test with 40 Mio document:
> - we had update ingest rate  on first 4,4Mio documents @  incredible
> 34k records / second...
> - then it dropped, suddenly to 20k records per second and this rate
> remained stable (variance 1k) until...
> - we hit 13Mio, where ingest rate dropped again really hard, from one
> instant in time to another to 10k records per second.
>
> it stayed there until we reached the end @40Mio (slightly reducing, to
> ca 9k, but this is not long enough to see trend).
>
> Nothing unusual happening with jvm memory ( tooth-saw  200- 450M fully
> regular). CPU in turn was  following the ingest rate trend, inicating
> that we were waiting on something. No searches , no commits, nothing.
>
> autoCommit was turned off. Updates were streaming directly from the database.
>
> -
> I did not expect something like this, knowing lucene merges in
> background. Also, having such sudden drops in ingest rate is
> indicative that we are not leaking something. (drop would have been
> much more gradual). It is some caches, but why two really significant
> drops? 33k/sec to 20k and than to 10k... We would love to keep it  @34
> k/second :)
>
> I am not really acquainted with the new MergePolicy and flushing
> settings, but I suspect this is something there we could tweak.
>
> Could it be windows is somehow, hmm, quirky with solr default
> directory on win64/jvm (I think it is MMAP by default)... We did not
> saturate IO with such a small documents I guess, It is a just couple
> of Gig over 1-2 hours.
>
> All in all, it works good, but is having such hard update ingest rate
> drops normal?
>
> Thanks,
> eks.
>


Re: Update ingest rate drops suddenly

2011-09-25 Thread eks dev
Thanks Otis,
we will look into these issues again, slightly deeper. Network
problems are not likely, but DB, I do not know, this is huge select
... we will try to scan db, without indexing, just to see if it can
sustain... But gut feeling says, nope, this is not the one.

IO saturation would surprise me, but you never know. Might be very
well that SSD is somehow having problems with this sustained
throughput.

8 Core... no, this was single update thread.

we left default index settings (do not tweak if it works :)
32

32MB sounds like a lot of our documents (100b average on disk size).
Assuming ram efficiency of 50% (?), we lend at 100k buffered
documents. Yes, this is kind of  smallish as every ~3 seconds we
fill-up ramBuffer. (our Analyzers surprised  me with 30k+ records per
second).

256 will do the job, ~24 seconds should be plenty of "idle" time for
IO-OS-JVM  to sort out MMAP issues, if any (windows was newer MMAP
performance champion when using it from java, but once you dance
around it, it works ok)...


Max jvm heap on this test was 768m, memory never went above 500m,
Using  -XX:-UseParallelGC ... this is definitely not a gc problem.

cheers,
eks


On Sun, Sep 25, 2011 at 6:20 AM, Otis Gospodnetic
 wrote:
> eks,
>
> This is clear as day - you're using Winblows!  Kidding.
>
> I'd:
> * watch IO with something like vmstat 2 and see if the rate drops correlate 
> to increased disk IO or IO wait time
> * monitor the DB from which you were pulling the data - maybe the DB or the 
> server that runs it had issues
> * monitor the network over which you pull data from DB
>
> If none of the above reveals the problem I'd still:
> * grab all data you need to index and copy it locally
> * index everything locally
>
> Out of curiosity, how big is your ramBufferSizeMB and your -Xmx?
> And on that 8-core box you have ~8 indexing threads going?
>
> Otis
> 
> Sematext is Hiring -- http://sematext.com/about/jobs.html
>
>
>
>
>>
>>From: eks dev 
>>To: solr-user 
>>Sent: Saturday, September 24, 2011 3:18 PM
>>Subject: Update ingest rate drops suddenly
>>
>>just looking for hints where to look for...
>>
>>We were testing single threaded ingest rate on solr, trunk version on
>>atypical collection (a lot of small documents), and we noticed
>>something we are not able to explain.
>>
>>Setup:
>>We use defaults for index settings, windows 64 bit, jdk 7 U2. on SSD,
>>machine with enough memory and 8 cores.   Schema has 5 stored fields,
>>4 of them indexed no positions no norms.
>>Average net document size (optimized index size / number of documents)
>>is around 100 bytes.
>>
>>On a test with 40 Mio document:
>>- we had update ingest rate  on first 4,4Mio documents @  incredible
>>34k records / second...
>>- then it dropped, suddenly to 20k records per second and this rate
>>remained stable (variance 1k) until...
>>- we hit 13Mio, where ingest rate dropped again really hard, from one
>>instant in time to another to 10k records per second.
>>
>>it stayed there until we reached the end @40Mio (slightly reducing, to
>>ca 9k, but this is not long enough to see trend).
>>
>>Nothing unusual happening with jvm memory ( tooth-saw  200- 450M fully
>>regular). CPU in turn was  following the ingest rate trend, inicating
>>that we were waiting on something. No searches , no commits, nothing.
>>
>>autoCommit was turned off. Updates were streaming directly from the database.
>>
>>-
>>I did not expect something like this, knowing lucene merges in
>>background. Also, having such sudden drops in ingest rate is
>>indicative that we are not leaking something. (drop would have been
>>much more gradual). It is some caches, but why two really significant
>>drops? 33k/sec to 20k and than to 10k... We would love to keep it  @34
>>k/second :)
>>
>>I am not really acquainted with the new MergePolicy and flushing
>>settings, but I suspect this is something there we could tweak.
>>
>>Could it be windows is somehow, hmm, quirky with solr default
>>directory on win64/jvm (I think it is MMAP by default)... We did not
>>saturate IO with such a small documents I guess, It is a just couple
>>of Gig over 1-2 hours.
>>
>>All in all, it works good, but is having such hard update ingest rate
>>drops normal?
>>
>>Thanks,
>>eks.
>>
>>
>>


Update ingest rate drops suddenly

2011-09-24 Thread eks dev
just looking for hints where to look for...

We were testing single threaded ingest rate on solr, trunk version on
atypical collection (a lot of small documents), and we noticed
something we are not able to explain.

Setup:
We use defaults for index settings, windows 64 bit, jdk 7 U2. on SSD,
machine with enough memory and 8 cores.   Schema has 5 stored fields,
4 of them indexed no positions no norms.
Average net document size (optimized index size / number of documents)
is around 100 bytes.

On a test with 40 Mio document:
- we had update ingest rate  on first 4,4Mio documents @  incredible
34k records / second...
- then it dropped, suddenly to 20k records per second and this rate
remained stable (variance 1k) until...
- we hit 13Mio, where ingest rate dropped again really hard, from one
instant in time to another to 10k records per second.

it stayed there until we reached the end @40Mio (slightly reducing, to
ca 9k, but this is not long enough to see trend).

Nothing unusual happening with jvm memory ( tooth-saw  200- 450M fully
regular). CPU in turn was  following the ingest rate trend, inicating
that we were waiting on something. No searches , no commits, nothing.

autoCommit was turned off. Updates were streaming directly from the database.

-
I did not expect something like this, knowing lucene merges in
background. Also, having such sudden drops in ingest rate is
indicative that we are not leaking something. (drop would have been
much more gradual). It is some caches, but why two really significant
drops? 33k/sec to 20k and than to 10k... We would love to keep it  @34
k/second :)

I am not really acquainted with the new MergePolicy and flushing
settings, but I suspect this is something there we could tweak.

Could it be windows is somehow, hmm, quirky with solr default
directory on win64/jvm (I think it is MMAP by default)... We did not
saturate IO with such a small documents I guess, It is a just couple
of Gig over 1-2 hours.

All in all, it works good, but is having such hard update ingest rate
drops normal?

Thanks,
eks.


solr-user@lucene.apache.org

2011-09-16 Thread eks dev
probably stupid question,

Which Directory implementation should be the best suited for index
mounted on ramfs/tmpfs? I guess plain old FSDirectory, (or mmap/nio?)


Which Solr / Lucene direcotory for ramfs?

2011-09-16 Thread eks dev
probably stupid question,

Which Directory implementation should be the best suited for index
mounted on ramfs/tmpfs? I guess plain old FSDirectory, (or mmap/nio?)


Re: DataImportHandler using new connection on each query

2011-09-02 Thread eks dev
watch out, "running 10 hours" != "idling 10 seconds" and trying again.
Those are different cases.

It is not dropping *used* connections (good to know it works that
good, thanks for reporting!), just not reusing connections more than
10 seconds idle



On Fri, Sep 2, 2011 at 10:26 PM, Gora Mohanty  wrote:
> On Sat, Sep 3, 2011 at 1:38 AM, Shawn Heisey  wrote:
> [...]
>> I use DIH with MySQL.  When things are going well, a full rebuild will leave
>> connections open and active for over two hours.  This is the case with
>> 1.4.0, 1.4.1, 3.1.0, and 3.2.0.  Due to some kind of problem on the database
>> server, last night I had a rebuild going for more than 11 hours with no
>> problems, verified from the processlist on the server.
>
> Will second that. Have had DIH connections open to both
> mysql, and MS-SQL for 8-10h. Dropped connections could
> be traced to network issues, or some other exception.
>
> Regards,
> Gora
>


Re: DataImportHandler using new connection on each query

2011-09-02 Thread eks dev
take care, "running 10 hours" != "idling 10 seconds" and trying again.
Those are different cases.

It is not dropping *used* connections (good to know it works that
good, thanks for reporting!), just not reusing connections more than
10 seconds idle



On Fri, Sep 2, 2011 at 10:26 PM, Gora Mohanty  wrote:
> On Sat, Sep 3, 2011 at 1:38 AM, Shawn Heisey  wrote:
> [...]
>> I use DIH with MySQL.  When things are going well, a full rebuild will leave
>> connections open and active for over two hours.  This is the case with
>> 1.4.0, 1.4.1, 3.1.0, and 3.2.0.  Due to some kind of problem on the database
>> server, last night I had a rebuild going for more than 11 hours with no
>> problems, verified from the processlist on the server.
>
> Will second that. Have had DIH connections open to both
> mysql, and MS-SQL for 8-10h. Dropped connections could
> be traced to network issues, or some other exception.
>
> Regards,
> Gora
>


Re: DataImportHandler using new connection on each query

2011-09-02 Thread eks dev
I am not sure if current version has this, but  DIH used to reload
connections after some idle time

if (currTime - connLastUsed > CONN_TIME_OUT) {
synchronized (this) {
Connection tmpConn = factory.call();
closeConnection();
connLastUsed = System.currentTimeMillis();
return conn = tmpConn;
}


Where CONN_TIME_OUT = 10 seconds



On Fri, Sep 2, 2011 at 12:36 AM, Chris Hostetter
 wrote:
>
> : However, I tested this against a slower SQL Server and I saw
> : dramatically worse results. Instead of re-using their database, each of
> : the sub-entities is recreating a connection each time the query runs.
>
> are you seeing any specific errors logged before these new connections are
> created?
>
> I don't *think* there's anything in the DIH JDBC/SQL code that causes it
> to timeout existing connections -- is it possible this is sometihng
> specific to the JDBC Driver you are using?
>
> Or maybe you are using the DIH "threads" option along with a JNDI/JDBC
> based pool of connections that is configured to create new Connections on
> demand, and with the fast DB it can reuse them but on the slow DB it does
> enough stuff in parallel to keep asking for new connections to be created?
>
>
> If it's DIH creating new connections over and over then i'm pretty sure
> you should see an INFO level log message like this for each connection...
>
>        LOG.info("Creating a connection for entity "
>                + context.getEntityAttribute(DataImporter.NAME) + " with URL: "
>                + url);
>
> ...are those messages different against you fast DB and your slow DB?
>
> -Hoss
>


NRT in Master- Slave setup, crazy?

2011-08-11 Thread eks dev
Thinking aloud and grateful for sparing ..

I need to support high commit rate (low update latency) in a master
slave setup and I have a bad feelings about it, even with disabling
warmup and stripping everything down that slows down refresh.

I will try it anyway, but I started thinking about "backup plan", like
NRT on slaves.

An idea is to have Master working on disk, doing commits in throughput
 friendly manner (e.g. every 5-10 minutes), but let slaves do the same
updates with softCommit

I am basically going to let slaves "possibly run out of sync" with
master, by issuing the same updates on all slaves with softCommit ...
every now and than syncing with Master.

Could this work? the trick is, index is big (can fit in Ca. 16-20G
Ram), but update rate is small and ugly distributed in time (every
couple of seconds a few documents), one hard commit on master + slave
update would probably cost much more than add(document) with
softCommit on every slave (2-5 of them)

So all in all, master remains real master and is there to ensure:
 a ) seeding if slave restarts
 b) authoritative index master, if slaves run out of sync (small diff
is ok if they get corrected once a day)

In general, do you find such idea wrong for some reason, should I be
doing something else/better to achieve low update latency in master
slave (for low update throughput)?

Anything I can do to make standard master slave latency better apart
from disabling warmup? Would loading os ramdisk (tmpfs forced in ram)
on slaves bring much.

I am talking about Ca. 1 second (plus/minus) update latency target
from update to search on slave... But not more than 0.5 - 2 updates
every second.  And what I so far understood how solr works, this is
going to be possible only with NRT on slaves (Analysis in my case is
fast, so not an issue)...


Re: DIH on sequence (or any type that supports ordering) possible?

2011-08-06 Thread eks dev
Thanks Shawn, nice! I didn't notice you can pass more params all the
way to sql.

So you really do not care about DIH incremental facility,  you use it
just as vehicle to provide
- SQL import
- transactional commit to solr on updates...

But keeping DB/solr n sync is externalized (I am trying to find
simple/robust solution for this part as well...).

I am researching possibilities to get this information from lucene
index itself,  "what was the last document added?" , and than read
stored ID field from it to feed DIH query like yours

Should be easy question for solr/lucene to do, but I really do not
know simple and fast way...


cheers,
eks


On Sat, Aug 6, 2011 at 8:32 PM, Shawn Heisey  wrote:
> On 8/6/2011 8:49 AM, eks dev wrote:
>>
>> I would appreciate some clarifications about DIH
>>
>> I do not have reliable timestamp, but I do have atomic sequence that
>> only grows on inserts/changes.
>
> I use DIH, but I don't use the built-in timestamp facility at all.  I have
> an autoincrement field in a MySQL database that tells me what's new.  Here
> are the three queries I have defined in dih-config.xml:
>
>      query="
>        SELECT * FROM ${dataimporter.request.dataView}
>        WHERE (
>          (
>            did > ${dataimporter.request.minDid}
>            AND did <= ${dataimporter.request.maxDid}
>          )
>          ${dataimporter.request.extraWhere}
>        ) AND (crc32(did) % ${dataimporter.request.numShards})
>          IN (${dataimporter.request.modVal})
>        "
>      deltaImportQuery="
>        SELECT * FROM ${dataimporter.request.dataView}
>        WHERE (
>          (
>            did > ${dataimporter.request.minDid}
>            AND did <= ${dataimporter.request.maxDid}
>          )
>          ${dataimporter.request.extraWhere}
>        ) AND (crc32(did) % ${dataimporter.request.numShards})
>          IN (${dataimporter.request.modVal})
>        "
>      deltaQuery="SELECT 1 AS did"
>
> If you look carefully, you'll notice that query and deltaImportQuery are
> identical, and deltaQuery is just something that always returns a value.  I
> keep track of did (the primary key for both dih-config and the database) in
> my build system, passing in minDid and maxDid parameters on the DIH URL to
> tell it what to index.  I include more parameters to handle sharding and
> special situations.  I actually use a different field (with it's own unique
> MySQL index) as Solr's uniqueKey.
>
> Currently Solr does not support keeping track of arbitrary data, just the
> current timestamp ... but if you can track it outside of Solr and pass the
> appropriate parameters in with the full-import or delta-import request, you
> can do almost anything.
>
> This is on Solr 3.2, but I used a similar setup when I was running 1.4.1 as
> well.
>
> Shawn
>
>


DIH on sequence (or any type that supports ordering) possible?

2011-08-06 Thread eks dev
I would appreciate some clarifications about DIH

I do not have reliable timestamp, but I do have atomic sequence that
only grows on inserts/changes.
You can understand it as a timestamp on some funky timezone not
related to wall clock time, it is integer type.

Is DIH keeping track of the MAX(committed timestamp) or it expects
timestamp in DB to be wall clock time?
If it expects wall clock timestamp, casting integer sequence value to
timestamp (like number of seconds since constant point in time) at
reading time would not work...

Ideally for my case, DIH should keep MAX(whatever_field_specified)...

Maybe in an Idea would be to modify DIH to support passing max(of the
specified field in dih config)  to the Lucene
IndexWriter.commit(Map commitUserData)

Later, just read IndexReader.getCommitUserData() and pass it to SQL as
${last.committed.sequence}

This would have charming property, in a master slave setup, to
continue working after master fail over without touching anything
every slave could overtake at any time

Second question is related to the delta queries as well. I know I have
no deletes/modifications in my DataSource, only additions.
Can I prevent DIH from trying to resolve deletes... my delta is fully
qualified by:
select * from source_table
where my_sequence > ${last.committed.sequence}

I imagine this step takes a lot of time to lookup every document ID in index?

Thanks in advance,
eks


Re: Matching queries on a per-element basis against a multivalued field

2011-08-02 Thread eks dev
Sure, I know...,
the point I was trying to make, "if someone serious like Lucid  is
using solr 4.x as a core technology for own customers, the trunk could
not be all that bad" => release date not as far as 2012 :)


On Tue, Aug 2, 2011 at 11:33 PM, Smiley, David W.  wrote:
> "LucidWorks Enterprise" (which is more than Solr, and a modified Solr at 
> that) isn't free; so you can't extract the Solr part of that package and use 
> it unless you are willing to pay them.
>
> Lucid's "Certified Solr", on the other hand, is free.  But they have yet to 
> bump that to trunk/4.x; it was only recently updated to 3.2.
>
> On Aug 2, 2011, at 5:26 PM, eks dev wrote:
>
>> Well, Lucid released "LucidWorks Enterprise"
>> with  " Complete Apache Solr 4.x Release Integrated and tested with
>> powerful enhancements"
>>
>> Whatever it means for solr 4.0
>>
>>
>>
>> On Tue, Aug 2, 2011 at 11:10 PM, David Smiley (@MITRE.org)
>>  wrote:
>>> My best guess (and it is just a guess) is between December and March.
>>>
>>> The roots of Solr 4 which triggered the major version change is known as
>>> "flexible indexing" (or just "flex" for short amongst developers).  The
>>> genesis of it was posted to JIRA as a patch on 18 November 2008 --
>>> LUCENE-1458 (almost 3 years ago!). About a year later it was committed into
>>> a special flex branch that is probably gone now, and then around
>>> April/early-May 2010, it went into trunk whereas the pre-flex code on trunk
>>> went to a newly formed 3x branch. That is ancient history now, and there are
>>> some amazing performance improvements tied to flex that haven't seen the
>>> light of day in an official release. It's a shame, really. So it's been so
>>> long that, well, after it dawns on everyone that it that the code is 3
>>> friggin years old without a release -- it's time to get on with the show.
>>>
>>> ~ David Smiley
>>>
>>> -
>>>  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
>>> --
>>> View this message in context: 
>>> http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3220242.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>
>


Re: Matching queries on a per-element basis against a multivalued field

2011-08-02 Thread eks dev
Well, Lucid released "LucidWorks Enterprise"
with  " Complete Apache Solr 4.x Release Integrated and tested with
powerful enhancements"

Whatever it means for solr 4.0



On Tue, Aug 2, 2011 at 11:10 PM, David Smiley (@MITRE.org)
 wrote:
> My best guess (and it is just a guess) is between December and March.
>
> The roots of Solr 4 which triggered the major version change is known as
> "flexible indexing" (or just "flex" for short amongst developers).  The
> genesis of it was posted to JIRA as a patch on 18 November 2008 --
> LUCENE-1458 (almost 3 years ago!). About a year later it was committed into
> a special flex branch that is probably gone now, and then around
> April/early-May 2010, it went into trunk whereas the pre-flex code on trunk
> went to a newly formed 3x branch. That is ancient history now, and there are
> some amazing performance improvements tied to flex that haven't seen the
> light of day in an official release. It's a shame, really. So it's been so
> long that, well, after it dawns on everyone that it that the code is 3
> friggin years old without a release -- it's time to get on with the show.
>
> ~ David Smiley
>
> -
>  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3220242.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: conditionally update document on unique id

2011-06-29 Thread eks dev
Hi Yonik,
as this recommendation comes from you, I am not going to test it, you
are well known as a speed junkie ;)

When we are there (in SignatureUpdateProcessor), why is this code not
moved to the constructor, but remains in processAdd

...
Signature sig = (Signature)
req.getCore().getResourceLoader().newInstance(signatureClass);
sig.init(params);
...
Should we be expecting on the fly signatureClass changes / params? I
am still not all that familiar with solr life cycles... might be
stupid question.

Thanks,
eks


On Wed, Jun 29, 2011 at 10:36 PM, Yonik Seeley
 wrote:
> On Wed, Jun 29, 2011 at 4:32 PM, eks dev  wrote:
>> req.getSearcher().getFirstMatch(t) != -1;
>
> Yep, this is currently the fastest option we have.
>
> -Yonik
> http://www.lucidimagination.com
>


Re: conditionally update document on unique id

2011-06-29 Thread eks dev
Thanks Shalin!

would you not expect

req.getSearcher().docFreq(t);

to be slightly faster? Or maybe even

req.getSearcher().getFirstMatch(t) != -1;

which one should be faster, any known side effects?




On Wed, Jun 29, 2011 at 1:45 PM, Shalin Shekhar Mangar
 wrote:
> On Wed, Jun 29, 2011 at 2:01 AM, eks dev  wrote:
>
>> Quick question,
>> Is there a way with solr to conditionally update document on unique
>> id? Meaning, default, add behavior if id is not already in index and
>> *not to touch index" if already there.
>>
>> Deletes are not important (no sync issues).
>>
>> I am asking because I noticed with deduplication turned on,
>> index-files get modified even if I update the same documents again
>> (same signatures).
>> I am facing very high dupes rate (40-50%), and setup is going to be
>> master-slave with high commit rate (requirement is to reduce
>> propagation latency for updates). Having unnecessary index
>> modifications is going to waste  "effort" to ship the same information
>> again and again.
>>
>> if there is no standard way, what would be the fastest way to check if
>> Term exists in index from UpdateRequestProcessor?
>>
>>
> I'd suggest that you use the searcher's getDocSet with a TermQuery.
>
> Use the SolrQueryRequest#getSearcher so you don't need to worry about ref
> counting.
>
> e.g. req.getSearcher().getDocSet(new TermQuery(new Term(signatureField,
> sigString))).size();
>
>
>
>> I intend to extend SignatureUpdateProcessor to prevent a document from
>> propagating down the chain if this happens?
>> Would that be a way to deal with it? I repeat, there are no deletes to
>> make headaches with synchronization
>>
>>
> Yes, that should be fine.
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: Using RAMDirectoryFactory in Master/Slave setup

2011-06-29 Thread eks dev
sure,  SSD or RAM disks fix these problems with IO.

Anyhow, I can really see no alternative for some in memory index for
slaves, especially for low latency master-slave apps (high commit rate
is a problem).

having possibility to run slaves  in memory that are slurping updates
from Master  seams to me like a preffered method (you need no
twiddling with OS, just CPU and RAM is what you need for your slaves,
run slave and point it to master ). I assume that update propagation
times could be better by having
some sexy ReadOnlySlaveRAMDirectorySlurpingUpdatesFromTheMaster that
does reload() directly from the Master (maybe even uncommitted,
somehow NRT-likish).

Point being, lower latency update than current 1-5 Minutes (wiki
recommended values) is not going to be possible with current
master-slave solution, due to the nature of it (commit to disk on
master, copy delta to slave disk, reload...) This is a lot of ping
pong... ES and solandra are by nature better suited if you need update
propagation in  seconds range.

It is just thinking aloud, and slightly off-topic... solr/lucene as it
is today, rocks  anyhow.

On Wed, Jun 29, 2011 at 10:55 AM, Toke Eskildsen  
wrote:
> On Wed, 2011-06-29 at 09:35 +0200, eks dev wrote:
>> In MMAP, you need to have really smart warm up (MMAP) to beat IO
>> quirks, for RAMDir  you need to tune gc(), choose your poison :)
>
> Other alternatives are operating system RAM disks (avoids the GC
> problem) and using SSDs (nearly the same performance as RAM).
>
>


Re: Using RAMDirectoryFactory in Master/Slave setup

2011-06-29 Thread eks dev
sure,  SSD or RAM disks fix these problems with IO.

Anyhow, I can really see no alternative for some in memory index for
slaves, especially for low latency master-slave apps (high commit rate
is a problem).

having possibility to run slaves  in memory that are slurping updates
from Master  seams to me like a preffered method (you need no
twiddling with OS, just CPU and RAM is what you need for your slaves,
run slave and point it to master ). I assume that update propagation
times could be better by having
some sexy ReadOnlySlaveRAMDirectorySlurpingUpdatesFromTheMaster that
does reload() directly from the Master (maybe even uncommitted,
somehow NRT-likish).

Point being, lower latency update than current 1-5 Minutes (wiki
recommended values) is not going to be possible with current
master-slave solution, due to the nature of it (commit to disk on
master, copy delta to slave disk, reload...) This is a lot of ping
pong... ES and solandra are by nature better suited if you need update
propagation in  seconds range.

It is just thinking aloud, and slightly off-topic... solr/lucene as it
is today, rocks  anyhow.



On Wed, Jun 29, 2011 at 10:55 AM, Toke Eskildsen  
wrote:
> On Wed, 2011-06-29 at 09:35 +0200, eks dev wrote:
>> In MMAP, you need to have really smart warm up (MMAP) to beat IO
>> quirks, for RAMDir  you need to tune gc(), choose your poison :)
>
> Other alternatives are operating system RAM disks (avoids the GC
> problem) and using SSDs (nearly the same performance as RAM).
>
>


Re: Using RAMDirectoryFactory in Master/Slave setup

2011-06-29 Thread eks dev
...Using RAMDirectory really does not help performance...

I kind of agree,  but in my experience with lucene,  there are cases
where RAMDirectory helps a lot, with all its drawbacks (huge heap and
gc() tuning).

We had very good experience with MMAP on average, but moving to
RAMDirectory with properly tuned gc() reduced 95% of "slow performers"
in upper range of response times (e.g. slowest 5% queries). On average
it made practically no difference.
Maybe is this mitigated by better warm up on solr than our hand-tuned
warmup, maybe not, I do not really know.

In MMAP, you need to have really smart warm up (MMAP) to beat IO
quirks, for RAMDir  you need to tune gc(), choose your poison :)

I argue, in some cases it is very hard to tame IO quirks (e.g. this is
shared resource, you never know what going really on in shared app
setup!). Then, see only what is happening on major merge and all these
efforts with native linux directory to somehow get a grip on that...
If you have spare ram, you are probably safer with RAMDirectory.

>From the theoretical perspective, in ideal case, RAM ought to be
faster than disk (and more expensive). If this is not the case, we did
something wrong.  I have a feeling that this work Mike is doing  with
in memory Codecs (fst TermDictionary, pulsing codec & co) in Lucene 4,
native directory features ... will make RAMDirectory really obsolete
for production setup.


Cheers,
eks




On Wed, Jun 29, 2011 at 6:00 AM, Lance Norskog  wrote:
> Using RAMDirectory really does not help performance. Java garbage
> collection has to work around all of the memory taken by the segments.
> It works out that Solr works better (for most indexes) without using
> the RAMDirectory.
>
>
>
> On Sun, Jun 26, 2011 at 2:07 PM, nipunb  wrote:
>> PS: Sorry if this is a repost, I was unable to see my message in the mailing
>> list - this may have been due to my outgoing email different from the one I
>> used to subscribe to the list with.
>>
>> Overview – Trying to evaluate if keeping the index in memory using
>> RAMDirectoryFactory can help in query performance.I am trying to perform the
>> indexing on the master using solr.StandardDirectoryFactory and make those
>> indexes accesible to the slave using solr.RAMDirectoryFactory
>>
>> Details:
>> We have set-up Solr in a master/slave enviornment. The index is built on the
>> master and then replicated to slaves which are used to serve the query.
>> The replication is done using the in-built Java replication in Solr.
>> On the master, in the  of solrconfig.xml we have
>> >        class="solr.StandardDirectoryFactory"/>
>>
>> On the slave, I tried to use the following in the 
>>
>> >         class="solr.RAMDirectoryFactory"/>
>>
>> My slave shows no data for any queries. In solrconfig.xml it is mentioned
>> that replication doesn’t work when using RAMDirectoryFactory, however this (
>> https://issues.apache.org/jira/browse/SOLR-1379) mentions that you can use
>> it to have the index on disk and then load into memory.
>>
>> To test the sanity of my set-up, I changed solrconfig.xml in the slave to
>> and replicated:
>> >        class="solr.StandardDirectoryFactory"/>
>> I was able to see the results.
>>
>> Shouldn’t RAMDirectoryFactory be used for reading index from disk into
>> memory?
>>
>> Any help/pointers in the right direction would be appreciated.
>>
>> Thanks!
>>
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Using-RAMDirectoryFactory-in-Master-Slave-setup-tp3111792p3111792.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


conditionally update document on unique id

2011-06-28 Thread eks dev
Quick question,
Is there a way with solr to conditionally update document on unique
id? Meaning, default, add behavior if id is not already in index and
*not to touch index" if already there.

Deletes are not important (no sync issues).

I am asking because I noticed with deduplication turned on,
index-files get modified even if I update the same documents again
(same signatures).
I am facing very high dupes rate (40-50%), and setup is going to be
master-slave with high commit rate (requirement is to reduce
propagation latency for updates). Having unnecessary index
modifications is going to waste  "effort" to ship the same information
again and again.

if there is no standard way, what would be the fastest way to check if
Term exists in index from UpdateRequestProcessor?

I intend to extend SignatureUpdateProcessor to prevent a document from
propagating down the chain if this happens?
Would that be a way to deal with it? I repeat, there are no deletes to
make headaches with synchronization


Thanks,
eks


overwirite if not already in index?

2011-06-28 Thread eks dev
Quick question,
Is there a way with solr to conditionally update document on unique
id? Meaning, default, add behavior if id is not already in index and
*not to touch index" if already there.

Deletes are not important (no sync issues).

I am asking because I noticed with deduplication turned on,
index-files get modified even if I update the same documents again
(same signatures).
I am facing very high dupes rate (40-50%), and setup is going to be
master-slave with high commit rate (requirement is to reduce
propagation latency for updates). Having unnecessary index
modifications is going to waste  "effort" to ship the same information
again and again.

if there is no standard way, what would be the fastest way to check if
Term exists in index from UpdateRequestProcessor?

I intend to extend SignatureUpdateProcessor to prevent a document from
propagating down the chain if this happens?
Would that be a way to deal with it? I repeat, there are no deletes to
make headaches with synchronization


Thanks,
eks


Re: Using RAMDirectoryFactory in Master/Slave setup

2011-06-27 Thread eks dev
Your best bet is MMapDirectoryFactory, you can come very close to the
performance of the RAMDirectory. Unfortunatelly this setup with
Master_on_disk->Slaves_in_ram type of setup is not possible using
solr.

We are moving our architecture to solr at the moment, and this is one
of "missings" we have to somehow figure out.

The problem is that MMap works fine on average, but it also has quirks
regarding upper quntiles of the responses.

If you are using RAMDirectory, you do not need to be afraid that
occasionally slow IO will kill performance for some of your requests.
This happens with MMAP, and not all that rare, depending on your usage
pattern (high update/commit rate for example). I repeat, RAMDirectory
is not to beat when it comes to reduction of the IO-caused "outliers".

We removed some 90% of the slowest response times by using
RAMDirectory instead of MMap...
Depending on what you want to optimize, MMap can work just fine for
you, and has some nice properties, eg.  you do not need to tune gc()
as much as if you manage bigger heap (RAMDirectory...)

But, imo, it would make sense to have some possibility to do it in solr.







On Mon, Jun 27, 2011 at 10:50 AM, Shalin Shekhar Mangar
 wrote:
> On Mon, Jun 27, 2011 at 12:49 PM, nipunb  wrote:
>> I found a similar post -
>> http://lucene.472066.n3.nabble.com/Problems-with-RAMDirectory-in-Solr-td1575223.html
>> It mentions that Java based replication might work (This is what I have
>> used, but didn't work for me)
>
> Solr Replication does not work with non-file directory implementations.
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: Question about http://wiki.apache.org/solr/Deduplication

2011-04-04 Thread eks dev
Thanks Hoss,

Externanlizing this part is exactly the path we are exploring now, not
only for this reason.

We already started testing Hadoop SequenceFile for write ahead log for
updates/deletes.
SequenceFile supports append now (simply great!). It was a a pain to
have to add hadoop into mix  for "mortal" collection
sizes 200 Mio, but on the other side, having hadoop around  offers
huge flexibility.
Write ahead log catches update commands (all solr slaves, fronting
clients accept updates but only to forward them to WAL). Solr master
is trying to catch up with update stream indexing in async fashion,
and finally solr slaves are chasing master index with standard solr
replication.
Overnight we run simple map reduce jobs to consolidate, normalize and
sort update stream and reindex at the end.
Deduplication and collection sorting is for us only an optimization,
if done reasonably offten, like  once per day/week, but if we do not
do it, it doubles HW resorces.

Imo, native WAL support on solr would be definitly one nice "nice to
have" (for HA, update scalability...). Charming with WAL  is that
updates never wait/disapear, if too much traffic, we only have
slightly higher update latency, but updates get definitley processed.
Some basic primitives on WAL (consolidation, replaying update stream
on solr etc...)  should be supported in this case, sort of "smallish
hadoop features subset for solr clusters", but nothing oversized.

Cheers,
eks









On Sun, Apr 3, 2011 at 1:05 AM, Chris Hostetter
 wrote:
>
> : Is it possible in solr to have multivalued "id"? Or I need to make my
> : own "mv_ID" for this? Any ideas how to achieve this efficiently?
>
> This isn't something the SignatureUpdateProcessor is going to be able to
> hel pyou with -- it does the deduplication be changing hte low level
> "update" (implemented as a delete then add) so that the key used to delete
> the older documents is based on the signature field instead of the id
> field.
>
> in order to do what you are describing, you would need to query the index
> for matching signatures, then add the resulting ids to your document
> before doing that "update"
>
> You could posibly do this in a custom UpdateProcessor, but you'd have to
> do something tricky to ensure you didn't overlook docs that had been addd
> but not yet committed when checking for dups.
>
> I don't have a good suggestion for how to do this internally in Slr -- it
> seems like the type of bulk processing logic that would be better suited
> for an external process before you ever start indexing (much like link
> analysis for back refrences)
>
> -Hoss
>


Deduplication questions

2011-03-25 Thread eks dev
Q1. Is is possible to pass *analyzed* content to the

public abstract class Signature {
  public void init(SolrParams nl) {  }
  public abstract String calculate(String content);
}


Q2. Method calculate() is using concatenated fields from name,features,cat
Is there any mechanism I could build  "field dependant signatures"?

Use case for this: I have two fields:
OWNER , TEXT
I need to disable *fuzzy* duplicates for one owner, one clean way
would be to make prefixed signature "OWNER/FUZZY_SIGNATURE"

Is  idea to make two UpdadeProcessors and chain them OK? (Is ugly, but
would work)

  
  true
  false
  exact_signature
  OWNER
  ExactSignature

  

hard_signature should not be  stored and not indexed field

  
  true
  true
  mixed_signature
  exact_signature, TEXT
  MixedSignature

  

 
 

Assuming I know how long my exact_signature is, I could calculate
fuzzy part and mix it properly.

Possible, better ideas?

Thanks,
eks


Question about http://wiki.apache.org/solr/Deduplication

2011-03-24 Thread eks dev
Hi,
Use case I am trying to figure out is about preserving IDs without
re-indexing on duplicate, rather adding this new ID under list of
document id "aliases".

Example:
Input collection:
"id":1, "text":"dummy text 1", "signature":"A"
"id":2, "text":"dummy text 1", "signature":"A"

I add the first document in empty index, text is going to be indexed,
ID is going to be "1", so far so good

Now the question, if I add second document with id == "2", instead of
deleting/indexing this new document, I would like to store id == 2 in
multivalued Field "id"

At the end, I would have one document less indexed and both ID are
going to be "searchable" (and stored as well)...

Is it possible in solr to have multivalued "id"? Or I need to make my
own "mv_ID" for this? Any ideas how to achieve this efficiently?

My target is not to add new documents if signature matches, but to
have IDs indexed and stored?

Thanks,
eks


Re: filter query from external list of Solr unique IDs

2010-10-16 Thread eks dev
if your index is read-only in production, can you add mapping
unique_id-Lucene docId in your kv store and and build filters externally?
That would make unique Key obsolete in your production index, as you would
work at lucene doc id level.

That way, you offline the problem to update/optimize phase. Ugly part is a
lot of updates on your kv-store...

I am not really familiar with solr, but working directly with lucene this is
doable, even having parallel index that has unique ID as a stored field, and
another index with indexed fields on update master, and than having only
this index with indexed fields in production.





On Fri, Oct 15, 2010 at 8:59 PM, Burton-West, Tom wrote:

> Hi Jonathan,
>
> The advantages of the obvious approach you outline are that it is simple,
> it fits in to the existing Solr model, it doesn't require any customization
> or modification to Solr/Lucene java code.  Unfortunately, it does not scale
> well.  We originally tried just what you suggest for our implementation of
> Collection Builder.  For a user's personal collection we had a table that
> maps the collection id to the unique Solr ids.
> Then when they wanted to search their collection, we just took their search
> and added a filter query with the fq=(id:1 OR id:2 OR).   I seem to
> remember running in to a limit on the number of OR clauses allowed. Even if
> you can set that limit larger, there are a  number of efficiency issues.
>
> We ended up constructing a separate Solr index where we have a multi-valued
> collection number field. Unfortunately, until incremental field updating
> gets implemented, this means that every time someone adds a document to a
> collection, the entire document (including 700KB of OCR) needs to be
> re-indexed just to update the collection number field. This approach has
> allowed us to scale up to a total of something under 100,000 documents, but
> we don't think we can scale it much beyond that for various reasons.
>
> I was actually thinking of some kind of custom Lucene/Solr component that
> would for example take a query parameter such as &lookitUp=123 and the
> component might do a JDBC query against a database or kv store and return
> results in some form that would be efficient for Solr/Lucene to process. (Of
> course this assumes that a JDBC query would be more efficient than just
> sending a long list of ids to Solr).  The other part of the equation is
> mapping the unique Solr ids to internal Lucene ids in order to implement a
> filter query.   I was wondering if something like the unique id to Lucene id
> mapper in zoie might be useful or if that is too specific to zoie. SoThis
> may be totally off-base, since I haven't looked at the zoie code at all yet.
>
> In our particular use case, we might be able to build some kind of
> in-memory map after we optimize an index and before we mount it in
> production. In our workflow, we update the index and optimize it before we
> release it and once it is released to production there is no
> indexing/merging taking place on the production index (so the internal
> Lucene ids don't change.)
>
> Tom
>
>
>
> -Original Message-
> From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
> Sent: Friday, October 15, 2010 1:07 PM
> To: solr-user@lucene.apache.org
> Subject: RE: filter query from external list of Solr unique IDs
>
> Definitely interested in this.
>
> The naive obvious approach would be just putting all the ID's in the query.
> Like fq=(id:1 OR id:2 OR).  Or making it another clause in the 'q'.
>
> Can you outline what's wrong with this approach, to make it more clear
> what's needed in a solution?
> 
>