Re: Swapping indexes on disk

2017-06-25 Thread Mike Lissner
–  at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
590125502 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
590125502 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
590125502 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at org.eclipse.jetty.server.Server.handle(Server.java:368)
590125502 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
590125502 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
590125502 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
590125502 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
590125503 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
590125503 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
590125503 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
590125503 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
590125503 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
590125503 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
590125503 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at java.lang.Thread.run(Thread.java:745)

Those are the last lines in the log, after all of the other indexes shut
down properly.

After that, a new log file is started, and it cannot start the index,
complaining about missing files. So at that point, the index is gone.

I'd love to prevent this from happening a third time. It's super baffling.
Any ideas?

Mike

On Tue, Jun 20, 2017 at 12:38 PM Mike Lissner <
mliss...@michaeljaylissner.com> wrote:

> Thanks for the suggestions everybody.
>
> Some responses to Shawn's questions:
>
> > Does your solr.xml file contain core definitions, or is that information
> in a core.properties file in each instanceDir?
>
> Were using core.properties files.
>
> > How did you install Solr
>
> Solr is installed just by downloading and unzipping. From there, we use
> the example directories as a starting point.
>
>
> > and how are you starting it?
>
> Using a pretty simple init script. Nothing too exotic here.
>
> > Do you have the full error and stacktrace from those null pointer
> exceptions?
>
> I put a log of the startup here:
> https://www.courtlistener.com/tools/sample-data/misc/null_logs.txt
>
> I created this by doing `grep -C 1000 -i nullpointer`, then cleaning out
> any private queries. I looked through it a bit. It looks like the index was
> missing a file, and was therefore unable to start up. I won't say it's
> impossible that the index was deleted before I started Solr, but it seemed
> to be operating fine using the other name prior to stopping solr and
> putting in a symlink. In the real-world logs, our disks are named /sata and
> /sata8 instead of /old and /new.
>
>
> > In the context of that information, what exactly did you do at each step
> of your process?
>
> The process above was pretty boring really.
>
> 1. Create new index and populate it:
>
>  - copied an existing index configuration into a new directory
>  - tweaked the datadir parameter in core.properties
>  - restarted solr
>  - re-indexed the database using usual HTTP API to populate the new index
>
> 2. stop solr: sudo service solr stop
>
> 3. make symlink:
>
>  - mv'ed the old index out of the way
>  - ln -s old new (or vice versa, I never remember which way ln goes)
>
> 4. start solr: sudo service solr start
>
> FWIW, I've got it working now using the SWAP index functionality, so the
> above is just in case somebody wants to try to track this down. I'll
> probably take those logs offline after a week or two.
>
> Mike
>
>
> On Tue, Jun 20, 2017 at 7:20 AM Shaw

Re: Swapping indexes on disk

2017-06-20 Thread Mike Lissner
Thanks for the suggestions everybody.

Some responses to Shawn's questions:

> Does your solr.xml file contain core definitions, or is that information
in a core.properties file in each instanceDir?

Were using core.properties files.

> How did you install Solr

Solr is installed just by downloading and unzipping. From there, we use the
example directories as a starting point.

> and how are you starting it?

Using a pretty simple init script. Nothing too exotic here.

> Do you have the full error and stacktrace from those null pointer
exceptions?

I put a log of the startup here:
https://www.courtlistener.com/tools/sample-data/misc/null_logs.txt

I created this by doing `grep -C 1000 -i nullpointer`, then cleaning out
any private queries. I looked through it a bit. It looks like the index was
missing a file, and was therefore unable to start up. I won't say it's
impossible that the index was deleted before I started Solr, but it seemed
to be operating fine using the other name prior to stopping solr and
putting in a symlink. In the real-world logs, our disks are named /sata and
/sata8 instead of /old and /new.

> In the context of that information, what exactly did you do at each step
of your process?

The process above was pretty boring really.

1. Create new index and populate it:

 - copied an existing index configuration into a new directory
 - tweaked the datadir parameter in core.properties
 - restarted solr
 - re-indexed the database using usual HTTP API to populate the new index

2. stop solr: sudo service solr stop

3. make symlink:

 - mv'ed the old index out of the way
 - ln -s old new (or vice versa, I never remember which way ln goes)

4. start solr: sudo service solr start

FWIW, I've got it working now using the SWAP index functionality, so the
above is just in case somebody wants to try to track this down. I'll
probably take those logs offline after a week or two.

Mike


On Tue, Jun 20, 2017 at 7:20 AM Shawn Heisey <apa...@elyograg.org> wrote:

> On 6/14/2017 12:26 PM, Mike Lissner wrote:
> > We are replacing a drive mounted at /old with one mounted at /new. Our
> > index currently lives on /old, and our plan was to:
> >
> > 1. Create a new index on /new
> > 2. Reindex from our database so that the new index on /new is properly
> > populated.
> > 3. Stop solr.
> > 4. Symlink /old to /new (Solr now looks for the index at /old/solr, which
> > redirects to /new/solr)
> > 5. Start solr
> > 6. (Later) Stop solr, swap the drives (old for new), and start solr.
> (Solr
> > now looks for the index at /old/solr again, and finds it there.)
> > 7. Delete the index pointing to /new created in step 1.
> >
> > The idea was that this would create a new index for solr, would populate
> it
> > with the right content, and would avoid having to touch our existing solr
> > configurations aside from creating one new index, which we could soon
> > delete.
> >
> > I just did steps 1-5, but I got null pointer exceptions when starting
> solr,
> > and it appears that the index on /new has been almost completely deleted
> by
> > Solr (this is a bummer, since it takes days to populate).
> >
> > Is this expected? Am I terribly crazy to try to swap indexes on disk? As
> > far as I know, the only difference between the indexes is their name.
> >
> > We're using Solr version 4.10.4.
>
> Solr should not delete indexes on startup.  The only time it should do
> that is when you explicitly request deletion.  Do you have the full
> error and stacktrace from those null pointer exceptions?  Something
> would have to be very wrong for it to behave like you describe.
>
> Does your solr.xml file contain core definitions, or is that information
> in a core.properties file in each instanceDir?  The latter is the only
> option supported in 5.0 and later, but the 4.10 version still supports
> both.
>
> How is Solr itself and the data directories laid out?  How did you
> install Solr, and how are you starting it?  In the context of that
> information, what exactly did you do at each step of your process?
>
> Thanks,
> Shawn
>
>


Re: Swapping indexes on disk

2017-06-14 Thread Mike Lissner
I figured Solr would have a native system built in, but since we don't use
it already, I didn't want to learn all of its ins and outs just for this
disk situation.

Ditto, essentially, applies for the swapping strategy. We don't have a Solr
expert, just me, a generalist, and sorting out these kinds of things can
take a while. The hope was to avoid that kind of complication with some
clever use of symlinks and minor downtime. Our front end has a retry
mechanism, so if solr is down for less than a minute, users will just have
delayed responses, which is fine.

The new strategy is to rsync the files while solr is live, stop solr, do a
rsync diff, then start solr again. That'll give a bit for bit copy with
very little downtime — it's the strategy postgres recommends for disk-based
backups, so it seems like a safer bet. We needed a re-index anyway due to
schema changes, which my first attempt included, but I guess that'll have
to wait.

Thanks for the replies. If anybody can explain why the first strategy
failed, I'd still be interested in learning.

Mike

On Wed, Jun 14, 2017 at 12:09 PM Chris Ulicny <culicny@iq.media> wrote:

> Are you physically swapping the disks to introduce the new index? Or having
> both disks mounted at the same time?
>
> If the disks are simultaneously available, can you just swap the cores and
> then delete the core on the old disk?
>
> https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API#CoreAdminAPI-SWAP
>
> We periodically move cores to different drives using solr's replication
> functionality and core swapping (after stopping replication). However, I've
> never encountered solr deleting an index like that.
>
>
>
> On Wed, Jun 14, 2017 at 2:48 PM David Hastings <
> hastings.recurs...@gmail.com>
> wrote:
>
> > I dont have an answer to why the folder got cleared, however i am
> wondering
> > why you arent using basic replication to do this exact same thing, since
> > solr will natively take care of all this for you with no interruption to
> > the user and no stop/start routines etc.
> >
> > On Wed, Jun 14, 2017 at 2:26 PM, Mike Lissner <
> > mliss...@michaeljaylissner.com> wrote:
> >
> > > We are replacing a drive mounted at /old with one mounted at /new. Our
> > > index currently lives on /old, and our plan was to:
> > >
> > > 1. Create a new index on /new
> > > 2. Reindex from our database so that the new index on /new is properly
> > > populated.
> > > 3. Stop solr.
> > > 4. Symlink /old to /new (Solr now looks for the index at /old/solr,
> which
> > > redirects to /new/solr)
> > > 5. Start solr
> > > 6. (Later) Stop solr, swap the drives (old for new), and start solr.
> > (Solr
> > > now looks for the index at /old/solr again, and finds it there.)
> > > 7. Delete the index pointing to /new created in step 1.
> > >
> > > The idea was that this would create a new index for solr, would
> populate
> > it
> > > with the right content, and would avoid having to touch our existing
> solr
> > > configurations aside from creating one new index, which we could soon
> > > delete.
> > >
> > > I just did steps 1-5, but I got null pointer exceptions when starting
> > solr,
> > > and it appears that the index on /new has been almost completely
> deleted
> > by
> > > Solr (this is a bummer, since it takes days to populate).
> > >
> > > Is this expected? Am I terribly crazy to try to swap indexes on disk?
> As
> > > far as I know, the only difference between the indexes is their name.
> > >
> > > We're using Solr version 4.10.4.
> > >
> > > Thank you,
> > >
> > > Mike
> > >
> >
>


Swapping indexes on disk

2017-06-14 Thread Mike Lissner
We are replacing a drive mounted at /old with one mounted at /new. Our
index currently lives on /old, and our plan was to:

1. Create a new index on /new
2. Reindex from our database so that the new index on /new is properly
populated.
3. Stop solr.
4. Symlink /old to /new (Solr now looks for the index at /old/solr, which
redirects to /new/solr)
5. Start solr
6. (Later) Stop solr, swap the drives (old for new), and start solr. (Solr
now looks for the index at /old/solr again, and finds it there.)
7. Delete the index pointing to /new created in step 1.

The idea was that this would create a new index for solr, would populate it
with the right content, and would avoid having to touch our existing solr
configurations aside from creating one new index, which we could soon
delete.

I just did steps 1-5, but I got null pointer exceptions when starting solr,
and it appears that the index on /new has been almost completely deleted by
Solr (this is a bummer, since it takes days to populate).

Is this expected? Am I terribly crazy to try to swap indexes on disk? As
far as I know, the only difference between the indexes is their name.

We're using Solr version 4.10.4.

Thank you,

Mike


Result Grouping vs. Collapsing Query Parser -- Can one be deprecated?

2016-10-19 Thread Mike Lissner
Hi all,

I've had a rotten day today because of Solr. I want to share my experience
and perhaps see if we can do something to fix this particular situation in
the future.

Solr currently has two ways to get grouped results (so far!). You can
either use Result Grouping or you can use the Collapsing Query Parser.
Result grouping seems like the obvious way to go. It's well documented, the
parameters are clear, it doesn't use a bunch of weird syntax (ie,
{!collapse blah=foo}), and it uses the feature name from SQL (so it comes
up in Google).

OTOH, if you use faceting with result grouping, which I imagine many people
do, you get terrible performance. In our case it went from subsecond to
10-120 seconds for big queries. Insanely bad.

Collapsing Query Parser looks like a good way forward for us, and we'll be
investigating that, but it uses the Expand component that our library
doesn't support, to say nothing of the truly bizarre syntax. So this will
be a fair amount of effort to switch.

I'm curious if there is anything we can do to clean up this situation. What
I'd really like to do is:

1. Put a HUGE warning on the Result Grouping docs directing people away
from the feature if they plan to use faceting (or perhaps directing them
away no matter what?)

2. Work towards eliminating one or the other of these features. They're
nearly completely compatible, except for their syntax and performance. The
collapsing query parser apparently was only written because the result
grouping had such bad performance -- In other words, it doesn't exist to
provide unique features, it exists to be faster than the old way. Maybe we
can get rid of one or the other of these, taking the best parts from each
(syntax from Result Grouping, and performance from Collapse Query Parser)?

Thanks,

Mike

PS -- For some extra context, I want to share some other reasons this is
frustrating:

1. I just spent a week upgrading a third-party library so it would support
grouped results, and another week implementing the feature in our code with
tests and everything. That was a waste.
2. It's hard to notice performance issues until after you deploy to a big
data environment. This creates a bad situation for users until you detect
it and revert the new features.
3. The documentation *could* say something about the fact that a new
feature was developed to provide better performance for grouping. It could
say that using facets with groups is an anti-feature. It says neither.

I only mention these because, like others, I've had a real rough time with
solr (again), and these are the kinds of seemingly small things that could
have made all the difference.


Re: Real Time Search and External File Fields

2016-10-10 Thread Mike Lissner
Thanks for the replies. I made the changes so that the external file field
is loaded per:


  
  

Re: Real Time Search and External File Fields

2016-10-08 Thread Mike Lissner
On Fri, Oct 7, 2016 at 8:18 PM Erick Erickson 
wrote:

> What you haven't mentioned is how often you add new docs. Is it once a
> day? Steadily
> from 8:00 to 17:00?
>

Alas, it's a steady trickle during business hours. We're ingesting court
documents as they're posted on court websites, then sending alerts as soon
as possible.


> Whatever, your soft commit really should be longer than your autowarm
> interval. Configure
> autowarming to reference queries (firstSearcher or newSearcher events
> or autowarm
> counts in queryResultCache and filterCache. Say 16 in each of these
> latter for a start) such
> that they cause the external file to load. That _should_ prevent any
> queries from being
> blocked since the autowarming will happen in the background and while
> it's happening
> incoming queries will be served by the old searcher.
>

I want to make sure I understand this properly and document this for future
people that may find this thread. Here's what I interpret your advice to be:

0. Slacken my auto soft commit interval to something more like a minute.

1. Set up a query in the newSearcher listener that uses my external file
field.
1a. Do the same in firstSearcher if I want newly started solr to warm up
before getting queries (this doesn't matter to me, so I'm skipping this).

and/or

2. Set autowarmcount in queryResultCache and filterCache to 16 so that the
top 16 query results from the previous searcher are regenerated in the new
searcher.

Doing #1 seems like a safe strategy since it's guaranteed to hit the
external file field. #2 feels like a bonus.

I'm a bit confused about the example autowarmcount for the caches, which is
0. Why not set this to something higher? I guess it's a RAM utilization vs.
speed tradeoff? A low number like 16 seems like it'd have minimal impact on
RAM?

Thanks for all the great replies and for everything you do for Solr. I
truly appreciate your efforts.

Mike


Re: Real Time Search and External File Fields

2016-10-08 Thread Mike Lissner
On Sat, Oct 8, 2016 at 8:46 AM Shawn Heisey  wrote:

> Most soft commit
> > documentation talks about setting up soft commits with  of
> about a
> > second.
>
> IMHO any documentation that recommends autoSoftCommit with a maxTime of
> one second is bad documentation, and needs to be fixed.  Where have you
> seen such a recommendation?


You know, I must have made that up, sorry. But the documentation you linked
to (on the Lucid Works blog) and the example file says 15 seconds for hard
commits, so it I think that got me thinking that soft commits could be more
frequent.

Should soft commits be less frequent than hard commits
(opensearcher=False)? If so, I didn't find that to be at all clear.


> right now Solr/Lucene has no
> way of knowing that your external file has not changed, so it must read
> the file every time it builds a searcher.


Is it crazy to file a feature request asking that Solr/Lucene keep the
modtime of this file and on reload it if it has changed? Seems like an easy
win.


>  I doubt this feature was
> designed to deal well with an extremely large external file like yours.
>

Perhaps not. It's probably worth mentioning that part of the reason the
file is so large is because pagerank uses very small and accurate floats.
So a typical line is:

1=9.50539603222e-08

Not something smaller like:

1=3.2

Pagerank also provides a value for every item in the index, so that makes
the file long. I'd suspect that anybody with a pagerank boosted index of
moderate size would have a similarly-sized file.


> If the info changes that infrequently, can you just incorporate it
> directly into the index with a standard field, with the info coming in
> as a part of your normal indexing process?


We've considered that, but whenever you re-run pagerank, it updates EVERY
value. So I guess we could try updating every doc in our index whenever we
run pagerank, but that's a nasty solution.


> It seems unlikely that Solr would stop serving queries while setting up
> a new searcher.  The old searcher should continue to serve requests
> until the new searcher is ready.  If this is happening, that definitely
> seems like a bug.
>

I'm positive I've observed this, though you're right, some queries still
seem to come through. Is it possible that queries relying on the field are
stopped while the field is loading? I've observed this two ways:

1. From the front end, things were stalling every time I was doing a hard
commit (opensearcher=true). I had hard commits coming in every ten minutes
via cron job, and sure enough, at ten, twenty, thirty...minutes after every
hour, I'd see stalls.

2. Watching the logs, I saw a flood of queries come through after the line:

Loaded external value source external_pagerank

Some queries were coming through before this line, but I think none of
those queries use the external file field (external_pagerank).

Mike


Real Time Search and External File Fields

2016-10-07 Thread Mike Lissner
I have an index of about 4M documents with an external file field
configured to do boosting based on pagerank scores of each document. The
pagerank file is about 93MB as of today -- it's pretty big.

Each day, I add about 1,000 new documents to the index, and I need them to
be available as soon as possible so that I can send out alerts to our users
about new content (this is Google Alerts, essentially).

Soft commits seem to be exactly the thing for this, but whenever I open a
new searcher (which soft commits seem to do), the external file is
reloaded, and all queries are halted until it finishes loading. When I just
measured, this took about 30 seconds to complete. Most soft commit
documentation talks about setting up soft commits with  of about a
second.

Is there anything I can do to make the external file field not get reloaded
constantly? It only changes about once a month, and I want to use soft
commits to power the alerts feature.

Thanks,

Mike