EmbeddedSolrServer and BinaryRequestWriter

2010-01-14 Thread Phil Hagelberg

I'm trying to reduce memory usage when indexing, and I see that using
the binary format may be a good way to do this. Unfortunately I can't
see a way to do this using the EmbeddedSolrServer since only the
CommonsHttpSolrServer has a setRequestWriter method. If I'm running out
of memory constructing XML request documents, does that mean I just have
to switch away from the EmbeddedSolrServer?

I understand I can stream requests if I'm just indexing files already on
disk, but I'm constructing them on the fly, and I run out of memory
constructing the XML document to submit to solr, not in actual indexing,
so it seems writing the document to disk would run into the same problems.

thanks,
Phil


Date ranges for indexes constructed outside Solr

2009-11-25 Thread Phil Hagelberg

I'm working on an application that will build indexes directly using the
Lucene API, but will expose them to clients using Solr. I'm seeing
plenty of documentation on how to support date range fields in Solr,
but they all assume that you are inserting documents through Solr rather
than merging already-generated indexes.

Where can I find details about the Lucene-level field operations that
can be used to generate date fields that Solr will work with? In
particular date resolution settings are unclear.

On a similar note: how much of schema.xml is relevant in cases where
Solr is not performing insertions? Obviously defaultSearchField is as
well as the solrQueryParser defaultOperator attribute, but it seems like
most of the field declarations might not matter.

thanks,
Phil


core size

2009-11-16 Thread Phil Hagelberg

I'm are planning out a system with large indexes and wondering what kind
of performance boost I'd see if I split out documents into many cores
rather than using a single core and splitting by a field. I've got about
500GB worth of indexes ranging from 100MB to 50GB each.

I'm assuming if we split them out to multiple cores we would see the
most dramatic benefit in searches on the smaller cores, but I'm just
wondering what level of speedup I should expect. Eventually the cores
will be split up anyway, I'm just trying to determine how to prioritize
it.

thanks,
Phil


Re: no .war with ubuntu release ?

2009-06-18 Thread Phil Hagelberg
On Thu, Jun 18, 2009 at 4:00 PM, Jonathan Vanascojvana...@2xlp.com wrote:
 can anyone give me a suggestion ? i haven't touched java / jetty / tomcat /
 whatever in at least a good 8 years and am lost.

I spent a lot of time trying to get this working too. My conclusion
was simply that the .deb packages for Solr are unmaintained and have
fallen victim to bitrot. You'll have a much easier time getting it
from a maven repository or just downloading a binary release.

I wish that it would be removed from the Ubuntu repositories though if
it isn't fixed as its presence there seems to cause more harm than
good.

-Phil


Re: Replication problems on 1.4

2009-06-16 Thread Phil Hagelberg
Phil Hagelberg p...@hagelb.org writes:

 Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com writes:

 if you removed the files while the slave is running , then the slave
 will not know that you removed the files (assuming it is a *nix box)
 and it will serve the search requests. But if you restart the slave ,
 it should have automatically picked up the current index.

 if it doesn't it is a bug

 I did restart the slave server in my case. If I can confirm this with
 the latest build from trunk, I will submit an issue.

Hmm... can't reproduce with a fresh checkout and recreating my indices
from that. Maybe it was something specifically misconfigured in my last
setup.

-Phil


Re: Replication problems on 1.4

2009-06-13 Thread Phil Hagelberg
Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com writes:

 if you removed the files while the slave is running , then the slave
 will not know that you removed the files (assuming it is a *nix box)
 and it will serve the search requests. But if you restart the slave ,
 it should have automatically picked up the current index.

 if it doesn't it is a bug

I did restart the slave server in my case. If I can confirm this with
the latest build from trunk, I will submit an issue.

-Phil


Replication problems on 1.4

2009-06-12 Thread Phil Hagelberg

I'm trying out the replication features on 1.4 (trunk) with multiple
indices using a setup based on the example multicore config.

The first time I tried it, (replicating through the admin web
interface), it worked fine. I was a little surprised that telling one
core to replicate caused both to replicate since the docs seem to imply
that replication is done on a per-core basis, but I was happy to see
that it worked.

I wanted to replay my steps, so on the slave machine I deleted
core0/data/* and core1/data/* and restarted the server. I restarted the
server on master just to be sure. Now replication doesn't work at
all. I've tried it both through the admin interface and by curl:

  curl http://localhost:8983/solr/core0/replication?command=snappull

The response from curl indicates that the replication was successful,
but nothing happened; my slave index is still empty.

My only guess as to what's going wrong here is that deleting the
coreN/data directory is not a good way to reset a core back to its
initial condition. Maybe there's a bit of state somewhere that's making
the slave think that it's already up-to-date with this master and so it
doesn't need to do any replicating? But this is a wild conjecture; I'd
appreciate any tips on where to look for what's going wrong.

As to why the replication claims to be successful, I've no idea. Am I
missing some crucial log file that explains what's going wrong?

It's also possible that this stuff is still in a heavy state of
development such that it shouldn't be expected to work by casual users,
if that is the case I can go back to the external-script-based
replication features of 1.3.

thanks,
Phil Hagelberg
http://technomancy.us


Re: Replication problems on 1.4

2009-06-12 Thread Phil Hagelberg
Phil Hagelberg p...@hagelb.org writes:

 My only guess as to what's going wrong here is that deleting the
 coreN/data directory is not a good way to reset a core back to its
 initial condition. Maybe there's a bit of state somewhere that's making
 the slave think that it's already up-to-date with this master and so it
 doesn't need to do any replicating? But this is a wild conjecture; I'd
 appreciate any tips on where to look for what's going wrong.

OK, so I inserted some more documents into the master, and now
replication works. I get the feeling it may be due to this line in the
master's solrconfig.xml:

  str name=replicateAftercommit/str

Now this is confusing since it seems that the timing of replication is
not up to the master, it's up to the slave. The slave's config has
settings for the interval at which to replicate, and you POST to the
slave to force a replication. So why is there a setting on the master to
control when replication happens?

My only interpretation from the config files is the master has some sort
of you may not replicate from me unless conditions. This seems pretty
undesirable since you may have a slave that needs to get replicated from
the master immediately; it shouldn't have to wait for a commit on the
master. Am I misunderstanding what's going on here? It certainly isn't
clear from the documents on the wiki, so I'm kind of grasping in the
dark. Perhaps I'm missing something.

thanks,
Phil Hagelberg
http://technomancy.us


Re: Replication problems on 1.4

2009-06-12 Thread Phil Hagelberg
Shalin Shekhar Mangar shalinman...@gmail.com writes:

 You are right. In Solr/Lucene, a commit exposes updates to searchers. So you
 need to call commit on the master for the slave to pick up the changes.
 Replicating changes from the master and then not exposing new documents to
 searchers does not make sense. However, there is a lot of work going on in
 Lucene to enable near real-time search (exposing documents to searchrs as
 soon as possible). Once those features are mature enough, Solr's replication
 will follow suit.

I understand that; it's totally reasonable.

What it doesn't explain is what happened in my case: the master added a
bunch of docs, committed, and then the slave replicated fine. Then the
slave lost all its data (due to me issuing an rm -rf of the data
directory, but let's say it happened due to a disk failure or something)
and tried to replicate again, but got zero docs. Once the master had
another commit issued, the slave could now replicate properly.

I would expect in this case the slave should be able to replicate after
losing its data but before the second commit. I can see why the master
would not expose uncommitted documents, but I can't see why it would
refuse to allow _any_ of its index to be replicated from.

I feel like I'm missing a piece of the picture here.

-Phil