Re: DataImport using last_indexed_id or getting max(id) quickly

2012-07-11 Thread avenka
Thanks. Can you explain more the first TermsComponent option to obtain
max(id)? Do I have to modify schema.xml to add a new field? How exactly do I
query for the lowest value of 1 - id?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/DataImport-using-last-indexed-id-or-getting-max-id-quickly-tp3993763p3994560.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud replication question

2012-07-09 Thread avenka
Erick, thanks. I now do see segment files in an index.timestamp directory at 
the replicas. Not sure why they were not getting populated earlier. 

I have a couple more questions, the second is more elaborate - let me know if I 
should move it to a separate thread.

(1) The speed of adding documents in SolrCloud is excruciatingly slow. It takes 
about 30-50 seconds to add a batch of 100 documents (and about twice that to 
add 200, etc.) to the primary but just ~10 seconds to add 5K documents in 
batches of 200 on a standalone solr 4 server. The log files indicate that the 
primary is timing out with messages like below and Cloud-Graph in the UI shows 
the other two replicas in orange after starting green.
 org.apache.solr.client.solrj.SolrServerException: Timeout occured while 
waiting response from server at: http://localhost:7574/solr

Any idea why?

(3) I am seriously considering using symbolic links for a replicated solr setup 
with completely independent instances on a *single machine*. Tell me if I am 
thinking about this incorrectly. Here is my reasoning: 

(a) Master/slave replication in 3.6 simply seems old school as it doesn't have 
the nice consistency properties of SolrCloud. Polling say every 20 seconds 
means I don't know exactly how up-to-speed each replica is, which will 
complicate my request re-distribution.

(b) SolrCloud seems like a great alternative to master/slave replication. But 
it seems slow (see 1) and having played with it, I don't feel comfortable with 
the maturity of ZK integration (or my comprehension of it) in solr 4 alpha. 

(c) Symbolic links seem like the fastest and most space-efficient solution 
*provided* there is only a single writer, which is just fine for me. I plan to 
run completely separate solr instances with one designated as the primary and 
do the following operations in sequence: Add a batch to the primary and commit 
-- From each replica's index directory, remove all symlinks and re-create 
symlinks to segment files in the primary (but not the write.lock file) -- Call 
update?commit=true to force replicas to re-load their in-memory index -- Do 
whatever read-only processing is required on the batch using the primary and 
all replicas by manually (randomly) distributing read requests -- Repeat 
sequence.

Is there any downside to 3(c) (other than maintaining a trivial script to 
manage symlinks and call commit)? I tested it on small index sizes and it seems 
to work fine. The throughput improves with more replicas (for 2-4 replicas) as 
a single replica is not enough to saturate the machine (due to high query 
latency). Am I overlooking something in this setup?

Overall, I need high throughput and minimal latency from the time a document is 
added to the time it is available at a replica. SolrCloud's automated request 
redirection, consistency, and fault-tolerance is awesome for a physically 
distributed setup, but I don't see how it beats 3(c) in a single-writer, 
single-machine, replicated setup.

AV

On Jul 9, 2012, at 9:43 AM, Erick Erickson [via Lucene] wrote:

 No, you're misunderstanding the setup. Each replica has a complete 
 index. Updates get automatically forwarded to _both_ nodes for a 
 particular shard. So, when a doc comes in to be indexed, it gets 
 sent to the leader for, say, shard1. From there: 
 1 it gets indexed on the leader 
 2 it gets forwarded to the replica(s) where it gets indexed locally. 
 
 Each replica has a complete index (for that shard). 
 
 There is no master/slave setup any more. And you do 
 _not_ have to configure replication. 
 
 Best 
 Erick 
 
 On Sun, Jul 8, 2012 at 1:03 PM, avenka [hidden email] wrote:
 
  I am trying to wrap my head around replication in SolrCloud. I tried the 
  setup at http://wiki.apache.org/solr/SolrCloud/. I mainly need replication 
  for high query throughput. The setup at the URL above appears to maintain 
  just one copy of the index at the primary node (instead of a replicated 
  index as in a master/slave configuration). Will I still get roughly an 
  n-fold increase in query throughput with n replicas? And if so, why would 
  one do master/slave replication with multiple copies of the index at all? 
  
  -- 
  View this message in context: 
  http://lucene.472066.n3.nabble.com/SolrCloud-replication-question-tp3993761.html
  Sent from the Solr - User mailing list archive at Nabble.com. 
 
 
 If you reply to this email, your message will be added to the discussion 
 below:
 http://lucene.472066.n3.nabble.com/SolrCloud-replication-question-tp3993761p3993889.html
 To unsubscribe from SolrCloud replication question, click here.
 NAML



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-replication-question-tp3993761p3993960.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud replication question

2012-07-09 Thread avenka
Hmm, never mind my question about replicating using symlinks. Given that
replication on a single machine improves throughput, I should be able to get
a similar improvement by simply sharding on a single machine. As also
observed at
 
http://carsabi.com/car-news/2012/03/23/optimizing-solr-7x-your-search-speed/

I am now benchmarking my workload to compare replication vs. sharding
performance on a single machine.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-replication-question-tp3993761p3994017.html
Sent from the Solr - User mailing list archive at Nabble.com.


SolrCloud error while propagating update to primary ZK node

2012-07-08 Thread avenka
I get a JSON parse error (pasted below) when I send an update to a replica
node. I downloaded solr 4 alpha and followed the instructions at
http://wiki.apache.org/solr/SolrCloud/ and setup numShards=1 with 3 total
servers managed by a zookeeper ensemble, the primary at 8983 and the other
two at 7574 and 8900 respectively. 

The error below shows up in the primary's log when I try to add a document
to either replica. The document add fails. I am able to successfully add
documents by directly sending to the primary. How do I correctly add
documents to replicas?

SEVERE: org.apache.noggit.JSONParser$ParseException: JSON Parse Error:
char=,position=0 BEFORE='' AFTER='adddoc boost=1.0field name=id2'
at org.apache.noggit.JSONParser.err(JSONParser.java:221)
at org.apache.noggit.JSONParser.next(JSONParser.java:620)
at org.apache.noggit.JSONParser.nextEvent(JSONParser.java:661)
at
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:105)
at
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:95)
at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:59)
at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
... [snip]



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-error-while-propagating-update-to-primary-ZK-node-tp3993760.html
Sent from the Solr - User mailing list archive at Nabble.com.


SolrCloud replication question

2012-07-08 Thread avenka
I am trying to wrap my head around replication in SolrCloud. I tried the
setup at http://wiki.apache.org/solr/SolrCloud/. I mainly need replication
for high query throughput. The setup at the URL above appears to maintain
just one copy of the index at the primary node (instead of a replicated
index as in a master/slave configuration). Will I still get roughly an
n-fold increase in query throughput with n replicas? And if so, why would
one do master/slave replication with multiple copies of the index at all?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-replication-question-tp3993761.html
Sent from the Solr - User mailing list archive at Nabble.com.


DataImport using last_indexed_id or getting max(id) quickly

2012-07-08 Thread avenka
My understanding is that the DIH in solr only enters last_indexed_time in
dataimport.properties, but not say last_indexed_id for a primary key 'id'.
How can I efficiently get the max(id) (note that 'id' is an auto-increment
field in the database) ? Maintaining max(id) outside of solr is brittle and
calling max(id) before each dataimport can take several minutes when the
index has several hundred million records.

How can I either import based on ID or get max(id) quickly? I can not use
timestamp-based import because I get out-of-memory errors if/when solr falls
behind and the suggested fixes online did not work for me. 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/DataImport-using-last-indexed-id-or-getting-max-id-quickly-tp3993763.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud error while propagating update to primary ZK node

2012-07-08 Thread avenka
I tried adding in two ways with the same outcome: (1) using solrj to call 
HttpSolrServer.add(docList) using BinaryRequestWriter; (2) using 
DataImportHandler to import directly from a database through a 
db-data-config.xml file.

The document I'm adding has a long primary key id field and a few other string 
and timestamp fields. I also added a long _version_ field coz the URL said so. 
I've been using this schema without problems with 3.6 for a while and it works 
fine when added to the primary in 4.0.

Mark Miller-3 [via Lucene] ml-node+s472066n3993780...@n3.nabble.com wrote:

Can you show us exactly how you are adding the document? 

Eg, what update handler are you using, and what is the document you are adding? 

On Jul 8, 2012, at 12:52 PM, avenka wrote: 


 I get a JSON parse error (pasted below) when I send an update to a replica 
 node. I downloaded solr 4 alpha and followed the instructions at 
 http://wiki.apache.org/solr/SolrCloud/ and setup numShards=1 with 3 total 
 servers managed by a zookeeper ensemble, the primary at 8983 and the other 
 two at 7574 and 8900 respectively. 
 
 The error below shows up in the primary's log when I try to add a document 
 to either replica. The document add fails. I am able to successfully add 
 documents by directly sending to the primary. How do I correctly add 
 documents to replicas? 
 
 SEVERE: org.apache.noggit.JSONParser$ParseException: JSON Parse Error: 
 char=,position=0 BEFORE='' AFTER='adddoc boost=1.0field name=id2' 
   at org.apache.noggit.JSONParser.err(JSONParser.java:221) 
   at org.apache.noggit.JSONParser.next(JSONParser.java:620) 
   at org.apache.noggit.JSONParser.nextEvent(JSONParser.java:661) 
   at 
 org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:105)
  
   at 
 org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:95)
  
   at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:59) 
   at 
 org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
  
   at 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
  
   at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
  
   at 
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240)
  
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561) 
   at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442)
  
   at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263)
  
   at 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
  
 ... [snip] 
 
 
 
 -- 
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SolrCloud-error-while-propagating-update-to-primary-ZK-node-tp3993760.html
 Sent from the Solr - User mailing list archive at Nabble.com. 


- Mark Miller 
lucidimagination.com 













_




If you reply to this email, your message will be added to the discussion below:


http://lucene.472066.n3.nabble.com/SolrCloud-error-while-propagating-update-to-primary-ZK-node-tp3993760p3993780.html
   



To unsubscribe from SolrCloud error while 
propagating update to primary ZK node, click here.
NAML



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-error-while-propagating-update-to-primary-ZK-node-tp3993760p3993781.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr java.lang.NullPointerException on select queries

2012-06-26 Thread avenka
So, I tried 'optimize', but it failed because of lack of space on the first
machine. I then moved the whole thing to a different machine where the index
was pretty much the only thing and was using about 37% of disk, but it still
failed because of a No space left on device IOException. Also, the size of
the index has since doubled to roughly 74% of the disk on this second
machine now and the number of files has increased from 3289 to 3329.
Actually even the 3289 files on the first machine were after I tried
optimize on it once, so the original size must have been even smaller.

I don't think I can afford any more space and am close to giving up and
reclaiming space on the two machines. A couple more questions before that:

1) I am tempted to try editing binary--the magnetic needle option. Could
you elaborate on this? Would there be a way to go back to an index that is
the original size from its super-sized current form(s)?

2) Will CheckIndex also need more than twice the space? Would there be a way
to bring down the size to the original size without running 'optimize' if I
try that route? Also how exactly do I run CheckIndex, e.g., the exact URL I
need to hit?


--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-java-lang-NullPointerException-on-select-queries-tp3989974p3991400.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr java.lang.NullPointerException on select queries

2012-06-21 Thread avenka
Erick, much thanks for detailing these options. I am currently trying the
second one as that seems a little easier and quicker to me.

I successfully deleted documents with IDs after the problem time that I do
know to an accuracy of a couple hours. Now, the stats are:
  numDocs : 2132454075
  maxDoc : -2130733352 
The former is nicely below 2^31. But I can't seem to get the latter to
decrease and become positive by deleting further. 

Should I just run an optimize at this point? I have never manually run an
optimize and plan to just hit 
  http://machine_name/solr/update?optimize=true
Can you confirm this?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-java-lang-NullPointerException-on-select-queries-tp3989974p3990798.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr java.lang.NullPointerException on select queries

2012-06-20 Thread avenka
For the first install, I copied over all files in the directory example
into, let's call it, install1. I did the same for install2. The two
installs run on different ports, use different jar files, are not really
related to each other in any way as far as I can see. In particular, they
are not multicore. They have the same access control setup via jetty. I
did a diff on config files and confirmed that only port numbers are
different.

Both had been running fine in parallel importing from a common database for
several weeks. The documents indexed by install1, the problematic one
currently, is a vastly bigger (~2.5B) superset of those indexed by install2
(~250M). 

At this point, select queries on install1 incurs the NullPointerException
irrespective of whether install2 is running or not. The log file looks like
it is indexing normally as always though. The index is also growing at the
usual rate each day. Just select queries fail. :(

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-java-lang-NullPointerException-on-select-queries-tp3989974p3990476.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr java.lang.NullPointerException on select queries

2012-06-20 Thread avenka
Erick, thanks for pointing that out. I was going to say in my original post
that it is almost like some limit on max documents got violated all of a
sudden, but the rest of the symptoms didn't seem to quite match. But now
that I think about it, the problem probably happened at 2B (corresponding
exactly to the size of the signed int space) as my ID space in the database
has roughly 85% holes and the problem probably happened when the ID hit
around 2.4B. 

It is still odd that indexing appears to proceed normally and the select
queries know which IDs are used because the error happens only for queries
with non-empty results, e.g., searching for an ID that doesn't exist gives a
valid 0 numResponses response. Is this because solr uses 'long' or more
for indexing (given that the schema supports long) but not in the querying
modules?

I hadn't used solr sharding because I really needed rolling partitions,
where I keep a small index of recent documents and throw the rest into a
slow archive index. So maintaining the smaller instance2 (usually  50M)
and replicating it if needed was my homebrewed sharding approach. But I
guess it is time to shard the archive after all.

AV

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-java-lang-NullPointerException-on-select-queries-tp3989974p3990534.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr java.lang.NullPointerException on select queries

2012-06-20 Thread avenka
Yes, wonky indeed. 
  numDocs : -2006905329
  maxDoc : -1993357870 

And yes, I meant that the holes are in the database auto-increment ID space,
nothing to do with lucene IDs.

I will set up sharding. But is there any way to retrieve most of the current
index? Currently, all select queries even in ranges in the hundreds of
millions return the NullPointerException. It would suck to lose all of this.
:(

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-java-lang-NullPointerException-on-select-queries-tp3989974p3990542.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr java.lang.NullPointerException on select queries

2012-06-20 Thread avenka
Thanks. Do you know if the tons of index files with names like '_zxt.tis' in
the index/data/ directory have the lucene IDs embedded in the binaries? The
files look good to me and are partly readable even if in binary. I am
wondering if I could just set up a new solr instance and move these index
files there and hope to use them (or most of them) as is without shards? If
so, I will just set up a separate sharded index for the documents indexed
henceforth, but won't bother splitting the huge existing index.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-java-lang-NullPointerException-on-select-queries-tp3989974p3990560.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr java.lang.NullPointerException on select queries

2012-06-20 Thread avenka
Erick, thanks for the advice, but let me make sure you haven't misunderstood
what I was asking.

I am not trying to split the huge existing index in install1 into shards. I
am also not trying to make the huge install1 index as one shard of a sharded
solr setup. I plan to use a sharded setup only for future docs.

I do want to avoid trying to re-index the docs in install1 and think of them
as a slow tape archive index server if I ever need to go and query the
past documents. So I was wondering if I could somehow use the existing
segment files to run an isolated (unsharded) solr server that lets me query
roughly the first 2B docs before the wraparound problem happened. If the
negative internal doc IDs have pervasively corrupted the segment files,
this would not be possible, but I am not able to imagine an underlying
lucene design that would cause such a problem. Is my only option to re-index
the past 2B docs if I want to be able to query them at this point or is
there any way to use the existing segment files?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-java-lang-NullPointerException-on-select-queries-tp3989974p3990615.html
Sent from the Solr - User mailing list archive at Nabble.com.


solr java.lang.NullPointerException on select queries

2012-06-16 Thread avenka
I have recently started getting the error pasted below with solr-3.6 on
/select queries. I don't know of anything that changed in the config to
start causing this error. I am also running a second independent solr server
on the same machine, which continues to run fine and has the same
configuration as the first one except for the port number. The first one
seems to be doing dataimport operations fine and updating index files as
usual, but fails on select queries. An example of a failing query (that used
to run fine) is:
http://machine_name/solr/select/?q=title%3Afooversion=2.2start=0rows=10indent=on

I am stupefied. Any idea?

HTTP ERROR 500

Problem accessing /solr/select/. Reason:

null
java.lang.NullPointerException at
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:398)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:186)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326) at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at
org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-java-lang-NullPointerException-on-select-queries-tp3989974.html
Sent from the Solr - User mailing list archive at Nabble.com.


Rolling partitions with solr shards

2012-05-27 Thread avenka
Is there a simple way to get solr to maintain shards as rolling partitions by
date, e.g., the last day's documents in one shard, the week before yesterday
in the next shard, the month before that in the next shard, and so on? I
really don't need querying to be fast on the entire index, but it is
critical that it be blazing fast on recent documents. 

A related but different question: in which config file can I change the
default hash function to assign documents to shards? This outdated post 
  http://wiki.apache.org/solr/NewSolrCloudDesign
seems to suggest that you can define your own hash functions as well as
assign hash ranges to partitions, but I am not sure whether or how solr 3.6
supports this. For that matter, I don't know whether or how SolrCloud (that
I understand is available only in solr4) supports this.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Rolling-partitions-with-solr-shards-tp3986315.html
Sent from the Solr - User mailing list archive at Nabble.com.