Lucene test framework documentation?

2015-01-08 Thread TK Solr
Is there any good document about Lucene Test Framework?
I can only find API docs.
Mimicking the unit test I've found in Lucene trunk, I tried to write
a unit test that tests a TokenFilter I am writing. But it is failing
with an error message like:
java.lang.AssertionError: close() called in wrong state: SETREADER
at 
__randomizedtesting.SeedInfo.seed([2899FF2F02A64CCB:47B7F94117CE7067]:0)
at 
org.apache.lucene.analysis.MockTokenizer.close(MockTokenizer.java:261)
at org.apache.lucene.analysis.TokenFilter.close(TokenFilter.java:58)

During a few round of try and error, I got an error message that the Test
Framework
JAR has to be before Lucene Core. And the above stack trace indicates
that the Test Framework has its own Analyzer implementation, and it has
a certain assumption but it is not clear what the assumption is.

This exception was thrown from one of these lines, I believe:
TokenStream ts = deuAna.tokenStream(text, new
StringReader(testText));
TokenStreamToDot tstd = new TokenStreamToDot(testText, ts, new
PrintWriter(System.out));
ts.close();

(I'm not too sure what TokenStreamToDot is about. I was hoping it would
dump a token stream.)

Kuro





Re: UUIDUpdateProcessorFactory causes repeated documents when uploading csv files?

2015-01-08 Thread jia gu
Problem solved - it's caused by a system outside of Solr. Thank you all for
the prompt replies! :)

On Thu, Jan 8, 2015 at 12:40 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:

 : Thank you for your reply Chris :)  Solr is producing the correct result
 on
 : its own. The problem is that I am calling a dataload class to call Solr,
 : which worked for assigned ID and composite ID, but not for UUID. Is
 there a

 Sorry -- still confused: are you confirming that you've tracked down the
 problem you are having to a system outside of Solr?  that the problem (of
 duplicate documents) is introduced by your dataload class prior to
 sending the docs to Solr?

 : place to delete my question on the mailing list?

 nope - once the emails have gone out, they've gone out -- just replying
 back and confirming the resolutionn to the problem you saw is good enough.



 -Hoss
 http://www.lucidworks.com/



Re: Lucene test framework documentation?

2015-01-08 Thread Alexandre Rafalovitch
(semi-relevant aside) We do happen to ship this test framework with
Solr distribution (in dist/test-framework).

Why, I don't know!

Regards,
Alex.

Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 8 January 2015 at 23:23, Shawn Heisey apa...@elyograg.org wrote:
 On 1/8/2015 8:31 PM, TK Solr wrote:
 Is there any good document about Lucene Test Framework?
 I can only find API docs.
 Mimicking the unit test I've found in Lucene trunk, I tried to write
 a unit test that tests a TokenFilter I am writing. But it is failing
 with an error message like:
 java.lang.AssertionError: close() called in wrong state: SETREADER
   at
 __randomizedtesting.SeedInfo.seed([2899FF2F02A64CCB:47B7F94117CE7067]:0)
   at 
 org.apache.lucene.analysis.MockTokenizer.close(MockTokenizer.java:261)
   at org.apache.lucene.analysis.TokenFilter.close(TokenFilter.java:58)

 During a few round of try and error, I got an error message that the Test
 Framework
 JAR has to be before Lucene Core. And the above stack trace indicates
 that the Test Framework has its own Analyzer implementation, and it has
 a certain assumption but it is not clear what the assumption is.

 This exception was thrown from one of these lines, I believe:
   TokenStream ts = deuAna.tokenStream(text, new
 StringReader(testText));
   TokenStreamToDot tstd = new TokenStreamToDot(testText, ts, new
 PrintWriter(System.out));
   ts.close();

 (I'm not too sure what TokenStreamToDot is about. I was hoping it would
 dump a token stream.)

 This question is probably more appropriate for the dev list than the
 solr-user list, especially since it has more to do with Lucene than Solr.

 If the javadocs for the classes you want to use are not providing enough
 info, then you may be able to learn more by looking into the tests
 included in the Lucene source code that use the framework classes you'd
 like to try.

 Thanks,
 Shawn



Re: Lucene test framework documentation?

2015-01-08 Thread Shawn Heisey
On 1/8/2015 8:31 PM, TK Solr wrote:
 Is there any good document about Lucene Test Framework?
 I can only find API docs.
 Mimicking the unit test I've found in Lucene trunk, I tried to write
 a unit test that tests a TokenFilter I am writing. But it is failing
 with an error message like:
 java.lang.AssertionError: close() called in wrong state: SETREADER
   at 
 __randomizedtesting.SeedInfo.seed([2899FF2F02A64CCB:47B7F94117CE7067]:0)
   at 
 org.apache.lucene.analysis.MockTokenizer.close(MockTokenizer.java:261)
   at org.apache.lucene.analysis.TokenFilter.close(TokenFilter.java:58)
 
 During a few round of try and error, I got an error message that the Test
 Framework
 JAR has to be before Lucene Core. And the above stack trace indicates
 that the Test Framework has its own Analyzer implementation, and it has
 a certain assumption but it is not clear what the assumption is.
 
 This exception was thrown from one of these lines, I believe:
   TokenStream ts = deuAna.tokenStream(text, new
 StringReader(testText));
   TokenStreamToDot tstd = new TokenStreamToDot(testText, ts, new
 PrintWriter(System.out));
   ts.close();
 
 (I'm not too sure what TokenStreamToDot is about. I was hoping it would
 dump a token stream.)

This question is probably more appropriate for the dev list than the
solr-user list, especially since it has more to do with Lucene than Solr.

If the javadocs for the classes you want to use are not providing enough
info, then you may be able to learn more by looking into the tests
included in the Lucene source code that use the framework classes you'd
like to try.

Thanks,
Shawn



GC tuning question - can improving GC pauses cause indexing to slow down?

2015-01-08 Thread Shawn Heisey
Is it possible that tuning garbage collection to achieve much better
pause characteristics might actually *decrease* index performance?

Rebuilds that I did while still using a tuned CMS config would take
between 5.5 and 6 hours, sometimes going slightly over 6 hours.

A rebuild that I did recently with G1 took 6.82 hours.  A rebuild that I
did yesterday with further tuned G1 settings (which seemed to result in
much smaller pauses than the previous G1 settings) took 8.97 hours, and
that was on slightly faster hardware than the rebuild that took 6.82 hours.

These rebuilds are done with DIH from MySQL.

It seems completely counter-intuitive that settings which show better GC
pause characteristics would result in indexing performance going down
... so can anyone shed light on this, tell me whether I'm out of my mind?

Thanks,
Shawn



Re: GC tuning question - can improving GC pauses cause indexing to slow down?

2015-01-08 Thread Boogie Shafer
In the abstract, it sounds like you are seeing the difference between tuning 
for latency vs tuning for throughput 

My hunch would be you are seeing more (albeit individually quicker) GC events 
with your new settings during the rebuild

I imagine that in most cases a solr rebuild is relatively rare compared to the 
amount of times where a lower latency request is desired. If the rebuild times 
are problematic for you, use tunings specific to that workload during the times 
you need it and then switch back to your low latency settings after. If you are 
doing that you can probably run with a bigger heap temporarily during the 
rebuild as you aren't likely to be fielding queries and don't benefit from 
having a larger OS cache available



Sent from my iPhone

 On Jan 8, 2015, at 20:54, Shawn Heisey apa...@elyograg.org wrote:
 
 Is it possible that tuning garbage collection to achieve much better
 pause characteristics might actually *decrease* index performance?
 
 Rebuilds that I did while still using a tuned CMS config would take
 between 5.5 and 6 hours, sometimes going slightly over 6 hours.
 
 A rebuild that I did recently with G1 took 6.82 hours.  A rebuild that I
 did yesterday with further tuned G1 settings (which seemed to result in
 much smaller pauses than the previous G1 settings) took 8.97 hours, and
 that was on slightly faster hardware than the rebuild that took 6.82 hours.
 
 These rebuilds are done with DIH from MySQL.
 
 It seems completely counter-intuitive that settings which show better GC
 pause characteristics would result in indexing performance going down
 ... so can anyone shed light on this, tell me whether I'm out of my mind?
 
 Thanks,
 Shawn
 


Re: GC tuning question - can improving GC pauses cause indexing to slow down?

2015-01-08 Thread Walter Underwood
I would not be surprised at all. Optimizing for minimum pauses usually 
increases overhead that decreases overall throughput. This is a pretty common 
tradeoff.

For maximum throughput, when you don’t care about pauses, the simplest 
non-concurrent GC is often the best. That might be the right choice for running 
big map-reduce jobs, for example.

Low-pause GCs do lots of extra work in parallel. Some of that work is making 
guesses which get thrown away, or doing “just in case” analysis.

To quote Oracle:

When you evaluate or tune any garbage collection, there is always a latency 
versus throughput trade-off. The G1 GC is an incremental garbage collector with 
uniform pauses, but also more overhead on the application threads. The 
throughput goal for the G1 GC is 90 percent application time and 10 percent 
garbage collection time. When you compare this to Java HotSpot VM's throughput 
collector, the goal there is 99 percent application time and 1 percent garbage 
collection time.”

http://www.oracle.com/technetwork/articles/java/g1gc-1984535.html

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/


On Jan 8, 2015, at 8:53 PM, Shawn Heisey apa...@elyograg.org wrote:

 Is it possible that tuning garbage collection to achieve much better
 pause characteristics might actually *decrease* index performance?
 
 Rebuilds that I did while still using a tuned CMS config would take
 between 5.5 and 6 hours, sometimes going slightly over 6 hours.
 
 A rebuild that I did recently with G1 took 6.82 hours.  A rebuild that I
 did yesterday with further tuned G1 settings (which seemed to result in
 much smaller pauses than the previous G1 settings) took 8.97 hours, and
 that was on slightly faster hardware than the rebuild that took 6.82 hours.
 
 These rebuilds are done with DIH from MySQL.
 
 It seems completely counter-intuitive that settings which show better GC
 pause characteristics would result in indexing performance going down
 ... so can anyone shed light on this, tell me whether I'm out of my mind?
 
 Thanks,
 Shawn
 



Re: GC tuning question - can improving GC pauses cause indexing to slow down?

2015-01-08 Thread Shawn Heisey
On 1/8/2015 11:05 PM, Boogie Shafer wrote:
 In the abstract, it sounds like you are seeing the difference between tuning 
 for latency vs tuning for throughput 
 
 My hunch would be you are seeing more (albeit individually quicker) GC events 
 with your new settings during the rebuild
 
 I imagine that in most cases a solr rebuild is relatively rare compared to 
 the amount of times where a lower latency request is desired. If the rebuild 
 times are problematic for you, use tunings specific to that workload during 
 the times you need it and then switch back to your low latency settings 
 after. If you are doing that you can probably run with a bigger heap 
 temporarily during the rebuild as you aren't likely to be fielding queries 
 and don't benefit from having a larger OS cache available

Full rebuilds are indeed relatively rare.  Avoiding long pauses and
keeping query latency low are usually a lot more important than how
quickly the index rebuilds.  Quick rebuilds are nice, but not strictly
necessary.

We do incremental updates that start at the top of every minute, unless
an update is already running.  Exactly how long those updates take is of
little importance, unless that time is easier to measure in minutes
rather than seconds.

If I ever find myself in a situation where completing a rebuild as fast
as possible becomes extremely important, does anyone have suggestions
for GC tuning options that will optimize for throughput?

Thanks,
Shawn



Re: How to return child documents with parent

2015-01-08 Thread Mikhail Khludnev
Did you check [child] at
https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents
?

On Thu, Jan 8, 2015 at 5:53 PM, yliu y...@mathworks.com wrote:

 Hi,

 What is the best way to return both parent document and child documents in
 one query?  I used SolrJ to create a document and added a few child
 documents using addChildDocuments() method and indexed the parent document.
 All documents are indexed successfully (parent and children).

 When I tried to retrieve the parent document along with the child
 documents.
 I used expand=trueexpand.field=_root_.  I was able to get the parent back
 in result section and children in expandedResults section.  Is there
 some other type of query I should use so I can get the child documents back
 as children instead of expanded result?

 Thanks,

 Y



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-return-child-documents-with-parent-tp4178081.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com


Re: Determining the Number of Solr Shards

2015-01-08 Thread Nishanth S
Thanks guys for your inputs I would be looking at around 100 Tb of total
 index size  with 5100 million documents  for  a period of  30 days before
we purge the  indexes.I had estimated it slightly on the  higher side of
things but that's where I feel we would be.

Thanks,
Nishanth

On Wed, Jan 7, 2015 at 7:50 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 1/7/2015 7:14 PM, Nishanth S wrote:
  Thanks Shawn and Walter.Yes those are 12,000 writes/second.Reads  for the
  moment would be in the 1000 reads/second. Guess finding out the right
  number  of  shards would be my starting point.

 I don't think indexing 12000 docs per second would be too much for Solr
 to handle, as long as you architect the indexing application properly.
 You would likely need to have several indexing threads or processes that
 index in parallel.  Solr is fully thread-safe and can handle several
 indexing requests at the same time.  If the indexing application is
 single-threaded, indexing speed will not reach its full potential.

 Be aware that indexing at the same time as querying will reduce the
 number of queries per second that you can handle.  In an environment
 where both reads and writes are heavy like you have described, more
 shards and/or more replicas might be required.

 For the query side ... even 1000 queries per second is a fairly heavy
 query rate.  You're likely to need at least a few replicas, possibly
 several, to handle that.  The type and complexity of the queries you do
 will make a big difference as well.  To handle that query level, I would
 still recommend only running one shard replica on each server.  If you
 have three shards and three replicas, that means 9 Solr servers.

 How many documents will you have in total?  You said they are about 6KB
 each ... but depending on the fieldType definitions (and the analysis
 chain for TextField types), 6KB might be very large or fairly small.

 Do you have any idea how large the Solr index will be with all your
 documents?  Estimating that will require indexing a significant
 percentage of your documents with the actual schema and config that you
 will use in production.

 If I know how many documents you have, how large the full index will be,
 and can see an example of the more complex queries you will do, I can
 make *preliminary* guesses about the number of shards you might need.  I
 do have to warn you that it will only be a guess.  You'll have to
 experiment to see what works best.

 Thanks,
 Shawn




Re: Determining the Number of Solr Shards

2015-01-08 Thread Jack Krupansky
My final advice would be my standard proof of concept implementation advice
- test a configuration with 10% (or 5%) of the target data size and 10% (or
5%) of the estimated resource requirements (maybe 25% of the estimated RAM)
and see how well it performs.

Take the actual index size and multiply by 10 (or 20 for a 5% load) to get
a closer estimate of total storage required.

If a 10% load fails to perform well with 25% of the total estimated RAM,
then you can be sure that you'll have problems with 10x the data and only
4x the RAM. Increase the RAM for that 10 load until you get acceptable
performance for both indexing and a full range of queries, and then use 10x
that RAM for the RAM for the 100% load. That's the OS system memory for
file caching, not the total system RAM.

-- Jack Krupansky

On Thu, Jan 8, 2015 at 4:55 PM, Nishanth S nishanth.2...@gmail.com wrote:

 Thanks guys for your inputs I would be looking at around 100 Tb of total
  index size  with 5100 million documents  for  a period of  30 days before
 we purge the  indexes.I had estimated it slightly on the  higher side of
 things but that's where I feel we would be.

 Thanks,
 Nishanth

 On Wed, Jan 7, 2015 at 7:50 PM, Shawn Heisey apa...@elyograg.org wrote:

  On 1/7/2015 7:14 PM, Nishanth S wrote:
   Thanks Shawn and Walter.Yes those are 12,000 writes/second.Reads  for
 the
   moment would be in the 1000 reads/second. Guess finding out the right
   number  of  shards would be my starting point.
 
  I don't think indexing 12000 docs per second would be too much for Solr
  to handle, as long as you architect the indexing application properly.
  You would likely need to have several indexing threads or processes that
  index in parallel.  Solr is fully thread-safe and can handle several
  indexing requests at the same time.  If the indexing application is
  single-threaded, indexing speed will not reach its full potential.
 
  Be aware that indexing at the same time as querying will reduce the
  number of queries per second that you can handle.  In an environment
  where both reads and writes are heavy like you have described, more
  shards and/or more replicas might be required.
 
  For the query side ... even 1000 queries per second is a fairly heavy
  query rate.  You're likely to need at least a few replicas, possibly
  several, to handle that.  The type and complexity of the queries you do
  will make a big difference as well.  To handle that query level, I would
  still recommend only running one shard replica on each server.  If you
  have three shards and three replicas, that means 9 Solr servers.
 
  How many documents will you have in total?  You said they are about 6KB
  each ... but depending on the fieldType definitions (and the analysis
  chain for TextField types), 6KB might be very large or fairly small.
 
  Do you have any idea how large the Solr index will be with all your
  documents?  Estimating that will require indexing a significant
  percentage of your documents with the actual schema and config that you
  will use in production.
 
  If I know how many documents you have, how large the full index will be,
  and can see an example of the more complex queries you will do, I can
  make *preliminary* guesses about the number of shards you might need.  I
  do have to warn you that it will only be a guess.  You'll have to
  experiment to see what works best.
 
  Thanks,
  Shawn
 
 



Re: How large is your solr index?

2015-01-08 Thread Bram Van Dam

On 01/07/2015 05:42 PM, Erick Erickson wrote:

True, and you can do this if you take explicit control of the document
routing, but...
that's quite tricky. You forever after have to send any _updates_ to the same
shard you did the first time, whereas SPLITSHARD will do the right thing.


Hmm. That is a good point. I wonder if there's some kind of middle 
ground here? Something that lets me send an update (or new document) to 
an arbitrary node/shard but which is still routed according to my 
specific requirements? Maybe this can already be achieved by messing 
with the routing?



snip there are some components that don't do the right thing in
distributed mode, joins for instance. The list is actually quite small and
is getting smaller all the time.



That's fine. We have a lot of query (pre-)processing outside of Solr. 
It's no problem for us to send a couple of queries to a couple of shards 
and aggregate the result ourselves. It would, of course, be nice if 
everything worked in distributed mode, but at least for us it's not an 
issue. This is a side effect of our complex reporting requirements -- we 
do aggregation, filtering and other magic on data that is partially in 
Solr and partially elsewhere.



Not true if the other shards have had any indexing activity. The commit is
usually forwarded to all shards. If the individual index on a
particular shard is
unchanged then it should be a no-op though.


I think a no-op commit no longer clears the caches either, so that's great.


But the usage pattern here is its own bit of a trap. If all your
indexing is going
to a single shard, then also the entire indexing _load_ is happening on that
shard. So the CPU utilization will be higher on that shard than the older ones.
Since distributed requests need to get a response from every shard before
returning to the client, the response time will be bounded by the response from
the slowest shard and this may actually be slower. Probably only noticeable
when the CPU is maxed anyway though.


This is a very good point. But I don't think SPLITSHARD is the magical 
answer here. If you have N shards on N boxes, and they are all getting 
nearly full and you decide to split one and move half to a new box, 
you'll end up with N-2 nearly full boxes and 2 half-full boxes. What 
happens if the disks fill up further? Do I have to split each shard? 
That sounds pretty nightmareish!


 - Bram


Re: How large is your solr index?

2015-01-08 Thread Toke Eskildsen
On Wed, 2015-01-07 at 22:26 +0100, Joseph Obernberger wrote:
 Thank you Toke - yes - the data is indexed throughout the day.  We are 
 handling very few searches - probably 50 a day; this is an RD system.

If your searches are in small bundles, you could pause the indexing flow
while the searches are executed, for better performance. 

 Our HDFS cache, I believe, is too small at 10GBytes per shard.

That depends a lot on your corpus, your searches and underlying storage.
But with our current level of information, it is a really good bet:
Having 10GB cache per 130GB (270GB?) data is not a lot with spinning
drives.

 Current parameters for running each shard are:
 JAVA_OPTS=-XX:MaxDirectMemorySize=10g -XX:+UseLargePages -XX:NewRatio=3 
[...]
 -Xmx10752m

One Solr/shard? You could probably win a bit by having one Solr/machine
instead. Anyway, it's quite a high Xmx, but I presume you have measured
the memory needs.

 I'd love to try SSDs, but don't have the budget at present to go that 
 route.

We find the price/performance for SSD + moderate RAM to be quite a
better deal than spinning drives + a lot of RAM, even when buying
enterprise hardware. For consumer SSDs (used in our large server) it is
even cheaper to use SSDs. It all depends on use pattern of course, but
your setup with non-concurrent searches seems like it would fit well.

Note: I am sure that the RAM == index size would deliver very high
performance. With enough RAM you can use tape to hold the index. Whether
it is cost effective is another matter.

 I'd really like to get the HDFS option to work well as it 
 reduces system complexity.

That is very understandable. We examined the option of networked storage
(Isilon) with underlying spindles, and it performed adequately for our
needs up to 2-3TB of index data. Unfortunately the heavy random read
load from Solr meant a noticeable degradation of other services using
the networked storage. I am sure it could be solved with more
centralized hardware, but in the end we found it cheaper and simpler to
use local storage for search. This will of course differ across
organizations and setups.

- Toke Eskildsen




Re: Solr: IndexNotFoundException: no segments* file HdfsDirectoryFactory

2015-01-08 Thread xinwu
Hi,did you solve this problem?
I met the same problem when I setted up solr+hdfs.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-IndexNotFoundException-no-segments-file-HdfsDirectoryFactory-tp4138737p4178034.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: leader split-brain at least once a day - need help

2015-01-08 Thread Thomas Lamy

Hi Alan,
thanks for the pointer, I'll look at our gc logs

Am 07.01.2015 um 15:46 schrieb Alan Woodward:

I had a similar issue, which was caused by 
https://issues.apache.org/jira/browse/SOLR-6763.  Are you getting long GC 
pauses or similar before the leader mismatches occur?

Alan Woodward
www.flax.co.uk


On 7 Jan 2015, at 10:01, Thomas Lamy wrote:


Hi there,

we are running a 3 server cloud serving a dozen 
single-shard/replicate-everywhere collections. The 2 biggest collections are 
~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5, Tomcat 
7.0.56, Oracle Java 1.7.0_72-b14

10 of the 12 collections (the small ones) get filled by DIH full-import once a 
day starting at 1am. The second biggest collection is updated usind DIH 
delta-import every 10 minutes, the biggest one gets bulk json updates with 
commits once in 5 minutes.

On a regular basis, we have a leader information mismatch:
org.apache.solr.update.processor.DistributedUpdateProcessor; Request says it is 
coming from leader, but we are the leader
or the opposite
org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState says 
we are the leader, but locally we don't think so

One of these pop up once a day at around 8am, making either some cores going to recovery 
failed state, or all cores of at least one cloud node into state gone.
This started out of the blue about 2 weeks ago, without changes to neither 
software, data, or client behaviour.

Most of the time, we get things going again by restarting solr on the current 
leader node, forcing a new election - can this be triggered while keeping solr 
(and the caches) up?
But sometimes this doesn't help, we had an incident last weekend where our 
admins didn't restart in time, creating millions of entries in 
/solr/oversser/queue, making zk close the connection, and leader re-elect 
fails. I had to flush zk, and re-upload collection config to get solr up again 
(just like in https://gist.github.com/isoboroff/424fcdf63fa760c1d1a7).

We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 1500 
requests/s) up and running, which does not have these problems since upgrading 
to 4.10.2.


Any hints on where to look for a solution?

Kind regards
Thomas

--
Thomas Lamy
Cytainment AG  Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.: +49 (40) 23 706-747
Fax: +49 (40) 23 706-139
Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476






--
Thomas Lamy
Cytainment AG  Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.: +49 (40) 23 706-747
Fax: +49 (40) 23 706-139

Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476



Re: Solr startup script in version 4.10.3

2015-01-08 Thread Ramkumar R. Aiyengar
Versions 4.10.3 and beyond already use server rather than example, which
still finds a reference in the script purely for back compat. A major
release 5.0 is coming soon, perhaps the back compat can be removed for that.
On 6 Jan 2015 09:30, Dominique Bejean dominique.bej...@eolya.fr wrote:

 Hi,

 In release 4.10.3, the following lines were removed from solr starting
 script (bin/solr)

 # TODO: see SOLR-3619, need to support server or example
 # depending on the version of Solr
 if [ -e $SOLR_TIP/server/start.jar ]; then
   DEFAULT_SERVER_DIR=$SOLR_TIP/server
 else
   DEFAULT_SERVER_DIR=$SOLR_TIP/example
 fi

 However, the usage message always say

   -d dir  Specify the Solr server directory; defaults to server


 Either the usage have to be fixed or the removed lines put back to the
 script.

 Personally, I like the default to server directory.

 My installation process in order to have a clean empty solr instance is to
 copy examples into server and remove directories like example-DIH,
 example-shemaless, multicore and solr/collection1

 Solr server (or node) can be started without the -d parameter.

 If this makes sense, a Jira issue could be open.

 Dominique
 http://www.eolya.fr/



Re: Solr startup script in version 4.10.3

2015-01-08 Thread Anshum Gupta
Things have changed reasonably for the 5.0 release.
In case of a standalone mode, it still defaults to the server directory. So
you'd find your logs in server/logs.
In case of solrcloud mode e.g. if you ran

bin/solr -e cloud -noprompt

this would default to stuff being copied into example directory (leaving
server directory untouched) and everything would run from there.

You will also have the option of just creating a new SOLR home and using
that instead. See the following:

https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud

The link above is for the upcoming Solr 5.0 and is still work in progress
but should give you more information.
Hope that helps.


On Tue, Jan 6, 2015 at 1:29 AM, Dominique Bejean dominique.bej...@eolya.fr
wrote:

 Hi,

 In release 4.10.3, the following lines were removed from solr starting
 script (bin/solr)

 # TODO: see SOLR-3619, need to support server or example
 # depending on the version of Solr
 if [ -e $SOLR_TIP/server/start.jar ]; then
   DEFAULT_SERVER_DIR=$SOLR_TIP/server
 else
   DEFAULT_SERVER_DIR=$SOLR_TIP/example
 fi

 However, the usage message always say

   -d dir  Specify the Solr server directory; defaults to server


 Either the usage have to be fixed or the removed lines put back to the
 script.

 Personally, I like the default to server directory.

 My installation process in order to have a clean empty solr instance is to
 copy examples into server and remove directories like example-DIH,
 example-shemaless, multicore and solr/collection1

 Solr server (or node) can be started without the -d parameter.

 If this makes sense, a Jira issue could be open.

 Dominique
 http://www.eolya.fr/




-- 
Anshum Gupta
http://about.me/anshumgupta


Re: UUIDUpdateProcessorFactory causes repeated documents when uploading csv files?

2015-01-08 Thread jia gu
Thank you for your reply Chris :)  Solr is producing the correct result on
its own. The problem is that I am calling a dataload class to call Solr,
which worked for assigned ID and composite ID, but not for UUID. Is there a
place to delete my question on the mailing list?
Thank you,
Jia

On Wed, Jan 7, 2015 at 8:47 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 : It's a single Solr Instance, and in my files, I used 'doc_key'
 everywhere,
 : but I changed it to id in the email I sent out wanting to make it
 easier
 : to read, sorry don't mean to confuse you :)

 https://wiki.apache.org/solr/UsingMailingLists

 - what version of solr?
 - how exactly are you doing the update? curl? post.jar?
 - what exactly is the HTTP response from your update?
 - what does your log file show during the update?
 - what exactly do all of your configs look like (you said you made a
 mistake in your email by trying to make the data easier to read that
 could easily be masking some other mistake in your actual cnfigs

 I did my best to try and reproduce what you describe, but i had no
 luck -- here's exactly what i did...


 hossman@frisbee:~/lucene/lucene-4.10.3_tag$ svn diff
 Index: solr/example/solr/collection1/conf/solrconfig.xml
 ===
 --- solr/example/solr/collection1/conf/solrconfig.xml   (revision 1650199)
 +++ solr/example/solr/collection1/conf/solrconfig.xml   (working copy)
 @@ -1076,7 +1076,17 @@
   str name=update.chaindedupe/str
 /lst
 --
 +lst name=defaults
 +  str name=update.chainautoGenId/str
 +/lst
/requestHandler
 +  updateRequestProcessorChain name=autoGenId
 +processor class=solr.UUIDUpdateProcessorFactory
 +  str name=fieldNameid/str
 +/processor
 +processor class=solr.LogUpdateProcessorFactory /
 +processor class=solr.RunUpdateProcessorFactory /
 +  /updateRequestProcessorChain

!-- The following are implicitly added
requestHandler name=/update/json class=solr.UpdateRequestHandler
 hossman@frisbee:~/lucene/lucene-4.10.3_tag$ curl -X POST '
 http://localhost:8983/solr/collection1/update?commit=true' -H
 Content-Type: application/csv --data-binary 'foo_s,bar_s
 aaa,cat
 bbb,dog
 ccc,yak
 '
 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeaderint name=status0/intint
 name=QTime350/int/lst
 /response
 hossman@frisbee:~/lucene/lucene-4.10.3_tag$ curl '
 http://localhost:8983/solr/collection1/select?q=*:*wt=jsonindent=true'
 {
   responseHeader:{
 status:0,
 QTime:7,
 params:{
   indent:true,
   q:*:*,
   wt:json}},
   response:{numFound:3,start:0,docs:[
   {
 foo_s:aaa,
 bar_s:cat,
 id:025c69cd-6407-4c70-903b-dfde170d373b,
 _version_:1489692576651935744},
   {
 foo_s:bbb,
 bar_s:dog,
 id:5c7b3d65-1274-4bad-a671-4d643531e2ae,
 _version_:1489692576673955840},
   {
 foo_s:ccc,
 bar_s:yak,
 id:25a3893f-c538-4b47-aa79-1f4268d66c39,
 _version_:1489692576673955841}]
   }}







 -Hoss
 http://www.lucidworks.com/



Solr with Tomcat - enabling SSL problem

2015-01-08 Thread Tali Finelt
Hi,

I am using Solr 4.10.2 with tomcat and embedded Zookeeper.
I followed 
https://cwiki.apache.org/confluence/display/solr/Enabling+SSL#EnablingSSL-SolrCloud
 
to enable SSL.

I am currently doing the following:

Starting tomcat
Running:
../scripts/cloud-scripts/zkcli.sh -zkhost localhost:9983 -cmd put 
/clusterprops.json '{urlScheme:https}' 
Restarting tomcat
Accessing Solr from my client using 
org.apache.solr.client.solrj.impl.CloudSolrServer.

And this works. 
If I don't restart tomcat again after running zkcli.sh, I get the 
following error:
IOException occured when talking to server at: 
http://ip:port/solr/ (http, not https).

Is it possible to do this without the second restart?
Thanks,
Tali


Re: How large is your solr index?

2015-01-08 Thread Shawn Heisey
On 1/8/2015 9:39 AM, Joseph Obernberger wrote:
 Yes - it would be 20GBytes of cache per 270GBytes of data.

That's not a lot of cache.  One rule of thumb is that you should have at
least 50% of the index size available as cache, with 100% being a lot
better.  The caching should happen on the Solr server itself so there
isn't a network bottleneck.  This is one of several reasons why local
storage on regular filesystems is preferred for Solr.

 We've tried lower Xmx but we get OOM errors during faceting of large
 datasets.  Right now we're running two JVMs per physical box (2 shards
 per box), but we're going to be changing that to on JVM and one shard
 per box.

This wiki page has some info on what can cause high heap requirements
and some general ideas for what you can do about it:

http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap

If you want to discuss your specific situation, we can use the list,
direct email, or the #solr IRC channel.

http://wiki.apache.org/solr/IRCChannels

Thanks,
Shawn



RE: Solr Cloud, 100 shards, shards progressively become slower

2015-01-08 Thread Andrew Butkus

 Extrapolating what Jack was saying on his reply ... with 100 shards and
 4 replicas, you have 400 cores that are each about 2.8GB.  That results in a 
 total index size of just over a terabyte, with 140GB of index data on each of 
 the eight servers.

 Assuming you have only one Solr instance per server, an ideal setup would 
 have enough RAM for that 140GB of index plus the 16GB max heap, so 156GB of 
 RAM.  Because the ideal setup is rarely a strict requirement unless the query 
 load is high, if you have 128 GB of RAM per server, then I would not be 
 worried about performance.  If you have less than that, then I would be 
 worried.

Have less than this :/ :( - with not much likelihood to upgrade anytime soon - 
just out of curiosity, if the performance is proportional to the RAM, why am I 
seeing such good query times for the initial shard queries? (they are all under 
100ms). 

 The behavior with the same shard listed multiple times is a little strange.  
 That behavior could indicate problems with garbage collection pauses -- as 
 Solr is building the memory structures necessary to compose the final 
 response, it might fill up one of the heap generations to its current size 
 limit and each subsequent allocation might require a significant garbage 
 collection, stopping the world while it happens, but not freeing up any 
 significant amount of memory in that particular heap generation.

 Have you tuned your garbage collection?  If not, that is a likely suspect.  
 If you run with the latest Oracle Java, you can use my settings and probably 
 see good GC performance:

 https://wiki.apache.org/solr/ShawnHeisey

 Further down on the page is a good set of CMS parameters for earlier Java 
 versions, if you can't run the latest.

We will look into this thank you, if this can decrease the last few shards 
qtime, then we should still see reasonable speeds (if not the fastest if it has 
to load from disk, but hopefully faster than the 50 seconds we have been seeing)

The weird thing is, if I query each shard inidividually with distrib=false the 
query time never goes over 100ms (I concurrently hammer 1 shard like I did with 
my test in the previous email but not using shard= and I never get a query over 
100ms) ... which leads me to believe there is some bottleneck with the 
distrib=/shard= parameters.


Re: UUIDUpdateProcessorFactory causes repeated documents when uploading csv files?

2015-01-08 Thread Chris Hostetter
: Thank you for your reply Chris :)  Solr is producing the correct result on
: its own. The problem is that I am calling a dataload class to call Solr,
: which worked for assigned ID and composite ID, but not for UUID. Is there a

Sorry -- still confused: are you confirming that you've tracked down the 
problem you are having to a system outside of Solr?  that the problem (of 
duplicate documents) is introduced by your dataload class prior to 
sending the docs to Solr?

: place to delete my question on the mailing list?

nope - once the emails have gone out, they've gone out -- just replying 
back and confirming the resolutionn to the problem you saw is good enough.



-Hoss
http://www.lucidworks.com/


RE: Solr Cloud, 100 shards, shards progressively become slower

2015-01-08 Thread Toke Eskildsen
Andrew Butkus [andrew.but...@c6-intelligence.com] wrote:

[Shawn/Jack: Ideal amount of RAM]

 Have less than this :/ :( - with not much likelihood to upgrade anytime soon

The right amount of RAM is what satisfies your requirements and is tightly 
correlated to the speed of your underlying storage. We have yet to build a 
machine with anywhere near the same amount of RAM as index size, and do have 
requirements of hundreds of searches/second on two of them.

 - just out of curiosity, if the performance is proportional to the RAM, why
 am I seeing such good query times for the initial shard queries? (they are
 all under 100ms).

That is the real mystery here and does not seem to be related to overall 
performance. Guessing wildly: Maybe the last reported time is the total time 
spend?

As your test works when you specify the same shard over and over again, perhaps 
you could specify the same shard A 30 times, followed by shard B 1 time and see 
if shard B reports a QTime of 100ms or 50,000ms?

- Toke Eskildsen


Re: Solr Cloud, 100 shards, shards progressively become slower

2015-01-08 Thread Shawn Heisey
On 1/8/2015 8:57 AM, Andrew Butkus wrote:
 We have 4gb usage (because the shards are split by 100 each shard is approx. 
 2.8gb on disk), we have allocated 14gb min and 16gb max of ram to solr, so it 
 has plenty to use (the ram in the dashboard never goes above about 8gb - so 
 still plenty ).

 I've managed to reproduce the issue with shards= parameter, and think I have 
 proven the disk cache issue to not be the problem 

 I'm simply querying the same shard, on the same server, multiple times (so 
 the shards index should always be in memory and never loaded from disk)?

 All but the last query are low ms ... 

Extrapolating what Jack was saying on his reply ... with 100 shards and
4 replicas, you have 400 cores that are each about 2.8GB.  That results
in a total index size of just over a terabyte, with 140GB of index data
on each of the eight servers.

Assuming you have only one Solr instance per server, an ideal setup
would have enough RAM for that 140GB of index plus the 16GB max heap, so
156GB of RAM.  Because the ideal setup is rarely a strict requirement
unless the query load is high, if you have 128 GB of RAM per server,
then I would not be worried about performance.  If you have less than
that, then I would be worried.

The behavior with the same shard listed multiple times is a little
strange.  That behavior could indicate problems with garbage collection
pauses -- as Solr is building the memory structures necessary to compose
the final response, it might fill up one of the heap generations to its
current size limit and each subsequent allocation might require a
significant garbage collection, stopping the world while it happens, but
not freeing up any significant amount of memory in that particular heap
generation.

Have you tuned your garbage collection?  If not, that is a likely
suspect.  If you run with the latest Oracle Java, you can use my
settings and probably see good GC performance:

https://wiki.apache.org/solr/ShawnHeisey

Further down on the page is a good set of CMS parameters for earlier
Java versions, if you can't run the latest.

Thanks,
Shawn



Re: Solr on HDFS in a Hadoop cluster

2015-01-08 Thread Charles VALLEE
Thanks a lot Otis,

While reading the SolrCloud documentation to understand how SolrCloud 
could run on HDFS, I got confused with leader, replica, non-replica 
shards, core, index, and collections.
Once it is specified that one cannot add shards, then that one can add 
replica-only shards, then that last Shard Splitting paragraph states 
that something changed starting with Solr 4.3.
But it doesn't states that splitting shards can end in a new non-replica 
shard, in a just added node, thus increasing the amount of storage 
available to the index / collection. It states that split action 
effectively makes two copies of the data as new shards instead, which 
tastes a lot like replica style shards.
So does it?
Could there be some sort of tutorial describing how to add available 
storage capacity for index / collection, thus adding a node / shard - core 
that one can send new documents to be indexed? (of course, load-balancing 
would be trigered, so it looks like documents would be added to shards out 
of a set of nodes).
Thanks,



 
Charles VALLEE
Centre de compétence Big data
EDF – DSP - CSP IT-O
DATACENTER - Expertise en Energie Informatique (EEI)
32 avenue Pablo Picasso
92000 Nanterre
 
charles.val...@edf.fr
Tél. : + (0) 1 78 66 69 81

Un geste simple pour l'environnement, n'imprimez ce message que si vous en 
avez l'utilité.




De :otis.gospodne...@gmail.com
A : solr-user@lucene.apache.org
Date :  06/01/2015 18:55
Objet : Re: Solr on HDFS in a Hadoop cluster



Oh, and https://issues.apache.org/jira/browse/SOLR-6743

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Tue, Jan 6, 2015 at 12:52 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Hi Charles,

 See http://search-lucene.com/?q=solr+hdfs and
 https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS

 Otis
 --
 Monitoring * Alerting * Anomaly Detection * Centralized Log Management
 Solr  Elasticsearch Support * http://sematext.com/


 On Tue, Jan 6, 2015 at 11:02 AM, Charles VALLEE charles.val...@edf.fr
 wrote:

 I am considering using *Solr* to extend *Hortonworks Data Platform*
 capabilities to search.

 - I found tutorials to index documents into a Solr instance from 
*HDFS*,
 but I guess this solution would require a Solr cluster distinct to the
 Hadoop cluster. Is it possible to have a Solr integrated into the 
Hadoop
 cluster instead? - *With the index stored in HDFS?*

 - Where would the processing take place (could it be handed down to
 Hadoop)? Is there a way to garantee a level of service (CPU, RAM) - to
 integrate with *Yarn*?

 - What about *SolrCloud*: what does it bring regarding Hadoop based
 use-cases? Does it stand for a Solr-only cluster?

 - Well, if that could lead to something working with a roles-based
 authorization-compliant *Banana*, it would be Christmass again!

 Thanks a lot for any help!

 Charles





Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à 
l'intention exclusive des destinataires et les informations qui y figurent sont 
strictement confidentielles. Toute utilisation de ce Message non conforme à sa 
destination, toute diffusion ou toute publication totale ou partielle, est 
interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le 
copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si 
vous avez reçu ce Message par erreur, merci de le supprimer de votre système, 
ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support 
que ce soit. Nous vous remercions également d'en avertir immédiatement 
l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie 
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute 
erreur ou virus.


This message and any attachments (the 'Message') are intended solely for the 
addressees. The information contained in this Message is confidential. Any use 
of information contained in this Message not in accord with its purpose, any 
dissemination or disclosure, either whole or partial, is prohibited except 
formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any 
part of it. If you have received this message in error, please delete it and 
all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or 
virus-free.


Re: How large is your solr index?

2015-01-08 Thread Erick Erickson
bq: you'll end up with N-2 nearly full boxes and 2 half-full boxes.

True, you'd have to repeat the process N times. At that point, though,
as Shawn mentions it's often easier to just re-index the whole thing.

Do note that one strategy is to create more shards than you need at
the beginning. Say you determine that 10 shards will work fine, but
you expect to grow your corpus by 2x. _Start_  with 20 shards
(multiple shards can be hosted in the same JVM, no problem, see
maxShardsPerNode in the collections API CREATE action. Then
as your corpus grows you can move the shards to their own
boxes.

This just kicks the can down the road of course, if your corpus grows
by 5x instead of 2x you're back to this discussion

Best,
Erick

On Thu, Jan 8, 2015 at 7:08 AM, Shawn Heisey apa...@elyograg.org wrote:
 On 1/8/2015 4:37 AM, Bram Van Dam wrote:
 Hmm. That is a good point. I wonder if there's some kind of middle
 ground here? Something that lets me send an update (or new document) to
 an arbitrary node/shard but which is still routed according to my
 specific requirements? Maybe this can already be achieved by messing
 with the routing?

 snip

 That's fine. We have a lot of query (pre-)processing outside of Solr.
 It's no problem for us to send a couple of queries to a couple of shards
 and aggregate the result ourselves. It would, of course, be nice if
 everything worked in distributed mode, but at least for us it's not an
 issue. This is a side effect of our complex reporting requirements -- we
 do aggregation, filtering and other magic on data that is partially in
 Solr and partially elsewhere.

 SolrCloud, when you do fully automatic document routing, does handle
 everything for you.  You can query any node and send updates to any
 node, and they will end up in the right place.  There is currently a
 strong caveat: Indexing performance sucks when updates are initially
 sent to the wrong node.  The performance hit is far larger than we
 expected it to be, so there is an issue in Jira to try and make that
 better.  No visible work has been done on the issue yet:

 https://issues.apache.org/jira/browse/SOLR-6717

 The Java client (SolrJ, specifically CloudSolrServer) sends all updates
 to the correct nodes, because it can access the clusterstate and knows
 where updates need to go and where the shard leaders are.

 This is a very good point. But I don't think SPLITSHARD is the magical
 answer here. If you have N shards on N boxes, and they are all getting
 nearly full and you decide to split one and move half to a new box,
 you'll end up with N-2 nearly full boxes and 2 half-full boxes. What
 happens if the disks fill up further? Do I have to split each shard?
 That sounds pretty nightmareish!

 Planning ahead for growth is critical with SolrCloud, but there is
 something you can do if you discover that you need to radically
 re-shard:  Create a whole new collection with the number of shards you
 want, likely using the original set of Solr servers plus some new ones.
  Rebuild the index into that collection.  Delete the old collection, and
 create a collection alias pointing the original name at the new
 collection.  The alias will work for both queries and updates.

 Thanks,
 Shawn



Re: Solr Cloud, 100 shards, shards progressively become slower

2015-01-08 Thread Shawn Heisey
On 1/8/2015 7:26 AM, Andrew Butkus wrote:
 Hi, we have 8 solr servers, split 4x4 across 2 data centers.
 
 We have a collection of around ½ billion documents, split over 100 shards, 
 each is replicated 4 times on separate nodes (evenly distributed across both 
 data centers).
 
 The problem we have is that when we use cursormark (and also when we don't 
 use cursormark the pattern below is the same but just shorter in time) the 
 time it takes to query each shard gets progressively longer when distrib=true 
 , I have tried to query shards directly (with shards=) and select my own 
 shards to query to see if it was a bandwidth bottleneck and the performance 
 is normal / fine - when using pre-defined shards.
 
 Does anyone know why the shards become progressively slower when 
 distrib=true? Or any suggestions on how I can fix, or how to debug the 
 problem further?
 
 I have monitored the performance of CPU and it never goes above 10% on each 
 server, so its not cpu, also the memory usage is about 4gb out of 16gb so its 
 not a memory issue either.
 
 I have tried all shard shuffling strategies incase it was a bottleneck at a 
 server being over used but as above, the cpu never goes above 10%, and when I 
 use shards= there are never any querytime bottlenecks.

The part about memory usage is not clear.  That 4GB and 16GB could refer
to the operating system view of memory, or the view of memory within the
JVM.  I'm curious about how much total RAM each machine has, how large
the Java heap is, and what the total size of the indexes that live on
each machine is.

Even if they are individually very small, 500 million documents will
result in a very large index, so I'm guessing that you don't have enough
RAM on each server for your index size.

What can happen with a highly sharded index that is too large for
available RAM:  Index data for the initial queries gets read from the OS
disk cache, but as those queries run, the information required for the
shards that come later in the distributed query gets pushed out of the
disk cache, so Solr must actually read the disk to do those later
queries.  Disks are slow, so if the machine has to actually read from
the disk, Solr will be slow.

http://wiki.apache.org/solr/SolrPerformanceProblems#RAM

Thanks,
Shawn



RE: Solr Cloud, 100 shards, shards progressively become slower

2015-01-08 Thread Andrew Butkus
Hi, we have 8 solr servers, split 4x4 across 2 data centers.

We have a collection of around ½ billion documents, split over 100 shards, each 
is replicated 4 times on separate nodes (evenly distributed across both data 
centers).

The problem we have is that when we use cursormark (and also when we don't use 
cursormark the pattern below is the same but just shorter in time) the time it 
takes to query each shard gets progressively longer when distrib=true , I have 
tried to query shards directly (with shards=) and select my own shards to query 
to see if it was a bandwidth bottleneck and the performance is normal / fine - 
when using pre-defined shards.

Does anyone know why the shards become progressively slower when distrib=true? 
Or any suggestions on how I can fix, or how to debug the problem further?

I have monitored the performance of CPU and it never goes above 10% on each 
server, so its not cpu, also the memory usage is about 4gb out of 16gb so its 
not a memory issue either.

I have tried all shard shuffling strategies incase it was a bottleneck at a 
server being over used but as above, the cpu never goes above 10%, and when I 
use shards= there are never any querytime bottlenecks.

Around 

http://2.2.213:8985/solr/Collection_shard16_replica1/|http://1.1.1.16:8985/solr/Collection_shard16_replica2/|http://1.1.1.17:8985/solr/Collection_shard16_replica3/|http://2.2.216:8985/solr/Collection_shard16_replica4/:{
  numFound:242899,
  maxScore:null,
  shardAddress:http://1.1.1.17:8985/solr/Collection_shard16_replica3;,
  time:134},

The timings get progressively worse, (there is a pattern, the time it takes to 
run queries on shards increasingly gets worse after about the first 60 entries 
- even though the earlier ones took a few milliseconds)

Here is my trace output

{
  responseHeader:{
    status:0,
    QTime:50093,
    params:{
  shard.shuffling.strategy:query,
  sort:id ASC,
  indent:true,
  q:spec_country:\United Kingdom\,
  shards.info:true,
  distrib:true,
  cursorMark:*,
  wt:json,
  rows:0}},
  shards.info:{
    
http://1.1.1.17:8985/solr/Collection_shard78_replica3/|http://2.2.216:8985/solr/Collection_shard78_replica4/|http://2.2.213:8985/solr/Collection_shard78_replica1/|http://1.1.1.16:8985/solr/Collection_shard78_replica2/:{
  numFound:243009,
  maxScore:null,
  shardAddress:http://1.1.1.16:8985/solr/Collection_shard78_replica2;,
  time:24},
    
http://2.2.213:8985/solr/Collection_shard24_replica1/|http://1.1.1.16:8985/solr/Collection_shard24_replica2/|http://1.1.1.17:8985/solr/Collection_shard24_replica3/|http://2.2.216:8985/solr/Collection_shard24_replica4/:{
  numFound:242309,
  maxScore:null,
  shardAddress:http://1.1.1.16:8985/solr/Collection_shard24_replica2;,
  time:23},
    
http://1.1.1.17:8985/solr/Collection_shard70_replica3/|http://2.2.213:8985/solr/Collection_shard70_replica1/|http://1.1.1.16:8985/solr/Collection_shard70_replica2/|http://2.2.216:8985/solr/Collection_shard70_replica4/:{
 numFound:242727,
  maxScore:null,
  shardAddress:http://1.1.1.16:8985/solr/Collection_shard70_replica2;,
  time:23},
    
http://2.2.216:8985/solr/Collection_shard76_replica4/|http://1.1.1.17:8985/solr/Collection_shard76_replica3/|http://2.2.213:8985/solr/Collection_shard76_replica1/|http://1.1.1.16:8985/solr/Collection_shard76_replica2/:{
  numFound:243324,
  maxScore:null,
  shardAddress:http://2.2.216:8985/solr/Collection_shard76_replica4;,
  time:26},
    
http://2.2.214:8985/solr/Collection_shard29_replica1/|http://1.1.1.18:8985/solr/Collection_shard29_replica2/|http://1.1.1.15:8985/solr/Collection_shard29_replica3/|http://2.2.215:8985/solr/Collection_shard29_replica4/:{
  numFound:242559,
  maxScore:null,
  shardAddress:http://2.2.214:8985/solr/Collection_shard29_replica1;,
  time:25},
    
http://1.1.1.17:8985/solr/Collection_shard74_replica3/|http://2.2.213:8985/solr/Collection_shard74_replica1/|http://2.2.216:8985/solr/Collection_shard74_replica4/|http://1.1.1.16:8985/solr/Collection_shard74_replica2/:{
  numFound:242419,
  maxScore:null,
  shardAddress:http://2.2.216:8985/solr/Collection_shard74_replica4;,
  time:24},
    
http://1.1.1.18:8985/solr/Collection_shard33_replica2/|http://1.1.1.15:8985/solr/Collection_shard33_replica3/|http://2.2.214:8985/solr/Collection_shard33_replica1/|http://2.2.215:8985/solr/Collection_shard33_replica4/:{
  numFound:242571,
  maxScore:null,
  shardAddress:http://2.2.214:8985/solr/Collection_shard33_replica1;,
  time:25},
    
http://1.1.1.18:8985/solr/Collection_shard77_replica2/|http://2.2.215:8985/solr/Collection_shard77_replica4/|http://1.1.1.15:8985/solr/Collection_shard77_replica3/|http://2.2.214:8985/solr/Collection_shard77_replica1/:{
  numFound:242901,
  maxScore:null,
  shardAddress:http://2.2.215:8985/solr/Collection_shard77_replica4;,
  time:27},
    

How to return child documents with parent

2015-01-08 Thread yliu
Hi,

What is the best way to return both parent document and child documents in
one query?  I used SolrJ to create a document and added a few child
documents using addChildDocuments() method and indexed the parent document. 
All documents are indexed successfully (parent and children).  

When I tried to retrieve the parent document along with the child documents. 
I used expand=trueexpand.field=_root_.  I was able to get the parent back
in result section and children in expandedResults section.  Is there
some other type of query I should use so I can get the child documents back
as children instead of expanded result?

Thanks,

Y



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-return-child-documents-with-parent-tp4178081.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr Cloud, 100 shards, shards progressively become slower

2015-01-08 Thread Andrew Butkus
Hi, we have 8 solr servers, split 4x4 across 2 data centers.



We have a collection of around ½ billion documents, split over 100 shards, each 
is replicated 4 times on separate nodes (evenly distributed across both data 
centers).



The problem we have is that when we use cursormark (and also when we don't use 
cursormark the pattern below is the same but just shorter in time) the time it 
takes to query each shard gets progressively longer when distrib=true , I have 
tried to query shards directly (with shards=) and select my own shards to query 
to see if it was a bandwidth bottleneck and the performance is normal / fine - 
when using pre-defined shards.



Does anyone know why the shards become progressively slower when distrib=true? 
Or any suggestions on how I can fix, or how to debug the problem further?



I have monitored the performance of CPU and it never goes above 10% on each 
server, so its not cpu, also the memory usage is about 4gb out of 16gb so its 
not a memory issue either.



I have tried all shard shuffling strategies incase it was a bottleneck at a 
server being over used but as above, the cpu never goes above 10%, and when I 
use shards= there are never any querytime bottlenecks.



Around



http://2.2.213:8985/solr/Collection_shard16_replica1/|http://1.1.1.16:8985/solr/Collection_shard16_replica2/|http://1.1.1.17:8985/solr/Collection_shard16_replica3/|http://2.2.216:8985/solr/Collection_shard16_replica4/http://2.2.213:8985/solr/Collection_shard16_replica1/|http:/1.1.1.16:8985/solr/Collection_shard16_replica2/|http:/1.1.1.17:8985/solr/Collection_shard16_replica3/|http:/2.2.216:8985/solr/Collection_shard16_replica4/:{

  numFound:242899,

  maxScore:null,

  shardAddress:http://1.1.1.17:8985/solr/Collection_shard16_replica3;,

  time:134},



The timings get progressively worse, (there is a pattern, the time it takes to 
run queries on shards increasingly gets worse after about the first 60 entries 
- even though the earlier ones took a few milliseconds)



Here is my trace output



{

  responseHeader:{

status:0,

QTime:50093,

params:{

  shard.shuffling.strategy:query,

  sort:id ASC,

  indent:true,

  q:spec_country:\United Kingdom\,

  shards.info:true,

  distrib:true,

  cursorMark:*,

  wt:json,

  rows:0}},

  shards.info:{


http://1.1.1.17:8985/solr/Collection_shard78_replica3/|http://2.2.216:8985/solr/Collection_shard78_replica4/|http://2.2.213:8985/solr/Collection_shard78_replica1/|http://1.1.1.16:8985/solr/Collection_shard78_replica2/http://1.1.1.17:8985/solr/Collection_shard78_replica3/|http:/2.2.216:8985/solr/Collection_shard78_replica4/|http:/2.2.213:8985/solr/Collection_shard78_replica1/|http:/1.1.1.16:8985/solr/Collection_shard78_replica2/:{

  numFound:243009,

  maxScore:null,

  shardAddress:http://1.1.1.16:8985/solr/Collection_shard78_replica2;,

  time:24},


http://2.2.213:8985/solr/Collection_shard24_replica1/|http://1.1.1.16:8985/solr/Collection_shard24_replica2/|http://1.1.1.17:8985/solr/Collection_shard24_replica3/|http://2.2.216:8985/solr/Collection_shard24_replica4/http://2.2.213:8985/solr/Collection_shard24_replica1/|http:/1.1.1.16:8985/solr/Collection_shard24_replica2/|http:/1.1.1.17:8985/solr/Collection_shard24_replica3/|http:/2.2.216:8985/solr/Collection_shard24_replica4/:{

  numFound:242309,

  maxScore:null,

  shardAddress:http://1.1.1.16:8985/solr/Collection_shard24_replica2;,

  time:23},


http://1.1.1.17:8985/solr/Collection_shard70_replica3/|http://2.2.213:8985/solr/Collection_shard70_replica1/|http://1.1.1.16:8985/solr/Collection_shard70_replica2/|http://2.2.216:8985/solr/Collection_shard70_replica4/http://1.1.1.17:8985/solr/Collection_shard70_replica3/|http:/2.2.213:8985/solr/Collection_shard70_replica1/|http:/1.1.1.16:8985/solr/Collection_shard70_replica2/|http:/2.2.216:8985/solr/Collection_shard70_replica4/:{

 numFound:242727,

  maxScore:null,

  shardAddress:http://1.1.1.16:8985/solr/Collection_shard70_replica2;,

  time:23},


http://2.2.216:8985/solr/Collection_shard76_replica4/|http://1.1.1.17:8985/solr/Collection_shard76_replica3/|http://2.2.213:8985/solr/Collection_shard76_replica1/|http://1.1.1.16:8985/solr/Collection_shard76_replica2/http://2.2.216:8985/solr/Collection_shard76_replica4/|http:/1.1.1.17:8985/solr/Collection_shard76_replica3/|http:/2.2.213:8985/solr/Collection_shard76_replica1/|http:/1.1.1.16:8985/solr/Collection_shard76_replica2/:{

  numFound:243324,

  maxScore:null,

  shardAddress:http://2.2.216:8985/solr/Collection_shard76_replica4;,

  time:26},



Re: leader split-brain at least once a day - need help

2015-01-08 Thread Yonik Seeley
It's worth noting that those messages alone don't necessarily signify
a problem with the system (and it wouldn't be called split brain).
The async nature of updates (and thread scheduling) along with
stop-the-world GC pauses that can change leadership, cause these
little windows of inconsistencies that we detect and log.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


On Wed, Jan 7, 2015 at 5:01 AM, Thomas Lamy t.l...@cytainment.de wrote:
 Hi there,

 we are running a 3 server cloud serving a dozen
 single-shard/replicate-everywhere collections. The 2 biggest collections are
 ~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5, Tomcat
 7.0.56, Oracle Java 1.7.0_72-b14

 10 of the 12 collections (the small ones) get filled by DIH full-import once
 a day starting at 1am. The second biggest collection is updated usind DIH
 delta-import every 10 minutes, the biggest one gets bulk json updates with
 commits once in 5 minutes.

 On a regular basis, we have a leader information mismatch:
 org.apache.solr.update.processor.DistributedUpdateProcessor; Request says it
 is coming from leader, but we are the leader
 or the opposite
 org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState
 says we are the leader, but locally we don't think so

 One of these pop up once a day at around 8am, making either some cores going
 to recovery failed state, or all cores of at least one cloud node into
 state gone.
 This started out of the blue about 2 weeks ago, without changes to neither
 software, data, or client behaviour.

 Most of the time, we get things going again by restarting solr on the
 current leader node, forcing a new election - can this be triggered while
 keeping solr (and the caches) up?
 But sometimes this doesn't help, we had an incident last weekend where our
 admins didn't restart in time, creating millions of entries in
 /solr/oversser/queue, making zk close the connection, and leader re-elect
 fails. I had to flush zk, and re-upload collection config to get solr up
 again (just like in https://gist.github.com/isoboroff/424fcdf63fa760c1d1a7).

 We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 1500
 requests/s) up and running, which does not have these problems since
 upgrading to 4.10.2.


 Any hints on where to look for a solution?

 Kind regards
 Thomas

 --
 Thomas Lamy
 Cytainment AG  Co KG
 Nordkanalstrasse 52
 20097 Hamburg

 Tel.: +49 (40) 23 706-747
 Fax: +49 (40) 23 706-139
 Sitz und Registergericht Hamburg
 HRA 98121
 HRB 86068
 Ust-ID: DE213009476



Re: How large is your solr index?

2015-01-08 Thread Shawn Heisey
On 1/8/2015 4:37 AM, Bram Van Dam wrote:
 Hmm. That is a good point. I wonder if there's some kind of middle
 ground here? Something that lets me send an update (or new document) to
 an arbitrary node/shard but which is still routed according to my
 specific requirements? Maybe this can already be achieved by messing
 with the routing?

snip

 That's fine. We have a lot of query (pre-)processing outside of Solr.
 It's no problem for us to send a couple of queries to a couple of shards
 and aggregate the result ourselves. It would, of course, be nice if
 everything worked in distributed mode, but at least for us it's not an
 issue. This is a side effect of our complex reporting requirements -- we
 do aggregation, filtering and other magic on data that is partially in
 Solr and partially elsewhere.

SolrCloud, when you do fully automatic document routing, does handle
everything for you.  You can query any node and send updates to any
node, and they will end up in the right place.  There is currently a
strong caveat: Indexing performance sucks when updates are initially
sent to the wrong node.  The performance hit is far larger than we
expected it to be, so there is an issue in Jira to try and make that
better.  No visible work has been done on the issue yet:

https://issues.apache.org/jira/browse/SOLR-6717

The Java client (SolrJ, specifically CloudSolrServer) sends all updates
to the correct nodes, because it can access the clusterstate and knows
where updates need to go and where the shard leaders are.

 This is a very good point. But I don't think SPLITSHARD is the magical
 answer here. If you have N shards on N boxes, and they are all getting
 nearly full and you decide to split one and move half to a new box,
 you'll end up with N-2 nearly full boxes and 2 half-full boxes. What
 happens if the disks fill up further? Do I have to split each shard?
 That sounds pretty nightmareish!

Planning ahead for growth is critical with SolrCloud, but there is
something you can do if you discover that you need to radically
re-shard:  Create a whole new collection with the number of shards you
want, likely using the original set of Solr servers plus some new ones.
 Rebuild the index into that collection.  Delete the old collection, and
create a collection alias pointing the original name at the new
collection.  The alias will work for both queries and updates.

Thanks,
Shawn



Re: Solr with Tomcat - enabling SSL problem

2015-01-08 Thread Shawn Heisey
On 1/8/2015 6:25 AM, Tali Finelt wrote:
 I am using Solr 4.10.2 with tomcat and embedded Zookeeper.
 I followed 
 https://cwiki.apache.org/confluence/display/solr/Enabling+SSL#EnablingSSL-SolrCloud
  
 to enable SSL.
 
 I am currently doing the following:
 
 Starting tomcat
 Running:
 ../scripts/cloud-scripts/zkcli.sh -zkhost localhost:9983 -cmd put 
 /clusterprops.json '{urlScheme:https}' 
 Restarting tomcat
 Accessing Solr from my client using 
 org.apache.solr.client.solrj.impl.CloudSolrServer.
 
 And this works. 
 If I don't restart tomcat again after running zkcli.sh, I get the 
 following error:
 IOException occured when talking to server at: 
 http://ip:port/solr/ (http, not https).
 
 Is it possible to do this without the second restart?

Solr will only read parameters like the urlScheme at startup.  Once it's
running, that information is never accessed again, so in order to get it
to change those parameters, a restart is required.

It might be possible to change the code so a re-read of these parameters
takes place ... but writing code to make fundamental changes to program
operation can be risky.  Restarting the program is much safer.

Thanks,
Shawn



Re: Solr with Tomcat - enabling SSL problem

2015-01-08 Thread Shawn Heisey
On 1/8/2015 8:50 AM, Tali Finelt wrote:
 Thanks for clarifying this. 
 Is there a different way to set the embedded Zookeeper urlScheme parameter 
 before ever starting tomcat? (some configuration file etc.)
 This way I won't need to start tomcat twice.

Most of the cloud options can be specified with system properties on the
java commandline.  I believe you would use this:

|-DurlScheme=https

I had thought maybe urlScheme could be specified in solr.xml, but I
can't find any examples, so it might not be possible.

Thanks,
Shawn

|


Re: How large is your solr index?

2015-01-08 Thread Joseph Obernberger


On 1/8/2015 3:16 AM, Toke Eskildsen wrote:

On Wed, 2015-01-07 at 22:26 +0100, Joseph Obernberger wrote:

Thank you Toke - yes - the data is indexed throughout the day.  We are
handling very few searches - probably 50 a day; this is an RD system.

If your searches are in small bundles, you could pause the indexing flow
while the searches are executed, for better performance.


Our HDFS cache, I believe, is too small at 10GBytes per shard.

That depends a lot on your corpus, your searches and underlying storage.
But with our current level of information, it is a really good bet:
Having 10GB cache per 130GB (270GB?) data is not a lot with spinning
drives.

Yes - it would be 20GBytes of cache per 270GBytes of data.



Current parameters for running each shard are:
JAVA_OPTS=-XX:MaxDirectMemorySize=10g -XX:+UseLargePages -XX:NewRatio=3

[...]

-Xmx10752m

One Solr/shard? You could probably win a bit by having one Solr/machine
instead. Anyway, it's quite a high Xmx, but I presume you have measured
the memory needs.
We've tried lower Xmx but we get OOM errors during faceting of large 
datasets.  Right now we're running two JVMs per physical box (2 shards 
per box), but we're going to be changing that to on JVM and one shard 
per box.



I'd love to try SSDs, but don't have the budget at present to go that
route.

We find the price/performance for SSD + moderate RAM to be quite a
better deal than spinning drives + a lot of RAM, even when buying
enterprise hardware. For consumer SSDs (used in our large server) it is
even cheaper to use SSDs. It all depends on use pattern of course, but
your setup with non-concurrent searches seems like it would fit well.

Note: I am sure that the RAM == index size would deliver very high
performance. With enough RAM you can use tape to hold the index. Whether
it is cost effective is another matter.
Ha!  Yes - our index is accessible via a 2400 baud modem, but we have 
lots of cache!  ;)



I'd really like to get the HDFS option to work well as it
reduces system complexity.

That is very understandable. We examined the option of networked storage
(Isilon) with underlying spindles, and it performed adequately for our
needs up to 2-3TB of index data. Unfortunately the heavy random read
load from Solr meant a noticeable degradation of other services using
the networked storage. I am sure it could be solved with more
centralized hardware, but in the end we found it cheaper and simpler to
use local storage for search. This will of course differ across
organizations and setups.


We're going to experiment with the one shard per box and more RAM cache 
per shard and see where that gets us; we'll also be adding more shards.

Thanks for the tips!
Interesting that you mention Isilon as we're planning on doing an eval 
with their product this year where we'll be testing out their HDFS 
layer.  It's a potential way to balance computer and storage since you 
can add HDFS storage without adding compute.




- Toke Eskildsen




-Joe


Re: Solr: IndexNotFoundException: no segments* file HdfsDirectoryFactory

2015-01-08 Thread praneethvarma
I've missed Norgorn's reply above. But in the past and also as suggested
above, I think the following lock type solved the problem for me.

lockType${solr.lock.type:hdfs}/lockType in your indexConfig in
solrconfig.xml



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-IndexNotFoundException-no-segments-file-HdfsDirectoryFactory-tp4138737p4178098.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr with Tomcat - enabling SSL problem

2015-01-08 Thread Tali Finelt
Hi Shawn,

Thanks for clarifying this. 
Is there a different way to set the embedded Zookeeper urlScheme parameter 
before ever starting tomcat? (some configuration file etc.)
This way I won't need to start tomcat twice.

Thanks,
Tali





From:   Shawn Heisey apa...@elyograg.org
To: solr-user@lucene.apache.org
Date:   08/01/2015 05:14 PM
Subject:Re: Solr with Tomcat - enabling SSL problem



On 1/8/2015 6:25 AM, Tali Finelt wrote:
 I am using Solr 4.10.2 with tomcat and embedded Zookeeper.
 I followed 
 
https://cwiki.apache.org/confluence/display/solr/Enabling+SSL#EnablingSSL-SolrCloud
 

 to enable SSL.
 
 I am currently doing the following:
 
 Starting tomcat
 Running:
 ../scripts/cloud-scripts/zkcli.sh -zkhost localhost:9983 -cmd put 
 /clusterprops.json '{urlScheme:https}' 
 Restarting tomcat
 Accessing Solr from my client using 
 org.apache.solr.client.solrj.impl.CloudSolrServer.
 
 And this works. 
 If I don't restart tomcat again after running zkcli.sh, I get the 
 following error:
 IOException occured when talking to server at: 
 http://ip:port/solr/ (http, not https).
 
 Is it possible to do this without the second restart?

Solr will only read parameters like the urlScheme at startup.  Once it's
running, that information is never accessed again, so in order to get it
to change those parameters, a restart is required.

It might be possible to change the code so a re-read of these parameters
takes place ... but writing code to make fundamental changes to program
operation can be risky.  Restarting the program is much safer.

Thanks,
Shawn




RE: Solr Cloud, 100 shards, shards progressively become slower

2015-01-08 Thread Andrew Butkus
Hi Shawn,

Thank you for your reply

The part about memory usage is not clear.  That 4GB and 16GB could refer to 
the operating system view of memory, or the view of memory within the JVM.  
I'm curious about how much total RAM each machine has, how large the Java 
heap is, and what the total size of the indexes that live on each machine is.

Even if they are individually very small, 500 million documents will result in 
a very large index, so I'm guessing that you don't have enough RAM on each 
server for your index size.

What can happen with a highly sharded index that is too large for available 
RAM:  Index data for the initial queries gets read from the OS disk cache, but 
as those queries run, the information required for the shards that come later 
in the distributed query gets pushed out of the disk cache, so Solr must 
actually read the disk to do those later queries.  Disks are slow, so if the 
machine has to actually read from the disk, Solr will be slow.

http://wiki.apache.org/solr/SolrPerformanceProblems#RAM


We have 4gb usage (because the shards are split by 100 each shard is approx. 
2.8gb on disk), we have allocated 14gb min and 16gb max of ram to solr, so it 
has plenty to use (the ram in the dashboard never goes above about 8gb - so 
still plenty ).

I've managed to reproduce the issue with shards= parameter, and think I have 
proven the disk cache issue to not be the problem 

I'm simply querying the same shard, on the same server, multiple times (so the 
shards index should always be in memory and never loaded from disk)?

All but the last query are low ms ... 

{
  responseHeader:{
status:0,
QTime:50190,
params:{
  sort:id ASC,
  
shards:1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,
  indent:true,
  q:spec_country:\United Kingdom\,
  shards.info:true,
  distrib:false,
  cursorMark:*,
  wt:json,
  rows:10}},
  shards.info:{
1.1.1.18:8985/solr/Collection:{
  numFound:242731,
  maxScore:null,
  shardAddress:http://1.1.1.18:8985/solr/Collection;,
  time:24},
1.1.1.18:8985/solr/Collection:{
  numFound:242731,
  maxScore:null,
  shardAddress:http://1.1.1.18:8985/solr/Collection;,
  time:35},
1.1.1.18:8985/solr/Collection:{
  numFound:242731,
  maxScore:null,
  shardAddress:http://1.1.1.18:8985/solr/Collection;,
  time:35},
1.1.1.18:8985/solr/Collection:{
  numFound:242731,
  maxScore:null,
  shardAddress:http://1.1.1.18:8985/solr/Collection;,
  time:52},
1.1.1.18:8985/solr/Collection:{
  numFound:242731,
  maxScore:null,
  shardAddress:http://1.1.1.18:8985/solr/Collection;,
  time:55},
1.1.1.18:8985/solr/Collection:{
  numFound:242731,
  maxScore:null,
  shardAddress:http://1.1.1.18:8985/solr/Collection;,
  time:59},
1.1.1.18:8985/solr/Collection:{
  numFound:242731,
  maxScore:null,
  shardAddress:http://1.1.1.18:8985/solr/Collection;,
  time:58},
1.1.1.18:8985/solr/Collection:{
  numFound:242731,
  maxScore:null,
  shardAddress:http://1.1.1.18:8985/solr/Collection;,
  time:75},
1.1.1.18:8985/solr/Collection:{
  numFound:242731,
  maxScore:null,
  shardAddress:http://1.1.1.18:8985/solr/Collection;,
  time:79},
1.1.1.18:8985/solr/Collection:{
  numFound:242731,
  maxScore:null,
  shardAddress:http://1.1.1.18:8985/solr/Collection;,
  time:78},
1.1.1.18:8985/solr/Collection:{
  numFound:242731,
  maxScore:null,
  shardAddress:http://1.1.1.18:8985/solr/Collection;,
  time:79},
1.1.1.18:8985/solr/Collection:{
  numFound:242731,
  maxScore:null,
  shardAddress:http://1.1.1.18:8985/solr/Collection;,
  time:80},
1.1.1.18:8985/solr/Collection:{
  numFound:242731,
  maxScore:null,
  

Re: ignoring bad documents during index

2015-01-08 Thread Chris Hostetter

i don't have specific answers toall of your questions, but you should 
probably look at SOLR-445 where a lot of this has already ben discussed 
and multiple patches with different approaches have been started...

https://issues.apache.org/jira/browse/SOLR-445

: Date: Wed, 7 Jan 2015 12:38:47 -0700 (MST)
: From: SolrUser1543 osta...@gmail.com
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Re: ignoring bad documents during index
: 
: I have implemented an update processor as  described above. 
: 
: On single solr instance it works fine. 
: 
: When I testing it on solr cloud with several nodes and trying to index few
: documents , when some of them are incorrect , each instance is creating its
: response, but it is not aggregated by the instance which got a request . 
: 
: I also tried to use QueryReponseWriter , but it is also was not aggregated . 
: 
: The questions are : 
: 1.  how to make it be aggregated ? 
: 2. what kind of update processor it should be : UpdateRequestProcessor or
: DistributedUpdateRequestProcessor ? 
: 
: 
: 
: 
: --
: View this message in context: 
http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947p4177911.html
: Sent from the Solr - User mailing list archive at Nabble.com.
: 

-Hoss
http://www.lucidworks.com/