Parallel Indexing With Solr?

2013-03-29 Thread Furkan KAMACI
Does Solr allows parallelism (parallel computing) for indexing?


Re: Parallel Indexing With Solr?

2013-03-29 Thread Gora Mohanty
On 29 March 2013 14:56, Furkan KAMACI furkankam...@gmail.com wrote:
 Does Solr allows parallelism (parallel computing) for indexing?

What do you mean by parallel computing in this context?

Solr can use multiple threads for indexing if that is what
you are asking.

Regards,
Gora


Re: Parallel Indexing With Solr?

2013-03-29 Thread Otis Gospodnetic
Yes.  You can index from any app that can hit SOlr with multiple
threads.  You can use StreamingUpdateSolrServer, at least in older
Solrs, to handle multi-threading for you.  You can index from a
MapReduce job 

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Fri, Mar 29, 2013 at 5:26 AM, Furkan KAMACI furkankam...@gmail.com wrote:
 Does Solr allows parallelism (parallel computing) for indexing?


Re: Parallel Indexing With Solr?

2013-03-29 Thread Furkan KAMACI
Can you tell more about You can index from a MapReduce job ? I use
nutch and it says Solr to index and reindex. I know that I can use Map
Reduce jobs at nutch side however can I use Map Reduce jobs at Solr side
(i.e for indexing etc.)?


2013/3/29 Otis Gospodnetic otis.gospodne...@gmail.com

 Yes.  You can index from any app that can hit SOlr with multiple
 threads.  You can use StreamingUpdateSolrServer, at least in older
 Solrs, to handle multi-threading for you.  You can index from a
 MapReduce job 

 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/





 On Fri, Mar 29, 2013 at 5:26 AM, Furkan KAMACI furkankam...@gmail.com
 wrote:
  Does Solr allows parallelism (parallel computing) for indexing?



Re: Parallel indexing in Solr

2012-02-07 Thread Sami Siren
On Mon, Feb 6, 2012 at 5:55 PM, Per Steffensen st...@designware.dk wrote:
 Sami Siren skrev:

 On Mon, Feb 6, 2012 at 2:53 PM, Per Steffensen st...@designware.dk
 wrote:




 Actually right now, I am trying to find our what my bottleneck is. The
 setup
 is more complex, than I would bother you with, but basically I have
 servers
 with 80-90% IO-wait and only 5-10% real CPU usage. It might not be a
 Solr-related problem, I am investigating different things, but just
 wanted
 to know a little more about how Jetty/Solr works in order to make a
 qualified guess.



 What kind of/how many discs do you have for your shards? ..also what
 kind of server are you experimenting with?


 Grrr, thats where I have a little fight with operations. For now they gave
 me one (fairly big) machine with XenServer. I create my machines as Xen
 VM's on top of that. One of the things I dont like about this (besides that
 I dont trust Xen to do its virtualization right, or at least not provide me
 with correct readings on IO) is that disk space is assigned from an iSCSI
 connected SAN that they all share (including the line out there). But for
 now actually it doesnt look like disk IO problems. It looks like
 networks-bottlenecks (but to some extend they all also shard network) among
 all the components in our setup - our client plus Lily stack (HDFS, HBase,
 ZK, Lily Server, Solr etc). Well it is complex, but anyways ...


You could try to isolate the bottleneck by testing the indexing speed
from the local machine hosting Solr. Also tools like iostat or sar
might give you more details about the disk side.

--
 Sami Siren


Re: Parallel indexing in Solr

2012-02-07 Thread Per Steffensen



You could try to isolate the bottleneck by testing the indexing speed
from the local machine hosting Solr. Also tools like iostat or sar
might give you more details about the disk side.
  
Yes, I am doing different stuff to isolate bottleneck. Im also profiling 
JVM. And I am using iostat, top and sar already. Thanks.
This questions was originally just to get an early indication of whether 
or not Jetty was at all designed for parallel production-like 
processing. Now I believe it is, until I prove that it does not live up 
to my requirements.

Thanks!

--
 Sami Siren

  




Re: Parallel indexing in Solr

2012-02-06 Thread Per Steffensen

See response below

Erick Erickson skrev:

Unfortunately, the answer is it depends(tm).

First question: How are you indexing things? SolrJ? post.jar?
  

SolrJ, CommonsHttpSolrServer

But some observations:

1 sure, using multiple cores will have some parallelism. So will
using a single core but using something like SolrJ and
StreamingUpdateSolrServer.
So SolrJ with CommonsHttpSolrServer will not support handling several 
requests concurrently?

 Especially with trunk (4.0)
 and the Document Writer Per Thread stuff.
We are using trunk (4.0). Can you provide me with a little more info on 
this Document Writer Per Thread stuff. A link or something?

 In 3.x, you'll
 see some pauses when segments are merged that you
 can't get around (per core). See:
 
http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/
 for an excellent writeup. But whether or not you use several
 cores should be determined by your problem space, certainly
 not by trying to increase the throughput. Indexing usually
 take a back seat to search performance.
  

We will have few searches, but a lot of indexing.

2 general settings are hard to come by. If you're sending
  structured documents that use Tika to parse the data
  behind the scenes, your performance will be much
  different (slower) than sending SolrInputDocuments
 (SolrJ).
  

We are sending SolrInputDocuments

3 The recommended servlet container is, generally,
  The one you're most comfortable with. Tomcat is
  certainly popular. That said, use whatever you're
  most comfortable with until you see a performance
 problem. Odds are you'll find your load on Solr is a
  at its limit before your servlet container has problems.
  

So Jetty in not a easy to use, but non-performance-container?

4 Monitor you CPU, fire more requests at it until it
 hits 100%. Note that there are occasions where the
servlet container limits the number of outstanding
 requests it will allow and queues ones over that
 limit (find the magic setting to increase this if it's a
 problem, it differs by container). If you start to see
 your response times lengthen but the CPU not being
fully utilized, that may be the cause.
  
Actually right now, I am trying to find our what my bottleneck is. The 
setup is more complex, than I would bother you with, but basically I 
have servers with 80-90% IO-wait and only 5-10% real CPU usage. It 
might not be a Solr-related problem, I am investigating different 
things, but just wanted to know a little more about how Jetty/Solr works 
in order to make a qualified guess.

5 How high is high performance? On a stock solr
 with the Wikipedia dump (11M docs), all running on
 my laptop, I see 7K docs/sec indexed. I know of
 installations that see 60 docs/sec or even less. I'm
sending simple docs with SolrJ locally and they're
 sending huge documents over the wire that Tika
 handles. There are just so many variables it's hard
 to say anything except try it and see..
  
Well eventaually we need to be able to index and delete about 50mio 
documents per day. We will need to keep a history of 2 years of data 
in our system, deletion will not start before we have been in production 
for 2 years. At that point in time the system needs to contain 2 year * 
365 days/year * 50mio docs/day = 36,5billion documents. At that point 
50mio documents need to be deleted and index per day - before that we 
only need to index 50mio documents per day. We are aware that we are 
probably going to need a certain amout of hardware for this, but most 
important thing is that we make a scalable setup so that we can get to 
this kind of numbers at all. Right now I am focusing on getting most out 
of one Solr instance potentially with several cores, though.

Best
Erick

On Fri, Feb 3, 2012 at 3:55 AM, Per Steffensen st...@designware.dk wrote:
  

Hi

This topic has probably been covered before, but I havnt had the luck to
find the answer.

We are running solr instances with several cores inside. Solr running
out-of-the-box on top of jetty. I believe jetty is receiving all the
http-requests about indexing ned documents, and forwards it to the solr
engine. What kind of parallelism does this setup provide. Can more than one
index-request get processed concurrently? How many? How to increase the
number of index-requests that can be handled in parallel? Will I get better
parallelism by running on another web-container than jetty - e.g. tomcat?
What is the recommended web-container for high performance production
systems?

Thanks!

Regards, Per Steffensen



  




Re: Parallel indexing in Solr

2012-02-06 Thread Erick Erickson
Right. See below.

On Mon, Feb 6, 2012 at 7:53 AM, Per Steffensen st...@designware.dk wrote:
 See response below

 Erick Erickson skrev:

 Unfortunately, the answer is it depends(tm).

 First question: How are you indexing things? SolrJ? post.jar?


 SolrJ, CommonsHttpSolrServer

 But some observations:

 1 sure, using multiple cores will have some parallelism. So will
    using a single core but using something like SolrJ and
    StreamingUpdateSolrServer.

 So SolrJ with CommonsHttpSolrServer will not support handling several
 requests concurrently?


Nope. Use StreamingUpdateSolrServer, it should be just a drop-in with
a different constructor.

  Especially with trunk (4.0)
     and the Document Writer Per Thread stuff.

 We are using trunk (4.0). Can you provide me with a little more info on this
 Document Writer Per Thread stuff. A link or something?


I already did, follow the link I provided.

  In 3.x, you'll
     see some pauses when segments are merged that you
     can't get around (per core). See:

 http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/
     for an excellent writeup. But whether or not you use several
     cores should be determined by your problem space, certainly
     not by trying to increase the throughput. Indexing usually
     take a back seat to search performance.


 We will have few searches, but a lot of indexing.


Hmmm, this the inverse of most installations, so it's good to know.

 2 general settings are hard to come by. If you're sending
      structured documents that use Tika to parse the data
      behind the scenes, your performance will be much
      different (slower) than sending SolrInputDocuments
     (SolrJ).


 We are sending SolrInputDocuments

 3 The recommended servlet container is, generally,
      The one you're most comfortable with. Tomcat is
      certainly popular. That said, use whatever you're
      most comfortable with until you see a performance
     problem. Odds are you'll find your load on Solr is a
      at its limit before your servlet container has problems.


 So Jetty in not a easy to use, but non-performance-container?


Again, test and see. Lots of commercial systems use Jetty. Consider
that you're just sending sets of documents at Solr, the container
is doing very little work. You are batching up your Solr documents
aren't you?

 4 Monitor you CPU, fire more requests at it until it
     hits 100%. Note that there are occasions where the
    servlet container limits the number of outstanding
     requests it will allow and queues ones over that
     limit (find the magic setting to increase this if it's a
     problem, it differs by container). If you start to see
     your response times lengthen but the CPU not being
    fully utilized, that may be the cause.


 Actually right now, I am trying to find our what my bottleneck is. The setup
 is more complex, than I would bother you with, but basically I have servers
 with 80-90% IO-wait and only 5-10% real CPU usage. It might not be a
 Solr-related problem, I am investigating different things, but just wanted
 to know a little more about how Jetty/Solr works in order to make a
 qualified guess.

You should see this differ with StreamingUpdateSolrServer assuming your
client can feed documents fast enough. You can consider having multiple
clients feed the same solr indexer if necessary.



 5 How high is high performance? On a stock solr
     with the Wikipedia dump (11M docs), all running on
     my laptop, I see 7K docs/sec indexed. I know of
     installations that see 60 docs/sec or even less. I'm
    sending simple docs with SolrJ locally and they're
     sending huge documents over the wire that Tika
     handles. There are just so many variables it's hard
     to say anything except try it and see..


 Well eventaually we need to be able to index and delete about 50mio
 documents per day. We will need to keep a history of 2 years of data in
 our system, deletion will not start before we have been in production for 2
 years. At that point in time the system needs to contain 2 year * 365
 days/year * 50mio docs/day = 36,5billion documents. At that point 50mio
 documents need to be deleted and index per day - before that we only need to
 index 50mio documents per day. We are aware that we are probably going to
 need a certain amout of hardware for this, but most important thing is that
 we make a scalable setup so that we can get to this kind of numbers at all.
 Right now I am focusing on getting most out of one Solr instance potentially
 with several cores, though.

My off-the-top-of-my-head feeling is that this will be a LOT of hardware. You'll
without doubt be sharding the index. NOTE: Shards are cores, just special
purpose ones, i.e. they're all use the same schema. When Solr folks see cores,
we assume that the several cores that may have different schemas and handle
unrelated queries. It sounds like you're talking about a sharded 

Re: Parallel indexing in Solr

2012-02-06 Thread Sami Siren
On Mon, Feb 6, 2012 at 2:53 PM, Per Steffensen st...@designware.dk wrote:


 Actually right now, I am trying to find our what my bottleneck is. The setup
 is more complex, than I would bother you with, but basically I have servers
 with 80-90% IO-wait and only 5-10% real CPU usage. It might not be a
 Solr-related problem, I am investigating different things, but just wanted
 to know a little more about how Jetty/Solr works in order to make a
 qualified guess.

What kind of/how many discs do you have for your shards? ..also what
kind of server are you experimenting with?

--
 Sami Siren


Re: Parallel indexing in Solr

2012-02-06 Thread Per Steffensen



So SolrJ with CommonsHttpSolrServer will not support handling several
requests concurrently?




Nope. Use StreamingUpdateSolrServer, it should be just a drop-in with
a different constructor.
  
I will try to do that. It is a little bit difficult for me, as we are 
actually not dealing with Solr ourselves. We are using Lily, but I will 
modify Lily, compile and try to see how goes.
  

 Especially with trunk (4.0)
and the Document Writer Per Thread stuff.
  

We are using trunk (4.0). Can you provide me with a little more info on this
Document Writer Per Thread stuff. A link or something?




I already did, follow the link I provided.
  

Ahh ok, didnt get it the first time, that the link below was about that



http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/

  

So Jetty in not a easy to use, but non-performance-container?




Again, test and see. Lots of commercial systems use Jetty. Consider
that you're just sending sets of documents at Solr, the container
is doing very little work. You are batching up your Solr documents
aren't you?
  
Havnt looked into Lily to see whether or not documents are batched, but 
I will. I didnt expect Jetty to be the problem, basically just wanted to 
know that is was not a stupid everything-in-a-single-thread container, 
almost designed to not perform (because the focus might be different, 
e.g. providing an easy-to-use/understand container for testing etc.)
  

Actually right now, I am trying to find our what my bottleneck is.



You should see this differ with StreamingUpdateSolrServer assuming your
client can feed documents fast enough. You can consider having multiple
clients feed the same solr indexer if necessary.
  

Thanks!



5 How high is high performance? On a stock solr
with the Wikipedia dump (11M docs), all running on
my laptop, I see 7K docs/sec indexed. I know of
installations that see 60 docs/sec or even less. I'm
   sending simple docs with SolrJ locally and they're
sending huge documents over the wire that Tika
handles. There are just so many variables it's hard
to say anything except try it and see..

  

50mio documents need to be deleted and indexed per day. 2 years history = 36 
billion docs in store



My off-the-top-of-my-head feeling is that this will be a LOT of hardware.
Well it takes what it takes. Someone else will buy the hardware. My 
first concern is to make sure we have a system that scales, so that we 
can buy us out of problems by buying more hardware. On the other hand of 
course I want to privide at system that makes the most of the hardware.

 You'll
without doubt be sharding the index. NOTE: Shards are cores, just special
purpose ones, i.e. they're all use the same schema. When Solr folks see cores,
we assume that the several cores that may have different schemas and handle
unrelated queries. It sounds like you're talking about a sharded system rather
than independent cores, is that so?
  
Yes that is correct. We only have one single schema/config shared by all 
cores through ZK. So the many cores are just for sharding, because I do 
not expect that it will work very well with 20 billion docs in the same 
core/shard :-)

You should have no trouble indexing 50M documents/day, even assuming that the
ingestion rate is not evenly distributed. The link I referenced talks
about indexing 10M documents in a little over 6 minutes. YMMV however. I think
you're going along the right path when trying to push a single indexer to
the max. My setup uses Jetty and is getting 5-7K docs/second so I doubt it's
inherently a Jetty problem, although there may be configuration tweaks getting
in your way.

Bottom line: I doubt it's a Jetty issue at this point but I've been
wrong on too many
occasions to count. I'd be looking other places first though. Start
with the streaming
update solr server though, and also whether your clients can spit out documents
fast enough...
  

I will have a look at all that. Thanks!

Best
Erick
  




Re: Parallel indexing in Solr

2012-02-06 Thread Per Steffensen

Sami Siren skrev:

On Mon, Feb 6, 2012 at 2:53 PM, Per Steffensen st...@designware.dk wrote:


  

Actually right now, I am trying to find our what my bottleneck is. The setup
is more complex, than I would bother you with, but basically I have servers
with 80-90% IO-wait and only 5-10% real CPU usage. It might not be a
Solr-related problem, I am investigating different things, but just wanted
to know a little more about how Jetty/Solr works in order to make a
qualified guess.



What kind of/how many discs do you have for your shards? ..also what
kind of server are you experimenting with?
  
Grrr, thats where I have a little fight with operations. For now they 
gave me one (fairly big) machine with XenServer. I create my machines 
as Xen VM's on top of that. One of the things I dont like about this 
(besides that I dont trust Xen to do its virtualization right, or at 
least not provide me with correct readings on IO) is that disk space is 
assigned from an iSCSI connected SAN that they all share (including the 
line out there). But for now actually it doesnt look like disk IO 
problems. It looks like networks-bottlenecks (but to some extend they 
all also shard network) among all the components in our setup - our 
client plus Lily stack (HDFS, HBase, ZK, Lily Server, Solr etc). Well it 
is complex, but anyways ...

--
 Sami Siren

  




Re: Parallel indexing in Solr

2012-02-06 Thread Erick Erickson
grin. I've had recurring discussions with executive level folks that no
matter how many VMs you host on a machine, and no matter how big that
machine is, there really, truly, *is* some hardware underlying it all that
really, truly, *does* have some limits.

And adding more VMs doesn't somehow get around those limits..

Good Luck!
Erick

On Mon, Feb 6, 2012 at 10:55 AM, Per Steffensen st...@designware.dk wrote:
 Sami Siren skrev:

 On Mon, Feb 6, 2012 at 2:53 PM, Per Steffensen st...@designware.dk
 wrote:




 Actually right now, I am trying to find our what my bottleneck is. The
 setup
 is more complex, than I would bother you with, but basically I have
 servers
 with 80-90% IO-wait and only 5-10% real CPU usage. It might not be a
 Solr-related problem, I am investigating different things, but just
 wanted
 to know a little more about how Jetty/Solr works in order to make a
 qualified guess.



 What kind of/how many discs do you have for your shards? ..also what
 kind of server are you experimenting with?


 Grrr, thats where I have a little fight with operations. For now they gave
 me one (fairly big) machine with XenServer. I create my machines as Xen
 VM's on top of that. One of the things I dont like about this (besides that
 I dont trust Xen to do its virtualization right, or at least not provide me
 with correct readings on IO) is that disk space is assigned from an iSCSI
 connected SAN that they all share (including the line out there). But for
 now actually it doesnt look like disk IO problems. It looks like
 networks-bottlenecks (but to some extend they all also shard network) among
 all the components in our setup - our client plus Lily stack (HDFS, HBase,
 ZK, Lily Server, Solr etc). Well it is complex, but anyways ...

 --
  Sami Siren






Parallel indexing in Solr

2012-02-03 Thread Per Steffensen

Hi

This topic has probably been covered before, but I havnt had the luck to 
find the answer.


We are running solr instances with several cores inside. Solr running 
out-of-the-box on top of jetty. I believe jetty is receiving all the 
http-requests about indexing ned documents, and forwards it to the solr 
engine. What kind of parallelism does this setup provide. Can more than 
one index-request get processed concurrently? How many? How to increase 
the number of index-requests that can be handled in parallel? Will I get 
better parallelism by running on another web-container than jetty - e.g. 
tomcat? What is the recommended web-container for high performance 
production systems?


Thanks!

Regards, Per Steffensen


Re: Parallel indexing in Solr

2012-02-03 Thread Erick Erickson
Unfortunately, the answer is it depends(tm).

First question: How are you indexing things? SolrJ? post.jar?

But some observations:

1 sure, using multiple cores will have some parallelism. So will
using a single core but using something like SolrJ and
StreamingUpdateSolrServer. Especially with trunk (4.0)
 and the Document Writer Per Thread stuff. In 3.x, you'll
 see some pauses when segments are merged that you
 can't get around (per core). See:
 
http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/
 for an excellent writeup. But whether or not you use several
 cores should be determined by your problem space, certainly
 not by trying to increase the throughput. Indexing usually
 take a back seat to search performance.
2 general settings are hard to come by. If you're sending
  structured documents that use Tika to parse the data
  behind the scenes, your performance will be much
  different (slower) than sending SolrInputDocuments
 (SolrJ).
3 The recommended servlet container is, generally,
  The one you're most comfortable with. Tomcat is
  certainly popular. That said, use whatever you're
  most comfortable with until you see a performance
 problem. Odds are you'll find your load on Solr is a
  at its limit before your servlet container has problems.
4 Monitor you CPU, fire more requests at it until it
 hits 100%. Note that there are occasions where the
servlet container limits the number of outstanding
 requests it will allow and queues ones over that
 limit (find the magic setting to increase this if it's a
 problem, it differs by container). If you start to see
 your response times lengthen but the CPU not being
fully utilized, that may be the cause.
5 How high is high performance? On a stock solr
 with the Wikipedia dump (11M docs), all running on
 my laptop, I see 7K docs/sec indexed. I know of
 installations that see 60 docs/sec or even less. I'm
sending simple docs with SolrJ locally and they're
 sending huge documents over the wire that Tika
 handles. There are just so many variables it's hard
 to say anything except try it and see..

Best
Erick

On Fri, Feb 3, 2012 at 3:55 AM, Per Steffensen st...@designware.dk wrote:
 Hi

 This topic has probably been covered before, but I havnt had the luck to
 find the answer.

 We are running solr instances with several cores inside. Solr running
 out-of-the-box on top of jetty. I believe jetty is receiving all the
 http-requests about indexing ned documents, and forwards it to the solr
 engine. What kind of parallelism does this setup provide. Can more than one
 index-request get processed concurrently? How many? How to increase the
 number of index-requests that can be handled in parallel? Will I get better
 parallelism by running on another web-container than jetty - e.g. tomcat?
 What is the recommended web-container for high performance production
 systems?

 Thanks!

 Regards, Per Steffensen