Re: Solr - Index Concurrency - Is it possible to have multiple threads write to same index?
SolrCloud supports this dynamic addition. SolrCloud makes copies of the source documents and every Solr instances does its own indexing. With replication, you only create the indexes once. When storing very large documents, this is worthwhile. The only use cases I have seen for EmbeddedSolrServer that really makes sense is as Hadoop output. On Mon, Aug 27, 2012 at 8:28 PM, KnightRider wrote: > One other thing i forgot to mention is - multicore setup we have requires us > to be able to add cores dynamically and i am not sure if thats supported by > http solr out-of-the-box. > > > > - > Thanks > -K'Rider > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solr-Index-Concurrency-Is-it-possible-to-have-multiple-threads-write-to-same-index-tp4002544p4003623.html > Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
Re: Solr - Index Concurrency - Is it possible to have multiple threads write to same index?
One other thing i forgot to mention is - multicore setup we have requires us to be able to add cores dynamically and i am not sure if thats supported by http solr out-of-the-box. - Thanks -K'Rider -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Index-Concurrency-Is-it-possible-to-have-multiple-threads-write-to-same-index-tp4002544p4003623.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr - Index Concurrency - Is it possible to have multiple threads write to same index?
Thanks for the Reply Lance. >From your post my understanding is that Solr commiters are more focussed on http solr than EmbeddedSolrServer and EmbeddedSolrServer may not be tested for all features supported by http solr. Said that, can you please tell if there is any justification/usecase for using EmbeddedSolrServer? Reason am asking is if EmbeddedSolrServer is not advised by Solr committers than why don't they deprecate it and force users to go http solr route instead of EmbeddedSolrServer. Just trying to understand if there is any valid use-case for using EmbeddedSolrServer. We currently have EmbeddedSolrServer with multi-core setup (one core per client and size of each core/index is in the range of 20G-70G) integrated in our web application and it has been working fine for us but after reading the responses I am now wondering if we should be moving towards Http Solr and what benefit we might get if EmbeddedSolrServer is replaced with Http Solr. For replication we have been using rsync tool and it has been working fine for us. Also for our needs (below) do you suggest Http Solr or EmbeddedSolrServer. 1) Indexing Speed is more important than flexibility 2) Have huge text articles/blog files (>2 MB) that needs to be parsed from filesystem and indexed. Our index size will be in the range of 20 GB - 70 GB per core. And there is a core for each client. 3) Need to store all the data in the index because we absolutely need the highlighter feature working and reading through Solr documentation I found that Highlighter can be used only when data is stored. 4) We also need to store positions and offsets because we need to be able to use phrase queries and also need the position of the terms in search result documents. Thanks K'Rider - Thanks -K'Rider -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Index-Concurrency-Is-it-possible-to-have-multiple-threads-write-to-same-index-tp4002544p4003622.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr - Index Concurrency - Is it possible to have multiple threads write to same index?
A few other things: Support: many of the Solr committers do not like the Embedded server. It does not get much attention, so if you find problems with it you may have to fix them and get someone to review and commit the fixes. I'm not saying they sabotage it, there just is not much interest in making it first-class. Replication: you can replicate from the Embedded server with the old rsync-based replicator. The Java Replication tool requires servlets. If you are Unix-savvy, the rsync tool is fine. Indexing speed: 1) You can use shards to split the index into pieces. This divides the indexing work among the shards. 2) Do not store the giant data. A lot of sites instead archive the datafile and index a link to the file. Giant stored fields cause indexing speed to drop dramatically because stored data is not saved just once: it is copied repeatedly during merging as new documents are added. Index data is also copied around, but this tends to increase sub-linearly since documents share terms. 3) Do not store positions and offsets. These allow you to do phrase queries because they store the position of each word. They take a lot of memory, and have to be copied around during merging. On Thu, Aug 23, 2012 at 1:31 AM, Mikhail Khludnev wrote: > I know the following drawbacks of EmbServer: > >- org.apache.solr.client.solrj.request.UpdateRequest.getContentStreams() >which is called on handling update request, provides a lot of garbage in >memory and bloat it by expensive XML. >- > org.apache.solr.response.BinaryResponseWriter.getParsedResponse(SolrQueryRequest, >SolrQueryResponse) does something like this on response side - it just >bloat your heap > > for me your task is covered by Multiple Cores. Anyway if you are ok with > EmbeddedServer let it be. Just be aware of stream updates feature > http://wiki.apache.org/solr/ContentStream > > my average indexing speed estimate is for fairly small docs less than 1K > (which are always used for micro-benchmarking). > > Much analysis is the key argument for invoking updates in multiple threads. > What's your CPU stat during indexing? > > > > > On Thu, Aug 23, 2012 at 7:52 AM, ksu wildcats wrote: > >> Thanks for the reply Mikhail. >> >> For our needs the speed is more important than flexibility and we have huge >> text files (ex: blogs / articles ~2 MB size) that needs to be read from our >> filesystem and then store into the index. >> >> We have our app creating separate core per client (dynamically) and there >> is >> one instance of EmbeddedSolrServer for each core thats used for adding >> documents to the index. >> Each document has about 10 fields and one of the field has ~2MB data stored >> (stored = true, analyzed=true). >> Also we have logic built into our webapp to dynamically create the solr >> config files >> (solrConfig & schema per core - filters/analyzers/handler values can be >> different for each core) >> for each core before creating an instance of EmbeddedSolrServer for that >> core. >> Another reason to go with EmbeddedSolrServer is to reduce overhead of >> transporting large data (~2 MB) over http/xml. >> >> We use this setup for building our master index which then gets replicated >> to slave servers >> using replication scripts provided by solr. >> We also have solr admin ui integrated into our webapp (using admin jsp & >> handlers from solradmin ui) >> >> We have been using this MultiCore setup for more than a year now and so far >> we havent run into any issues with EmbeddedSolrServer integrated into our >> webapp. >> However I am now trying to figure out the impact if we allow multiple >> threads sending request to EmbeddedSolrServer (same core) for adding docs >> to >> index simultaneously. >> >> Our understanding was that EmbeddedSolrServer would give us better >> performance over http solr for our needs. >> Its quite possible that we might be wrong and http solr would have given us >> similar/better performance. >> >> Also based on documentation from SolrWiki I am assuming that >> EmbeddedSolrServer API is same as the one used by Http Solr. >> >> Said that, can you please tell if there is any specific downside to using >> EmbeddedSolrServer that could cause issues for us down the line. >> >> I am also interested in your below comment about indexing 1 million docs in >> few mins. Ideally we would like to get to that speed >> I am assuming this depends on the size of the doc and type of >> analyzer/tokenizer/filters being used. Correct? >> Can you please share (or point me to documentation) on how to get this >> speed >
Re: Solr - Index Concurrency - Is it possible to have multiple threads write to same index?
I know the following drawbacks of EmbServer: - org.apache.solr.client.solrj.request.UpdateRequest.getContentStreams() which is called on handling update request, provides a lot of garbage in memory and bloat it by expensive XML. - org.apache.solr.response.BinaryResponseWriter.getParsedResponse(SolrQueryRequest, SolrQueryResponse) does something like this on response side - it just bloat your heap for me your task is covered by Multiple Cores. Anyway if you are ok with EmbeddedServer let it be. Just be aware of stream updates feature http://wiki.apache.org/solr/ContentStream my average indexing speed estimate is for fairly small docs less than 1K (which are always used for micro-benchmarking). Much analysis is the key argument for invoking updates in multiple threads. What's your CPU stat during indexing? On Thu, Aug 23, 2012 at 7:52 AM, ksu wildcats wrote: > Thanks for the reply Mikhail. > > For our needs the speed is more important than flexibility and we have huge > text files (ex: blogs / articles ~2 MB size) that needs to be read from our > filesystem and then store into the index. > > We have our app creating separate core per client (dynamically) and there > is > one instance of EmbeddedSolrServer for each core thats used for adding > documents to the index. > Each document has about 10 fields and one of the field has ~2MB data stored > (stored = true, analyzed=true). > Also we have logic built into our webapp to dynamically create the solr > config files > (solrConfig & schema per core - filters/analyzers/handler values can be > different for each core) > for each core before creating an instance of EmbeddedSolrServer for that > core. > Another reason to go with EmbeddedSolrServer is to reduce overhead of > transporting large data (~2 MB) over http/xml. > > We use this setup for building our master index which then gets replicated > to slave servers > using replication scripts provided by solr. > We also have solr admin ui integrated into our webapp (using admin jsp & > handlers from solradmin ui) > > We have been using this MultiCore setup for more than a year now and so far > we havent run into any issues with EmbeddedSolrServer integrated into our > webapp. > However I am now trying to figure out the impact if we allow multiple > threads sending request to EmbeddedSolrServer (same core) for adding docs > to > index simultaneously. > > Our understanding was that EmbeddedSolrServer would give us better > performance over http solr for our needs. > Its quite possible that we might be wrong and http solr would have given us > similar/better performance. > > Also based on documentation from SolrWiki I am assuming that > EmbeddedSolrServer API is same as the one used by Http Solr. > > Said that, can you please tell if there is any specific downside to using > EmbeddedSolrServer that could cause issues for us down the line. > > I am also interested in your below comment about indexing 1 million docs in > few mins. Ideally we would like to get to that speed > I am assuming this depends on the size of the doc and type of > analyzer/tokenizer/filters being used. Correct? > Can you please share (or point me to documentation) on how to get this > speed > for 1 mil docs. > >> - one million is a fairly small amount, in average it should be indexed > >> in few mins. I doubt that you really need to distribute indexing > > Thanks > -K > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solr-Index-Concurrency-Is-it-possible-to-have-multiple-threads-write-to-same-index-tp4002544p4002776.html > Sent from the Solr - User mailing list archive at Nabble.com. > -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics <http://www.griddynamics.com>
Re: Solr - Index Concurrency - Is it possible to have multiple threads write to same index?
Thanks for the reply Mikhail. For our needs the speed is more important than flexibility and we have huge text files (ex: blogs / articles ~2 MB size) that needs to be read from our filesystem and then store into the index. We have our app creating separate core per client (dynamically) and there is one instance of EmbeddedSolrServer for each core thats used for adding documents to the index. Each document has about 10 fields and one of the field has ~2MB data stored (stored = true, analyzed=true). Also we have logic built into our webapp to dynamically create the solr config files (solrConfig & schema per core - filters/analyzers/handler values can be different for each core) for each core before creating an instance of EmbeddedSolrServer for that core. Another reason to go with EmbeddedSolrServer is to reduce overhead of transporting large data (~2 MB) over http/xml. We use this setup for building our master index which then gets replicated to slave servers using replication scripts provided by solr. We also have solr admin ui integrated into our webapp (using admin jsp & handlers from solradmin ui) We have been using this MultiCore setup for more than a year now and so far we havent run into any issues with EmbeddedSolrServer integrated into our webapp. However I am now trying to figure out the impact if we allow multiple threads sending request to EmbeddedSolrServer (same core) for adding docs to index simultaneously. Our understanding was that EmbeddedSolrServer would give us better performance over http solr for our needs. Its quite possible that we might be wrong and http solr would have given us similar/better performance. Also based on documentation from SolrWiki I am assuming that EmbeddedSolrServer API is same as the one used by Http Solr. Said that, can you please tell if there is any specific downside to using EmbeddedSolrServer that could cause issues for us down the line. I am also interested in your below comment about indexing 1 million docs in few mins. Ideally we would like to get to that speed I am assuming this depends on the size of the doc and type of analyzer/tokenizer/filters being used. Correct? Can you please share (or point me to documentation) on how to get this speed for 1 mil docs. >> - one million is a fairly small amount, in average it should be indexed >> in few mins. I doubt that you really need to distribute indexing Thanks -K -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Index-Concurrency-Is-it-possible-to-have-multiple-threads-write-to-same-index-tp4002544p4002776.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr - Index Concurrency - Is it possible to have multiple threads write to same index?
Hello, - embedded server is not the best way, usually - lucene perfectly indexes in multiple thread concurrently. Single writer per directory is called concurrently. - with solrj you can use ConcurrentUpdateSolr server, or call StreamingUpdateSolrServer in multiple threads, or just updates docs in parallel through plain SolrServer - Also, there is SOLR-3585 it adds server-side concurrency for handling long single thread requests (it's intended to work with StreamingUpdateSolrServer). - if you want to distribute your indexes it's what SolrCloud is done for, then you can search these indices in parallel. - kind of esoteric to me, after you build indexes distributed you can try to merge them in the single solid one http://wiki.apache.org/solr/MergingSolrIndexes - NFS almost never provides enough consistency, ie. they are hardly useful for indexing. - one million is a fairly small amount, in average it should be indexed in few mins. I doubt that you really need to distribute indexing. On Wed, Aug 22, 2012 at 8:53 AM, ksu wildcats wrote: > We have a webapp that has embedded solr integrated in it. > It essentially handles creating separate index (core) per client and it is > currently setup such that there can only be one index write operation per > core. > Say if we have 1 Million documents that needs be to Indexed, our app reads > each document and writes it to index (using embedded solr library). > > I am looking into ways to speed up indexing time and I was wondering if it > would be possible to have our app run on multiple servers and each server > process indexing docs concurrently. I was thinking of having Index storage > on NFS that can be accessed by all servers. > > I am not entirely sure but reading through documentation my understanding > is > that we cannot have multiple index writers (even if they are running on > different servers) write to same index directory simultaneously. is that > correct? > > If there is a limitation on concurrent writes to same index directory then > do i need to have each server build a separate index (more like a cores > within core) and merge all the sub indexes into main index to speed up the > indexing time? > > Please let me know if am heading in correct path or if there are better > alternatives to speed up indexing time? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solr-Index-Concurrency-Is-it-possible-to-have-multiple-threads-write-to-same-index-tp4002544.html > Sent from the Solr - User mailing list archive at Nabble.com. > -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics <http://www.griddynamics.com>
Solr - Index Concurrency - Is it possible to have multiple threads write to same index?
We have a webapp that has embedded solr integrated in it. It essentially handles creating separate index (core) per client and it is currently setup such that there can only be one index write operation per core. Say if we have 1 Million documents that needs be to Indexed, our app reads each document and writes it to index (using embedded solr library). I am looking into ways to speed up indexing time and I was wondering if it would be possible to have our app run on multiple servers and each server process indexing docs concurrently. I was thinking of having Index storage on NFS that can be accessed by all servers. I am not entirely sure but reading through documentation my understanding is that we cannot have multiple index writers (even if they are running on different servers) write to same index directory simultaneously. is that correct? If there is a limitation on concurrent writes to same index directory then do i need to have each server build a separate index (more like a cores within core) and merge all the sub indexes into main index to speed up the indexing time? Please let me know if am heading in correct path or if there are better alternatives to speed up indexing time? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Index-Concurrency-Is-it-possible-to-have-multiple-threads-write-to-same-index-tp4002544.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index Concurrency
On 5/10/07, joestelmach <[EMAIL PROTECTED]> wrote: > Yes, coordination between the main index searcher, the index writer, > and the index reader needed to delete other documents. Can you point me to any documentation/code that describes this implementation? Look at SolrCore.getSearcher() and DirectUpdateHandler2. -Yonik
Re: Index Concurrency
> Yes, coordination between the main index searcher, the index writer, > and the index reader needed to delete other documents. Can you point me to any documentation/code that describes this implementation? > That's weird... I've never seen that. > The lucene write lock is only obtained when the IndexWriter is created. > Can you post the relevant part of the log file where the exception > happens? After doing some more testing, I believe it was a stale lock file that was causing me to have these lock issues yesterday - sorry for the false alarm :) > Also, unless you have at least 6 CPU cores or so, you are unlikely to > see greater throughput with 10 threads. If you add multiple documents > per HTTP-POST (such that HTTP latency is minimized), the best setting > would probably be nThreads == nCores. For a single doc per POST, more > threads will serve to cover the latency and keep Solr busy. I agree with your thinking here. My requirement for a large number of threads is somewhat of an artifact of my current system design. I'm trying not to serialize the system's processing at the point of indexing. -- View this message in context: http://www.nabble.com/Index-Concurrency-tf3718634.html#a10424207 Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index Concurrency
Though, isn't there a recent patch to allow multiple indices under a single Solr instance in JIRA? Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Yonik Seeley <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Wednesday, May 9, 2007 6:32:33 PM Subject: Re: Index Concurrency On 5/9/07, joestelmach <[EMAIL PROTECTED]> wrote: > My first intuition is to give each user their own index. My thinking here is > that querying would be faster (since each user's index would be much smaller > than one big index,) and, more importantly, that I would dodge any > concurrency issues stemming from multiple threads trying to update the same > index simultaneously. I realize that Lucene implements a locking mechanism > to protect against concurrent access, but I seem to hit the lock access > timeout quite easily with only a couple threads. > > After looking at solr, I would really like to take advantage of the many > features it adds to Lucene, but it doesn't look like I'll be able to achieve > multiple indexes. No, not currently. Start your implementation with just a single index... unless it is very large, it will likely be fast enough. Solr also handles all the concurrency issues, and you should never hit "lock access timeout" when updating from multiple threads. -Yonik
Re: Index Concurrency
On 5/9/07, joestelmach <[EMAIL PROTECTED]> wrote: Does solr provide any additional concurrency control over what Lucene provides? Yes, coordination between the main index searcher, the index writer, and the index reader needed to delete other documents. In my simple testing of indexing 2,000 messages, solr would issue lock access timeouts with as little as 10 threads. That's weird... I've never seen that. The lucene write lock is only obtained when the IndexWriter is created. Can you post the relevant part of the log file where the exception happens? Also, unless you have at least 6 CPU cores or so, you are unlikely to see greater throughput with 10 threads. If you add multiple documents per HTTP-POST (such that HTTP latency is minimized), the best setting would probably be nThreads == nCores. For a single doc per POST, more threads will serve to cover the latency and keep Solr busy. -Yonik
Re: Index Concurrency
Yonik, Thanks for your fast reply. > No, not currently. Start your implementation with just a single > index... unless it is very large, it will likely be fast enough. My index will get quite large > Solr also handles all the concurrency issues, and you should never hit > "lock access timeout" when updating from multiple threads. Does solr provide any additional concurrency control over what Lucene provides? In my simple testing of indexing 2,000 messages, solr would issue lock access timeouts with as little as 10 threads. Running all 2,000 messages through sequentially yields no problems at all. Actually, I'm able churn through over 100,000 messages when no threads are involved. Am I missing some concurrency settings? Thanks, Joe -- View this message in context: http://www.nabble.com/Index-Concurrency-tf3718634.html#a10406382 Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index Concurrency
On 5/9/07, joestelmach <[EMAIL PROTECTED]> wrote: My first intuition is to give each user their own index. My thinking here is that querying would be faster (since each user's index would be much smaller than one big index,) and, more importantly, that I would dodge any concurrency issues stemming from multiple threads trying to update the same index simultaneously. I realize that Lucene implements a locking mechanism to protect against concurrent access, but I seem to hit the lock access timeout quite easily with only a couple threads. After looking at solr, I would really like to take advantage of the many features it adds to Lucene, but it doesn't look like I'll be able to achieve multiple indexes. No, not currently. Start your implementation with just a single index... unless it is very large, it will likely be fast enough. Solr also handles all the concurrency issues, and you should never hit "lock access timeout" when updating from multiple threads. -Yonik
Index Concurrency
Hello, I'm a bit new to search indexing and I'm hoping some of you here can help me with an e-mail application I'm working on. I have a mail retrieval program that accesses multiple POP accounts in parallel, and parses each message into a database. I would like to add a new document to a solr index each time I process a message. My first intuition is to give each user their own index. My thinking here is that querying would be faster (since each user's index would be much smaller than one big index,) and, more importantly, that I would dodge any concurrency issues stemming from multiple threads trying to update the same index simultaneously. I realize that Lucene implements a locking mechanism to protect against concurrent access, but I seem to hit the lock access timeout quite easily with only a couple threads. After looking at solr, I would really like to take advantage of the many features it adds to Lucene, but it doesn't look like I'll be able to achieve multiple indexes. Am I completely off in thinking that I need multiple indexes? Is there some best practice for this sort of thing that I haven't stumbled upon? Any advice would be greatly appreciated. Thanks, Joe -- View this message in context: http://www.nabble.com/Index-Concurrency-tf3718634.html#a10403918 Sent from the Solr - User mailing list archive at Nabble.com.