RE: Solr performance issue on indexing
> Also we will try to decouple tika to solr. +1 -Original Message- From: tstusr [mailto:ulfrhe...@gmail.com] Sent: Friday, March 31, 2017 4:31 PM To: solr-user@lucene.apache.org Subject: Re: Solr performance issue on indexing Hi, thanks for the feedback. Yes, it is about OOM, indeed even solr instance makes unavailable. As I was saying I can't find more relevant information on logs. We're are able to increment JVM amout, so, the first thing we'll do will be that. As far as I know, all documents are bounded to that amount (14K), just the processing could change. We are making some tests on indexing and it seems it works without concurrent threads. Also we will try to decouple tika to solr. By the way, make it available with solr cloud will improve performance? Or there will be no perceptible improvement? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-performance-issue-on-indexing-tp4327886p4327914.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr performance issue on indexing
If, by chance, the docs you're sending get routed to different Solr nodes then all the processing is in parallel. I don't know if there's a good way to insure that the docs get sent to different replicas on different Solr instances. You could try addressing specific Solr replicas, something like "blah blah/solr/collection1_shard1_replica1/export" but I'm not totally sure that'll do what you want either. But that still doesn't decouple Tika from the Solr instances running those replicas. So if Tika has a problem it has the potential to bring the Solr node down. Best, Erick On Fri, Mar 31, 2017 at 1:31 PM, tstusr <ulfrhe...@gmail.com> wrote: > Hi, thanks for the feedback. > > Yes, it is about OOM, indeed even solr instance makes unavailable. As I was > saying I can't find more relevant information on logs. > > We're are able to increment JVM amout, so, the first thing we'll do will be > that. > > As far as I know, all documents are bounded to that amount (14K), just the > processing could change. We are making some tests on indexing and it seems > it works without concurrent threads. Also we will try to decouple tika to > solr. > > By the way, make it available with solr cloud will improve performance? Or > there will be no perceptible improvement? > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solr-performance-issue-on-indexing-tp4327886p4327914.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr performance issue on indexing
Hi, thanks for the feedback. Yes, it is about OOM, indeed even solr instance makes unavailable. As I was saying I can't find more relevant information on logs. We're are able to increment JVM amout, so, the first thing we'll do will be that. As far as I know, all documents are bounded to that amount (14K), just the processing could change. We are making some tests on indexing and it seems it works without concurrent threads. Also we will try to decouple tika to solr. By the way, make it available with solr cloud will improve performance? Or there will be no perceptible improvement? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-performance-issue-on-indexing-tp4327886p4327914.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr performance issue on indexing
First, running multiple threads with PDF files to a Solr running 4G of JVM is...ambitious. You say it crashes; how? OOMs? Second while the extracting request handler is a fine way to get up and running, any problems with Tika will affect Solr. Tika does a great job of extraction, but there are so many variants of so many file formats that this scenario isn' recommended for production. Consider extracting the PDF on a client and sending the docs to Solr. Tika can run as a server also so you aren't coupling Solr and Tika. For a sample SolrJ program, see: https://lucidworks.com/2012/02/14/indexing-with-solrj/ Best, Erick On Fri, Mar 31, 2017 at 10:44 AM, tstusr <ulfrhe...@gmail.com> wrote: > Hi there. > > We are currently indexing some PDF files, the main handler to index is > /extract where we perform simple processing (extract relevant fields and > store on some fields). > > The PDF files are about 10M~100M size and we have to have available the text > extracted. So, everything works correct on test stages, but when we try to > index all the 14K files (around 120Gb) on a client application that only > sends http curls through 3-4 concurrent threads to /extract handler it > crashes. I can't find some relevant information about on solr logs (We > checked in server/logs & in core_dir/tlog). > > My question is about performance. I think it is a small amount of info we > are processing, the deploy scenario is in a docker container with 4gb of JVM > Memory and ~50gb of physical memory (reported through dashboard) we are > using a single instance. > > I don't think is a normal behaviour that handler crashes. So, what are some > general tips about improving performance for this scenario? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solr-performance-issue-on-indexing-tp4327886.html > Sent from the Solr - User mailing list archive at Nabble.com.
Solr performance issue on indexing
Hi there. We are currently indexing some PDF files, the main handler to index is /extract where we perform simple processing (extract relevant fields and store on some fields). The PDF files are about 10M~100M size and we have to have available the text extracted. So, everything works correct on test stages, but when we try to index all the 14K files (around 120Gb) on a client application that only sends http curls through 3-4 concurrent threads to /extract handler it crashes. I can't find some relevant information about on solr logs (We checked in server/logs & in core_dir/tlog). My question is about performance. I think it is a small amount of info we are processing, the deploy scenario is in a docker container with 4gb of JVM Memory and ~50gb of physical memory (reported through dashboard) we are using a single instance. I don't think is a normal behaviour that handler crashes. So, what are some general tips about improving performance for this scenario? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-performance-issue-on-indexing-tp4327886.html Sent from the Solr - User mailing list archive at Nabble.com.