RE: Solr performance issue on indexing

2017-04-04 Thread Allison, Timothy B.
>  Also we will try to decouple tika to solr.
+1


-Original Message-
From: tstusr [mailto:ulfrhe...@gmail.com] 
Sent: Friday, March 31, 2017 4:31 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr performance issue on indexing

Hi, thanks for the feedback.

Yes, it is about OOM, indeed even solr instance makes unavailable. As I was 
saying I can't find more relevant information on logs.

We're are able to increment JVM amout, so, the first thing we'll do will be 
that.

As far as I know, all documents are bounded to that amount (14K), just the 
processing could change. We are making some tests on indexing and it seems it 
works without concurrent threads. Also we will try to decouple tika to solr.

By the way, make it available with solr cloud will improve performance? Or 
there will be no perceptible improvement?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-performance-issue-on-indexing-tp4327886p4327914.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr performance issue on indexing

2017-03-31 Thread Erick Erickson
If, by chance, the docs you're sending get routed to different Solr
nodes then all the processing is in parallel. I don't know if there's
a good way to insure that the docs get sent to different replicas on
different Solr instances. You could try addressing specific Solr
replicas, something like "blah
blah/solr/collection1_shard1_replica1/export" but I'm not totally sure
that'll do what you want either.

 But that still doesn't decouple Tika from the Solr instances running
those replicas. So if Tika has a problem it has the potential to bring
the Solr node down.

Best,
Erick

On Fri, Mar 31, 2017 at 1:31 PM, tstusr <ulfrhe...@gmail.com> wrote:
> Hi, thanks for the feedback.
>
> Yes, it is about OOM, indeed even solr instance makes unavailable. As I was
> saying I can't find more relevant information on logs.
>
> We're are able to increment JVM amout, so, the first thing we'll do will be
> that.
>
> As far as I know, all documents are bounded to that amount (14K), just the
> processing could change. We are making some tests on indexing and it seems
> it works without concurrent threads. Also we will try to decouple tika to
> solr.
>
> By the way, make it available with solr cloud will improve performance? Or
> there will be no perceptible improvement?
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-performance-issue-on-indexing-tp4327886p4327914.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr performance issue on indexing

2017-03-31 Thread tstusr
Hi, thanks for the feedback.

Yes, it is about OOM, indeed even solr instance makes unavailable. As I was
saying I can't find more relevant information on logs.

We're are able to increment JVM amout, so, the first thing we'll do will be
that.

As far as I know, all documents are bounded to that amount (14K), just the
processing could change. We are making some tests on indexing and it seems
it works without concurrent threads. Also we will try to decouple tika to
solr.

By the way, make it available with solr cloud will improve performance? Or
there will be no perceptible improvement?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-performance-issue-on-indexing-tp4327886p4327914.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr performance issue on indexing

2017-03-31 Thread Erick Erickson
First, running multiple threads with PDF files to a Solr running 4G of
JVM is...ambitious. You say it crashes; how? OOMs?

Second while the extracting request handler is a fine way to get up
and running, any problems with Tika will affect Solr. Tika does a
great job of extraction, but there are so many variants of so many
file formats that this scenario isn' recommended for production.
Consider extracting the PDF on a client and sending the docs to Solr.
Tika can run as a server also so you aren't coupling Solr and Tika.

For a sample SolrJ program, see:
https://lucidworks.com/2012/02/14/indexing-with-solrj/

Best,
Erick

On Fri, Mar 31, 2017 at 10:44 AM, tstusr <ulfrhe...@gmail.com> wrote:
> Hi there.
>
> We are currently indexing some PDF files, the main handler to index is
> /extract where we perform simple processing (extract relevant fields and
> store on some fields).
>
> The PDF files are about 10M~100M size and we have to have available the text
> extracted. So, everything works correct on test stages, but when we try to
> index all the 14K files (around 120Gb) on a client application that only
> sends http curls through 3-4 concurrent threads to /extract handler it
> crashes. I can't find some relevant information about on solr logs (We
> checked in server/logs & in core_dir/tlog).
>
> My question is about performance. I think it is a small amount of info we
> are processing, the deploy scenario is in a docker container with 4gb of JVM
> Memory and ~50gb of physical memory (reported through dashboard) we are
> using a single instance.
>
> I don't think is a normal behaviour that handler crashes. So, what are some
> general tips about improving performance for this scenario?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-performance-issue-on-indexing-tp4327886.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Solr performance issue on indexing

2017-03-31 Thread tstusr
Hi there.

We are currently indexing some PDF files, the main handler to index is
/extract where we perform simple processing (extract relevant fields and
store on some fields). 

The PDF files are about 10M~100M size and we have to have available the text
extracted. So, everything works correct on test stages, but when we try to
index all the 14K files (around 120Gb) on a client application that only
sends http curls through 3-4 concurrent threads to /extract handler it
crashes. I can't find some relevant information about on solr logs (We
checked in server/logs & in core_dir/tlog).

My question is about performance. I think it is a small amount of info we
are processing, the deploy scenario is in a docker container with 4gb of JVM
Memory and ~50gb of physical memory (reported through dashboard) we are
using a single instance. 

I don't think is a normal behaviour that handler crashes. So, what are some
general tips about improving performance for this scenario?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-performance-issue-on-indexing-tp4327886.html
Sent from the Solr - User mailing list archive at Nabble.com.