Re: Solr 7.3 debug/explain with boost applied

2018-04-25 Thread Ryan Yacyshyn
Typically you would use a function query there to manipulate the score
rather than a constant of 2. This doesn't do much but simply multiply all
scores by that value. You can do something like boost=sqrt(popularity) if
you wanted to boost on the popularity field for example. In both cases,
however, the explain is still the same in that it's not showing the full
explain when applying a boost function (or constant number).

Wondering if I'm missing something here?

Ryan







On Wed, 25 Apr 2018 at 04:21 Nawab Zada Asad Iqbal  wrote:

> I didn't know you can add boosts like that (=2 ). Are you boosting on
> a field or document by using that syntax?
>
> On Sun, Apr 22, 2018 at 10:51 PM, Ryan Yacyshyn 
> wrote:
>
> > Hi all,
> >
> > When viewing the explain under debug=true in Solr 7.3.0 using
> > the edismax query parser with a boost, I only see the "boost" part of the
> > explain. Without applying a boost I see the full explain. Is this the
> > expected behaviour?
> >
> > Here's how to check using the techproducts example..
> >
> > bin/solr -e techproducts
> >
> > ```
> > http://localhost:8983/solr/techproducts/select?q={!
> > edismax}samsung=name=true
> > ```
> >
> > returns:
> >
> > ```
> > "debug": {
> > "rawquerystring": "{!edismax}samsung",
> > "querystring": "{!edismax}samsung",
> > "parsedquery": "+DisjunctionMaxQuery((name:samsung))",
> > "parsedquery_toString": "+(name:samsung)",
> > "explain": {
> >   "SP2514N": "\n2.3669035 = weight(name:samsung in 1)
> > [SchemaSimilarity], result of:\n  2.3669035 = score(doc=1,freq=1.0 =
> > termFreq=1.0\n), product of:\n2.6855774 = idf, computed as log(1 +
> > (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n  1.0 =
> docFreq\n
> > 21.0 = docCount\n0.8813388 = tfNorm, computed as (freq * (k1 +
> 1))
> > / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:\n
> 1.0
> > = termFreq=1.0\n  1.2 = parameter k1\n  0.75 = parameter b\n
> > 7.5238094 = avgFieldLength\n  10.0 = fieldLength\n"
> > },
> > "QParser": "ExtendedDismaxQParser",
> > ...
> > ```
> >
> > If I just add =2 to this, I get this explain back:
> >
> > ```
> > "debug": {
> > "rawquerystring": "{!edismax}samsung",
> > "querystring": "{!edismax}samsung",
> > "parsedquery":
> "FunctionScoreQuery(FunctionScoreQuery(+(name:samsung),
> > scored by boost(const(2",
> > "parsedquery_toString": "FunctionScoreQuery(+(name:samsung), scored
> by
> > boost(const(2)))",
> > "explain": {
> >   "SP2514N": "\n4.733807 = product of:\n  1.0 = boost\n  4.733807 =
> > boost(const(2))\n"
> > },
> > "QParser": "ExtendedDismaxQParser",
> > ...
> > ```
> >
> > Is this normal? I was expecting to see more like the first example, with
> > the addition of the boost applied.
> >
> > Thanks,
> > Ryan
> >
>


Re: Modify data-conf.xml on the runtime

2018-04-25 Thread Shawn Heisey

On 4/25/2018 4:12 AM, rameshkjes wrote:

Actually I am trying to approach this problem from another way.
I am taking user input from gui which is direcotory of dataset, and saving
that path in properties file. Since I am using Maven, so I am able to access
that path in my pom file using properties tag. So, now is it possible to use
that properties variable from pom.xml to the data-conf.xml? do you think, it
is right way?


Maven is used to build a program from source.  Maven is not involved 
when running Solr unless you're doing something non-standard.  It's 
difficult for this list to support a non-standard setup.


What are you trying to do?  Perhaps there is a way to do it that doesn't 
involve using unexpected tools.


If you must use Maven, you may be on your own in trying to get it to 
work with Solr.  I have never heard of such an integration before except 
at build time.  The maven build for Lucene/Solr is not the standard 
build tool.  The standard build uses ant.


Thanks,
Shawn



Re: solr cell: write entire file content binary to index along with metadata

2018-04-25 Thread Rahul Singh
Lucene ( the major underlying Tech in SolR ) can handle any data, but it’s 
optimized to be an index , not a file store. Better to put that in another DB 
or file system like Cassandra, S3, etc. (better than SolR).

In our experience , leveraging the tika binary / microservice as a pre-index 
process can improve the overall stability of the SolR service.


--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 25, 2018, 12:49 PM -0400, Shawn Heisey , wrote:
> On 4/25/2018 4:02 AM, Lee Carroll wrote:
> > *We don't recommend using solr-cell for production indexing.*
> >
> > Ok. Are the reasons for:
> >
> > Performance. I think we have rather modest index requirement (1000 a day...
> > on a busy day)
> >
> > Security. The index workflow is, upload files to public facing server with
> > auth. Files written to disk, scanned and copied to internal server and
> > ingested into index via here.
> >
> > other reasons we should worry about ?
>
> Tika is the underlying technology in solr-cell.  Tika is a separate
> Apache product designed for parsing common rich-text formats, like
> Microsoft, PDF, etc.
>
> http://tika.apache.org/
>
> The problems that can result are related to running Tika inside of Solr,
> which is what solr-cell does.
>
> The Tika authors try very hard to make sure that Tika doesn't misbehave,
> but the very nature of what Tika does means it is somewhat prone to
> misbehaving.  Many of the file formats that Tika processes are
> undocumented, or any documentation that is available is not available to
> open source developers.  Also, sometimes documents in those formats will
> be constructed in a way that the Tika authors have never seen before, or
> they may completely violate what conventions the authors DO know about.
>
> Long story short -- Tika can encounter documents that can cause it to
> crash, or to consume all the memory in the system, or misbehave in other
> ways.  If Tika is running inside Solr, then when it has a problem, Solr
> itself can blow up and have a problem too.
>
> For this reason, and because Tika can sometimes use a lot of resources
> even when it is working correctly, we recommend running it outside of
> Solr in another program that takes its output and sends it to Solr.
> Ideally, it will be running on a completely different machine than Solr
> is running on.
>
> Thanks,
> Shawn
>


Re: CDCR broken for Mixed Replica Collections

2018-04-25 Thread Amrit Sarkar
Pardon, * I have added extensive tests for both the use-cases.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2
Medium: https://medium.com/@sarkaramrit2

On Thu, Apr 26, 2018 at 3:50 AM, Amrit Sarkar 
wrote:

> Webster,
>
> I have patch uploaded to both Cdcr supporting Tlog: https://issues.apache.
> org/jira/browse/SOLR-12057 and core not getting failed while initializing
> for Pull type replicas: https://issues.apache.org/jira/browse/SOLR-12071
> and awaiting feedback from open source community. The solution for pull
> type replicas can be designed better, apart from that, if this is urgent
> need for you, please apply the patches for your packages and probably give
> a shot. I will added extensive tests for both the use-cases.
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> Medium: https://medium.com/@sarkaramrit2
>
> On Thu, Apr 26, 2018 at 2:46 AM, Erick Erickson 
> wrote:
>
>> CDCR won't really ever make sense for PULL replicas since the PULL
>> replicas have no tlog and don't do any indexing and can't ever become
>> a leader seamlessly.
>>
>> As for plans to address TLOG replicas, patches are welcome if you have
>> a need. That's really how open source works, people add functionality
>> as they have use-cases they need to support and contribute them back.
>> So far this isn't a high-demand topic.
>>
>> Best,
>> Erick
>>
>> On Wed, Apr 25, 2018 at 8:03 AM, Webster Homer 
>> wrote:
>> > I was looking at SOLR-12057
>> >
>> > According to the comment on the ticket, CDCR can not work when a
>> collection
>> > has PULL Replicas. That seems like a MAJOR limitation to CDCR and PULL
>> > Replicas. Is this likely to be addressed in the future?
>> > CDCR currently is broken for TLOG replicas too.
>> >
>> > https://issues.apache.org/jira/browse/SOLR-12057?focusedComm
>> entId=16391558=com.atlassian.jira.plugin.system.
>> issuetabpanels%3Acomment-tabpanel#comment-16391558
>> >
>> > Thanks
>> >
>> > --
>> >
>> >
>> > This message and any attachment are confidential and may be
>> > privileged or
>> > otherwise protected from disclosure. If you are not the intended
>> > recipient,
>> > you must not copy this message or attachment or disclose the
>> > contents to
>> > any other person. If you have received this transmission in error,
>> > please
>> > notify the sender immediately and delete the message and any attachment
>> >
>> > from your system. Merck KGaA, Darmstadt, Germany and any of its
>> > subsidiaries do
>> > not accept liability for any omissions or errors in this
>> > message which may
>> > arise as a result of E-Mail-transmission or for damages
>> > resulting from any
>> > unauthorized changes of the content of this message and
>> > any attachment thereto.
>> > Merck KGaA, Darmstadt, Germany and any of its
>> > subsidiaries do not guarantee
>> > that this message is free of viruses and does
>> > not accept liability for any
>> > damages caused by any virus transmitted
>> > therewith.
>> >
>> >
>> >
>> > Click http://www.emdgroup.com/disclaimer
>> >  to access the
>> > German, French, Spanish
>> > and Portuguese versions of this disclaimer.
>>
>
>


Re: CDCR broken for Mixed Replica Collections

2018-04-25 Thread Amrit Sarkar
Webster,

I have patch uploaded to both Cdcr supporting Tlog:
https://issues.apache.org/jira/browse/SOLR-12057 and core not getting
failed while initializing for Pull type replicas:
https://issues.apache.org/jira/browse/SOLR-12071 and awaiting feedback from
open source community. The solution for pull type replicas can be designed
better, apart from that, if this is urgent need for you, please apply the
patches for your packages and probably give a shot. I will added extensive
tests for both the use-cases.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2
Medium: https://medium.com/@sarkaramrit2

On Thu, Apr 26, 2018 at 2:46 AM, Erick Erickson 
wrote:

> CDCR won't really ever make sense for PULL replicas since the PULL
> replicas have no tlog and don't do any indexing and can't ever become
> a leader seamlessly.
>
> As for plans to address TLOG replicas, patches are welcome if you have
> a need. That's really how open source works, people add functionality
> as they have use-cases they need to support and contribute them back.
> So far this isn't a high-demand topic.
>
> Best,
> Erick
>
> On Wed, Apr 25, 2018 at 8:03 AM, Webster Homer 
> wrote:
> > I was looking at SOLR-12057
> >
> > According to the comment on the ticket, CDCR can not work when a
> collection
> > has PULL Replicas. That seems like a MAJOR limitation to CDCR and PULL
> > Replicas. Is this likely to be addressed in the future?
> > CDCR currently is broken for TLOG replicas too.
> >
> > https://issues.apache.org/jira/browse/SOLR-12057?
> focusedCommentId=16391558=com.atlassian.jira.
> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16391558
> >
> > Thanks
> >
> > --
> >
> >
> > This message and any attachment are confidential and may be
> > privileged or
> > otherwise protected from disclosure. If you are not the intended
> > recipient,
> > you must not copy this message or attachment or disclose the
> > contents to
> > any other person. If you have received this transmission in error,
> > please
> > notify the sender immediately and delete the message and any attachment
> >
> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do
> > not accept liability for any omissions or errors in this
> > message which may
> > arise as a result of E-Mail-transmission or for damages
> > resulting from any
> > unauthorized changes of the content of this message and
> > any attachment thereto.
> > Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not guarantee
> > that this message is free of viruses and does
> > not accept liability for any
> > damages caused by any virus transmitted
> > therewith.
> >
> >
> >
> > Click http://www.emdgroup.com/disclaimer
> >  to access the
> > German, French, Spanish
> > and Portuguese versions of this disclaimer.
>


Re: CDCR broken for Mixed Replica Collections

2018-04-25 Thread Erick Erickson
CDCR won't really ever make sense for PULL replicas since the PULL
replicas have no tlog and don't do any indexing and can't ever become
a leader seamlessly.

As for plans to address TLOG replicas, patches are welcome if you have
a need. That's really how open source works, people add functionality
as they have use-cases they need to support and contribute them back.
So far this isn't a high-demand topic.

Best,
Erick

On Wed, Apr 25, 2018 at 8:03 AM, Webster Homer  wrote:
> I was looking at SOLR-12057
>
> According to the comment on the ticket, CDCR can not work when a collection
> has PULL Replicas. That seems like a MAJOR limitation to CDCR and PULL
> Replicas. Is this likely to be addressed in the future?
> CDCR currently is broken for TLOG replicas too.
>
> https://issues.apache.org/jira/browse/SOLR-12057?focusedCommentId=16391558=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16391558
>
> Thanks
>
> --
>
>
> This message and any attachment are confidential and may be
> privileged or
> otherwise protected from disclosure. If you are not the intended
> recipient,
> you must not copy this message or attachment or disclose the
> contents to
> any other person. If you have received this transmission in error,
> please
> notify the sender immediately and delete the message and any attachment
>
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do
> not accept liability for any omissions or errors in this
> message which may
> arise as a result of E-Mail-transmission or for damages
> resulting from any
> unauthorized changes of the content of this message and
> any attachment thereto.
> Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee
> that this message is free of viruses and does
> not accept liability for any
> damages caused by any virus transmitted
> therewith.
>
>
>
> Click http://www.emdgroup.com/disclaimer
>  to access the
> German, French, Spanish
> and Portuguese versions of this disclaimer.


Re: Preventing solr cache flush when committing

2018-04-25 Thread Erick Erickson
Had this typed up yesterday and forgot to send.

"Is there no way to ensure that the top level filter caches are not
expunged when some documents are added to the index and have the
changes available at the same time?"

no. And it's not something that you can do without major architectural
changes. When you commit, background merging kicks in which will
renumber the _internal_ Lucene document ID. This ID ranges 0-maxDoc
and is used as the bit to set in the filterCache object. So if you
preserved the filterCache, the bits will be wrong. The
queryResultCache is


"If that is the case, then do I need to always have to rely on warmup
of caches to get some documents in caches?"

Yes, that's exactly what the "autowarm" feature is on the caches. Also
the newSearcher event can be used to hand-craft warmup searches where
you know certain things about the index and you specifically want to
ensure certain warming.

Please start out with modest numbers for autowarm, as in 20-30. It's
very often the case that you don't need much more than that. What
those numbers do in filterCache and queryResultCache is re-execute the
associated fq or q clause, respectively.

"Are there any other approaches then warmup which folks usually do to
avoid this; if they want to build a fast searchable product and having
some write throughput as well?" and " I can't afford to get my cached
flushed".

What evidence do you have for this last statement?

"Currently I do commits via my indexing application (after every batch
of documents)"

Please, please, please do _not_ do this. It's especially egregious
because you do it after every batch of docs. So rather than flushing
your caches every 5 minutes (say), you hammer Solr with commit after
commit after commit. Configure your soft commit interval to your
latency requirements and forget about it. Or just configure hard
commit with openSearcher set to true. Or perhaps even just specify
commitWithin when you send docs to Solr. At a guess you may have seen
warnings about "too many on deck searchers" if your commit interval ls
shorter than your autowarm time.

I'll bend a little bit if the client only issues a commit at the very
end of the run and there's precisely one client running at a time and
you can _guarantee_ there's only one commit, but it's usually much
easier and more reliable to use the solr config settings.

Perhaps you're not entirely familiar with how openSearcher works, so
here's a brief review. This applies to either hard commit
(openSearcher=true) or soft commit.
1> a commit happens
2> a new searcher is being opened and autowarming kicks off
3> incoming searches are served by the _old_ searcher, using all the
_old_ caches.
4> autowarming completes
5a> incoming requests are routed to the new searcher
5b> the old searcher finishes serving the outstanding requests
received before <4> and closes
6> the old caches are flushed.

So having high read throughput

On Tue, Apr 24, 2018 at 10:36 AM, Lee Carroll
 wrote:
> From memory try the following:
> Don't manually commit from client after batch indexing
> set soft commit to be a a long time interval. As long as acceptable to run
> stale, say 5 mins or longer if you can.
> set hard commit to be short   (seconds ) to keep everything neat and tidy
> regards updates and avoid backing up log files
> set opensearcher=false
>
> I'm pretty sure that works for at least one of our indices. It's worth a go.
>
> Lee C
>
> On 24 April 2018 at 06:56, Papa Pappu  wrote:
>
>> Hi,
>> I've written down my query over stack-overflow. Here is the link for that :
>> https://stackoverflow.com/questions/49993681/preventing-
>> solr-cache-flush-when-commiting
>>
>> In short, I am facing troubles maintaining my solr caches when commits
>> happen and the question provides detailed description of the same.
>>
>> Based on my use-case if someone can recommend what settings I should use or
>> practices I should follow it'll be really helpful.
>>
>> Thanks and regards,
>> Dmitri
>>


System collection - lazy loading mechanism not working for custom UpdateProcessors?

2018-04-25 Thread Johannes Brucher
Hi all,

I'm facing an issue regarding custom code inside a .system-collection and 
starting up a Solr Cloud cluster.
I thought, like its stated in the documentation, that in case using the .system 
collection custom code is lazy loaded, because it can happen that a collection 
that uses custom code is initialized before the system collection is up and 
running.

I did all the necessary configuration and while debugging, I can see that the 
custom code is wrapped via a PluginBag$LazyPluginHolder. So far its seems good, 
but I still get Exceptions when starting the Solr Cloud with the following 
errors:

SolrException: Blob loading failed: .no active replica available for .system 
collection...

In my case I'm using custom code for a couple of UpdateProcessors. So it seems, 
that this lazy mechanism is not working well for UpdateProcessors.
Inside the calzz LazyPluginHolder the comment says:

"A class that loads plugins Lazily. When the get() method is invoked the Plugin 
is initialized and returned."

When a core is initialized and you have a custom UpdateProcessor, the 
get-method is invoked directly and the lazy loading mechanism tries to get the 
custom class from the MemClassLoader, but in most scenarios the system 
collection is not up and the above Exception is thrown...
So maybe it’s the case that for UpdateProcessors while initializing a core, the 
routine is not implemented optimal for the lazy loading mechanism?

Pls let me know if it helps sharing my configuration!

Many thanks,

Johannes




Re: Modify data-conf.xml on the runtime

2018-04-25 Thread rameshkjes
Actually I am trying to approach this problem from another way. 
I am taking user input from gui which is direcotory of dataset, and saving
that path in properties file. Since I am using Maven, so I am able to access
that path in my pom file using properties tag. So, now is it possible to use
that properties variable from pom.xml to the data-conf.xml? do you think, it
is right way? 



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: solr cell: write entire file content binary to index along with metadata

2018-04-25 Thread Shawn Heisey

On 4/25/2018 4:02 AM, Lee Carroll wrote:

*We don't recommend using solr-cell for production indexing.*

Ok. Are the reasons for:

Performance. I think we have rather modest index requirement (1000 a day...
on a busy day)

Security. The index workflow is, upload files to public facing server with
auth. Files written to disk, scanned and copied to internal server and
ingested into index via here.

  other reasons we should worry about ?


Tika is the underlying technology in solr-cell.  Tika is a separate 
Apache product designed for parsing common rich-text formats, like 
Microsoft, PDF, etc.


http://tika.apache.org/

The problems that can result are related to running Tika inside of Solr, 
which is what solr-cell does.


The Tika authors try very hard to make sure that Tika doesn't misbehave, 
but the very nature of what Tika does means it is somewhat prone to 
misbehaving.  Many of the file formats that Tika processes are 
undocumented, or any documentation that is available is not available to 
open source developers.  Also, sometimes documents in those formats will 
be constructed in a way that the Tika authors have never seen before, or 
they may completely violate what conventions the authors DO know about.


Long story short -- Tika can encounter documents that can cause it to 
crash, or to consume all the memory in the system, or misbehave in other 
ways.  If Tika is running inside Solr, then when it has a problem, Solr 
itself can blow up and have a problem too.


For this reason, and because Tika can sometimes use a lot of resources 
even when it is working correctly, we recommend running it outside of 
Solr in another program that takes its output and sends it to Solr.  
Ideally, it will be running on a completely different machine than Solr 
is running on.


Thanks,
Shawn



How does the stopwords file work?

2018-04-25 Thread lina Zhang
Hello,

I am trying to create a domain specific search engine. As most of the collected 
information contains ‘NDIS', I added 'NDIS' to stopwords_en.txt. The field type 
is text_en, so it should use StopFilterFactory based on the schema. If I search 
for ‘ipad’, I can get the results I need. When I try to search for ‘NDIS ipad’, 
I got no result. The final query string is as below:
   "parsedquery_toString":"Synonym(_text_:ipad _text_:ipads)"
If I search for ‘ipad NDIS’, I also got what I need. But the final query string 
is this:
   "parsedquery_toString":"Synonym(Question:ipad Question:ipad) _text_:ndis"


Could someone explain the differences of the three queries? Will the first word 
being a stop word cause problem?

Thanks,
Lina



CDCR broken for Mixed Replica Collections

2018-04-25 Thread Webster Homer
I was looking at SOLR-12057

According to the comment on the ticket, CDCR can not work when a collection
has PULL Replicas. That seems like a MAJOR limitation to CDCR and PULL
Replicas. Is this likely to be addressed in the future?
CDCR currently is broken for TLOG replicas too.

https://issues.apache.org/jira/browse/SOLR-12057?focusedCommentId=16391558=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16391558

Thanks

-- 


This message and any attachment are confidential and may be
privileged or 
otherwise protected from disclosure. If you are not the intended
recipient, 
you must not copy this message or attachment or disclose the
contents to 
any other person. If you have received this transmission in error,
please 
notify the sender immediately and delete the message and any attachment

from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do
not accept liability for any omissions or errors in this 
message which may
arise as a result of E-Mail-transmission or for damages 
resulting from any
unauthorized changes of the content of this message and 
any attachment thereto.
Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee
that this message is free of viruses and does 
not accept liability for any
damages caused by any virus transmitted 
therewith.



Click http://www.emdgroup.com/disclaimer 
 to access the
German, French, Spanish 
and Portuguese versions of this disclaimer.


Re: SolrCloud DIH (Data Import Handler) MySQL 404

2018-04-25 Thread Mikhail Khludnev
Can you share more log lines around this odd NPE?
It might be necessary to restart jvm with -verbose:class and look through
its' output to find why it can't load this class.

On Wed, Apr 25, 2018 at 11:42 AM, msaunier  wrote:

> Hello Shawn,
>
> I have install SolrCloud 7.3 on an other server and the problem not apear.
> I create a Jira Ticket ?
>
> But I have an other problem:
>
> Full Import 
> failed:org.apache.solr.handler.dataimport.DataImportHandlerException:
> Unable to PropertyWriter implementation:ZKPropertiesWriter
> at org.apache.solr.handler.dataimport.DataImporter.
> createPropertyWriter(DataImporter.java:330)
> at org.apache.solr.handler.dataimport.DataImporter.
> doFullImport(DataImporter.java:411)
> at org.apache.solr.handler.dataimport.DataImporter.
> runCmd(DataImporter.java:474)
> at org.apache.solr.handler.dataimport.DataImporter.
> lambda$runAsync$0(DataImporter.java:457)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
> at org.apache.solr.handler.dataimport.DocBuilder.
> loadClass(DocBuilder.java:935)
> at org.apache.solr.handler.dataimport.DataImporter.
> createPropertyWriter(DataImporter.java:326)
> ... 4 more
>
> I regard to solved the problem.
>
> Cordialement,
>
>
>
>
>
> -Message d'origine-
> De : Shawn Heisey [mailto:elyog...@elyograg.org]
> Envoyé : mardi 24 avril 2018 17:39
> À : solr-user@lucene.apache.org
> Objet : Re: SolrCloud DIH (Data Import Handler) MySQL 404
>
> On 4/24/2018 2:03 AM, msaunier wrote:
> > If I access to the interface, I have a null pointer exception:
> >
> > null:java.lang.NullPointerException
> >   at
> > org.apache.solr.handler.RequestHandlerBase.getVersion(RequestHandlerBa
> > se.java:233)
>
> The line of code where this exception occurred uses fundamental Java
> methods. Based on the error, either the getClass method common to all java
> objects, or the getPackage method on the class, is returning null. That
> shouldn't be possible.  This has me wondering whether there is something
> broken in your particular Solr installation -- corrupt jars, or something
> like that.  Or maybe something broken in your Java.
>
> Thanks,
> Shawn
>
>
>


-- 
Sincerely yours
Mikhail Khludnev


System collection - lazy loading mechanism not working for custom UpdateProcessors

2018-04-25 Thread Johannes Brucher
Hi all,

I'm facing an issue regarding custom code inside a .system-collection and 
starting up a Solr Cloud cluster.
I thought, like its stated in the documentation, that in case using the .system 
collection custom code is lazy loaded, because it can happen that a collection 
that uses custom code is initialized before the system collection is up and 
running.

I did all the necessary configuration and while debugging, I can see that the 
custom code is wrapped via a PluginBag$LazyPluginHolder. So far its seems good, 
but I still get Exceptions when starting the Solr Cloud with the following 
errors:

SolrException: Blob loading failed: .no active replica available for .system 
collection...

In my case I'm using custom code for a couple of UpdateProcessors. So it seems, 
that this lazy mechanism is not working well for UpdateProcessors.
Inside the calzz LazyPluginHolder the comment says:

"A class that loads plugins Lazily. When the get() method is invoked the Plugin 
is initialized and returned."

When a core is initialized and you have a custom UpdateProcessor, the 
get-method is invoked directly and the lazy loading mechanism tries to get the 
custom class from the MemClassLoader, but in most scenarios the system 
collection is not up and the above Exception is thrown...
So maybe it’s the case that for UpdateProcessors while initializing a core, the 
routine is not implemented optimal for the lazy loading mechanism?

Pls let me know if it helps sharing my configuration!

Many thanks,

Johannes




Re: SolrCloud cluster does not accept new documents for indexing

2018-04-25 Thread Denis Demichev
Shawn, Mikhail, Chris,

Thank you all for your feedback.
Unfortunately I cannot try your recommendations right away - this week is
busy.
Will post my results here next week.

Regards,
Denis


On Tue, Apr 24, 2018 at 11:33 AM Shawn Heisey  wrote:

> On 4/24/2018 6:30 AM, Chris Ulicny wrote:
> > I haven't worked with AWS, but recently we tried to move some of our solr
> > instances to a cloud in Google's Cloud offering, and it did not go well.
> > All of our problems ended up stemming from the fact that the I/O is
> > throttled. Any complicated enough query would require too many disk reads
> > to return the results in a reasonable time when being throttled. SSDs
> were
> > better but not a practical cost and not as performant as our own bare
> metal.
>
> If there's enough memory installed beyond what is required for the Solr
> heap, then Solr will rarely need to actually read the disk to satisfy a
> query.  That is the secret to stellar performance.  If switching to
> faster disks made a big difference in query performance, adding memory
> would yield an even greater improvement.
>
> https://wiki.apache.org/solr/SolrPerformanceProblems#RAM
>
> > When we were doing the initial indexing, the indexing processes would get
> > to a point where the updates were taking minutes to complete and the
> cause
> > was throttled write ops.
>
> Indexing speed is indeed affected by disk speed, and adding memory can't
> fix that particular problem.  Using a storage controller with a large
> amount of battery-backed cache memory can improve it.
>
> > -- set the max threads and max concurrent merges of the mergeScheduler to
> > be 1 (or very low). This prevented excessive IO during indexing.
>
> The max threads should be at 1 in the merge scheduler, but the max
> merges should actually be *increased*.  I use a value of 6 for that.
> With SSD disks, the max threads can be increased, but I wouldn't push it
> very high.
>
> Thanks,
> Shawn
>
>


Re: solr cell: write entire file content binary to index along with metadata

2018-04-25 Thread Lee Carroll
>
>
>
>
> *That's not usually the kind of information you want to have in a
> Solrindex.  Most of the time, there will be an entry in the Solr index
> thattells the system making queries how to locate the actual data --
> afilename, a URL, a database lookup key, etc.*


 Agreed. The app will have a few implementations for storing the binary
file. Easiest for a user to configure for proto-typing would be store in
index impl. A live impl would probably be fs

   *We don't recommend using solr-cell for production indexing.*


Ok. Are the reasons for:

Performance. I think we have rather modest index requirement (1000 a day...
on a busy day)

Security. The index workflow is, upload files to public facing server with
auth. Files written to disk, scanned and copied to internal server and
ingested into index via here.

 other reasons we should worry about ?

Cheers Lee C

On 25 April 2018 at 00:37, Shawn Heisey  wrote:

> On 4/24/2018 10:26 AM, Lee Carroll wrote:
> > Does the solr cell contrib give access to the files raw content  along
> with
> > the extracted metadata?\
>
> That's not usually the kind of information you want to have in a Solr
> index.  Most of the time, there will be an entry in the Solr index that
> tells the system making queries how to locate the actual data -- a
> filename, a URL, a database lookup key, etc.
>
> I have no idea whether solr-cell can put the info in the index.  My best
> guess would be that it can't, since putting the entire binary content
> into the index isn't recommended.
>
> We don't recommend using solr-cell for production indexing.  If you
> follow recommendations and write your own indexing program using Tika,
> then you can do pretty much anything you want, including writing the
> full content into the index.
>
> Thanks,
> Shawn
>
>


Re: SolrCloud cluster does not accept new documents for indexing

2018-04-25 Thread Emir Arnautović
Hi Denis,
Merge works on segments and depending on merge strategy it is triggered 
separately so there is no some queue between update executor and merge threads.

Re SPM - I am using it on a daily bases for most of my consulting work and if 
you have SPM app you can invite me to it and I’ll take a quick look to see if 
there are some obvious bottlenecks.

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 23 Apr 2018, at 23:37, Denis Demichev  wrote:
> 
> I conducted another experiment today with local SSD drives, but this did not 
> seem to fix my problem.
> Don't see any extensive I/O in this case:
> 
> Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
> xvda  1.7688.83 5.521256191  77996
> xvdb 13.95   111.30 56663.931573961  801303364
> 
> xvdb - is the device where SolrCloud is installed and data files are kept.
> 
> What I see:
> - There are 17 "Lucene Merge Thread #..." running. Some of them are blocked, 
> some of them are RUNNING
> - updateExecutor-N-thread-M threads are in parked mode and number of docs 
> that I am able to submit is still low
> - Tried to change maxIndexingThreads, set it to something high. This seems to 
> prolong the time when cluster is accepting new indexing requests and keeps 
> CPU utilization a lot higher while the cluster is merging indexes
> 
> Could anyone please point me to the right direction (documentation or Java 
> classes) where I can read about how data is passed from updateExecutor thread 
> pool to Merge Threads? I assume there should be some internal blocking queue 
> or something similar.
> Still cannot wrap my head around how Solr blocks incoming connections. Non 
> merged indexes are not kept in memory so I don't clearly understand why Solr 
> cannot keep writing index file to HDD while other threads are merging indexes 
> (since this is a continuous process anyway).
> 
> Does anyone use SPM monitoring tool for that type of problems? Is it of any 
> use at all?
> 
> 
> Thank you in advance.
> 
> 
> 
> 
> Regards,
> Denis
> 
> 
> On Fri, Apr 20, 2018 at 1:28 PM Denis Demichev  > wrote:
> Mikhail,
> 
> Sure, I will keep everyone posted. Moving to non-HVM instance may take some 
> time, so hopefully I will be able to share my observations in the next couple 
> of days or so.
> Thanks again for all the help.
> 
> Regards,
> Denis
> 
> 
> On Fri, Apr 20, 2018 at 6:02 AM Mikhail Khludnev  > wrote:
> Denis, please let me know what it ends up with. I'm really curious regarding 
> this case and AWS instace flavours. fwiw since 7.4 we'll have 
> ioThrottle=false option. 
> 
> On Thu, Apr 19, 2018 at 11:06 PM, Denis Demichev  > wrote:
> Mikhail, Erick,
> 
> Thank you.
> 
> What just occurred to me - we don't use local SSD but instead we're using EBS 
> volumes.
> This was a wrong instance type that I looked at.
> Will try to set up a cluster with SSD nodes and retest.
> 
> Regards,
> Denis
> 
> 
> On Thu, Apr 19, 2018 at 2:56 PM Mikhail Khludnev  > wrote:
> I'm not sure it's the right context, but here is one guy shows really low 
> throthle boundary 
> https://issues.apache.org/jira/browse/SOLR-11200?focusedCommentId=16115348=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16115348
>  
> 
> 
> 
> On Thu, Apr 19, 2018 at 8:37 PM, Mikhail Khludnev  > wrote:
> Threads are hanging on merge io throthling 
> at 
> org.apache.lucene.index.MergePolicy$OneMergeProgress.pauseNanos(MergePolicy.java:150)
> at 
> org.apache.lucene.index.MergeRateLimiter.maybePause(MergeRateLimiter.java:148)
> at 
> org.apache.lucene.index.MergeRateLimiter.pause(MergeRateLimiter.java:93)
> at 
> org.apache.lucene.store.RateLimitedIndexOutput.checkRate(RateLimitedIndexOutput.java:78)
> It seems odd. Please confirm that you don't commit on every update request. 
> The only way to monitor io throthling is to enable infostream and read a lot 
> of logs.
>
> 
> On Thu, Apr 19, 2018 at 7:59 PM, Denis Demichev  > wrote:
> Erick,
> 
> Thank you for your quick response.
> 
> I/O bottleneck: Please see another screenshot attached, as you can see disk 
> r/w operations are pretty low or not significant.
> iostat==
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> xvda  0.00 0.000.000.00 0.00 0.00   

RE: SolrCloud DIH (Data Import Handler) MySQL 404

2018-04-25 Thread msaunier
Hello Shawn,

I have install SolrCloud 7.3 on an other server and the problem not apear. I 
create a Jira Ticket ?

But I have an other problem:

Full Import 
failed:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to 
PropertyWriter implementation:ZKPropertiesWriter
at 
org.apache.solr.handler.dataimport.DataImporter.createPropertyWriter(DataImporter.java:330)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:411)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:474)
at 
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:457)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at 
org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:935)
at 
org.apache.solr.handler.dataimport.DataImporter.createPropertyWriter(DataImporter.java:326)
... 4 more

I regard to solved the problem.

Cordialement,





-Message d'origine-
De : Shawn Heisey [mailto:elyog...@elyograg.org] 
Envoyé : mardi 24 avril 2018 17:39
À : solr-user@lucene.apache.org
Objet : Re: SolrCloud DIH (Data Import Handler) MySQL 404

On 4/24/2018 2:03 AM, msaunier wrote:
> If I access to the interface, I have a null pointer exception:
>
> null:java.lang.NullPointerException
>   at 
> org.apache.solr.handler.RequestHandlerBase.getVersion(RequestHandlerBa
> se.java:233)

The line of code where this exception occurred uses fundamental Java methods. 
Based on the error, either the getClass method common to all java objects, or 
the getPackage method on the class, is returning null. That shouldn't be 
possible.  This has me wondering whether there is something broken in your 
particular Solr installation -- corrupt jars, or something like that.  Or maybe 
something broken in your Java.

Thanks,
Shawn