Re: RegexReplaceProcessorFactory pattern to detect multiple \n

2019-03-20 Thread Zheng Lin Edwin Yeo
Hi Paul,

Would like to check, if there is any difference in performance when we use
the two different patterns method?

(\n\W*){2,}

[ \t\x0b\f]*\r?\n

Regards,
Edwin

On Thu, 14 Mar 2019 at 09:36, Zheng Lin Edwin Yeo 
wrote:

> Hi Paul,
>
> Thanks for your reply.
>
> So far we did not find cases of punctuation that are being removed.
>
> Our aim is to remove the list of spaces (\n) into 2 , and they are not
> likely to have any punctuation in between.
>
> Do you know if this pattern  (\n\W*){2,} that
> we are using is ok?
> Or would the other pattern like  [
> \t\x0b\f]*\r?\n is better?
>
> Regards,
> Edwin
>
> On Wed, 13 Mar 2019 at 20:08,  wrote:
>
>> Hi Edwin,
>> With \W you will also replace non-word characters such as punktuation. If
>> that's OK fine. Otherwise you need to identify the white space characters
>> that are causing the problem.
>> 
>> Von: Zheng Lin Edwin Yeo 
>> Gesendet: Mittwoch, 13. März 2019 03:25:39
>> An: solr-user@lucene.apache.org
>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>
>> Hi,
>>
>> We have managed to resolve the issue, by changing the \s to \W. The reason
>> could be due to that some of the spaces and white space instead of just a
>> space. Using \s will only remove the spaces and not the white spaces, but
>> using \W will remove the white spaces as well.
>>
>> We have used this config, and it works.
>>
>> 
>>content
>>(\n\W*){2,}
>>brbr
>>true
>> 
>> 
>>content
>>(\n\W*){1,}
>>br
>>true
>> 
>>
>> Regards,
>> Edwin
>>
>> On Tue, 12 Mar 2019 at 10:49, Zheng Lin Edwin Yeo 
>> wrote:
>>
>> > Hi,
>> >
>> > Has anyone else faced the same issue before?
>> > So far all the regex patterns that we tried in this thread are not able
>> to
>> > resolve the issue.
>> >
>> > Regards,
>> > Edwin
>> >
>> > On Fri, 8 Mar 2019 at 12:17, Zheng Lin Edwin Yeo 
>> > wrote:
>> >
>> >> Hi Paul,
>> >>
>> >> Sorry, I realized there is an extra ']' in the pattern provided, which
>> is
>> >> why there are so many  in the output.
>> >>
>> >> The output is exactly the same as previously (previous index result) if
>> >> we remove the extra ']', as shown in the configuration below.
>> >>
>> >>  
>> >>content
>> >>[ \t\x0b\f]*\r?\n
>> >>br
>> >>true
>> >>  
>> >>  
>> >>content
>> >>(br[ \t\x0b\f]*){3,}
>> >>brbr
>> >>true
>> >>  
>> >>
>> >> Regards,
>> >> Edwin
>> >>
>> >>
>> >>
>> >> On Thu, 7 Mar 2019 at 22:51, Zheng Lin Edwin Yeo > >
>> >> wrote:
>> >>
>> >>> Hi Paul,
>> >>>
>> >>> Thanks for the reply.
>> >>>
>> >>> For the 2nd pattern, if we put this pattern > >>> name="pattern">(br[ \t\x0b\f]]*){3,}, which is like the
>> >>> configurations below:
>> >>>
>> >>> 
>> >>>content
>> >>>[ \t\x0b\f]*\r?\n
>> >>>br
>> >>>true
>> >>> 
>> >>> 
>> >>>content
>> >>>(br[ \t\x0b\f]]*){3,}
>> >>>brbr
>> >>>true
>> >>> 
>> >>>
>> >>> It will not be able to change all those more than 3  to 2 .
>> >>>
>> >>> We will end up with many  in the output, like the example below:
>> >>>
>> >>>  http://www.concorded.com/
>> 
>> On Tue, Dec 18, 2018
>> >>>
>> >>>
>> >>> Regards,
>> >>> Edwin
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Thu, 7 Mar 2019 at 20:44,  wrote:
>> >>>
>>  Hi Edwin
>> 
>> 
>> 
>>  I can’t understand why the pattern is not working and where the
>> spaces
>>  between the  are coming from. It should be possible to allow for
>> spaces
>>  between the  in the second match pattern however i.e. 2nd pattern
>> 
>> 
>> 
>>  (br[ \t\x0b\f]]*){3,}
>> 
>> 
>> 
>>  /Paul
>> 
>> 
>> 
>>  Gesendet von Mail
>> für
>>  Windows 10
>> 
>> 
>> 
>>  Von: Zheng Lin Edwin Yeo
>>  Gesendet: Mittwoch, 6. März 2019 16:28
>>  An: solr-user@lucene.apache.org
>>  Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple
>> \n
>> 
>> 
>> 
>>  Hi Paul,
>> 
>>  I have tried with the first match pattern to be [
>>  \t\x0b\f]*\r?\n, like the configuration below:
>> 
>>  
>> content
>> [ \t\x0b\f]*\r?\n
>> br
>> true
>>  
>>  
>> content
>> (br){3,}
>> brbr
>> true
>>  
>> 
>>  However, the result is still the same as before (previous index
>>  results),
>>  with the 4 .
>> 
>>  Regards,
>>  Edwin
>> 
>> 
>>  On Wed, 6 Mar 2019 at 18:23,  wrote:
>> 
>>  > Hi Edwin
>>  >
>>  >
>>  >
>>  > You are correct  re the 2nd pattern – my bad. Looking at the 4
>> ,
>>  it’s
>>  > actually the sequence «  »? So perhaps the first
>> match
>>  > pattern could be [ \t\x0b\f]*\r?\n
>>  >
>>  >
>>  >
>>  > i.e. [space tab vertical-tab formfeed]
>>  >
>>  >
>> 

Re: Gather Nodes Streaming

2019-03-20 Thread Zheng Lin Edwin Yeo
Hi,

What is the fieldType of your 'to field? Which tokenizers/filters is it
using?

Also, which Solr version are you using?

Regards,
Edwin

On Thu, 21 Mar 2019 at 01:57, Susmit Shukla  wrote:

> Hi,
>
> Trying to use solr streaming 'gatherNodes' function. It is for extracting
> email graph based on from and to fields.
> It requires 'to' field to be a single value field with docvalues enabled
> since it is used internally for sorting and unique streams
>
> The 'to' field can contain multiple email addresses - each being a node.
> How to map multiple comma separated email addresses from the 'to' fields as
> separate graph nodes?
>
> Thanks
>
>
>
> >
> >
>


ClassCastException on partial update TrieDateField Solr 7.7.1

2019-03-20 Thread damienk
Hi,

I've upgraded a collection from Solr 6 to Solr 7.7.1 and now when I do a
partial update on a doc and set a TrieDateField I'm seeing a
ClassCastException. I understand TrieDateField's are deprecated and I am
planning to re-index using to a DatePointField, but I was expecting this to
work. Has anyone else seen this? Are there any other limitations around
Trie fields?

ERROR - 2019-03-21 11:39:49.518; [c:i_0_2017_q1_old s:shard2 r:core_node14
x:i_0_2017_q1_old_shard2_replica2] org.apache.solr.servlet.HttpSolrCall;
null:java.lang.ClassCastException:
org.apache.solr.common.util.ByteArrayUtf8CharSequence cannot be cast to
java.lang.String
at
org.apache.solr.schema.TrieDateField.toNativeType(TrieDateField.java:100)
at
org.apache.solr.update.processor.AtomicUpdateDocumentMerger.doSet(AtomicUpdateDocumentMerger.java:319)
at
org.apache.solr.update.processor.AtomicUpdateDocumentMerger.merge(AtomicUpdateDocumentMerger.java:108)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.getUpdatedDocument(DistributedUpdateProcessor.java:1422)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1106)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:693)
at
org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
at
org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:110)
at
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:327)
at
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readIterator(JavaBinUpdateRequestCodec.java:280)
at
org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:333)
at
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:278)
at
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$StreamingCodec.readNamedList(JavaBinUpdateRequestCodec.java:235)
at
org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:298)
at
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:278)
at
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:191)
at
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:126)
at
org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:123)
at
org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:70)
at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2551)
at
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:710)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:516)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:395)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:341)

Regards,
Damien


Re: Upgrading tika

2019-03-20 Thread Geoffrey Willis
Thanks for the explanation, it makes sense. I have noticed that sometimes a pdf 
document with spaces in the name can kill Tika and as result Solr so I get your 
point. Was trying to keep my webApp all Javascript/Typescript so went with the 
exposed extract/update handler. In my looking around I did see some python 
wrappers around a Tika service that might be a better solution. I can’t afford 
to crash the Solr server and it’s been many years since I wrote any java code! 
I remember we did use Solrj back in the day (I think the Lucene version was 3.5 
and the Solr was similar). May have to dust off that code. Anyway, thanks for 
the thorough explanation. Gonna have to rethink my design.
Geoff

> On Mar 20, 2019, at 12:51 PM, Geoffrey Willis  
> wrote:
> 
> Could you expand on that please? I’m currently building a webApp that submits 
> documents to Solr/Tika via the update/extract handler and it’s working fine. 
> What do you mean when you say “You do not want to have your Solr instance 
> processing via Tika”? If that’s a bad design choice please elaborate. 
> Thanks,
> Geoff
> 
> 
>> On Mar 19, 2019, at 5:15 PM, Phil Scadden  wrote:
>> 
>> As per Erick advice, I would strongly recommend that you do anything tika in 
>> a  separate solrj programme. You do not want to have your solr instance 
>> processing via tika.
>> 
>> -Original Message-
>> From: Tannen, Lev (USAEO) [Contractor] 
>> Sent: Wednesday, 20 March 2019 08:17
>> To: solr-user@lucene.apache.org
>> Subject: RE: Upgrading tika
>> 
>> Sorry Erick,
>> Please disregard my previous message. Somehow I downloaded the version 
>> without those two files. I am going to download the latest version solr 
>> 8.0.0 and try it.
>> Best
>> Lev Tannen
>> 
>> -Original Message-
>> From: Erick Erickson 
>> Sent: Tuesday, March 19, 2019 2:48 PM
>> To: solr-user 
>> Subject: Re: Upgrading tika
>> 
>> Yes, Solr is distributed with Tika. Look in:
>> ./solr/contrib/extraction/lib
>> 
>> Tika is upgraded when new versions come out, so the underlying files are 
>> whatever are current at the time.
>> 
>> The integration is a fairly loose coupling, if you're using some external 
>> program (say a SolrJ program) to parse the files, there's no requirement to 
>> use the jars distributed with Solr, use whatever suits your fancy. An 
>> external program just constructs a SolrDocument to send to Solr. What you 
>> use to create that document is irrelevant. See:
>> https://lucidworks.com/2012/02/14/indexing-with-solrj/ for some background.
>> 
>> If you're using the ExtractingRequestHandler, where you just send the 
>> semi-structured docs to Solr (PDFs, Word or whatever), then needing to know 
>> anything about individual Tika-related jar files is kind of strange.
>> 
>> If your predecessors wrote some custom code that runs as part of Solr, I 
>> don't know what to say...
>> 
>> Best,
>> Erick
>> 
>> On Tue, Mar 19, 2019 at 10:47 AM Tannen, Lev (USAEO) [Contractor] 
>>  wrote:
>>> 
>>> Thank you Shawn.
>>> I assumed that tika has been integrated with solr. I the project written 
>>> before me they used two tika files taken from solr distribution. I am 
>>> trying to do the same with solr 7.7.1. However this version contains a 
>>> different set of tika related files. So I am confused. Does  solr does not 
>>> have integrated tika anymore, or I just cannot recognize them?
>>> 
>>> -Original Message-
>>> From: Shawn Heisey 
>>> Sent: Tuesday, March 19, 2019 11:11 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Upgrading tika
>>> 
>>> On 3/19/2019 9:03 AM, levtannen wrote:
 Could anybody suggest me what files do I need to use the latest
 version of Tika and where to find them?
>>> 
>>> This mailing list is solr-user.  Tika is an entirely separate project from 
>>> Solr within the Apache Foundation.  To get help with Tika, you'll need to 
>>> ask that project.
>>> 
>>> https://tika.apache.org/mail-lists.html
>>> 
>>> Thanks,
>>> Shawn
>> Notice: This email and any attachments are confidential and may not be used, 
>> published or redistributed without the prior written consent of the 
>> Institute of Geological and Nuclear Sciences Limited (GNS Science). If 
>> received in error please destroy and immediately notify GNS Science. Do not 
>> copy or disclose the contents.
> 



RE: Upgrading tika

2019-03-20 Thread Phil Scadden
While using the update/extract handler is good for test, tika is a heavyweight 
with the risk that a bad document would compromise the solr instance and tika 
even with ordinary docs is a hog.

I wrote code with solrj to do the indexing and run it on completely different 
machine to the solr instance. It just sends SolrDocuments (created from 
analysis by tika) to the server as Erick says. It becomes even more important 
if you are going to incorporate inline OCR into the tika processing (the 
default). Solr docs gives you the outline for the solrj process. I don’t do 
inline OCR.

My workflow is something like this.
Find document to add.
If image PDF convert to searchable PDF via OCR  as searchable PDF is more 
useful document to deliver as result of search.
Submit document to the solrj-based solr indexer.

The core of my indexer is:
 File f = new File(filename);
 ContentHandler textHandler = new 
BodyContentHandler(Integer.MAX_VALUE);
 Metadata metadata = new Metadata();
 Parser parser = new AutoDetectParser();
 ParseContext context = new ParseContext();
 if (filename.toLowerCase().contains("pdf")) {  // this special 
setup of pdf processing is only required to switch OCR off
   PDFParserConfig pdfConfig = new PDFParserConfig();
   pdfConfig.setExtractInlineImages(false);
   pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR);
   context.set(PDFParserConfig.class,pdfConfig);
   context.set(Parser.class,parser);
 }
 InputStream input = new FileInputStream(f);
 try {
   parser.parse(input, textHandler, metadata, context);
 } catch (Exception e) {
 // exception handling
 }
 SolrInputDocument up = new SolrInputDocument();
 up.addField("id",f.getCanonicalPath());
// other addField calls for items extracted from metadata etc.
 up.addField("_text_",content);
 UpdateRequest req = new UpdateRequest();
 req.add(up);
 req.setBasicAuthCredentials("solrAdmin", password);
 UpdateResponse ur =  req.process(solr,"myindex");
 req.commit(solr, "myindex");

-Original Message-
From: Geoffrey Willis 
Sent: Thursday, 21 March 2019 06:52
To: solr-user@lucene.apache.org
Subject: Re: Upgrading tika

Could you expand on that please? I’m currently building a webApp that submits 
documents to Solr/Tika via the update/extract handler and it’s working fine. 
What do you mean when you say “You do not want to have your Solr instance 
processing via Tika”? If that’s a bad design choice please elaborate.
Thanks,
Geoff


> On Mar 19, 2019, at 5:15 PM, Phil Scadden  wrote:
>
> As per Erick advice, I would strongly recommend that you do anything tika in 
> a  separate solrj programme. You do not want to have your solr instance 
> processing via tika.
>
> -Original Message-
> From: Tannen, Lev (USAEO) [Contractor] 
> Sent: Wednesday, 20 March 2019 08:17
> To: solr-user@lucene.apache.org
> Subject: RE: Upgrading tika
>
> Sorry Erick,
> Please disregard my previous message. Somehow I downloaded the version 
> without those two files. I am going to download the latest version solr 8.0.0 
> and try it.
> Best
> Lev Tannen
>
> -Original Message-
> From: Erick Erickson 
> Sent: Tuesday, March 19, 2019 2:48 PM
> To: solr-user 
> Subject: Re: Upgrading tika
>
> Yes, Solr is distributed with Tika. Look in:
> ./solr/contrib/extraction/lib
>
> Tika is upgraded when new versions come out, so the underlying files are 
> whatever are current at the time.
>
> The integration is a fairly loose coupling, if you're using some external 
> program (say a SolrJ program) to parse the files, there's no requirement to 
> use the jars distributed with Solr, use whatever suits your fancy. An 
> external program just constructs a SolrDocument to send to Solr. What you use 
> to create that document is irrelevant. See:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/ for some background.
>
> If you're using the ExtractingRequestHandler, where you just send the 
> semi-structured docs to Solr (PDFs, Word or whatever), then needing to know 
> anything about individual Tika-related jar files is kind of strange.
>
> If your predecessors wrote some custom code that runs as part of Solr, I 
> don't know what to say...
>
> Best,
> Erick
>
> On Tue, Mar 19, 2019 at 10:47 AM Tannen, Lev (USAEO) [Contractor] 
>  wrote:
>>
>> Thank you Shawn.
>> I assumed that tika has been integrated with solr. I the project written 
>> before me they used two tika files taken from solr distribution. I am trying 
>> to do the same with solr 7.7.1. However this version contains a different 
>> set of tika related files. So I am confused. Does  solr does not have 
>> integrated tika anymore, or I just cannot recognize them?
>>
>> -Original 

Re:BM25F in Solr

2019-03-20 Thread Diego Ceccarelli (BLOOMBERG/ LONDON)
If you want a 'global' IDF across different fields, maybe one solution is to 
use a copyfield to copy all the fields in a common field (e.g, title, authors, 
body, footer all copied into a copyfield call text), and then you should be 
able to use it with a function query or by implementing your own similarity 
score, retrieving the idf on the defined copyfield...

Cheers,
Diego


From: solr-user@lucene.apache.org At: 03/20/19 16:26:08To:  
solr-user@lucene.apache.org
Subject: BM25F in Solr

Hi

There have been several discussions in the past on how to do BM25F scoring in 
Solr.
People have mentioned BlendedTermQuery and in Lucene 8.0 we got a new 
BM25FQuery.

What I mainly want is to normalize the doc freq (IDF) across fields, so that
e.g. title field uses same doc-freq as body field. And ideally it should work
in any query parser, including edismax.

Have any of you succeeded in this, alternatively some other workaround achieving
a normalized IDF across fields?

An approximation could be to always use doc-freq from the largest field in the 
index,
e.g. body, but not sure if you can do that in Similarity?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com




Re: Re: Re: obfuscated password error

2019-03-20 Thread Branham, Jeremy (Experis)
Hard to see in email, particularly because my email server strips urls, but a 
few thinigs I would suggest –

Be sure there aren’t any spaces after your line continuation characters ‘\’. 
This has bit me before.
Check the running processes JVM args and compare `ps –ef | grep solr`
Also, I’d recommend changes be made only in the solr.in.sh, and leave 
‘./bin/solr’ original.

 
Jeremy Branham
jb...@allstate.com


On 3/20/19, 10:24 AM, "Satya Marivada"  wrote:

Sending again, with highlighted text in yellow.

So I got a chance to do a diff of the environments solr-6.3.0 folder within
contents.

solr-6.3.0/bin/solr file has the difference highlighted in yellow. Any idea
of what is going on in that if else in solr file?

*The working configuration file contents are 
(https://urldefense.proofpoint.com/v2/url?u=http-3A__ssl.properties=DwIFaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=nIFuSrMfKCWUmJGtJXgZ_y91GZw9SK5EBljlXsjJgMk=2Rbg_Jc8K1tqOJBPdQt4lsSC0Y3rbEdiug2q577ZoLU=
 below has the
keystore path and password repeated):*

SOLR_SSL_OPTS=""

if [ -n "$SOLR_SSL_KEY_STORE" ]; then

  SOLR_JETTY_CONFIG+=("--module=https")

  SOLR_URL_SCHEME=https

  SOLR_SSL_OPTS=" -Dsolr.jetty.keystore=$SOLR_SSL_KEY_STORE \

-Dsolr.jetty.keystore.password=$SOLR_SSL_KEY_STORE_PASSWORD \

-Dsolr.jetty.truststore=$SOLR_SSL_TRUST_STORE \

-Dsolr.jetty.truststore.password=$SOLR_SSL_TRUST_STORE_PASSWORD \

-Dsolr.jetty.ssl.needClientAuth=$SOLR_SSL_NEED_CLIENT_AUTH \

-Dsolr.jetty.ssl.wantClientAuth=$SOLR_SSL_WANT_CLIENT_AUTH"

  if [ -n "$SOLR_SSL_CLIENT_KEY_STORE" ]; then

SOLR_SSL_OPTS+=" -Djavax.net.ssl.keyStore=$SOLR_SSL_CLIENT_KEY_STORE \

  -Djavax.net.ssl.keyStorePassword=$SOLR_SSL_CLIENT_KEY_STORE_PASSWORD \

  -Djavax.net.ssl.trustStore=$SOLR_SSL_CLIENT_TRUST_STORE \


-Djavax.net.ssl.trustStorePassword=$SOLR_SSL_CLIENT_TRUST_STORE_PASSWORD"
 else
SOLR_SSL_OPTS+="

-Dcom.sun.management.jmxremote.ssl.config.file=/sanfs/mnt/vol01/solr/solr-6.3.0/server/etc/https://urldefense.proofpoint.com/v2/url?u=http-3A__ssl.properties=DwIFaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=nIFuSrMfKCWUmJGtJXgZ_y91GZw9SK5EBljlXsjJgMk=2Rbg_Jc8K1tqOJBPdQt4lsSC0Y3rbEdiug2q577ZoLU=;
  fi

else

  SOLR_JETTY_CONFIG+=("--module=http")

Fi


*Not working one (basically overriding again and is causing the incorrect
password):*



SOLR_SSL_OPTS=""

if [ -n "$SOLR_SSL_KEY_STORE" ]; then

  SOLR_JETTY_CONFIG+=("--module=https")

  SOLR_URL_SCHEME=https

  SOLR_SSL_OPTS=" -Dsolr.jetty.keystore=$SOLR_SSL_KEY_STORE \

-Dsolr.jetty.keystore.password=$SOLR_SSL_KEY_STORE_PASSWORD \

-Dsolr.jetty.truststore=$SOLR_SSL_TRUST_STORE \

-Dsolr.jetty.truststore.password=$SOLR_SSL_TRUST_STORE_PASSWORD \

-Dsolr.jetty.ssl.needClientAuth=$SOLR_SSL_NEED_CLIENT_AUTH \

-Dsolr.jetty.ssl.wantClientAuth=$SOLR_SSL_WANT_CLIENT_AUTH"

  if [ -n "$SOLR_SSL_CLIENT_KEY_STORE" ]; then

SOLR_SSL_OPTS+=" -Djavax.net.ssl.keyStore=$SOLR_SSL_CLIENT_KEY_STORE \

  -Djavax.net.ssl.keyStorePassword=$SOLR_SSL_CLIENT_KEY_STORE_PASSWORD \

  -Djavax.net.ssl.trustStore=$SOLR_SSL_CLIENT_TRUST_STORE \


-Djavax.net.ssl.trustStorePassword=$SOLR_SSL_CLIENT_TRUST_STORE_PASSWORD"

  else

SOLR_SSL_OPTS+=" -Djavax.net.ssl.keyStore=$SOLR_SSL_KEY_STORE \

  -Djavax.net.ssl.keyStorePassword=$SOLR_SSL_KEY_STORE_PASSWORD \

  -Djavax.net.ssl.trustStore=$SOLR_SSL_TRUST_STORE \

  -Djavax.net.ssl.trustStorePassword=$SOLR_SSL_TRUST_STORE_PASSWORD"

  fi



On Wed, Mar 20, 2019 at 10:45 AM Satya Marivada 
wrote:

> So I got a chance to do a diff of the environments solr-6.3.0 folder
> within contents.
>
> solr-6.3.0/bin/solr file has the difference highlighted in yellow. Any
> idea of what is going on in that if else in solr file?
>
> *The working configuration file contents are 
(https://urldefense.proofpoint.com/v2/url?u=http-3A__ssl.properties=DwIFaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=nIFuSrMfKCWUmJGtJXgZ_y91GZw9SK5EBljlXsjJgMk=2Rbg_Jc8K1tqOJBPdQt4lsSC0Y3rbEdiug2q577ZoLU=
 below has the
> keystore path and password repeated):*
>
> SOLR_SSL_OPTS=""
>
> if [ -n "$SOLR_SSL_KEY_STORE" ]; then
>
>   SOLR_JETTY_CONFIG+=("--module=https")
>
>   SOLR_URL_SCHEME=https
>
>   SOLR_SSL_OPTS=" -Dsolr.jetty.keystore=$SOLR_SSL_KEY_STORE \
>
> -Dsolr.jetty.keystore.password=$SOLR_SSL_KEY_STORE_PASSWORD \
>
> -Dsolr.jetty.truststore=$SOLR_SSL_TRUST_STORE \
 

Re: Range query syntax on a polygon field is returning all documents

2019-03-20 Thread David Smiley
Hi Mitchell,

Seems like there's a bug based on what you've shown.
* Can you please try RptWithGeometrySpatialField instead
of SpatialRecursivePrefixTreeFieldType to see if the problem goes away?
This could point to a precision issue; though still what you've seen is
suspicious.
* Can you try one other query syntax e.g. bbox query parser to see if the
problem goes away?  I doubt this is it but you seem to point to the syntax
being related.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Mar 18, 2019 at 12:24 AM Mitchell Bösecke <
mitchell.bose...@forcorp.com> wrote:

> Hi everyone,
>
> I'm trying to index geodetic polygons and then query them out using an
> arbitrary rectangle. When using the Geo3D spatial context factory, the data
> indexes just fine but using a range query (as per the solr documentation)
> does not seem to filter the results appropriately (I get all documents
> back).
>
> When I switch it to JTS, everything works as expected. However, it
> significantly slowed down the initial indexing time. A sample size of 3000
> documents took 3 seconds with Geo3D and 50 seconds with JTS.
>
> I've documented my journey in detail on stack overflow:
> https://stackoverflow.com/q/55212622/1017571
>
>1. Can I not use the range query syntax with Geo3D? I.e. am I
>misreading the documentation?
>2. Is it expected that using JTS will *significantly* slow down the
>indexing time?
>
> Thanks for any insight.
>
> --
> Mitchell Bosecke, B.Sc.
> Senior Application Developer
>
> FORCORP
> Suite 200, 15015 - 123 Ave NW,
> Edmonton, AB, T5V 1J7
> www.forcorp.com
> (d) 780.733.0494
> (o) 780.452.5878 ext. 263
> (f) 780.453.3986
>


Gather Nodes Streaming

2019-03-20 Thread Susmit Shukla
Hi,

Trying to use solr streaming 'gatherNodes' function. It is for extracting
email graph based on from and to fields.
It requires 'to' field to be a single value field with docvalues enabled
since it is used internally for sorting and unique streams

The 'to' field can contain multiple email addresses - each being a node.
How to map multiple comma separated email addresses from the 'to' fields as
separate graph nodes?

Thanks



>
>


Re: Upgrading tika

2019-03-20 Thread Geoffrey Willis
Could you expand on that please? I’m currently building a webApp that submits 
documents to Solr/Tika via the update/extract handler and it’s working fine. 
What do you mean when you say “You do not want to have your Solr instance 
processing via Tika”? If that’s a bad design choice please elaborate. 
Thanks,
Geoff


> On Mar 19, 2019, at 5:15 PM, Phil Scadden  wrote:
> 
> As per Erick advice, I would strongly recommend that you do anything tika in 
> a  separate solrj programme. You do not want to have your solr instance 
> processing via tika.
> 
> -Original Message-
> From: Tannen, Lev (USAEO) [Contractor] 
> Sent: Wednesday, 20 March 2019 08:17
> To: solr-user@lucene.apache.org
> Subject: RE: Upgrading tika
> 
> Sorry Erick,
> Please disregard my previous message. Somehow I downloaded the version 
> without those two files. I am going to download the latest version solr 8.0.0 
> and try it.
> Best
> Lev Tannen
> 
> -Original Message-
> From: Erick Erickson 
> Sent: Tuesday, March 19, 2019 2:48 PM
> To: solr-user 
> Subject: Re: Upgrading tika
> 
> Yes, Solr is distributed with Tika. Look in:
> ./solr/contrib/extraction/lib
> 
> Tika is upgraded when new versions come out, so the underlying files are 
> whatever are current at the time.
> 
> The integration is a fairly loose coupling, if you're using some external 
> program (say a SolrJ program) to parse the files, there's no requirement to 
> use the jars distributed with Solr, use whatever suits your fancy. An 
> external program just constructs a SolrDocument to send to Solr. What you use 
> to create that document is irrelevant. See:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/ for some background.
> 
> If you're using the ExtractingRequestHandler, where you just send the 
> semi-structured docs to Solr (PDFs, Word or whatever), then needing to know 
> anything about individual Tika-related jar files is kind of strange.
> 
> If your predecessors wrote some custom code that runs as part of Solr, I 
> don't know what to say...
> 
> Best,
> Erick
> 
> On Tue, Mar 19, 2019 at 10:47 AM Tannen, Lev (USAEO) [Contractor] 
>  wrote:
>> 
>> Thank you Shawn.
>> I assumed that tika has been integrated with solr. I the project written 
>> before me they used two tika files taken from solr distribution. I am trying 
>> to do the same with solr 7.7.1. However this version contains a different 
>> set of tika related files. So I am confused. Does  solr does not have 
>> integrated tika anymore, or I just cannot recognize them?
>> 
>> -Original Message-
>> From: Shawn Heisey 
>> Sent: Tuesday, March 19, 2019 11:11 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Upgrading tika
>> 
>> On 3/19/2019 9:03 AM, levtannen wrote:
>>> Could anybody suggest me what files do I need to use the latest
>>> version of Tika and where to find them?
>> 
>> This mailing list is solr-user.  Tika is an entirely separate project from 
>> Solr within the Apache Foundation.  To get help with Tika, you'll need to 
>> ask that project.
>> 
>> https://tika.apache.org/mail-lists.html
>> 
>> Thanks,
>> Shawn
> Notice: This email and any attachments are confidential and may not be used, 
> published or redistributed without the prior written consent of the Institute 
> of Geological and Nuclear Sciences Limited (GNS Science). If received in 
> error please destroy and immediately notify GNS Science. Do not copy or 
> disclose the contents.



Re: Nested geofilt query for LTR feature

2019-03-20 Thread David Smiley
Hi,

I've never used the LTR module, but I suspect I might know what the error
is.  I think that the "query" Function Query has parsing limitations on
what you pass to it.  At least it used to.  Try to put the embedded query
onto another parameter and then refer to it with a dollar-sign.  See the
examples here:
https://builds.apache.org/job/Solr-reference-guide-master/javadoc/function-queries.html#query-function

Also, I think it's a bit inefficient to wrap a query function query around
a geofilt query that exposes a distance as a score.  If you want the
distance then call the "geodist" function query.

Additionally if you dump the full stack trace here, it might be helpful.
Getting a RuntimeException suggests we need to do a better of job
wrapping/cleaning errors internally.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Thu, Mar 14, 2019 at 11:43 PM Kamuela Lau  wrote:

> Hello,
>
> I'm currently using Solr 7.2.2 and trying to use the LTR contrib module to
> rerank queries.
> For my LTR model, I would like to use a feature that is essentially a
> "normalized distance," a value between 0 and 1 which is based on distance.
>
> When using geodist() to define a feature in the feature store, I received a
> "failed to parse feature query" error, and thus I am using the below
> geofilt query for distance.
>
> {
>   "name":"dist",
>   "class":"org.apache.solr.ltr.feature.SolrFeature",
>   "params":{"q":"{!geofilt sfield=latlon score=kilometers filter=false
> pt=${ltrpt} d=5000}"},
>   "store":"ltrFeatureStore"
> }
>
> This feature correctly returns the distance between ltrpt and the sfield
> latlon (LatLonPointSpatialField).
> As I mentioned previously, I would like a feature which uses this distance
> in another function. To test this functionality, I tried to define a
> feature which multiplies the distance by two:
>
> {
>   "name":"twoDist",
>   "class":"org.apache.solr.ltr.feature.SolrFeature",
>   "params":{"q":"{!func}product(2,query({!geofilt v= sfield=latlon
> score=kilometers filter=false pt=${ltrpt} d=5000},0.0))"},
>   "store":"ltrFeatureStore"
> }
>
> When trying to extract this feature, I receive the following error:
>
> java.lang.RuntimeException: Exception from createWeight for SolrFeature
> [name=multDist, params={q={!func}product(2,query({!geofilt v= sfield=latlon
> score=kilometers filter=false pt=${ltrpt} d=5000},0.0))}]  missing sfield
> for spatial request
>
> However, when I define the following in fl for a regular, non-reranked
> query, I find that it is correctly parsed and I receive the correct value,
> which is twice the value of geodist() (pt2 is defined in a different part
> of the query):
> fl=score,geodist(),{!func}product(2,query({!geofilt v= sfield=latlon
> score=kilometers filter=false pt=${pt2} d=5},0.0))
>
> For reference, below is what I have defined in my schema:
>
>
>  docValues="true"/>
>
> Is this the correct, intended behavior? If so, is my query for this
> correct, or should I go about extracting this sort of feature a different
> way?
>


CDCR one source multiple targets

2019-03-20 Thread Arnold Bronley
Hi,

is it possible to use CDCR with one source SolrCloud cluster and multiple
target SolrCloud clusters? I tried to edit the zkHost setting in source
cluster's solrconfig file by adding multiple comma separated values for
target zkhosts for multuple target clusters. But the CDCR replication
happens only to one of the zkhosts and not all. If this is not supported
then how should I go about implementing something like this?


BM25F in Solr

2019-03-20 Thread Jan Høydahl
Hi

There have been several discussions in the past on how to do BM25F scoring in 
Solr.
People have mentioned BlendedTermQuery and in Lucene 8.0 we got a new 
BM25FQuery.

What I mainly want is to normalize the doc freq (IDF) across fields, so that
e.g. title field uses same doc-freq as body field. And ideally it should work
in any query parser, including edismax.

Have any of you succeeded in this, alternatively some other workaround achieving
a normalized IDF across fields?

An approximation could be to always use doc-freq from the largest field in the 
index,
e.g. body, but not sure if you can do that in Similarity?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com



Re: Re: obfuscated password error

2019-03-20 Thread Satya Marivada
Sending again, with highlighted text in yellow.

So I got a chance to do a diff of the environments solr-6.3.0 folder within
contents.

solr-6.3.0/bin/solr file has the difference highlighted in yellow. Any idea
of what is going on in that if else in solr file?

*The working configuration file contents are (ssl.properties below has the
keystore path and password repeated):*

SOLR_SSL_OPTS=""

if [ -n "$SOLR_SSL_KEY_STORE" ]; then

  SOLR_JETTY_CONFIG+=("--module=https")

  SOLR_URL_SCHEME=https

  SOLR_SSL_OPTS=" -Dsolr.jetty.keystore=$SOLR_SSL_KEY_STORE \

-Dsolr.jetty.keystore.password=$SOLR_SSL_KEY_STORE_PASSWORD \

-Dsolr.jetty.truststore=$SOLR_SSL_TRUST_STORE \

-Dsolr.jetty.truststore.password=$SOLR_SSL_TRUST_STORE_PASSWORD \

-Dsolr.jetty.ssl.needClientAuth=$SOLR_SSL_NEED_CLIENT_AUTH \

-Dsolr.jetty.ssl.wantClientAuth=$SOLR_SSL_WANT_CLIENT_AUTH"

  if [ -n "$SOLR_SSL_CLIENT_KEY_STORE" ]; then

SOLR_SSL_OPTS+=" -Djavax.net.ssl.keyStore=$SOLR_SSL_CLIENT_KEY_STORE \

  -Djavax.net.ssl.keyStorePassword=$SOLR_SSL_CLIENT_KEY_STORE_PASSWORD \

  -Djavax.net.ssl.trustStore=$SOLR_SSL_CLIENT_TRUST_STORE \


-Djavax.net.ssl.trustStorePassword=$SOLR_SSL_CLIENT_TRUST_STORE_PASSWORD"
 else
SOLR_SSL_OPTS+="
-Dcom.sun.management.jmxremote.ssl.config.file=/sanfs/mnt/vol01/solr/solr-6.3.0/server/etc/ssl.properties"
  fi

else

  SOLR_JETTY_CONFIG+=("--module=http")

Fi


*Not working one (basically overriding again and is causing the incorrect
password):*



SOLR_SSL_OPTS=""

if [ -n "$SOLR_SSL_KEY_STORE" ]; then

  SOLR_JETTY_CONFIG+=("--module=https")

  SOLR_URL_SCHEME=https

  SOLR_SSL_OPTS=" -Dsolr.jetty.keystore=$SOLR_SSL_KEY_STORE \

-Dsolr.jetty.keystore.password=$SOLR_SSL_KEY_STORE_PASSWORD \

-Dsolr.jetty.truststore=$SOLR_SSL_TRUST_STORE \

-Dsolr.jetty.truststore.password=$SOLR_SSL_TRUST_STORE_PASSWORD \

-Dsolr.jetty.ssl.needClientAuth=$SOLR_SSL_NEED_CLIENT_AUTH \

-Dsolr.jetty.ssl.wantClientAuth=$SOLR_SSL_WANT_CLIENT_AUTH"

  if [ -n "$SOLR_SSL_CLIENT_KEY_STORE" ]; then

SOLR_SSL_OPTS+=" -Djavax.net.ssl.keyStore=$SOLR_SSL_CLIENT_KEY_STORE \

  -Djavax.net.ssl.keyStorePassword=$SOLR_SSL_CLIENT_KEY_STORE_PASSWORD \

  -Djavax.net.ssl.trustStore=$SOLR_SSL_CLIENT_TRUST_STORE \


-Djavax.net.ssl.trustStorePassword=$SOLR_SSL_CLIENT_TRUST_STORE_PASSWORD"

  else

SOLR_SSL_OPTS+=" -Djavax.net.ssl.keyStore=$SOLR_SSL_KEY_STORE \

  -Djavax.net.ssl.keyStorePassword=$SOLR_SSL_KEY_STORE_PASSWORD \

  -Djavax.net.ssl.trustStore=$SOLR_SSL_TRUST_STORE \

  -Djavax.net.ssl.trustStorePassword=$SOLR_SSL_TRUST_STORE_PASSWORD"

  fi



On Wed, Mar 20, 2019 at 10:45 AM Satya Marivada 
wrote:

> So I got a chance to do a diff of the environments solr-6.3.0 folder
> within contents.
>
> solr-6.3.0/bin/solr file has the difference highlighted in yellow. Any
> idea of what is going on in that if else in solr file?
>
> *The working configuration file contents are (ssl.properties below has the
> keystore path and password repeated):*
>
> SOLR_SSL_OPTS=""
>
> if [ -n "$SOLR_SSL_KEY_STORE" ]; then
>
>   SOLR_JETTY_CONFIG+=("--module=https")
>
>   SOLR_URL_SCHEME=https
>
>   SOLR_SSL_OPTS=" -Dsolr.jetty.keystore=$SOLR_SSL_KEY_STORE \
>
> -Dsolr.jetty.keystore.password=$SOLR_SSL_KEY_STORE_PASSWORD \
>
> -Dsolr.jetty.truststore=$SOLR_SSL_TRUST_STORE \
>
> -Dsolr.jetty.truststore.password=$SOLR_SSL_TRUST_STORE_PASSWORD \
>
> -Dsolr.jetty.ssl.needClientAuth=$SOLR_SSL_NEED_CLIENT_AUTH \
>
> -Dsolr.jetty.ssl.wantClientAuth=$SOLR_SSL_WANT_CLIENT_AUTH"
>
>   if [ -n "$SOLR_SSL_CLIENT_KEY_STORE" ]; then
>
> SOLR_SSL_OPTS+=" -Djavax.net.ssl.keyStore=$SOLR_SSL_CLIENT_KEY_STORE \
>
>   -Djavax.net.ssl.keyStorePassword=$SOLR_SSL_CLIENT_KEY_STORE_PASSWORD
> \
>
>   -Djavax.net.ssl.trustStore=$SOLR_SSL_CLIENT_TRUST_STORE \
>
>
> -Djavax.net.ssl.trustStorePassword=$SOLR_SSL_CLIENT_TRUST_STORE_PASSWORD"
>
>   else
>
> SOLR_SSL_OPTS+="
> -Dcom.sun.management.jmxremote.ssl.config.file=/sanfs/mnt/vol01/solr/solr-6.3.0/server/etc/ssl.properties"
>
>   fi
>
> else
>
>   SOLR_JETTY_CONFIG+=("--module=http")
>
> Fi
>
>
> *Not working one (basically overriding again and is causing the incorrect
> password):*
>
>
>
> SOLR_SSL_OPTS=""
>
> if [ -n "$SOLR_SSL_KEY_STORE" ]; then
>
>   SOLR_JETTY_CONFIG+=("--module=https")
>
>   SOLR_URL_SCHEME=https
>
>   SOLR_SSL_OPTS=" -Dsolr.jetty.keystore=$SOLR_SSL_KEY_STORE \
>
> -Dsolr.jetty.keystore.password=$SOLR_SSL_KEY_STORE_PASSWORD \
>
> -Dsolr.jetty.truststore=$SOLR_SSL_TRUST_STORE \
>
> -Dsolr.jetty.truststore.password=$SOLR_SSL_TRUST_STORE_PASSWORD \
>
> -Dsolr.jetty.ssl.needClientAuth=$SOLR_SSL_NEED_CLIENT_AUTH \
>
> -Dsolr.jetty.ssl.wantClientAuth=$SOLR_SSL_WANT_CLIENT_AUTH"
>
>   if [ -n "$SOLR_SSL_CLIENT_KEY_STORE" ]; then
>
> SOLR_SSL_OPTS+=" -Djavax.net.ssl.keyStore=$SOLR_SSL_CLIENT_KEY_STORE \
>
>   

RE: Upgrading tika

2019-03-20 Thread Tannen, Lev (USAEO) [Contractor]
Thank you Shawn and Erick,
 I truly did not want to dive into Tika and Cxf worlds, but it looks I have no 
choice.

-Original Message-
From: Shawn Heisey  
Sent: Wednesday, March 20, 2019 11:09 AM
To: solr-user@lucene.apache.org
Subject: Re: Upgrading tika

On 3/20/2019 8:24 AM, Tannen, Lev (USAEO) [Contractor] wrote:
> I still need your advice. The program I have to fix uses class 
> AutoDetectParser along with Solrj for parsing PDF files before sending the 
> result to the solr server. To do this it linked two tika jar files taken from 
> the solr distribution. Namely: tika-core and tika-parsers. Maybe it used some 
> other tika related files but I have problems to identify them among a lot of 
> other jar files linked. The program worked more or less OK, but it gave too 
> many warnings of kind "Font not found". I had a rumor that this was fixed in 
> the next tika distribution.

Solr does include a subset of Tika - just enough to make the Extracting Request 
Handler work.

Since you're writing your own program that uses Tika, the dependencies you need 
could be very different than what Solr needs for its Tika integration.

It is strongly recommended, as Erick mentioned, to never use Solr's Tika 
integration in production.  Tika has a tendency to crash with some input files, 
especially PDF, and if it crashes when it is running inside Solr, then Solr 
will crash too.  No more search engine.

>   I switched from Solr 7.5 to solr 7.7.1 in a hope that will solve that 
> problem. However when I switched I encountered an  another problem:
> java.lang.NoClassDefFoundError: 
> org/apache/cxf/jaxrs/ext/multipart/ContentDisposition.
> Apparently I have not included some necessary jars. Those jars supposed to 
> come from a different project  called cxf, but because they are related to 
> tika I expected them be distributed with solr. However I did not find them in 
> the solr 7.7.1 (in the solr 7.5 as well).

I had never heard of CXF before.  It is not included with Solr.  The Extracting 
Request Handler must not use the part of Tika that needs CXF, so we don't 
include it.

> So could you please advise what is the best way to proceed.

If you want to know how to use Tika in your program and what you need for your 
particular use case, talk to the Tika project.  There is at least one person 
from the Tika project subscribed, but questions about that project are 
off-topic on this list.

Thanks,
Shawn


Re: Use of ShingleFilter causing very large BooleanQuery structures in Solr 7.1

2019-03-20 Thread Erick Erickson
The Apache mail server aggressively strips attachments, so yours didn’t come 
through. People often provide links to images stored somewhere else

As to why this is behaving this way, I’m pretty clueless. A _complete_ shot in 
the dark is the query parsing changed its default for split on whitespace from 
true to false, perhaps try specifying "=true". Here’s some background: 
https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/

I have no actual, you know, _knowledge_ that it’s related but it’d be 
super-easy to try and might give a clue.

Best,
Erick

> On Mar 20, 2019, at 2:00 AM, Hubert-Price, Neil  
> wrote:
> 
> Hello All,
>  
> We have a recently upgraded system that went from Solr 4.6 to Solr 7.1 (used 
> as part of an ecommerce application).  In the upgraded version we are seeing 
> frequent issues with very high Solr memory usage for certain types of query, 
> but the older 4.6 version does not produce the same response.
>  
> Having taken a heap dump and investigated, we can see instances of individual 
> Solr threads where the retained set is 4GB to 5GB in size.  Drilling into 
> this we can see a particular subquery with over 500,000 clauses.  Screenshots 
> below are from Eclipse MAT viewing a heap dump from the SOLR process. 
> Observations of the 4.6 version we can see memory increments of 100-200 MB 
> for the same query, rather than 4-5 GB.
>  
> In both systems the index has around 2 million documents, with average size 
> around 8KB.
>  
>  
> 
>  
> 
>  
>  
> The subquery with a very large set of clauses relates to a particular field 
> setup to use ShingleFilter (with maxShingleSize=30, and outputUnigrams=true). 
> Schema.xml definitions for this field are:
>  
>  positionIncrementGap="100">
> 
>  class="solr.WhitespaceTokenizerFactory" />
>  class="solr.StandardFilterFactory" />
>  class="solr.LowerCaseFilterFactory" />
>  class="solr.ShingleFilterFactory" maxShingleSize="30" outputUnigrams="true"/>
> 
> 
>  
>  type="lowercase_tokens" indexed="true" stored="false" multiValued="true"/>
>  
>  dest="productdetails_tokens_en" />
>  dest="productdetails_tokens_en" />
>  dest="productdetails_tokens_en" />
>  dest="productdetails_tokens_en" />
>  
> The issue happens when the user search contains large numbers of tokens.  In 
> the example screenshots above the user search text had 20 tokens. The Solr 
> query for that thread was as below (formatting/indentation added by me, the 
> original is one long string).  This specific query contains tabs, however the 
> same behaviour happens when spaces are used as well:
> (
> +(
>   fulltext_en:(96114445009611444520   9611444530   
> 9611444540   9611414550 96121940029612194002   
> 9612194002   9612194003   9612194007 9611416470 
> 9611416470   96114164709611416480   9611416480 
> 9613484402 9613484402   9613484402   9613484402   
> 9613484402)
>   OR productdetails_tokens_en:(9611444500   9611444520   9611444530   
> 9611444540   9611414550 9612194002   9612194002   9612194002  
>  9612194003   9612194007 9611416470 9611416470
> 9611416470   9611416480   9611416480 9613484402 
> 9613484402   9613484402   96134844029613484402)
>   OR codePartial:(9611444500 9611444520   9611444530   9611444540 
>   9611414550 96121940029612194002   9612194002   
> 9612194003   9612194007 9611416470 9611416470   
> 96114164709611416480   9611416480 9613484402 
> 9613484402   9613484402   9613484402   9613484402)
> )
> )
> AND
> (
> (
>   (
>(productChannelVisibility_string_mv:ALL OR 
> productChannelVisibility_string_mv:EBUSINESS OR 
> productChannelVisibility_string_mv:INTERNET OR 
> productChannelVisibility_string_mv:INTRANET)
>AND
>!productChannelVisibility_string_mv:NOTVISIBLE
>   )
>   AND
>   (
>+(
> fulltext_en:(9611444500  9611444520   9611444530   
> 9611444540   9611414550 96121940029612194002   
> 9612194002   9612194003   9612194007 9611416470 
> 9611416470   96114164709611416480   9611416480 
> 9613484402 9613484402   9613484402   9613484402   
> 9613484402)
> OR productdetails_tokens_en:(9611444500 9611444520   9611444530   
> 9611444540   9611414550 9612194002   9612194002   9612194002  
>  

Re: Upgrading tika

2019-03-20 Thread Shawn Heisey

On 3/20/2019 8:24 AM, Tannen, Lev (USAEO) [Contractor] wrote:

I still need your advice. The program I have to fix uses class AutoDetectParser along 
with Solrj for parsing PDF files before sending the result to the solr server. To do this 
it linked two tika jar files taken from the solr distribution. Namely: tika-core and 
tika-parsers. Maybe it used some other tika related files but I have problems to identify 
them among a lot of other jar files linked. The program worked more or less OK, but it 
gave too many warnings of kind "Font not found". I had a rumor that this was 
fixed in the next tika distribution.


Solr does include a subset of Tika - just enough to make the Extracting 
Request Handler work.


Since you're writing your own program that uses Tika, the dependencies 
you need could be very different than what Solr needs for its Tika 
integration.


It is strongly recommended, as Erick mentioned, to never use Solr's Tika 
integration in production.  Tika has a tendency to crash with some input 
files, especially PDF, and if it crashes when it is running inside Solr, 
then Solr will crash too.  No more search engine.



  I switched from Solr 7.5 to solr 7.7.1 in a hope that will solve that 
problem. However when I switched I encountered an  another problem:
java.lang.NoClassDefFoundError: 
org/apache/cxf/jaxrs/ext/multipart/ContentDisposition.
Apparently I have not included some necessary jars. Those jars supposed to come 
from a different project  called cxf, but because they are related to tika I 
expected them be distributed with solr. However I did not find them in the solr 
7.7.1 (in the solr 7.5 as well).


I had never heard of CXF before.  It is not included with Solr.  The 
Extracting Request Handler must not use the part of Tika that needs CXF, 
so we don't include it.



So could you please advise what is the best way to proceed.


If you want to know how to use Tika in your program and what you need 
for your particular use case, talk to the Tika project.  There is at 
least one person from the Tika project subscribed, but questions about 
that project are off-topic on this list.


Thanks,
Shawn


Re: Upgrading tika

2019-03-20 Thread Erick Erickson
Well, I’d have to do the same thing, go spelunking in Tika.. When I used it 
from SolrJ, I just linked to the Tika distro and it “just worked”, but I admit 
that was a while ago.

Your best bet would probably be the Tika user’s list.

Best,
Erick

> On Mar 20, 2019, at 7:24 AM, Tannen, Lev (USAEO) [Contractor] 
>  wrote:
> 
> Hi Erick,
> 
> I still need your advice. The program I have to fix uses class 
> AutoDetectParser along with Solrj for parsing PDF files before sending the 
> result to the solr server. To do this it linked two tika jar files taken from 
> the solr distribution. Namely: tika-core and tika-parsers. Maybe it used some 
> other tika related files but I have problems to identify them among a lot of 
> other jar files linked. The program worked more or less OK, but it gave too 
> many warnings of kind "Font not found". I had a rumor that this was fixed in 
> the next tika distribution.
> I switched from Solr 7.5 to solr 7.7.1 in a hope that will solve that 
> problem. However when I switched I encountered an  another problem: 
> java.lang.NoClassDefFoundError: 
> org/apache/cxf/jaxrs/ext/multipart/ContentDisposition.
> Apparently I have not included some necessary jars. Those jars supposed to 
> come from a different project  called cxf, but because they are related to 
> tika I expected them be distributed with solr. However I did not find them in 
> the solr 7.7.1 (in the solr 7.5 as well). 
> I have found the necessary file in the cxf distribution and included it. It 
> asked for an another file which I included as well. After this I got a 
> message that some temporary resources were not closed. Apparently something 
> is not matched. And now I am stuck. I do not want to start from scratch and 
> search the whole tika and cxf projects for the files I need and I do not want 
> to include all files from those projects especially because I was not able to 
> find a binary distribution. So could you please advise what is the best way 
> to proceed.
> 
> Thank you,
> Lev Tannen
> 
> -Original Message-
> From: Erick Erickson  
> Sent: Tuesday, March 19, 2019 2:48 PM
> To: solr-user 
> Subject: Re: Upgrading tika
> 
> Yes, Solr is distributed with Tika. Look in:
> ./solr/contrib/extraction/lib
> 
> Tika is upgraded when new versions come out, so the underlying files are 
> whatever are current at the time.
> 
> The integration is a fairly loose coupling, if you're using some external 
> program (say a SolrJ program) to parse the files, there's no requirement to 
> use the jars distributed with Solr, use whatever suits your fancy. An 
> external program just constructs a SolrDocument to send to Solr. What you use 
> to create that document is irrelevant. See:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/ for some background.
> 
> If you're using the ExtractingRequestHandler, where you just send the 
> semi-structured docs to Solr (PDFs, Word or whatever), then needing to know 
> anything about individual Tika-related jar files is kind of strange.
> 
> If your predecessors wrote some custom code that runs as part of Solr, I 
> don't know what to say...
> 
> Best,
> Erick
> 
> On Tue, Mar 19, 2019 at 10:47 AM Tannen, Lev (USAEO) [Contractor] 
>  wrote:
>> 
>> Thank you Shawn.
>> I assumed that tika has been integrated with solr. I the project written 
>> before me they used two tika files taken from solr distribution. I am trying 
>> to do the same with solr 7.7.1. However this version contains a different 
>> set of tika related files. So I am confused. Does  solr does not have 
>> integrated tika anymore, or I just cannot recognize them?
>> 
>> -Original Message-
>> From: Shawn Heisey 
>> Sent: Tuesday, March 19, 2019 11:11 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Upgrading tika
>> 
>> On 3/19/2019 9:03 AM, levtannen wrote:
>>> Could anybody suggest me what files do I need to use the latest 
>>> version of Tika and where to find them?
>> 
>> This mailing list is solr-user.  Tika is an entirely separate project from 
>> Solr within the Apache Foundation.  To get help with Tika, you'll need to 
>> ask that project.
>> 
>> https://tika.apache.org/mail-lists.html
>> 
>> Thanks,
>> Shawn



Re: Re: obfuscated password error

2019-03-20 Thread Satya Marivada
So I got a chance to do a diff of the environments solr-6.3.0 folder within
contents.

solr-6.3.0/bin/solr file has the difference highlighted in yellow. Any idea
of what is going on in that if else in solr file?

*The working configuration file contents are (ssl.properties below has the
keystore path and password repeated):*

SOLR_SSL_OPTS=""

if [ -n "$SOLR_SSL_KEY_STORE" ]; then

  SOLR_JETTY_CONFIG+=("--module=https")

  SOLR_URL_SCHEME=https

  SOLR_SSL_OPTS=" -Dsolr.jetty.keystore=$SOLR_SSL_KEY_STORE \

-Dsolr.jetty.keystore.password=$SOLR_SSL_KEY_STORE_PASSWORD \

-Dsolr.jetty.truststore=$SOLR_SSL_TRUST_STORE \

-Dsolr.jetty.truststore.password=$SOLR_SSL_TRUST_STORE_PASSWORD \

-Dsolr.jetty.ssl.needClientAuth=$SOLR_SSL_NEED_CLIENT_AUTH \

-Dsolr.jetty.ssl.wantClientAuth=$SOLR_SSL_WANT_CLIENT_AUTH"

  if [ -n "$SOLR_SSL_CLIENT_KEY_STORE" ]; then

SOLR_SSL_OPTS+=" -Djavax.net.ssl.keyStore=$SOLR_SSL_CLIENT_KEY_STORE \

  -Djavax.net.ssl.keyStorePassword=$SOLR_SSL_CLIENT_KEY_STORE_PASSWORD \

  -Djavax.net.ssl.trustStore=$SOLR_SSL_CLIENT_TRUST_STORE \


-Djavax.net.ssl.trustStorePassword=$SOLR_SSL_CLIENT_TRUST_STORE_PASSWORD"

  else

SOLR_SSL_OPTS+="
-Dcom.sun.management.jmxremote.ssl.config.file=/sanfs/mnt/vol01/solr/solr-6.3.0/server/etc/ssl.properties"

  fi

else

  SOLR_JETTY_CONFIG+=("--module=http")

Fi


*Not working one (basically overriding again and is causing the incorrect
password):*



SOLR_SSL_OPTS=""

if [ -n "$SOLR_SSL_KEY_STORE" ]; then

  SOLR_JETTY_CONFIG+=("--module=https")

  SOLR_URL_SCHEME=https

  SOLR_SSL_OPTS=" -Dsolr.jetty.keystore=$SOLR_SSL_KEY_STORE \

-Dsolr.jetty.keystore.password=$SOLR_SSL_KEY_STORE_PASSWORD \

-Dsolr.jetty.truststore=$SOLR_SSL_TRUST_STORE \

-Dsolr.jetty.truststore.password=$SOLR_SSL_TRUST_STORE_PASSWORD \

-Dsolr.jetty.ssl.needClientAuth=$SOLR_SSL_NEED_CLIENT_AUTH \

-Dsolr.jetty.ssl.wantClientAuth=$SOLR_SSL_WANT_CLIENT_AUTH"

  if [ -n "$SOLR_SSL_CLIENT_KEY_STORE" ]; then

SOLR_SSL_OPTS+=" -Djavax.net.ssl.keyStore=$SOLR_SSL_CLIENT_KEY_STORE \

  -Djavax.net.ssl.keyStorePassword=$SOLR_SSL_CLIENT_KEY_STORE_PASSWORD \

  -Djavax.net.ssl.trustStore=$SOLR_SSL_CLIENT_TRUST_STORE \


-Djavax.net.ssl.trustStorePassword=$SOLR_SSL_CLIENT_TRUST_STORE_PASSWORD"

  else

SOLR_SSL_OPTS+=" -Djavax.net.ssl.keyStore=$SOLR_SSL_KEY_STORE \

  -Djavax.net.ssl.keyStorePassword=$SOLR_SSL_KEY_STORE_PASSWORD \

  -Djavax.net.ssl.trustStore=$SOLR_SSL_TRUST_STORE \

  -Djavax.net.ssl.trustStorePassword=$SOLR_SSL_TRUST_STORE_PASSWORD"

  fi

On Tue, Mar 19, 2019 at 10:10 AM Satya Marivada 
wrote:

> Hi Jeremy,
>
> Thanks for the points. Yes, agreed that there is some conflicting property
> somewhere that is not letting it work. So I basically restored solr-6.3.0
> directory from another environment and replace the host name appropriately
> for this environment. And I used the original keystore that has been
> generated for this environment and it worked fine. So basically the
> keystore is good as well except that there is some conflicting property
> which is not letting it do deobfuscation right.
>
> Thanks,
> Satya
>
> On Mon, Mar 18, 2019 at 2:32 PM Branham, Jeremy (Experis) <
> jb...@allstate.com> wrote:
>
>> I’m not sure if you are sharing the trust/keystores, so I may be off-base
>> here…
>>
>> Some thoughts –
>> - Verify your VM arguments, to be sure there aren’t conflicting SSL
>> properties.
>> - Verify the environment is targeting the correct version of Java
>> - Verify the trust/key stores exist where they are expected, and you can
>> list the contents with the keytool
>> - Verify the correct CA certs are trusted
>>
>>
>> Jeremy Branham
>> jb...@allstate.com
>>
>> On 3/18/19, 1:08 PM, "Satya Marivada"  wrote:
>>
>> Any suggestions please.
>>
>> Thanks,
>> Satya
>>
>> On Mon, Mar 18, 2019 at 11:12 AM Satya Marivada <
>> satya.chaita...@gmail.com>
>> wrote:
>>
>> > Hi All,
>> >
>> > Using solr-6.3.0, to obfuscate the password, have used jetty util to
>> > generate obfuscated password
>> >
>> >
>> > java -cp jetty-util-9.3.8.v20160314.jar
>> > org.eclipse.jetty.util.security.Password mypassword
>> >
>> >
>> > The output has been used in
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__solr.in.sh=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=YtmCJK2U90u6mqx-FOmBS5nqy03luM2J-Zc_LhImnG0=
>> as below
>> >
>> >
>> >
>> >
>> SOLR_SSL_KEY_STORE=/sanfs/mnt/vol01/solr/solr-6.3.0/server/etc/solr-ssl.keystore.jks
>> >
>> >
>> SOLR_SSL_KEY_STORE_PASSWORD="OBF:1bcd1l161lts1ltu1uum1uvk1lq41lq61k221b9t"
>> >
>> >
>> >
>> SOLR_SSL_TRUST_STORE=/sanfs/mnt/vol01/solr/solr-6.3.0/server/etc/solr-ssl.keystore.jks
>> >
>> >
>> >
>> SOLR_SSL_TRUST_STORE_PASSWORD="OBF:1bcd1l161lts1ltu1uum1uvk1lq41lq61k221b9t"
>> >
>>

RE: Upgrading tika

2019-03-20 Thread Tannen, Lev (USAEO) [Contractor]
Hi Erick,

I still need your advice. The program I have to fix uses class AutoDetectParser 
along with Solrj for parsing PDF files before sending the result to the solr 
server. To do this it linked two tika jar files taken from the solr 
distribution. Namely: tika-core and tika-parsers. Maybe it used some other tika 
related files but I have problems to identify them among a lot of other jar 
files linked. The program worked more or less OK, but it gave too many warnings 
of kind "Font not found". I had a rumor that this was fixed in the next tika 
distribution.
 I switched from Solr 7.5 to solr 7.7.1 in a hope that will solve that problem. 
However when I switched I encountered an  another problem: 
java.lang.NoClassDefFoundError: 
org/apache/cxf/jaxrs/ext/multipart/ContentDisposition.
Apparently I have not included some necessary jars. Those jars supposed to come 
from a different project  called cxf, but because they are related to tika I 
expected them be distributed with solr. However I did not find them in the solr 
7.7.1 (in the solr 7.5 as well). 
I have found the necessary file in the cxf distribution and included it. It 
asked for an another file which I included as well. After this I got a message 
that some temporary resources were not closed. Apparently something is not 
matched. And now I am stuck. I do not want to start from scratch and search the 
whole tika and cxf projects for the files I need and I do not want to include 
all files from those projects especially because I was not able to find a 
binary distribution. So could you please advise what is the best way to proceed.

Thank you,
Lev Tannen

-Original Message-
From: Erick Erickson  
Sent: Tuesday, March 19, 2019 2:48 PM
To: solr-user 
Subject: Re: Upgrading tika

Yes, Solr is distributed with Tika. Look in:
./solr/contrib/extraction/lib

Tika is upgraded when new versions come out, so the underlying files are 
whatever are current at the time.

The integration is a fairly loose coupling, if you're using some external 
program (say a SolrJ program) to parse the files, there's no requirement to use 
the jars distributed with Solr, use whatever suits your fancy. An external 
program just constructs a SolrDocument to send to Solr. What you use to create 
that document is irrelevant. See:
https://lucidworks.com/2012/02/14/indexing-with-solrj/ for some background.

If you're using the ExtractingRequestHandler, where you just send the 
semi-structured docs to Solr (PDFs, Word or whatever), then needing to know 
anything about individual Tika-related jar files is kind of strange.

If your predecessors wrote some custom code that runs as part of Solr, I don't 
know what to say...

Best,
Erick

On Tue, Mar 19, 2019 at 10:47 AM Tannen, Lev (USAEO) [Contractor] 
 wrote:
>
> Thank you Shawn.
> I assumed that tika has been integrated with solr. I the project written 
> before me they used two tika files taken from solr distribution. I am trying 
> to do the same with solr 7.7.1. However this version contains a different set 
> of tika related files. So I am confused. Does  solr does not have integrated 
> tika anymore, or I just cannot recognize them?
>
> -Original Message-
> From: Shawn Heisey 
> Sent: Tuesday, March 19, 2019 11:11 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Upgrading tika
>
> On 3/19/2019 9:03 AM, levtannen wrote:
> > Could anybody suggest me what files do I need to use the latest 
> > version of Tika and where to find them?
>
> This mailing list is solr-user.  Tika is an entirely separate project from 
> Solr within the Apache Foundation.  To get help with Tika, you'll need to ask 
> that project.
>
> https://tika.apache.org/mail-lists.html
>
> Thanks,
> Shawn


Suggester case (in)sensitive

2019-03-20 Thread Moritz Schmidt
Hello everyone.

I’m trying to build autocomplete functionality.
My setup works but has one problem:
When using HighFrequencyDictionaryFactory the Suggestion-Results I get are all 
lowercase as defined in my schema.xml:


  


  


Without the LowerCaseFilterFactory I get my results as I want but the search is 
case sensitive.

When using DocumentDictionaryFactory I get my suggestions in unaltered form but 
I get a lot of duplicates as I’m using amongst others a field for keywords for 
suggestions and DocumentDictionaryFactory saves them per Document(?)


Is this intended behaviour for HighFrequencyDictionaryFactory?
In case it is, how would you solve that? Remove the LowerCaseFilterFactory and 
write a custom SuggestComponent that searches the query lowercased and with the 
first letter capitalized?

Here’s the solrconfig.xml excerpt for reference:

  texts
  AnalyzingInfixLookupFactory
  HighFrequencyDictionaryFactory
  
  suggestion
  suggestion_text
  false
  true
  true
  false
  0.0


Thanks for your help,
Moe

Use of ShingleFilter causing very large BooleanQuery structures in Solr 7.1

2019-03-20 Thread Hubert-Price, Neil
Hello All,

We have a recently upgraded system that went from Solr 4.6 to Solr 7.1 (used as 
part of an ecommerce application).  In the upgraded version we are seeing 
frequent issues with very high Solr memory usage for certain types of query, 
but the older 4.6 version does not produce the same response.

Having taken a heap dump and investigated, we can see instances of individual 
Solr threads where the retained set is 4GB to 5GB in size.  Drilling into this 
we can see a particular subquery with over 500,000 clauses.  Screenshots below 
are from Eclipse MAT viewing a heap dump from the SOLR process. Observations of 
the 4.6 version we can see memory increments of 100-200 MB for the same query, 
rather than 4-5 GB.

In both systems the index has around 2 million documents, with average size 
around 8KB.


[cid:image001.png@01D4DF03.B9ADD460]

[cid:image002.png@01D4DF03.B9ADD460]


The subquery with a very large set of clauses relates to a particular field 
setup to use ShingleFilter (with maxShingleSize=30, and outputUnigrams=true). 
Schema.xml definitions for this field are:

















The issue happens when the user search contains large numbers of tokens.  In 
the example screenshots above the user search text had 20 tokens. The Solr 
query for that thread was as below (formatting/indentation added by me, the 
original is one long string).  This specific query contains tabs, however the 
same behaviour happens when spaces are used as well:
(
+(
  fulltext_en:(96114445009611444520   9611444530   
9611444540   9611414550 96121940029612194002   
9612194002   9612194003   9612194007 9611416470 9611416470  
 96114164709611416480   9611416480 9613484402   
  9613484402   9613484402   9613484402   9613484402)
  OR productdetails_tokens_en:(9611444500   9611444520   9611444530 
  9611444540   9611414550 9612194002   9612194002   9612194002  
 9612194003   9612194007 9611416470 9611416470
9611416470   9611416480   9611416480 9613484402 9613484402  
 9613484402   96134844029613484402)
  OR codePartial:(9611444500 9611444520   9611444530   9611444540   
9611414550 96121940029612194002   9612194002   
9612194003   9612194007 9611416470 9611416470   9611416470  
  9611416480   9611416480 9613484402 9613484402 
  9613484402   9613484402   9613484402)
)
)
AND
(
(
  (
   (productChannelVisibility_string_mv:ALL OR 
productChannelVisibility_string_mv:EBUSINESS OR 
productChannelVisibility_string_mv:INTERNET OR 
productChannelVisibility_string_mv:INTRANET)
   AND
   !productChannelVisibility_string_mv:NOTVISIBLE
  )
  AND
  (
   +(
fulltext_en:(9611444500  9611444520   9611444530   
9611444540   9611414550 96121940029612194002   
9612194002   9612194003   9612194007 9611416470 9611416470  
 96114164709611416480   9611416480 9613484402   
  9613484402   9613484402   9613484402   9613484402)
OR productdetails_tokens_en:(9611444500 9611444520   9611444530 
  9611444540   9611414550 9612194002   9612194002   9612194002  
 9612194003   9612194007 9611416470 9611416470
9611416470   9611416480   9611416480 9613484402 9613484402  
 9613484402   96134844029613484402)
OR codePartial:(9611444500  9611444520   9611444530   9611444540
   9611414550 96121940029612194002   9612194002   
9612194003   9612194007 9611416470 9611416470   9611416470  
  9611416480   9611416480 9613484402 9613484402 
  9613484402   9613484402   9613484402)
   )
  )
)
)

In the heap dump we can see the subqueries relating to fulltext_en/codePartial 
fields both have just 20 clauses.  However the two subqueries relating to 
productdetails_tokens_en both have 524288 clauses & each of those clauses is a 
subquery with up to 20 clauses (each of which seems to be a different shingled 
combination of the original tokens). For example, selecting an arbitrary single 
entry from the 524288 clauses, we can see a subquery with the following clauses:

Occur.MUST, productdetails_tokens_en: 9611444500
Occur.MUST, productdetails_tokens_en: 9611416470 9611416480
Occur.MUST, productdetails_tokens_en: