Re: get all tokens from TokenStream in my custom filter

2017-11-20 Thread Emir Arnautović
Hi Kumar
> Emir , i need all tokens of query in incrementToken() function not only 
> current token 

That was just an example - the point was that you need to set attributes - you 
can read all tokens from previous stream, do whatever needed with them and when 
ready, set attributes and return true. Peeking and looking ahead is just 
convenient method that can be used to decide if you want to emit token from 
your token filter without consuming token from the previous one.
If I got your case right, you want to consume all tokens, concat them, return 
single token and forget about tokens from previous stream. If that’s the case, 
you need something like:

String term = “”;
int pos = posAtt.getPosition();
while (input.incrementToken()) {
 term += “ ” + termAtt.toString(); // better use StringBuffer
 }
termAtt.setValue(term); // or whatever method for setting att value;
posAtt.setPosition(pos);
lenghtAtt.setLength(term.lenght);

return term.length > 0;


Take this as pseudo code.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 19 Nov 2017, at 21:54, Ahmet Arslan  wrote:
> 
> 
> Hi Kumar,
> I checked the code base and I couldn't find peek method either. However, I 
> found LookaheadTokenFilter that may be useful to you.
> I figured that this is a Lucene question and you can receive more answers in 
> the Lucene user list.
> Ahmet
> 
> 
>On Sunday, November 19, 2017, 10:16:21 PM GMT+3, kumar gaurav 
>  wrote:  
> 
> Hi friends 
> very much thank you for your replies .
> yet i could not solved the problem .
> Emir , i need all tokens of query in incrementToken() function not only 
> current token .
> Modassar , if i am not ending or closing the stream . all tokens is blank and 
> only last token is indexed .
> Ahmet i could not find peek or advance method :(  
> 
> Please help me guys . 
> 
> On Fri, Nov 17, 2017 at 10:10 PM, Ahmet Arslan  wrote:
> 
> Hi Kumar,
> If I am not wrong, I think there is method named something like peek(2) or 
> advance(2).Some filters access tokens ahead and perform some logic.
> AhmetOn Wednesday, November 15, 2017, 10:50:55 PM GMT+3, kumar gaurav 
>  wrote:  
> 
> Hi
> 
> I need to get full field value from TokenStream in my custom filter class .
> 
> I am using this
> 
> stream.reset();
> while (tStream.incrementToken()) {
> term += " "+charTermAttr.toString();
> }
> stream.end();
> stream.close();
> 
> this is ending streaming . no token is producing if i am using this .
> 
> I want to get full string without hampering token creation .
> 
> Eric ! Are you there ? :)  Anyone Please help  ?
> 
> 



Solr - How to Clear the baseDir folder after the DIH import

2017-11-20 Thread Karan Saini
Hi guys,

Solr Version :: 6.6.1

I am able to import the pdf files into the Solr system using the DIH and
performs the indexing as expected. But i wish to clear the folder
C:/solr-6.6.1/server/solr/core_K2_Depot*/Depot* after the successful finish
of the indexing process.

Please suggest, if there is a way to delete all the files from the folder
via the DIH data-config.xml.




  
  





  





  

  


T
​hanks,

Karan​


Re: Leading wildcard searches very slow

2017-11-20 Thread Emir Arnautović
Hi Sundeep,
The simplified explanation is that terms are indexed to be more prefix search 
friendly (and that is why Amrit suggested that you index term reversed if you 
want leading wildcard). If you use leading wildcard, there is no structure to 
limit terms that can be matched and engine has to check every term and see if 
it matches provided suffix. That mean that latency depends on cardinality of 
your field. And when matching terms are found, engine has to create OR query 
using all matched term - more terms matched, longer it will take to execute 
query (this explanation also applies to regular wildcard queries - if you have 
short prefix that results in many terms being matched, it will be slow).

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 18 Nov 2017, at 01:42, Amrit Sarkar  wrote:
> 
> Sundeep,
> 
> You would like to explore
> http://lucene.apache.org/solr/6_6_1/solr-core/org/apache/solr/analysis/ReversedWildcardFilterFactory.html
> here probably.
> 
> Thanks
> Amrit Sarkar
> 
> On 18 Nov 2017 6:06 a.m., "Sundeep T"  wrote:
> 
>> Hi,
>> 
>> We have several indexed string fields which is not tokenized  and does not
>> have docValues enabled.
>> 
>> When we do leading wildcard searches on these fields they are running very
>> slow. We were thinking that since this field is indexed, such queries
>> should be running pretty quickly. We are using Solr 6.6.1. Anyone has ideas
>> on why these queries are running slow and if there are any ways to speed
>> them up?
>> 
>> Thanks
>> Sundeep
>> 



Re: Solr LTR plugin - Training

2017-11-20 Thread ilayaraja
Yes, that works. Thanks.



-
--Ilay
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr cloud in kubernetes

2017-11-20 Thread Björn Häuser
Hi Raja,

we are using solrcloud as a statefulset and every pod has its own storage 
attached to it.

Thanks
Björn

> On 20. Nov 2017, at 05:59, rajasaur  wrote:
> 
> Hi Bjorn,
> 
> Im trying a similar approach now (to get solrcloud working on kubernetes). I
> have run Zookeeper as a statefulset, but not running SolrCloud, which is
> causing an issue when my pods get destroyed and restarted. 
> I will try running the -h option so that the SOLR_HOST is used when
> connecting to itself (and to zookeeper).
> 
> On another note, how do you store the indexes ? I had an issue with my GCE
> node (Node NotReady), which had its kubelet to be restarted, but with that,
> since solrcloud pods were restarted, all the data got wiped out. Just
> wondering how you have setup your indexes with the solrcloud kubernetes
> setup.
> 
> Thanks
> Raja
> 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Solr7: Very High number of threads on aggregator node

2017-11-20 Thread Nawab Zada Asad Iqbal
@rick
I see many indexing config, but i don't see any config related to query
(i.e., number of threads etc.) in solrconfig. What will be the relevant
part for this area?  In jetty threadpool is set to 1.

@Toke:
I have a webserver which uses solr for querying, this i guess is pretty
typical. At times, there are 50 users sending queries at a given second.
Sometimes, the queries take a few second to finish (i.e., if the max across
all shards is 5 second  due to any local reason, even if the median is
sub-second, the aggregator query will take 5 seconds). This can cause some
query load to build up on the aggregator node. This is all fine and
understandable. Now, the load and testclient is identical for both solr4.5
and solr7, what can be causing solr7 aggregator to spin more threads? I
also agree that 4000 threads is not useful, so the solution is not to
increase the threadlimit for the process, rather it is somewhere else.



Thanks
Nawab




On Sat, Nov 18, 2017 at 10:22 AM, Rick Leir  wrote:

> Nawab
> You probably need to share the relevant config to get an answer to this.
> Cheers -- Rick
>
> On November 17, 2017 2:19:03 PM EST, Nawab Zada Asad Iqbal <
> khi...@gmail.com> wrote:
> >Hi,
> >
> >I have a sharded solr7 cluster and I am using an aggregator node (which
> >has
> >no data/index of its own) to distribute queries and aggregate results
> >from
> >the shards. I am puzzled that when I use solr7 on the aggregator node,
> >then
> >number of threads shoots up to 32000 on that host and then the process
> >reaches its memory limits. However, when i use solr4 on the aggregator,
> >then it all seems to work fine. The peak number of threads during my
> >testing were around 4000 or so. The test load is same in both cases,
> >except
> >that it doesn't finish in case of solr7 (due to the memory / thread
> >issue).
> >The memory settings and Jetty  threadpool setting (max=1) are also
> >consistent in both servers (solr 4 and solr 7).
> >
> >
> >Has anyone else been in similar circumstances?
> >
> >
> >Thanks
> >Nawab
>
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com


Re: Fwd: CVE-2017-3163 - SOLR-5.2.1 version

2017-11-20 Thread Rick Leir
Pad
Read the CVE. Do you have an affected version of Solr? Do you have the 
replication feature enabled in solrconfig.xml? Note that it might be enabled by 
default. Test directory traversal on your system: can you read files remotely? 
No? Then you are finished.

A better plan: upgrade to a newer version of Solr (I know, you may not be able 
to).
Cheers -- Rick

On November 20, 2017 4:01:47 AM EST, padmanabhan gonesani 
 wrote:
>Please help me here
>
>
>
>-- Forwarded message --
>From: padmanabhan gonesani 
>Date: Mon, Nov 13, 2017 at 5:12 PM
>Subject: CVE-2017-3163 - SOLR-5.2.1 version
>To: gene...@lucene.apache.org
>
>
>
>Hi Team,
>
>*Description:* Apache Solr could allow a remote attacker to traverse
>directories on the system, caused by a flaw in the Index Replication
>feature. An attacker could send a specially-crafted request to read
>arbitrary files on the system (CVE-ID: CVE-2017-3163)
>
>Security vulnerability link: https://cve.mitre.org/cgi-bin/
>cvename.cgi?name=CVE-2017-3163
>
>*Apache SOLR implementation:*
>
>We are using Apache Solr-5.2.1 and replication factor=1 for index
>creation.
>We are using basic common SOLR features and it doesn't have the
>following
>features
>
>1. Index Replication
>2. Master / slave mechanism
>
>*Considering the above not implemented features will this "CVE-ID:
>CVE-2017-3163" security vulnerability have any impact?*
>
>Any help is appreciated here.
>
>
>Best Regards,
>Paddy G
>+91-8148593020 <+91%2081485%2093020>
>
>
>
>-- 
>
>
>Best Regards,
>Paddy G
>+91-8148593020

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Issue facing with spell text field containing hyphen

2017-11-20 Thread Chirag Garg
Hi Team,

I am facing issue for string containing hyphen when searched in spell field.
My solr core is solr-6.6.0

Points to reproduce:-
Eg:- 1. My search string is "spider-man".
2. When I do a search in solr with query spell:*spider-*. It shows numDocs=0 
even though content is present.
3 . But working fine when searched spell:*spider*.

My config for solr in schema.xml is:-


  










  
  









  
   









  



  





  




Fwd: CVE-2017-3163 - SOLR-5.2.1 version

2017-11-20 Thread padmanabhan gonesani
Please help me here



-- Forwarded message --
From: padmanabhan gonesani 
Date: Mon, Nov 13, 2017 at 5:12 PM
Subject: CVE-2017-3163 - SOLR-5.2.1 version
To: gene...@lucene.apache.org



Hi Team,

*Description:* Apache Solr could allow a remote attacker to traverse
directories on the system, caused by a flaw in the Index Replication
feature. An attacker could send a specially-crafted request to read
arbitrary files on the system (CVE-ID: CVE-2017-3163)

Security vulnerability link: https://cve.mitre.org/cgi-bin/
cvename.cgi?name=CVE-2017-3163

*Apache SOLR implementation:*

We are using Apache Solr-5.2.1 and replication factor=1 for index creation.
We are using basic common SOLR features and it doesn't have the following
features

1. Index Replication
2. Master / slave mechanism

*Considering the above not implemented features will this "CVE-ID:
CVE-2017-3163" security vulnerability have any impact?*

Any help is appreciated here.


Best Regards,
Paddy G
+91-8148593020 <+91%2081485%2093020>



-- 


Best Regards,
Paddy G
+91-8148593020


Trailing wild card searches very slow in Solr

2017-11-20 Thread Sundeep T
Hi,

We have several indexed string fields which is not tokenized and does not
have docValues enabled.

When we do trailing wildcard searches on these fields they are running very
slow. We were thinking that since this field is indexed, such queries
should be running pretty quickly. We are using Solr 6.6.1. Anyone has ideas
on why these queries are running slow and if there are any ways to speed
them up?

Thanks
Sundeep


Re: Trailing wild card searches very slow in Solr

2017-11-20 Thread Erick Erickson
You already asked that question and got several answers, did you not
see them? If you did see them, what is unclear?

Best,
Erick

On Mon, Nov 20, 2017 at 9:33 AM, Sundeep T  wrote:
> Hi,
>
> We have several indexed string fields which is not tokenized and does not
> have docValues enabled.
>
> When we do trailing wildcard searches on these fields they are running very
> slow. We were thinking that since this field is indexed, such queries
> should be running pretty quickly. We are using Solr 6.6.1. Anyone has ideas
> on why these queries are running slow and if there are any ways to speed
> them up?
>
> Thanks
> Sundeep


Merging of index in Solr

2017-11-20 Thread Zheng Lin Edwin Yeo
Hi,

Does anyone knows how long usually the merging in Solr will take?

I am currently merging about 3.5TB of data, and it has been running for
more than 28 hours and it is not completed yet. The merging is running on
SSD disk.

I am using Solr 6.5.1.

Regards,
Edwin


Deep Paging with cursorMark throws error

2017-11-20 Thread Webster Homer
I am developing an application that uses cursorMark deep paging. It's a
java client using solrj client.

Currently the client is created with Solr 6.2 solrj jars, but the test
server is a solr 7.1 server

I am getting this error:
Error from server at http://XX:8983/solr/sial-catalog-product: Cursor
functionality requires a sort containing a uniqueKey field tie breaker

But the sort does have the field that is marked as unique in the schema.

sort=score desc,*id_material* asc

id_material

Does the sort need to be on just the unique field?

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: Solr7: Very High number of threads on aggregator node

2017-11-20 Thread Rick Leir
Nawab
Why it would be good to share the solrconfigs: I had a suspicion that you might 
be using the same solrconfig for version 7 and 4.5. That is unlikely to work 
well. But I could be way off base. 
Rick
-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Solr regex phrase query syntax

2017-11-20 Thread Chuming Chen
Hi All,

According to 
http://lucene.apache.org/core/7_1_0/core/org/apache/lucene/util/automaton/RegExp.html.
 Lucene supports repeat expressions.

repeatexp   ::= repeatexp ? (zero or one occurrence)
|   repeatexp * (zero or more occurrences)  
|   repeatexp + (one or more occurrences)   
|   repeatexp {n}   (n occurrences) 
|   repeatexp {n,}  (n or more occurrences) 
|   repeatexp {n,m} (n to m occurrences, including both)


Does Solr support multiple occurrence of terms in a phrase query? For example: 
name:”abc{0, 3} def”, which means term “abc” repeats 0 to 3 times in the phrase.

Thanks,

Chuming



Re: Solr7: Very High number of threads on aggregator node

2017-11-20 Thread Toke Eskildsen
Nawab Zada Asad Iqbal  wrote:
> I have a webserver which uses solr for querying, this i guess is pretty
> typical. At times, there are 50 users sending queries at a given second.
> Sometimes, the queries take a few second to finish (i.e., if the max across
> all shards is 5 second  due to any local reason, even if the median is
> sub-second, the aggregator query will take 5 seconds).

No alarm bells until this point.

> This can cause some query load to build up on the aggregator node.
> This is all fine and understandable.

Yes, as long as the build-up is strictly temporary and not high, where "high" 
of course is hard to quantify. 

I might be misunderstanding something. Do you test by having a fixed number of 
workers issuing sequential queries (if so, how many workers?) or by firing off 
new queries at intervals, regardless of whether the previous queries has 
finished or not? It it is the latter, one explanation could be that your Solr 7 
setup is simply slower on average to respond than your Solr 4 setup, to the 
point where it cannot keep up with the influx of queries.

- Toke Eskildsen


Re: Trailing wild card searches very slow in Solr

2017-11-20 Thread Sundeep T
Hi Erick.

I initially asked this question regarding leading wildcards. This was a
typo, and what I meant was trailing wild card queries were slow. So queries
like text:'hello*" are slow. We were expecting since the string field is
already indexed, the searches should be fast, but that seems to be not the
case

Thanks
Sundeep

On Mon, Nov 20, 2017 at 9:39 AM, Erick Erickson 
wrote:

> You already asked that question and got several answers, did you not
> see them? If you did see them, what is unclear?
>
> Best,
> Erick
>
> On Mon, Nov 20, 2017 at 9:33 AM, Sundeep T  wrote:
> > Hi,
> >
> > We have several indexed string fields which is not tokenized and does not
> > have docValues enabled.
> >
> > When we do trailing wildcard searches on these fields they are running
> very
> > slow. We were thinking that since this field is indexed, such queries
> > should be running pretty quickly. We are using Solr 6.6.1. Anyone has
> ideas
> > on why these queries are running slow and if there are any ways to speed
> > them up?
> >
> > Thanks
> > Sundeep
>


Re: Trailing wild card searches very slow in Solr

2017-11-20 Thread Erick Erickson
At first glance you have a mis-configured setup. The most glaring
issue is that you're trying to search a 150G index in 1G of memory.

bq: String field (not tokenized) is docValues=true, indexed=true and stored=true

OK, this is kind of unusual to query but if the field just contains
single tokens it's probably OK.

bq: Field is almost unique in the index, around 80 million are unique

This is a _lot_ of unique fields, but as long as your wildcard
searches don't actually match too many values (say 1,000 or so) it
should be OK.

bq: no commits on index

Huh? Then you can't search. I suspect you have autocommit settings in
your solrconfig.xml file?

bq: solr jvm heap 1GB

This is far too small. It's a miracle it works at all.

bq: index size on disk is around 150GB

I pretty much guarantee your heap is undersized for an index that size.

bq: q=myfield:abc* has QTime=17-20secs after filecache on OS is primed

How many terms does abc* match? That's the biggest question in terms
of perfirmance.

But really, I expect even if you created an OR clause with, say, 50
terms in it it would perform poorly. My guess is that you don't have
nearly enough memory for your Solr instance.

You didn't include the results of adding =query, perhaps you
can't due to corporate policy. But you _can_ scrub the
parsedQuery_tostring bits of the return.

But really, don't do much until you give your Solr instance enough
memory to work  with.

Best,
Erick


On Mon, Nov 20, 2017 at 5:26 PM, Sundeep T  wrote:
> Hi Erick,
>
> Thanks for the reply. Here are more details on our setup -
>
> *Setup/schema details -*
>
> 100 million doc solr core
>
> String field (not tokenized) is docValues=true, indexed=true and stored=true
>
> Field is almost unique in the index, around 80 million are unique
>
> no commits on index
>
> all caches disabled in solrconfig.xml
>
> solr jvm heap 1GB
>
> single solr core in jvm
>
> solr core is not optimized and has about 50 segment files some up to 5GB
>
> index size on disk is around 150GB
>
> solr v6.5.0
>
>
>
> *Performance -*
>
>
> q=myfield:abc* has QTime=30secs+ first time
>
> q=myfield:abc* has QTime=17-20secs after filecache on OS is primed
>
>
> Thanks
> Sundeep
>
>
> On Mon, Nov 20, 2017 at 12:16 PM, Erick Erickson 
> wrote:
>
>> Well, define "slow". Conceptually a large OR clause is created that
>> contains all the terms that start with the indicated text. (actually a
>> PrefixQuery should be formed).
>>
>> That said, I'd expect hello* to be reasonably fast as not many terms
>> _probably_ start with 'hello'. Not the same at all for, say, h*.
>>
>> You might review: https://wiki.apache.org/solr/UsingMailingLists,
>> you're not really providing much information to go on here.
>>
>> What is the result of adding =query? Particularly it would be
>> useful to see the parsed query.
>>
>> Are all such queries slow? What happens if you submit hel* followed by
>> hello*, the first one will bring the underlying index structures into
>> memory, for all we know this could simply be an autowarming issue.
>>
>> Are you indexing at the same time? Do you have a short autocommit interval?
>>
>> What version of Solr?
>>
>> Details matter.
>> Best,
>> Erick
>>
>> On Mon, Nov 20, 2017 at 11:50 AM, Sundeep T  wrote:
>> > Hi Erick.
>> >
>> > I initially asked this question regarding leading wildcards. This was a
>> > typo, and what I meant was trailing wild card queries were slow. So
>> queries
>> > like text:'hello*" are slow. We were expecting since the string field is
>> > already indexed, the searches should be fast, but that seems to be not
>> the
>> > case
>> >
>> > Thanks
>> > Sundeep
>> >
>> > On Mon, Nov 20, 2017 at 9:39 AM, Erick Erickson > >
>> > wrote:
>> >
>> >> You already asked that question and got several answers, did you not
>> >> see them? If you did see them, what is unclear?
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Mon, Nov 20, 2017 at 9:33 AM, Sundeep T 
>> wrote:
>> >> > Hi,
>> >> >
>> >> > We have several indexed string fields which is not tokenized and does
>> not
>> >> > have docValues enabled.
>> >> >
>> >> > When we do trailing wildcard searches on these fields they are running
>> >> very
>> >> > slow. We were thinking that since this field is indexed, such queries
>> >> > should be running pretty quickly. We are using Solr 6.6.1. Anyone has
>> >> ideas
>> >> > on why these queries are running slow and if there are any ways to
>> speed
>> >> > them up?
>> >> >
>> >> > Thanks
>> >> > Sundeep
>> >>
>>


Re: Do i need to reindex after changing similarity setting

2017-11-20 Thread Walter Underwood
Similarity is query time.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 20, 2017, at 4:57 PM, Nawab Zada Asad Iqbal  wrote:
> 
> Hi,
> 
> I want to switch to Classic similarity instead of BM25 (default in solr7).
> Do I need to reindex all cores after this? Or is it only a query time
> setting?
> 
> 
> Thanks
> Nawab



Do i need to reindex after changing similarity setting

2017-11-20 Thread Nawab Zada Asad Iqbal
Hi,

I want to switch to Classic similarity instead of BM25 (default in solr7).
Do I need to reindex all cores after this? Or is it only a query time
setting?


Thanks
Nawab


Re: Issue facing with spell text field containing hyphen

2017-11-20 Thread Rick Leir
Chirag
Some scattered clues:
StandardTokenizer splits on punctuation, so your spell field might not contain 
spider-man.

When you do a wildcard search, the analysis chain can be different from what 
you expected.
Cheers -- Rick

On November 20, 2017 9:58:54 AM EST, Chirag Garg  wrote:
>Hi Team,
>
>I am facing issue for string containing hyphen when searched in spell
>field.
>My solr core is solr-6.6.0
>
>Points to reproduce:-
>Eg:- 1. My search string is "spider-man".
>2. When I do a search in solr with query spell:*spider-*. It shows
>numDocs=0 even though content is present.
>3 . But working fine when searched spell:*spider*.
>
>My config for solr in schema.xml is:-
>
>positionIncrementGap="100">
>  
>mapping="mapping-ISOLatin1Accent.txt"/>
>
>
>
>ignoreCase="true"
>words="stopwords.txt"
>/>
>protected="protwords.txt"
>generateWordParts="1"
>generateNumberParts="1"
>catenateWords="1"
>catenateNumbers="1"
>catenateAll="0"
>splitOnCaseChange="0"
>preserveOriginal="1"/>
>
>
>protected="protwords.txt"/>
>
>  
>  
>mapping="mapping-ISOLatin1Accent.txt"/>
>
>ignoreCase="true" expand="true"/>
>ignoreCase="true"
>words="stopwords.txt"
>/>
>protected="protwords.txt"
>generateWordParts="1"
>generateNumberParts="1"
>catenateWords="0"
>catenateNumbers="0"
>catenateAll="0"
>splitOnCaseChange="0"
>preserveOriginal="1"/>
>
>
>protected="protwords.txt"/>
>
>  
>   
>mapping="mapping-ISOLatin1Accent.txt"/>
>
>ignoreCase="true" expand="true"/>
>ignoreCase="true"
>words="stopwords.txt"
>/>
>protected="protwords.txt"
>generateWordParts="1"
>generateNumberParts="1"
>catenateWords="0"
>catenateNumbers="0"
>catenateAll="0"
>splitOnCaseChange="1"
>preserveOriginal="1"/>
>
>
>protected="protwords.txt"/>
>
>  
>
>
>positionIncrementGap="100">
>  
>
>words="stopwords.txt"/>
>
>
>
>  
>

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

OutOfMemoryError in 6.5.1

2017-11-20 Thread Walter Underwood
When I ran load benchmarks with 6.3.0, an overloaded cluster would get super 
slow but keep functioning. With 6.5.1, we hit 100% CPU, then start getting 
OOMs. That is really bad, because it means we need to reboot every node in the 
cluster.

Also, the JVM OOM hook isn’t running the process killer (JVM 1.8.0_121-b13). 
Using the G1 collector with the Shawn Heisey settings in an 8G heap.

GC_TUNE=" \
-XX:+UseG1GC \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=8m \
-XX:MaxGCPauseMillis=200 \
-XX:+UseLargePages \
-XX:+AggressiveOpts \
"

This is not good behavior in prod. The process goes to the bad place, then we 
need to wait until someone is paged and kills it manually. Luckily, it usually 
drops out of the live nodes for each collection and doesn’t take user traffic.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)




Re: Error when indexing EML files in Solr 7.1.0

2017-11-20 Thread Zheng Lin Edwin Yeo
Hi,

Any updates regarding the error?

Regards,
Edwin


On 16 November 2017 at 10:21, Zheng Lin Edwin Yeo 
wrote:

> Hi Karthik,
>
> Thanks for the update.
>
> I see from the JIRA that it is still unresolved, meaning we can't index
> EML files to Solr 7.1.0 for the time being?
>
> Also, when the patch is ready, are we able to apply the patch to the
> current Solr 7.1.0? Or do we have to wait for the next release of Solr?
>
> Regards.
> Edwin
>
>
> On 15 November 2017 at 23:35, Karthik Ramachandran 
> wrote:
>
>> JIRA already exists, https://issues.apache.org/jira/browse/SOLR-11622.
>>
>>
>> On Mon, Nov 13, 2017 at 5:55 PM, Zheng Lin Edwin Yeo <
>> edwinye...@gmail.com>
>> wrote:
>>
>> > Hi Erick,
>> >
>> > I have added the apache-mime4j-core-0.7.2.jar in the Java Build Path of
>> the
>> > Eclipse, but it is also not working.
>> >
>> > Regards,
>> > Edwin
>> >
>> > On 13 November 2017 at 23:33, Erick Erickson 
>> > wrote:
>> >
>> > > Where are you getting your mime4j file? MimeConfig is in
>> > > /extraction/lib/apache-mime4j-core-0.7.2.jar and you need to make
>> sure
>> > > you're including that at a guess.
>> > >
>> > > Best,
>> > > Erick
>> > >
>> > > On Mon, Nov 13, 2017 at 6:15 AM, Zheng Lin Edwin Yeo
>> > >  wrote:
>> > > > Hi,
>> > > >
>> > > > I am using Solr 7.1.0, and I am trying to index EML files using the
>> > > > SimplePostTools.
>> > > >
>> > > > However, I get the following error
>> > > >
>> > > > java.lang.NoClassDefFoundError:
>> > > > org/apache/james/mime4j/stream/MimeConfig$Builder
>> > > >
>> > > >
>> > > > Is there any we class or dependencies which I need to add as
>> compared
>> > to
>> > > > Solr 6?
>> > > >
>> > > > The indexing is ok for other files type like .doc, .ppt. I only face
>> > the
>> > > > error when indexing .eml files.
>> > > >
>> > > > Regards,
>> > > > Edwin
>> > >
>> >
>>
>
>


Re: Trailing wild card searches very slow in Solr

2017-11-20 Thread Sundeep T
Hi Erick,

Thanks for the reply. Here are more details on our setup -

*Setup/schema details -*

100 million doc solr core

String field (not tokenized) is docValues=true, indexed=true and stored=true

Field is almost unique in the index, around 80 million are unique

no commits on index

all caches disabled in solrconfig.xml

solr jvm heap 1GB

single solr core in jvm

solr core is not optimized and has about 50 segment files some up to 5GB

index size on disk is around 150GB

solr v6.5.0



*Performance -*


q=myfield:abc* has QTime=30secs+ first time

q=myfield:abc* has QTime=17-20secs after filecache on OS is primed


Thanks
Sundeep


On Mon, Nov 20, 2017 at 12:16 PM, Erick Erickson 
wrote:

> Well, define "slow". Conceptually a large OR clause is created that
> contains all the terms that start with the indicated text. (actually a
> PrefixQuery should be formed).
>
> That said, I'd expect hello* to be reasonably fast as not many terms
> _probably_ start with 'hello'. Not the same at all for, say, h*.
>
> You might review: https://wiki.apache.org/solr/UsingMailingLists,
> you're not really providing much information to go on here.
>
> What is the result of adding =query? Particularly it would be
> useful to see the parsed query.
>
> Are all such queries slow? What happens if you submit hel* followed by
> hello*, the first one will bring the underlying index structures into
> memory, for all we know this could simply be an autowarming issue.
>
> Are you indexing at the same time? Do you have a short autocommit interval?
>
> What version of Solr?
>
> Details matter.
> Best,
> Erick
>
> On Mon, Nov 20, 2017 at 11:50 AM, Sundeep T  wrote:
> > Hi Erick.
> >
> > I initially asked this question regarding leading wildcards. This was a
> > typo, and what I meant was trailing wild card queries were slow. So
> queries
> > like text:'hello*" are slow. We were expecting since the string field is
> > already indexed, the searches should be fast, but that seems to be not
> the
> > case
> >
> > Thanks
> > Sundeep
> >
> > On Mon, Nov 20, 2017 at 9:39 AM, Erick Erickson  >
> > wrote:
> >
> >> You already asked that question and got several answers, did you not
> >> see them? If you did see them, what is unclear?
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, Nov 20, 2017 at 9:33 AM, Sundeep T 
> wrote:
> >> > Hi,
> >> >
> >> > We have several indexed string fields which is not tokenized and does
> not
> >> > have docValues enabled.
> >> >
> >> > When we do trailing wildcard searches on these fields they are running
> >> very
> >> > slow. We were thinking that since this field is indexed, such queries
> >> > should be running pretty quickly. We are using Solr 6.6.1. Anyone has
> >> ideas
> >> > on why these queries are running slow and if there are any ways to
> speed
> >> > them up?
> >> >
> >> > Thanks
> >> > Sundeep
> >>
>


Re: Deep Paging with cursorMark throws error

2017-11-20 Thread Webster Homer
As I suspected this was a bug in my code. We use KIE Drools to configure
our queries, and there was a conflict between two rules.

On Mon, Nov 20, 2017 at 4:09 PM, Webster Homer 
wrote:

> I am developing an application that uses cursorMark deep paging. It's a
> java client using solrj client.
>
> Currently the client is created with Solr 6.2 solrj jars, but the test
> server is a solr 7.1 server
>
> I am getting this error:
> Error from server at http://XX:8983/solr/sial-catalog-product: Cursor
> functionality requires a sort containing a uniqueKey field tie breaker
>
> But the sort does have the field that is marked as unique in the schema.
>
> sort=score desc,*id_material* asc
>
> id_material
>
> Does the sort need to be on just the unique field?
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: Issue facing with spell text field containing hyphen

2017-11-20 Thread Chirag garg
Hi Rick,

Actually my spell field also contains text with hyphen i.e. it contains
"spider-man" even then also i am not able to search it.

Regards,
Chirag



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: OutOfMemoryError in 6.5.1

2017-11-20 Thread Bernd Fehling
Hi Walter,

you can check if the JVM OOM hook is acknowledged by JVM
and setup in the JVM. The options are "-XX:+PrintFlagsFinal -version"

You can modify your bin/solr script and tweak the function "launch_solr"
at the end of the script. Replace "-jar start.jar" with "-XX:+PrintFlagsFinal 
-version".
Instead of starting solr this will print a huge list of all really
used (and accepted) JVM parameters.
Check what "ccstrlist OnOutOfMemoryError" is telling you.
Is it really pointing to your OOM script?

You can give more MaxGCPauseMillis to give GC more time to cleanup.

The default InitiatingHeapOccupancyPercent is at 45, try it with 75
by setting -XX:InitiatingHeapOccupancyPercent=75



By the way, do you really use UseLargePages in your system
(because the OS must also support this) or is the JVM parameter
just set because some else is also using it?
http://www.oracle.com/technetwork/java/javase/tech/largememory-jsp-137182.html


Regards,
Bernd


Am 21.11.2017 um 02:17 schrieb Walter Underwood:
> When I ran load benchmarks with 6.3.0, an overloaded cluster would get super 
> slow but keep functioning. With 6.5.1, we hit 100% CPU, then start getting 
> OOMs. That is really bad, because it means we need to reboot every node in 
> the cluster.
> 
> Also, the JVM OOM hook isn’t running the process killer (JVM 1.8.0_121-b13). 
> Using the G1 collector with the Shawn Heisey settings in an 8G heap.
> 
> GC_TUNE=" \
> -XX:+UseG1GC \
> -XX:+ParallelRefProcEnabled \
> -XX:G1HeapRegionSize=8m \
> -XX:MaxGCPauseMillis=200 \
> -XX:+UseLargePages \
> -XX:+AggressiveOpts \
> "
> 
> This is not good behavior in prod. The process goes to the bad place, then we 
> need to wait until someone is paged and kills it manually. Luckily, it 
> usually drops out of the live nodes for each collection and doesn’t take user 
> traffic.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
> 


Re: Solr regex phrase query syntax

2017-11-20 Thread Mikhail Khludnev
Hello, Chuming.
It doesn't. The closest thing is to create TermAutomatonQuery.

On Mon, Nov 20, 2017 at 11:03 PM, Chuming Chen 
wrote:

> Hi All,
>
> According to http://lucene.apache.org/core/7_1_0/core/org/apache/lucene/
> util/automaton/RegExp.html. Lucene supports repeat expressions.
>
> repeatexp   ::= repeatexp ? (zero or one occurrence)
> |   repeatexp * (zero or more occurrences)
> |   repeatexp + (one or more occurrences)
> |   repeatexp {n}   (n occurrences)
> |   repeatexp {n,}  (n or more occurrences)
> |   repeatexp {n,m} (n to m occurrences, including both)
>
>
> Does Solr support multiple occurrence of terms in a phrase query? For
> example: name:”abc{0, 3} def”, which means term “abc” repeats 0 to 3 times
> in the phrase.
>
> Thanks,
>
> Chuming
>
>


-- 
Sincerely yours
Mikhail Khludnev


Re: Trailing wild card searches very slow in Solr

2017-11-20 Thread Erick Erickson
Well, define "slow". Conceptually a large OR clause is created that
contains all the terms that start with the indicated text. (actually a
PrefixQuery should be formed).

That said, I'd expect hello* to be reasonably fast as not many terms
_probably_ start with 'hello'. Not the same at all for, say, h*.

You might review: https://wiki.apache.org/solr/UsingMailingLists,
you're not really providing much information to go on here.

What is the result of adding =query? Particularly it would be
useful to see the parsed query.

Are all such queries slow? What happens if you submit hel* followed by
hello*, the first one will bring the underlying index structures into
memory, for all we know this could simply be an autowarming issue.

Are you indexing at the same time? Do you have a short autocommit interval?

What version of Solr?

Details matter.
Best,
Erick

On Mon, Nov 20, 2017 at 11:50 AM, Sundeep T  wrote:
> Hi Erick.
>
> I initially asked this question regarding leading wildcards. This was a
> typo, and what I meant was trailing wild card queries were slow. So queries
> like text:'hello*" are slow. We were expecting since the string field is
> already indexed, the searches should be fast, but that seems to be not the
> case
>
> Thanks
> Sundeep
>
> On Mon, Nov 20, 2017 at 9:39 AM, Erick Erickson 
> wrote:
>
>> You already asked that question and got several answers, did you not
>> see them? If you did see them, what is unclear?
>>
>> Best,
>> Erick
>>
>> On Mon, Nov 20, 2017 at 9:33 AM, Sundeep T  wrote:
>> > Hi,
>> >
>> > We have several indexed string fields which is not tokenized and does not
>> > have docValues enabled.
>> >
>> > When we do trailing wildcard searches on these fields they are running
>> very
>> > slow. We were thinking that since this field is indexed, such queries
>> > should be running pretty quickly. We are using Solr 6.6.1. Anyone has
>> ideas
>> > on why these queries are running slow and if there are any ways to speed
>> > them up?
>> >
>> > Thanks
>> > Sundeep
>>


Re: How to get a solr core to persist

2017-11-20 Thread Amanda Shuman
Hi Shawn,

I did as you suggested and created the core by hand - I copied the files
from the existing core, including the index files (data directory) and
changed the core.properties file to the new core name (core_new) and
restarted. Now I'm having a different issue - it says it is Optimized but
that Current is not (the console shows the red prohibited sign, which I
guess means false or something?). So basically there's no content at all in
there. Re-reading your instructions here: " If you want to relocate the
data, you can add a dataDir property to core.properties.  If it has a
relative path, it is relative to the core.properties location." - Did I
miss a step to get the existing index to load?

Thanks!
Amanda

--
Dr. Amanda Shuman
Post-doc researcher, University of Freiburg, The Maoist Legacy Project

PhD, University of California, Santa Cruz
http://www.amandashuman.net/
http://www.prchistoryresources.org/
Office: +49 (0) 761 203 4925


On Wed, Nov 15, 2017 at 1:32 PM, Shawn Heisey  wrote:

> On 11/15/2017 2:28 AM, Amanda Shuman wrote:
>
>> 1) so does this mean that on the back-end I should first create my new
>> core, e.g., core1 and then within that place a conf folder with all the
>> files? Same for the data folder? If so, is it fine to just use the
>> existing
>> config files that I've previously worked on (i.e. the config for search
>> that I already modified)? I presume this won't be an issue.
>>
>> 2) does it matter if I create this core through the admin console or at
>> command line?
>>
>
> You can create your cores however you like.  I actually create all my
> cores completely by hand, including the core.properties file, and let Solr
> discover them on startup.  Mostly I just copy an existing core, change
> core.properties to correct values, make any config changes I need, and
> restart Solr.
>
> If you want to use the admin UI (or the CoreAdmin API directly, which is
> what the admin UI calls), then the instanceDir must have a conf directory
> with all the config files you require for the core, and NOT have a
> core.properties file.  If you're adding a core that already has a an index,
> then you would also include the data directory in the core's instanceDir.
> If you want to relocate the data, you can add a dataDir property to
> core.properties.  If it has a relative path, it is relative to the
> core.properties location.
>
> The commandline creation works pretty well.  The way it works is by
> copying a configset (which may be in server/solr/configsets or in a custom
> location) to the "conf" directory in the core, then calling the CoreAdmin
> API to actually add the core to Solr (and create core.properties so it'll
> get picked up on restart).
>
> Thanks,
> Shawn
>


Re: A problem of tracking the commits of Lucene using SHA num

2017-11-20 Thread TOM
Dear Shawn and Chris,
Thanks very much for your replies and helps.
And so sorry for my mistakes of first-time use of Mailing Lists.

On 11/9/2017 5:13 PM, Shawn wrote:
> Where did this information originate?

My SHA data come from the paper On the Naturalness of Buggy Code(Baishakhi Ray, 
et al. ICSE ??16), and download from
http://odd-code.github.io/Data.html.


On 11/9/2017 6:10 PM, Chris wrote:
> Also -- What exactly are you trying to do? what is your objective?

I want to analysis buggy codes?? statistical properties through some
learning models on Ray??s experimental dataset. Since its large size,
Ray did not make the entire data online. What I can acquire is a batch
of commits?? SHA data and some other info. So, I need to pick out
the old commits which are correlated to these SHAs.


On 17/9/2017 1:47 PM, Shawn wrote:
> The commit data you're using is nearly useless, because the repository
> where it originated has been gone for nearly two years. If you can find
> out how it was generated, you can build a new version from the current
> repository -- either on github or from Apache's official servers.


Thanks for all of your suggestions and helps, I am going to try other ways.
Thanks so much.
 
Best,
Xian

Re: Help with complex boolean search queries

2017-11-20 Thread Gajendra Dadheech
Hey Ankit,

Try this tool for a better view of your debug output, and then if you have
any specific question, do let me know :

http://splainer.io/

On Sun, Oct 29, 2017 at 2:34 AM, Ankit Shah  wrote:

> Hi,
> I am new to the solr community, and have this weird problem with the search
> results
> here is whats going on. i have a logfile that is indexed into solr with the
> following config
>
>  stored="true" termPositions="true" termVectors="true" multiValued="false"
> required="true"/>
>  "100" multiValued="true">   "solr.StandardTokenizerFactory"/>  words="stopwords.txt" ignoreCase="true"/>  "solr.LowerCaseFilterFactory"/>   <
> tokenizer class="solr.StandardTokenizerFactory"/>  "solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>  class="solr.SynonymFilterFactory" expand="true" ignoreCase="true"
> synonyms=
> "synonyms.txt"/>  
> 
>
> here is a sample for demonstration purpose, assume the following
> logfile(text) is indexed to solr in the field "log"
>
> AppleCare+ extends the basic warranty that covers non-accidental iPhone
> mishaps -- such as battery issues or a faulty headphone jack -- from one
> year to two. The iPhone X was unveiled to much fanfare last month. It
> boasts a radical update to the iPhone models of years past, with an
> all-glass display and an option to unlock with facial recognition. It also
> has an all-glass back, so owners run the risk of cracking either side of
> the phone. However, Apple has claimed the glass on the iPhone 8 and iPhone
> X is much stronger than earlier models, so it could be harder to break.
> Pre-orders for the phone began online Friday, and units were selling out
> quickly. The U.S. Apple Store site said it would take
>
> now the query that i run is as follows:
>
> q=("warranty that covers non-accidental") OR ("risking it all" AND "harder
> to break")
> hl.q=("warranty that covers non-accidental") OR ("risking it all" AND
> "harder
> to break")
> hl=true hl.fl=log hl.usePhraseHighlighter=true hl.fragsize=2000
> hl.maxAnalyzedChars=2097152
> indent=on
>
> or as a URL
> http://localhost:8983/solr/mycore/select?hl.usePhraseHighlighter=true.
> fl=log=true=2000=on=json
> =(%22warranty%20that%20covers%20non-accidental%22)%20OR%20(%22risking
> %20it%20all%22%20AND%20%22harder%20to%20break%22)=(%22warranty%20that%20
> covers%20non-accidental%22)%20OR%20(%22risking%20it%20all
> %22%20AND%20%22harder%20to%20break%22)
>
> the response is as follows:
> {
>   "responseHeader":{
> "status":0,
> "QTime":24,
> "params":{
>   "q":"(\"warranty that covers non-accidental\") OR (\"risking it all\"
> AND \"harder to break\")",
>   "hl":"true",
>   "indent":"on",
>   "hl.q":"(\"warranty that covers non-accidental\") OR (\"risking it
> all\" AND \"harder to break\")",
>   "hl.usePhraseHighlighter":"true",
>   "hl.fragsize":"2000",
>   "hl.fl":"log",
>   "wt":"json"}},
>   "response":{"numFound":1,"start":0,"docs":[
>   {
> "logid":5487941,
> "log":"AppleCare+ extends the basic warranty that covers
> non-accidental iPhone mishaps -- such as battery issues or a faulty
> headphone jack -- from one year to two.\nThe iPhone X was unveiled to much
> fanfare last month. It boasts a radical update to the iPhone models of
> years past, with an all-glass display and an option to unlock with facial
> recognition.\nIt also has an all-glass back, so owners run the risk of
> cracking either side of the phone.\nHowever, Apple has claimed the glass on
> the iPhone 8 and iPhone X is much stronger than earlier models, so it could
> be harder to break.\nPre-orders for the phone began online Friday, and
> units were selling out quickly. The U.S. Apple Store site said it would
> take \n",
> "_version_":1582439847966015488}]
>   },
>   "highlighting":{
> "5487941":{
>   "log":["AppleCare+ extends the basic *warranty that
> covers non-accidental* iPhone mishaps -- such as battery
> issues or a faulty headphone jack -- from one year to two.\nThe iPhone X
> was unveiled to much fanfare last month. It boasts a radical update to the
> iPhone models of years past, with an all-glass display and an option to
> unlock with facial recognition.\nIt also has an all-glass back, so owners
> run the risk of cracking either side of the phone.\nHowever, Apple has
> claimed the glass on the iPhone 8 and iPhone X is much stronger than
> earlier models, so it could be *harder to
> break*.\nPre-orders
> for the phone began online Friday, and units were selling out quickly. The
> U.S. Apple Store site said it would take \n"]
> }
>   }
> }
>
> i get the correct document as a hit, but the highlighted text is wrong, i
> am wondering the querying is straight forward, match either condition 1 or
> condition 2
> where condition 1 =  "warranty that covers non-accidental"
> and condition 2 = "risking it all" AND "harder to break"
>
> now the hit is correct as condition 1 matched, but why is the highlight
>