Re: Query with exact number of tokens

2018-09-21 Thread Michael Kuhlmann
Hi Sergio,

alas that's not possible that way. If you search for CENTURY BANCORP,
INC., then Solr will be totally happy to find all these terms in "NEW
CENTURY BANCORP, INC." and return it with a high score.

But you can prepare your data at index time. Make it a multivalued field
of type string or text without any tokenization and then permute company
names in all reasonable combinations. Since company names should seldom
have more than half a dozen words, that might be practicable.

You then search with an exact match on that field. Make sure to quote
your query parameter correctly, otherwise NEW CENTURY BANCORP, INC.
would match CENTURY BANCORP, INC..

-Michael

Am 21.09.2018 um 15:00 schrieb marotosg:
> Hi,
> 
> I have to search for company names where my first requirement is to find
> only exact matches on the company name.
> 
> For instance if I search for "CENTURY BANCORP, INC." I shouldn't find "NEW
> CENTURY BANCORP, INC."
> because the result company has the extra keyword "NEW".
> 
> I can't use exact match because the sequence of tokens may differ. Basically
> I need to find results where the  tokens are the same in any order and the
> number of tokens match.
> 
> I have no idea if it's possible as include in the query the number of tokens
> and solr field has that info within to match it.
> 
> Thanks for your help
> Sergio
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> 



Re: How to split index more than 2GB in size

2018-06-21 Thread Michael Kuhlmann
Hi Sushant,

while this is true in general, it won't hold here. If you split your
index, searching on each splitted shard might be a bit faster, but
you'll increase search time much more because Solr needs to send your
search queries to all shards and then combine the results. So instead of
having one medium fast search request, you'll have several fast requests
plus the aggregation step.

Erick is totally right, splitting an index of that size has no
performance benefit. Sharding is not a technique to improve performance,
it's a technique to be able to handle indexes of hundreds of megabytes
in size, which won't fit into an individual machine.

Best,
Michael


Am 20.06.2018 um 19:58 schrieb Sushant Vengurlekar:
> Thank you for the detailed response Eric. Very much appreciated. The reason
> I am looking into splitting the index into two is because it’s much faster
> to search across a smaller index than a larger one.
> 
> On Wed, Jun 20, 2018 at 10:46 AM Erick Erickson 
> wrote:
> 
>> You still haven't answered _why_ you think splitting even a 20G index
>> is desirable. We regularly see 200G+ indexes per replica in the field,
>> so what's the point? Have you measured different setups to see if it's
>> a good idea? A 200G index needs some beefy hardware admittedly
>>
>> If you have adequate response times with a 20G index and need to
>> increase the QPS rate, just add more replicas. Having more than one
>> shard inevitably adds overhead which may (or may not) be made up for
>> by parallelizing some of the work. It's nearly always better to use
>> only one shard if it meets your response time requirements.
>>
>> Best,
>> Erick
>>
>> On Wed, Jun 20, 2018 at 10:39 AM, Sushant Vengurlekar
>>  wrote:
>>> The index size is small because this is my local development copy.  The
>>> production index is more than 20GB. So I am working on getting the index
>>> split and replicated on different nodes. Our current instance on prod is
>>> single instance solr 6 which we are working on moving towards solrcloud 7
>>>
>>> On Wed, Jun 20, 2018 at 10:30 AM Erick Erickson >>
>>> wrote:
>>>
 Use the indexupgrader tool or optimize your index before using
>> splitshard.

 Since this is a small index (< 5G), optimizing will not create an
 overly-large segment, so that pitfall is avoided.

 You haven't yet explained why you think splitting the index would be
 beneficial. Splitting an index this small is unlikely to improve query
 performance appreciably. This feels a lot like an "XY" problem, you're
 asking how to do X thinking it will solve Y but not telling us what Y
 is.

 Best,
 Erick

 On Wed, Jun 20, 2018 at 9:40 AM, Sushant Vengurlekar
  wrote:
> How can I resolve this error?
>
> On Wed, Jun 20, 2018 at 9:11 AM, Alexandre Rafalovitch <
 arafa...@gmail.com>
> wrote:
>
>> This seems more related to an old index upgraded to latest Solr
>> rather
 than
>> the split itself.
>>
>> Regards,
>> Alex
>>
>> On Wed, Jun 20, 2018, 12:07 PM Sushant Vengurlekar, <
>> svengurle...@curvolabs.com> wrote:
>>
>>> Thanks for the reply Alessandro! Appreciate it.
>>>
>>> Below is the full request and the error received
>>>
>>> curl '
>>>
>>> http://localhost:8081/solr/admin/collections?action=
>> SPLITSHARD=dev-transactions=shard1
>>> '
>>>
>>> {
>>>
>>>   "responseHeader":{
>>>
>>> "status":500,
>>>
>>> "QTime":7920},
>>>
>>>   "success":{
>>>
>>> "solr-1:8081_solr":{
>>>
>>>   "responseHeader":{
>>>
>>> "status":0,
>>>
>>> "QTime":1190},
>>>
>>>   "core":"dev-transactions_shard1_0_replica_n3"},
>>>
>>> "solr-1:8081_solr":{
>>>
>>>   "responseHeader":{
>>>
>>> "status":0,
>>>
>>> "QTime":1047},
>>>
>>>   "core":"dev-transactions_shard1_1_replica_n4"},
>>>
>>> "solr-1:8081_solr":{
>>>
>>>   "responseHeader":{
>>>
>>> "status":0,
>>>
>>> "QTime":6}},
>>>
>>> "solr-1:8081_solr":{
>>>
>>>   "responseHeader":{
>>>
>>> "status":0,
>>>
>>> "QTime":1009}}},
>>>
>>>   "failure":{
>>>
>>>
>>>
>> "solr-1:8081_solr":"org.apache.solr.client.solrj.impl.HttpSolrClient$
>> RemoteSolrException:Error
>>> from server at http://solr-1:8081/solr:
>>> java.lang.IllegalArgumentException:
>>> Cannot merge a segment that has been created with major version 6
>> into
>> this
>>> index which has been created by major version 7"},
>>>
>>>   "Operation splitshard caused
>>>
>>> exception:":"org.apache.solr.common.SolrException:org.
>> apache.solr.common.SolrException:
>>> SPLITSHARD failed to invoke SPLIT core admin command",
>>>
>>>   

Re: SolrException undefined field *

2018-01-09 Thread Michael Kuhlmann
To correct myself, querying "*" is allowed in the sense that asking for
all fields is done by assigning "*" to the fl parameter.

So the problem is possibly not that "*" is requested, but that the star
is used somewhere else, probably in the q parameter.

We can help you better when you pass the full query string (if you're
able to fetch it).

-Michael


Am 09.01.2018 um 16:38 schrieb Michael Kuhlmann:
> First, you might want to index, but what Solr is executing here is a
> search request.
> 
> Second, you're querying for a dynamic field "*" which is not defined in
> your schema. This is quite obvious, the exception says right this.
> 
> So whatever is sending the query (some client, it seems) is doing the
> wrong thing. Or your schema definition is not matching what the client
> expects.
> 
> Since we don't know what client code you're using, we can't tell more.
> 
> -Michael
> 
> 
> Am 09.01.2018 um 16:31 schrieb padmanabhan:
>> I get the below error whenever an indexing is executed.. I didn't find enough
>> clue on to where this field is coming from and how could i debug on to it..
>> any help would be appreciated
>>
>> 2018-01-09 16:03:11.705 INFO 
>> (searcherExecutor-51-thread-1-processing-x:master_backoffice_backoffice_product_default)
>> [   x:master_backoffice_backoffice_product_default]
>> o.a.s.c.QuerySenderListener QuerySenderListener sending requests to
>> Searcher@232ae42b[master_backoffice_backoffice_product_default]
>> main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_1p(6.4.1):C56)))}
>> 2018-01-09 16:03:11.705 ERROR
>> (searcherExecutor-51-thread-1-processing-x:master_backoffice_backoffice_product_default)
>> [   x:master_backoffice_backoffice_product_default]
>> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: undefined
>> field *
>>  at
>> org.apache.solr.schema.IndexSchema.getDynamicFieldType(IndexSchema.java:1308)
>>  at 
>> org.apache.solr.schema.IndexSchema.getFieldType(IndexSchema.java:1260)
>>  at
>> org.apache.solr.parser.SolrQueryParserBase.getWildcardQuery(SolrQueryParserBase.java:932)
>>  at
>> org.apache.solr.parser.SolrQueryParserBase.handleBareTokenQuery(SolrQueryParserBase.java:616)
>>  at org.apache.solr.parser.QueryParser.Term(QueryParser.java:312)
>>  at org.apache.solr.parser.QueryParser.Clause(QueryParser.java:182)
>>  at org.apache.solr.parser.QueryParser.Query(QueryParser.java:102)
>>  at org.apache.solr.parser.QueryParser.TopLevelQuery(QueryParser.java:91)
>>  at
>> org.apache.solr.parser.SolrQueryParserBase.parse(SolrQueryParserBase.java:194)
>>  at org.apache.solr.search.LuceneQParser.parse(LuceneQParser.java:50)
>>  at org.apache.solr.search.QParser.getQuery(QParser.java:168)
>>  at
>> org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:160)
>>  at
>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:269)
>>  at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:166)
>>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:2306)
>>  at
>> org.apache.solr.core.QuerySenderListener.newSearcher(QuerySenderListener.java:72)
>>  at 
>> org.apache.solr.core.SolrCore.lambda$getSearcher$4(SolrCore.java:2094)
>>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>  at
>> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
>>  at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>  at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>  at java.lang.Thread.run(Thread.java:748)
>>
>> 2018-01-09 16:03:11.705 INFO 
>> (searcherExecutor-51-thread-1-processing-x:master_backoffice_backoffice_product_default)
>> [   x:master_backoffice_backoffice_product_default] o.a.s.c.S.Request
>> [master_backoffice_backoffice_product_default]  webapp=null path=null
>> params={q=*:*%26facet%3Dtrue%26facet.field%3DcatalogVersion%26facet.field%3DcatalogId%26facet.field%3DapprovalStatus_string%26facet.field%3Dcategory_string_mv=false=newSearcher}
>> status=400 QTime=0
>>
>>
>>
>>
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>>
> 



Re: SolrException undefined field *

2018-01-09 Thread Michael Kuhlmann
First, you might want to index, but what Solr is executing here is a
search request.

Second, you're querying for a dynamic field "*" which is not defined in
your schema. This is quite obvious, the exception says right this.

So whatever is sending the query (some client, it seems) is doing the
wrong thing. Or your schema definition is not matching what the client
expects.

Since we don't know what client code you're using, we can't tell more.

-Michael


Am 09.01.2018 um 16:31 schrieb padmanabhan:
> I get the below error whenever an indexing is executed.. I didn't find enough
> clue on to where this field is coming from and how could i debug on to it..
> any help would be appreciated
> 
> 2018-01-09 16:03:11.705 INFO 
> (searcherExecutor-51-thread-1-processing-x:master_backoffice_backoffice_product_default)
> [   x:master_backoffice_backoffice_product_default]
> o.a.s.c.QuerySenderListener QuerySenderListener sending requests to
> Searcher@232ae42b[master_backoffice_backoffice_product_default]
> main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_1p(6.4.1):C56)))}
> 2018-01-09 16:03:11.705 ERROR
> (searcherExecutor-51-thread-1-processing-x:master_backoffice_backoffice_product_default)
> [   x:master_backoffice_backoffice_product_default]
> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: undefined
> field *
>   at
> org.apache.solr.schema.IndexSchema.getDynamicFieldType(IndexSchema.java:1308)
>   at 
> org.apache.solr.schema.IndexSchema.getFieldType(IndexSchema.java:1260)
>   at
> org.apache.solr.parser.SolrQueryParserBase.getWildcardQuery(SolrQueryParserBase.java:932)
>   at
> org.apache.solr.parser.SolrQueryParserBase.handleBareTokenQuery(SolrQueryParserBase.java:616)
>   at org.apache.solr.parser.QueryParser.Term(QueryParser.java:312)
>   at org.apache.solr.parser.QueryParser.Clause(QueryParser.java:182)
>   at org.apache.solr.parser.QueryParser.Query(QueryParser.java:102)
>   at org.apache.solr.parser.QueryParser.TopLevelQuery(QueryParser.java:91)
>   at
> org.apache.solr.parser.SolrQueryParserBase.parse(SolrQueryParserBase.java:194)
>   at org.apache.solr.search.LuceneQParser.parse(LuceneQParser.java:50)
>   at org.apache.solr.search.QParser.getQuery(QParser.java:168)
>   at
> org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:160)
>   at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:269)
>   at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:166)
>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:2306)
>   at
> org.apache.solr.core.QuerySenderListener.newSearcher(QuerySenderListener.java:72)
>   at 
> org.apache.solr.core.SolrCore.lambda$getSearcher$4(SolrCore.java:2094)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
>   at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> 
> 2018-01-09 16:03:11.705 INFO 
> (searcherExecutor-51-thread-1-processing-x:master_backoffice_backoffice_product_default)
> [   x:master_backoffice_backoffice_product_default] o.a.s.c.S.Request
> [master_backoffice_backoffice_product_default]  webapp=null path=null
> params={q=*:*%26facet%3Dtrue%26facet.field%3DcatalogVersion%26facet.field%3DcatalogId%26facet.field%3DapprovalStatus_string%26facet.field%3Dcategory_string_mv=false=newSearcher}
> status=400 QTime=0
> 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> 



Re: Edismax leading wildcard search

2017-12-22 Thread Michael Kuhlmann
Am 22.12.2017 um 11:57 schrieb Selvam Raman:
> 1) how can i disable leading wildcard search

Do it on the client side. Just don't allow leading asterisks or question
marks in your query term.

> 2) why leading wildcard search takes so much of time to give the response.
> 

Because Lucene can't just look in the index for all terms beginning with
something; it needs to look in all terms instead. Basically, indexed
terms are in alphabetical order, but that doesn't help with leading
wildcards.

There's a ReversedWildcardFilterFactory in Solr to address this issue.

-Michael


Re: How to sort on dates?

2017-12-18 Thread Michael Kuhlmann
Am 16.12.2017 um 19:39 schrieb Georgios Petasis:
> Even if the DateRangeField field can store a range of dates, doesn't
> Solr understand that I have used single timestamps?

No. It could theoretically, but sorting just isn't implemented in
DateRangeField.

> I have even stored the dates.
> My problem is that I need to use the query formating stated in the
> documentation:
> https://lucene.apache.org/solr/guide/7_1/working-with-dates.html#date-range-formatting
> 
> For example, if "financialYear" is a date range, I can do
> q=financialYear:2014 and it will return everything that has a date
> within 2014. If the field is date point, will it work?

Yes, just query with the plain old range syntax:
q=financialYear:[2014-01-01T00:00:00.000Z TO 2015-01-01T00:00:00.000Z}

DateRangeField might be slightly faster for such queries, but that
doesn't really matter much. I only used normal date fields yet, usually
they're fast enough.

As a rule of thunb, only use DateRangeField if you really need to index
date ranges.

-Michael


Re: Wildcard searches with special character gives zero result

2017-12-15 Thread Michael Kuhlmann
Solr does not analyze queries with wildcards in it. So, with ch*p-seq,
it will search for terms that start with ch and end with p-seq. Since
your indexer has analyzed all tokens before, only chip and seq are in
the index.

See
https://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/
for example.

If you really need results for such queries, I suggest to have a
copyField which is unstemmed and only tokenized on whitespaces. If you
then detect a wildcard character in your query string, search on that
field instead of the others.

-Michael

Am 15.12.2017 um 11:59 schrieb Selvam Raman:
> I am using edismax query parser.
> 
> On Fri, Dec 15, 2017 at 10:37 AM, Selvam Raman  wrote:
> 
>> Solr version - 6.4.0
>>
>> "title_en":["Chip-seq"]
>>
>> When i fired query like below
>>
>> 1) chip-seq
>> 2) chi*
>>
>> it is giving expected result, for this case one result.
>>
>> But when i am searching with wildcard it produce zero result.
>> 1) ch*p-seq
>>
>>
>> if i use escape character in '-' it creates two terms rather than single
>> term.
>>
>> --
>> Selvam Raman
>> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>>
> 
> 
> 



Re: How to sort on dates?

2017-12-15 Thread Michael Kuhlmann
Hi Georgios,

DateRangeField is a kind of SpatialField which is not sortable at all.

For sorting, use a DatePointField instead. It's not deprecated; the
deprecated class is TrieDateField.

Best,
Michael


Am 15.12.2017 um 10:53 schrieb Georgios Petasis:
> Hi all,
> 
> I have a field of type "date_range" defined as:
> 
>  multiValued="false" indexed="true" stored="true"/>
> 
> The problem is that sorting on this field does not work (despite the
> fact that I put dates in there). Instead I get an error prompting to
> perform sorting through a query.
> 
> How can I do that? There is no documentation that I could find, that
> shows an alternative.
> 
> Also, I think that I saw a warning somewhere, that DateRangeField is
> deprecated. But no alternative is suggested:
> 
> https://lucene.apache.org/solr/guide/7_1/working-with-dates.html
> 
> I am using solr 7.1.
> 
> George
> 



Re: Newbie question about why represent timestamps as "float" values

2017-10-10 Thread Michael Kuhlmann
While you're generally right, in this case it might make sense to stick
to a primitive type.

I see "unixtime" as a technical information, probably from
System.currentTimeMillis(). As long as it's not used as a "real world"
date but only for sorting based on latest updates, or chosing which
document is more recent, it's totally okay to index it as a long value.

But definitely not as a float.

-Michael

Am 10.10.2017 um 10:55 schrieb alessandro.benedetti:
> There was time ago a Solr installation which had the same problem, and the
> author explained me that the choice was made for performance reasons.
> Apparently he was sure that handling everything as primitive types would
> give a boost to the Solr searching/faceting performance.
> I never agreed ( and one of the reasons is that you need to transform back
> from float to dates to actually render them in a readable format).
> 
> Furthermore I tend to rely on standing on the shoulders of giants, so if a
> community ( not just a single developer) spent time implementing a date type
> ( with the different available implementations) to manage specifically date
> information, I tend to thrust them and believe that the best approach to
> manage dates is to use that ad hoc date type ( in its variants, depending on
> the use cases).
> 
> As a plus, using the right data type gives you immense power in debugging
> and understanding better your data.
> For proper maintenance , it is another good reason to stick with standards.
> 
> 
> 
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> 



Re: Where the uploaded configset from SOLR into zookeeper ensemble resides?

2017-09-28 Thread Michael Kuhlmann
Do you find your configs in the Solr admin panel, in the Cloud --> Tree
folder?

-Michael

Am 28.09.2017 um 04:50 schrieb Gunalan V:
> Hello,
> 
> Could you please let me know where can I find the uploaded configset from
> SOLR into zookeeper ensemble ?
> 
> In docs it says they will  "/configs/" but I'm not able to see
> the configs directory in zookeeper. Please let me know if I need to check
> somewhere else.
> 
> 
> Thanks!
> 



Re: Modifing create_core's instanceDir attribute

2017-09-28 Thread Michael Kuhlmann
I'd rather say you didn't quote the URL when sending it using curl.

Bash accepts the ampersand as a request to execute curl including the
URL up to CREATE in background - that's why the error is included within
the next output, followed by "Exit" - and then tries to execute the
following part of the URL as additional commands, which of course fails.

Just put the URL in quotes, and it will work much better.

-Michael

Am 27.09.2017 um 23:14 schrieb Miller, William K - Norman, OK - Contractor:
> I understand that this has to be done on the command line, but I don't know 
> where to put this structure or what it should look like.  Can you please be 
> more specific in this answer?  I have only been working with Solr for about 
> six months.
> 
> 
> 
> 
> ~~~
> William Kevin Miller
> 
> ECS Federal, Inc.
> USPS/MTSC
> (405) 573-2158
> 
> 
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com] 
> Sent: Wednesday, September 27, 2017 3:57 PM
> To: solr-user
> Subject: Re: Modifing create_core's instanceDir attribute
> 
> Standard command-line. You're doing this on the box itself, not through a 
> REST API.
> 
> Erick
> 
> On Wed, Sep 27, 2017 at 10:26 AM, Miller, William K - Norman, OK - Contractor 
>  wrote:
>> This is my first time to try using the core admin API.  How do I go about 
>> creating the directory structure?
>>
>>
>>
>>
>> ~~~
>> William Kevin Miller
>>
>> ECS Federal, Inc.
>> USPS/MTSC
>> (405) 573-2158
>>
>>
>> -Original Message-
>> From: Erick Erickson [mailto:erickerick...@gmail.com]
>> Sent: Wednesday, September 27, 2017 11:45 AM
>> To: solr-user
>> Subject: Re: Modifing create_core's instanceDir attribute
>>
>> Right, the core admin API is pretty low-level, it expects the base directory 
>> exists, you have to create the directory structure by hand.
>>
>> Best,
>> Erick
>>
>> On Wed, Sep 27, 2017 at 9:24 AM, Miller, William K - Norman, OK - Contractor 
>>  wrote:
>>> Thanks Erick for pointing me in this direction.  Unfortunately when I try 
>>> to us this I get an error.  Here is the command that I am using and the 
>>> response I get:
>>>
>>> https://solrserver:8983/solr/admin/cores?action=CREATE=mycore
>>> s 
>>> tanceDir=/var/solr/data/mycore=data=custom_configs
>>>
>>>
>>> [1] 32023
>>> [2] 32024
>>> [3] 32025
>>> -bash: https://solrserver:8983/solr/admin/cores?action=CREATE: No 
>>> such file or directory [4] 32026
>>> [1] Exit 127
>>> https://solrserver:8983/solr/adkmin/cores?action=CREATE
>>> [2] Donename=mycore
>>> [3]-DoneinstanceDir=/var/solr/data/mycore
>>> [4]+DonedataDir=data
>>>
>>>
>>> I even tried to use the UNLOAD action to remove a core and got the same 
>>> type of error as the -bash line above.
>>>
>>> I have tried searching online for an answer and have found nothing so far.  
>>> Any ideas why this error is occuring.
>>>
>>>
>>>
>>> ~~~
>>> William Kevin Miller
>>>
>>> ECS Federal, Inc.
>>> USPS/MTSC
>>> (405) 573-2158
>>>
>>> -Original Message-
>>> From: Erick Erickson [mailto:erickerick...@gmail.com]
>>> Sent: Tuesday, September 26, 2017 3:33 PM
>>> To: solr-user
>>> Subject: Re: Modifing create_core's instanceDir attribute
>>>
>>> I don't think you can. You can, however, use the core admin API to do 
>>> that,
>>> see:
>>> https://lucene.apache.org/solr/guide/6_6/coreadmin-api.html#coreadmin
>>> -
>>> api
>>>
>>> Best,
>>> Erick
>>>
>>> On Tue, Sep 26, 2017 at 1:14 PM, Miller, William K - Norman, OK - 
>>> Contractor  wrote:
>>>
 I know that when the create_core command is used that it sets the 
 core to the name of the parameter supplied with the “-c” option and 
 the instanceDir attribute in the http is also set to the name of the core.
 What I want is to tell the create_core to use a different 
 instanceDir parameter.  How can I go about doing this?





 I am using Solr 6.5.1 and it is running on a linux server using the 
 apache tomcat webserver.











 ~~~

 William Kevin Miller

 [image: ecsLogo]

 ECS Federal, Inc.

 USPS/MTSC

 (405) 573-2158






Re: Moving to Point, trouble with IntPoint.newRangeQuery()

2017-09-26 Thread Michael Kuhlmann
Arrgh, forget my question. I just see that newExactQuery() simply
triggers newRangeQuery() like you already do.

-Michael

Am 26.09.2017 um 13:29 schrieb Michael Kuhlmann:
> Hi Markus,
> 
> I don't know why there aren't any results. But just out of curiosity,
> why don't you use the better choice IntPoint.newExectQuery(String,int)?
> 
> What happens if you use that?
> 
> -Michael
> 
> Am 26.09.2017 um 13:22 schrieb Markus Jelsma:
>> Hello,
>>
>> I have a QParser impl. that transforms text input to one or more integers, 
>> it makes a BooleanQuery one a field with all integers in OR-more. It used to 
>> work by transforming the integer using LegacyNumericUtils.intToPrefixCoded, 
>> getting a BytesRef.
>>
>> I have now moved it to use IntPoint.newRangeQuery(field, integer, integer), 
>> i read (think javadocs) this is the way to go, but i get no matches!
>>
>> Iterator i = digests.iterator();
>> while (i.hasNext()) {
>>   Integer digest = i.next();
>>   queryBuilder.add(IntPoint.newRangeQuery(field, digest, digest), 
>> Occur.SHOULD);
>> }
>> return queryBuilder.build();
>>
>> To be sure i didn't mess up elsewhere i also tried building a string for 
>> LuceneQParser and cheat:
>>
>> Iterator i = digests.iterator();
>> while (i.hasNext()) {
>>   Integer digest = i.next();
>>   str.append(ClientUtils.escapeQueryChars(digest.toString()));
>>   if (i.hasNext()) {
>> str.append(" OR ");
>>   }
>> }
>> QParser luceneQParser = new LuceneQParser(str.append(")").toString(), 
>> localParams, params, req);
>> return luceneQParser.parse();
>>
>> Well, this works! This is their respective debug output:
>>
>> Using the IntPoint range query:
>>
>> 
>> 
>> 
>>   {!q  f=d1}value
>>   {!q  f=d1}value
>>   (d1:[-1820898630 TO -1820898630])
>>   d1:[-1820898630 TO -1820898630]
>>
>> LuceneQParser cheat, it does find!
>>
>> 
>>   
>> 1
>> -1820898630
>> 
>> 
>>   {!qd f=d1}value
>>   {!qd f=d1}value
>>   d1:-1820898630
>>
>> There is not much difference in output, it looks fine, using LuceneQParser 
>> you can also match using a range query, so what am i doing wrong?
>>
>> Many thanks!
>> Markus
>>
> 



Re: Moving to Point, trouble with IntPoint.newRangeQuery()

2017-09-26 Thread Michael Kuhlmann
Hi Markus,

I don't know why there aren't any results. But just out of curiosity,
why don't you use the better choice IntPoint.newExectQuery(String,int)?

What happens if you use that?

-Michael

Am 26.09.2017 um 13:22 schrieb Markus Jelsma:
> Hello,
> 
> I have a QParser impl. that transforms text input to one or more integers, it 
> makes a BooleanQuery one a field with all integers in OR-more. It used to 
> work by transforming the integer using LegacyNumericUtils.intToPrefixCoded, 
> getting a BytesRef.
> 
> I have now moved it to use IntPoint.newRangeQuery(field, integer, integer), i 
> read (think javadocs) this is the way to go, but i get no matches!
> 
> Iterator i = digests.iterator();
> while (i.hasNext()) {
>   Integer digest = i.next();
>   queryBuilder.add(IntPoint.newRangeQuery(field, digest, digest), 
> Occur.SHOULD);
> }
> return queryBuilder.build();
> 
> To be sure i didn't mess up elsewhere i also tried building a string for 
> LuceneQParser and cheat:
> 
> Iterator i = digests.iterator();
> while (i.hasNext()) {
>   Integer digest = i.next();
>   str.append(ClientUtils.escapeQueryChars(digest.toString()));
>   if (i.hasNext()) {
> str.append(" OR ");
>   }
> }
> QParser luceneQParser = new LuceneQParser(str.append(")").toString(), 
> localParams, params, req);
> return luceneQParser.parse();
> 
> Well, this works! This is their respective debug output:
> 
> Using the IntPoint range query:
> 
> 
> 
> 
>   {!q  f=d1}value
>   {!q  f=d1}value
>   (d1:[-1820898630 TO -1820898630])
>   d1:[-1820898630 TO -1820898630]
> 
> LuceneQParser cheat, it does find!
> 
> 
>   
> 1
> -1820898630
> 
> 
>   {!qd f=d1}value
>   {!qd f=d1}value
>   d1:-1820898630
> 
> There is not much difference in output, it looks fine, using LuceneQParser 
> you can also match using a range query, so what am i doing wrong?
> 
> Many thanks!
> Markus
> 



Re: Solr nodes crashing (OOM) after 6.6 upgrade

2017-09-22 Thread Michael Kuhlmann
Hi Shamik,

funny enough, we had a similar issue with our old legacy application
that still used plain Lucene code in a JBoss container.

Same, there were no specific queries or updates causing this, the
performance just broke completely without unusual usage. GC was raising
up to 99% or so. Sometimes it came back after some while but most often
we had to completely restart JBoss for that.

I never figured out what the root cause was, but my suspicion still is
that Lucene was innocent. I rather suspect Rackspace's hypervisor to be
the blamable.

So maybe you can give it a try and have a look at the Amazon cloud settings?

Best,
Michael

Am 22.09.2017 um 12:00 schrieb shamik:
> All the tuning and scaling down of memory seemed to be stable for a couple of
> days but then came down due to a huge spike in CPU usage, contributed by G1
> Old Generation GC. I'm really puzzled why the instances are suddenly
> behaving like this. It's not that a sudden surge of load contributed to
> this, query and indexing load seemed to be comparable with the previous time
> frame. Just wondering if the hardware itself is not adequate enough for 6.6.
> The instances are all running on 8 CPU / 30gb m3.2xlarge EC2 instances.
> 
> Does anyone ever face issues similar to this?
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> 



Re: solr Facet.contains

2017-09-15 Thread Michael Kuhlmann
What is the field type? Which Analyzers are configured?
How do you split at "~"? (You have to do it by yourself, or configure
some tokenizer for that.)
What do you get when you don't filter your facets?
What do you mean with "it is not working"? What is your result now?

-Michael


 Am 15.09.2017 um 13:43 schrieb vobium:
> Hello,
>
> I want to limit my facet data by using substring (only that contain
> specified substring). My solr version is 4.8.0
>
> e.g if doc with such type of string (field with such type of data is
> multivalued and splited with "~")
>
>  India/maha/mumbai~India/gujarat/badoda
>  India/goa/xyz
>  India/raj/jaypur
>  1236/maha/890~India/maha/kolhapur
>  India/maha/mumbai
>  India/maha/nashik
>  Uk/Abc/Cde
>
>
> Expected  facet Data that contain maha as  substring
> o/p
> India/maha/mumbai (2)
>  India/maha/kolhapur(1)
>  India/maha/nashik(1)
> 1236/maha/890(1)
>
> I tried it by using facet.contains but it is not working
> so plz give solution for this issue 
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html




Re: ways to check if document is in a huge search result set

2017-09-13 Thread Michael Kuhlmann
Am 13.09.2017 um 04:04 schrieb Derek Poh:
> Hi Michael
>
> "Then continue using binary search depending on the returned score
> values."
>
> May I know what do you mean by using binary search?

An example algorithm is in Java method java.util.Arrays::binarySearch.

Or more detailed: https://en.wikipedia.org/wiki/Binary_search_algorithm

Best,
Michael



Re: ways to check if document is in a huge search result set

2017-09-12 Thread Michael Kuhlmann
So you're looking for a solution to validate the result output.

You have two ways:
1. Assuming you're sorting by the default "score" sort option:
Find the result you're looking for by setting the fq filter clause
accordingly, and add "score" the the fl field list.
Then do the normal unfiltered search, still including "score", and start
with page, let's say, 50,000.
Then continue using binary search depending on the returned score values.

2. Set fl to return only the supplier id, then you'll probably be able
to return several ten-thousand results at once.


But be warned, the result position of these elements can vary with every
single commit, esp. when there're lots of documents with the same score
value.

-Michael


Am 12.09.2017 um 03:21 schrieb Derek Poh:
> Some additional information.
>
> I have a query from user that a supplier's product(s) is not in the
> search result.
> I debugged by adding a fq on the supplier id to the query to verify
> the supplier's product is in thesearch result. The products do existin
> the search result.
> I want to tell user in which page of the search result the supplier's
> product appear in. To do this I go through each page of the search
> result to find the supplier's product.
> It is still fine if the search result has a few hundreds products but
> it will be a chore if the result have thousands. In this case there
> are more than 100,000 products in the result.
>
> Any advice on easier ways to check which page the supplier's product
> or document appear in a search result?
>
> On 9/11/2017 2:44 PM, Mikhail Khludnev wrote:
>> You can request facet field, query facet, filter or even explainOther.
>>
>> On Mon, Sep 11, 2017 at 5:12 AM, Derek Poh 
>> wrote:
>>
>>> Hi
>>>
>>> I have a collection of productdocument.
>>> Each productdocument has supplier information in it.
>>>
>>> I need to check if a supplier's products is return in a search
>>> resultcontaining over 100,000 products and in which page (assuming
>>> pagination is 20 products per page).
>>> Itis time-consuming and "labour-intensive" to go through each page
>>> to look
>>> for the product of the supplier.
>>>
>>> Would like to know if you guys have any better and easier waysto do
>>> this?
>>>
>>> Derek
>>>
>>> --
>>> CONFIDENTIALITY NOTICE
>>> This e-mail (including any attachments) may contain confidential and/or
>>> privileged information. If you are not the intended recipient or have
>>> received this e-mail in error, please inform the sender immediately and
>>> delete this e-mail (including any attachments) from your computer,
>>> and you
>>> must not use, disclose to anyone else or copy this e-mail (including
>>> any
>>> attachments), whether in whole or in part.
>>> This e-mail and any reply to it may be monitored for security, legal,
>>> regulatory compliance and/or other appropriate reasons.
>>
>>
>>
>
>
> --
> CONFIDENTIALITY NOTICE
> This e-mail (including any attachments) may contain confidential
> and/or privileged information. If you are not the intended recipient
> or have received this e-mail in error, please inform the sender
> immediately and delete this e-mail (including any attachments) from
> your computer, and you must not use, disclose to anyone else or copy
> this e-mail (including any attachments), whether in whole or in part.
> This e-mail and any reply to it may be monitored for security, legal,
> regulatory compliance and/or other appropriate reasons.




Re: ways to check if document is in a huge search result set

2017-09-11 Thread Michael Kuhlmann
Maybe I don't understand your problem, but why don't you just filter by
"supplier information"?

-Michael

Am 11.09.2017 um 04:12 schrieb Derek Poh:
> Hi
>
> I have a collection of productdocument.
> Each productdocument has supplier information in it.
>
> I need to check if a supplier's products is return in a search
> resultcontaining over 100,000 products and in which page (assuming
> pagination is 20 products per page).
> Itis time-consuming and "labour-intensive" to go through each page to
> look for the product of the supplier.
>
> Would like to know if you guys have any better and easier waysto do this?
>
> Derek
>
> --
> CONFIDENTIALITY NOTICE
> This e-mail (including any attachments) may contain confidential
> and/or privileged information. If you are not the intended recipient
> or have received this e-mail in error, please inform the sender
> immediately and delete this e-mail (including any attachments) from
> your computer, and you must not use, disclose to anyone else or copy
> this e-mail (including any attachments), whether in whole or in part.
> This e-mail and any reply to it may be monitored for security, legal,
> regulatory compliance and/or other appropriate reasons.




Re: Solr Issue

2017-09-07 Thread Michael Kuhlmann
Hi Patrick,

can you attach the query you're sending to Solr and one example result?
Or more specific, what are your hl.* parameters?

-Michael

Am 07.09.2017 um 09:36 schrieb Patrick Fallert:
>
> Hey Guys, 
> i´ve got a problem with my Solr Highlighter..
> When I search for a word, i get some results. For every result i want
> to display the highlighted text and here is my problem. Some of the
> returned documents have a highlighted text the other ones doesnt. I
> don´t know why it is but i need to fix this problem. Below is the
> configuration of my managed-schema. The configuration of the
> highlighter in solrconfig.xml is default.
> I hope someone can help me. If you need more details you can ask me
> for sure.
>
> managed-schema:
>
> 
> 
> 
> id
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  sortMissingLast="true" multiValued="true"/>
>  currencyConfig="currency.xml" defaultCurrency="USD" precisionStep="8"/>
>  positionIncrementGap="0" docValues="true" precisionStep="0"/>
>  positionIncrementGap="0" docValues="true" multiValued="true"
> precisionStep="0"/>
>  indexed="true" stored="false">
> 
> 
> 
> 
> 
>  indexed="true" stored="false">
> 
> 
>  encoder="integer"/>
> 
> 
>  indexed="true" stored="false">
> 
> 
>  encoder="identity"/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  positionIncrementGap="0" docValues="true" precisionStep="0"/>
>  positionIncrementGap="0" docValues="true" multiValued="true"
> precisionStep="0"/>
>  positionIncrementGap="0" docValues="true" precisionStep="0"/>
>  positionIncrementGap="0" docValues="true" multiValued="true"
> precisionStep="0"/>
>  stored="false" docValues="false" multiValued="true"/>
>  positionIncrementGap="0" docValues="true" precisionStep="0"/>
>  positionIncrementGap="0" docValues="true" multiValued="true"
> precisionStep="0"/>
>  docValues="true"/>
>  class="solr.SpatialRecursivePrefixTreeFieldType" geo="true"
> maxDistErr="0.001" distErrPct="0.025" distanceUnits="kilometers"/>
>  positionIncrementGap="0" docValues="true" precisionStep="0"/>
>  positionIncrementGap="0" docValues="true" multiValued="true"
> precisionStep="0"/>
>  positionIncrementGap="100">
> 
> 
> 
> 
> 
> 
>  multiValued="true"/>
> 
>  docValues="true" multiValued="true"/>
> 
>  docValues="true" multiValued="true"/>
>  stored="false">
> 
> 
> 
> 
> 
> 
>  multiValued="true"/>
> 
>  multiValued="true"/>
>  dimension="2"/>
> 
>  docValues="true"/>
>  docValues="true" multiValued="true"/>
>  positionIncrementGap="0" docValues="true" precisionStep="6"/>
>  positionIncrementGap="0" docValues="true" multiValued="true"
> precisionStep="6"/>
>  positionIncrementGap="0" docValues="true" precisionStep="8"/>
>  positionIncrementGap="0" docValues="true" multiValued="true"
> precisionStep="8"/>
>  positionIncrementGap="100">
> 
> 
> 
>  ignoreCase="true"/>
> 
> 
> 
> 
>  positionIncrementGap="100">
> 
> 
> 
>  ignoreCase="true"/>
> 
> 
> 
>  positionIncrementGap="100">
> 
> 
>  articles="lang/contractions_ca.txt" ignoreCase="true"/>
> 
>  ignoreCase="true"/>
> 
> 
> 
>  positionIncrementGap="100">
> 
> 
> 
> 
> 
> 
> 
>  positionIncrementGap="100">
> 
> 
> 
>  ignoreCase="true"/>
> 
> 
> 
>  positionIncrementGap="100">
> 
> 
> 
>  words="lang/stopwords_da.txt" ignoreCase="true"/>
> 
> 
> 
>  positionIncrementGap="100">
> 
> 
> 
>  words="lang/stopwords_de.txt" ignoreCase="true"/>
> 
> 
> 
> 
>  positionIncrementGap="100">
> 
> 
> 
>  ignoreCase="false"/>
> 
> 
> 
>  positionIncrementGap="100">
> 
> 
>  ignoreCase="true"/>
> 
> 
>  protected="protwords.txt"/>
> 
> 
> 
> 
>  ignoreCase="true" synonyms="synonyms.txt"/>
>  ignoreCase="true"/>
> 
> 
>  protected="protwords.txt"/>
> 
> 
> 
>  autoGeneratePhraseQueries="true" positionIncrementGap="100">
> 
> 
>  ignoreCase="true"/>
>  catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="1"
> generateWordParts="1" catenateAll="0" catenateWords="1"/>
> 
>  protected="protwords.txt"/>
> 
> 
> 
> 
> 
>  ignoreCase="true" synonyms="synonyms.txt"/>
>  ignoreCase="true"/>
>  catenateNumbers="0" generateNumberParts="1" splitOnCaseChange="1"
> generateWordParts="1" catenateAll="0" catenateWords="0"/>
> 
>  protected="protwords.txt"/>
> 
> 
> 
>  autoGeneratePhraseQueries="true" positionIncrementGap="100">
> 
> 
>  ignoreCase="true" synonyms="synonyms.txt"/>
>  ignoreCase="true"/>
>  catenateNumbers="1" generateNumberParts="0" generateWordParts="0"
> catenateAll="0" catenateWords="1"/>
> 
>  protected="protwords.txt"/>
> 
> 
> 
> 
> 
> 
>  ignoreCase="true" synonyms="synonyms.txt"/>
>  ignoreCase="true"/>
>  catenateNumbers="1" generateNumberParts="0" generateWordParts="0"
> catenateAll="0" catenateWords="1"/>
> 
>  protected="protwords.txt"/>
> 
> 
> 
> 
>  positionIncrementGap="100">
> 
> 
> 
>  words="lang/stopwords_es.txt" ignoreCase="true"/>
> 
> 
> 
>  positionIncrementGap="100">
> 
> 
> 
>  ignoreCase="true"/>
> 
> 
> 
>  positionIncrementGap="100">
> 
> 
> 
> 
> 
> 
>  ignoreCase="true"/>
> 
> 
>  positionIncrementGap="100">
> 
> 
> 
>  words="lang/stopwords_fi.txt" 

Re: Solr6.6 Issue/Bug

2017-09-06 Thread Michael Kuhlmann
Why would you need to start Solr as root? You should definitely not do
this, there's no reason for that.

And even if you *really* want this: What's so bad about the -force option?

-Michael

Am 06.09.2017 um 07:26 schrieb Kasim Jinwala:
> Dear team,
>   I am using solr 5.0 last 1 year, now we are planning to upgrade
> solr 6.6.
>  While trying to start solr using root user, we need to pass -force
> parameter to start solr forcefully,
> please help to start solr using root user without -force command.
>
> Regards
> Kasim J.
>



Re: Error after moving index

2017-06-22 Thread Michael Kuhlmann
Hi Moritz,

did you stop your local Solr sever before? Copying data from a running
instance may cause headaches.

If yes, what happens if you copy everything again? It seems that your
copy operations wasn't successful.

Best,
Michael

Am 22.06.2017 um 14:37 schrieb Moritz Munte:
> Hello,
>
>  
>
> I created an index on my local machine (Windows 10) and it works fine there.
>
> After uploading the index to the production server (Linux), the server shows
> an error:
.


Re: Solr NLS custom query parser

2017-06-15 Thread Michael Kuhlmann
Hi Arun,

your question is too generic. What do you mean with nlp search? What do
you expect to happen?

The short answer is: No, there is no such parser because the individual
requirements will vary a lot.

-Michael

Am 14.06.2017 um 16:32 schrieb aruninfo100:
> Hi,
>
> I am trying to configure NLP search with Solr. I am using OpenNLP for the
> same.I am able to index the documents and extract named entities and POS
> using OpenNLP-UIMA support and also by using a UIMA Update request processor
> chain.But I am not able to write a query parser for the same.Is there a
> query parser already written to satisfy the above features(nlp search).
>
> Thanks and Regards,
> Arun
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-NLS-custom-query-parser-tp4340511.html
> Sent from the Solr - User mailing list archive at Nabble.com.




Re: Mixing AND OR conditions with query parameters

2017-04-24 Thread Michael Kuhlmann
Make sure to have a whitespace are the OR operator.

The parenthesises should be around the OR query, not including the "fq:"
-- this should be outside the parenthesises (which are not necessary at
all).

What exactly are you expecting?

-Michael

Am 24.04.2017 um 12:59 schrieb VJ:
> Hi All,
>
> I am facing issues with OR/AND conditions with query parameters:
>
> fq=cioname:"XYZ" & (fq=attr1:trueORattr2:true)
>
> The queries are not returning expected results.
>
> I have tried various permutation and combinations but couldn't get it
> working. Any pointers on this?
>
>
>
> Regards,
> VJ
>



Re: fq performance

2017-03-17 Thread Michael Kuhlmann

Hi Ganesh,

you might want to use something like this:

fq=access_control:(g1 g2 g5 g99 ...)

Then it's only one fq filter per request. Internally it's like an OR condition, 
but in a more condensed form. I already have used this with up to 500 values 
without larger performance degradation (but in that case it was the unique id 
field).

You should think a minute about your filter cache here. Since you only have one 
fq filter per request, you won't blow your cache that fast. But it depends on 
your use case whether you should cache these filters at all. When it's common 
that a single user will send several requests within one commit interval, or 
when it's likely that several users will be in the same groups, that just use 
it like that. But when it's more likely that each request belongs to a 
different user with different security settings, then you should consider 
disabling the cache for this fq filter so that your filter cache (for other 
filters you probably have) won't be polluted: 
fq=*{!cache=false}*access_control:(g1 g2 g5 g99 ...). See 
http://yonik.com/advanced-filter-caching-in-solr/ for information on that.

-Michael



Am 17.03.2017 um 07:46 schrieb Ganesh M:

Hi Shawn / Michael,

Thanks for your replies and I guess you have got my scenarios exactly right.

Initially my document contains information about who have access to the
documents, like field as (U1_s:true). if 100 users can access a document,
we will have 100 such fields for each user.
So when U1 wants to see all this documents..i will query like get all
documents where U1_s:true.

If user U5 added to group G1, then I have to take all the documents of
group G1 and have to set the information of user U5 in the document like
U5_s:true in the document. For this, I have re-index all the documents in
that group.

To avoid this, I was trying to keep group information instead of user
information like G1_s:true, G2_s:true in the document. And for querying
user documents, I will first get all the groups of User U1, and then query
get all documents where G1_s:true OR G2_s:true or G3_s:true  By this we
don't need to re-index all the documents. But while querying I need to
query with OR of all the groups user belongs to.

For how many ORs solr can give the results in less than one second.Can I
pass 100's of OR condtion in the solr query? will that affects the
performance ?

Pls share your valuable inputs.

On Thu, Mar 16, 2017 at 6:04 PM Shawn Heisey  wrote:


On 3/16/2017 6:02 AM, Ganesh M wrote:

We have 1 million of documents and would like to query with multiple fq

values.

We have kept the access_control ( multi value field ) which holds

information about for which group that document is accessible.

Now to get the list of all the documents of an user, we would like to

pass multiple fq values ( one for each group user belongs to )



q:somefiled:value:access_control:g1:access_control:g2:access_control:g3:access_control:g4:access_control:g5...

Like this, there could be 100 groups for an user.

The correct syntax is fq=field:value -- what you have there is not going
to work.

This might not do what you expect.  Filter queries are ANDed together --
*every* filter must match, which means that if a document that you want
has only one of those values in access_control, or has 98 of them but
not all 100, then the query isn't going to match that document.  The
solution is one filter query that can match ANY of them, which also
might run faster.  I can't say whether this is a problem for you or
not.  Your data might be completely correct for matching 100 filters.

Also keep in mind that there is a limit to the size of a URL that you
can send into any webserver, including the container that runs Solr.
That default limit is 8192 bytes, and includes the "GET " or "POST " at
the beginning and the " HTTP/1.1" at the end (note the spaces).  The
filter query information for 100 of the filters you mentioned is going
to be over 2K, which will fit in the default, but if your query has more
complexity than you have mentioned here, the total URL might not fit.
There's a workaround to this -- use a POST request and put the
parameters in the request body.


If we fire query with 100 values in the fq, whats the penalty on the

performance ? Can we get the result in less than one second for 1 million
of documents.

With one million documents, each internal filter query result is 25
bytes -- the number of documents divided by eight.  That's 2.5 megabytes
for 100 of them.  In addition, every time a filter is run, it must
examine every document in the index to create that 25 byte
structure, which means that filters which *aren't* found in the
filterCache are relatively slow.  If they are found in the cache,
they're lightning fast, because the cache will contain the entire 25
byte bitset.

If you make your filterCache large enough, it's going to consume a LOT
of java heap memory, particularly if the index gets bigger.  The 

Re: fq performance

2017-03-16 Thread Michael Kuhlmann
First of all, from what I can see, this won't do what you're expecting. 
Multiple fq conditions are always combined using AND, so if a user is 
member of 100 groups, but the document is accessible to only 99 of them, 
then the user won't find it.


Or in other words, if you add a user to some group, then she would get 
*less* results than before.


But coming back to your performance question: Just try it. Having 100 fq 
conditions will of course slow down your query a bit, but not that much. 
I rather see the problem with the filter cache: It will only be fast 
enough if all of your fq filters fit into the cache. Each possible fq 
filter will take 1 million/8 == 125k bytes, so having hundreds of 
possible access groups conditions might blow up your query cache (which 
must fit into RAM).


-Michael


Am 16.03.2017 um 13:02 schrieb Ganesh M:

Hi,

We have 1 million of documents and would like to query with multiple fq values.

We have kept the access_control ( multi value field ) which holds information 
about for which group that document is accessible.

Now to get the list of all the documents of an user, we would like to pass 
multiple fq values ( one for each group user belongs to )

q:somefiled:value&
fq:access_control:g1:access_control:g2:access_control:g3:access_control:g4:access_control:g5...

Like this, there could be 100 groups for an user.

If we fire query with 100 values in the fq, whats the penalty on the 
performance ? Can we get the result in less than one second for 1 million of 
documents.

Let us know your valuable inputs on this.

Regards,





Re: Sorl 6 with jetty issues

2017-02-20 Thread Michael Kuhlmann
This may be related to SOLR-10130.

Am 20.02.2017 um 14:06 schrieb ~$alpha`:
> Issues with solr settings while migrating from solr 4.0 to solr6.0.
>
> Issue Faced: My CPU consumption goes to unacceptable levels. ie. load on
> solr4.0 is between 6 to 10 while the load on solr 6 reaches 100 and since
> its the production i rolled back quickly.
>
> My Solr4 setting
>
>  - Running on tomcat
>  - JVM Memory : 16GB
>  - 24 core cpu
>  - JVM settings :
>- JVM Runtime Java HotSpot(TM) 64-Bit Server VM (24.45-b08) 
>- Processors   24 
>- Args : Paths mentioned here
>
>
> **My Solr6 setting**
>
>  - Running on jetty
>  - JVM Memory : 20GB
>  - 32 core cpu
>  - JVM settings :
>- Oracle Corporation Java HotSpot(TM) 64-Bit Server VM 1.8.0_45 25.45-b02
>- Processors   32
>- Args
>   - DSTOP.KEY=solrrocks
>   - DSTOP.PORT=7983
>   - Djetty.home=/usr/local/solr-6.4.1/server-Djetty.port=8983
>   -
> Dlog4j.configuration=file:/usr/local/solr-6.4.1/example/resources/log4j.properties
>   -
> Dsolr.install.dir=/usr/local/solr-6.4.1-Dsolr.log.dir=/usr/local/solr-6.4.1/example/techproducts/solr/../logs
>   - Dsolr.log.muteconsole
>   -
> Dsolr.solr.home=/usr/local/solr-6.4.1/example/techproducts/solr-Duser.timezone=US/Eastern
>   - XX:+AggressiveOpts
>   - XX:+CMSParallelRemarkEnabled
>   - XX:+CMSScavengeBeforeRemark
>   - XX:+ParallelRefProcEnabled
>   - XX:+PrintGCApplicationStoppedTime
>   - XX:+PrintGCDateStamps
>   - XX:+PrintGCDetails
>   - XX:+PrintGCTimeStamps
>   - XX:+PrintHeapAtGC
>   - XX:+PrintTenuringDistribution
>   - XX:+UseCMSInitiatingOccupancyOnly
>   - XX:+UseConcMarkSweepGC
>   - XX:+UseGCLogFileRotation
>   - XX:-UseSuperWord
>   - XX:CMSFullGCsBeforeCompaction=1
>   - XX:CMSInitiatingOccupancyFraction=70
>   - XX:CMSMaxAbortablePrecleanTime=6000
>   - XX:CMSTriggerPermRatio=80
>   - XX:GCLogFileSize=20M
>   - XX:MaxTenuringThreshold=8
>   - XX:NewRatio=2
>   - XX:NumberOfGCLogFiles=9
>   - XX:OnOutOfMemoryError=/usr/local/solr-6.4.1/bin/oom_solr.sh 8983
> /usr/local/solr-6.4.1/example/techproducts/solr/../logs
>   - XX:PretenureSizeThreshold=64m
>   - XX:SurvivorRatio=15
>   -
> XX:TargetSurvivorRatio=90-Xloggc:/usr/local/solr-6.4.1/example/techproducts/solr/../logs/solr_gc.log-Xms21g-Xmx21g-Xss256k-verbose:gc
> What i looking for
>
> My guess its related to gc setting of jetty as i am not expert in
> jetty(java8).please help how to tune these settings. Also how should i
> chosoe these values or how to to debug these issue ?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Sorl-6-with-jetty-issues-tp4321291.html
> Sent from the Solr - User mailing list archive at Nabble.com.




Re: Select TOP 10 items from Solr Query

2017-02-17 Thread Michael Kuhlmann
It's not possible to do such thing in one request with faceting only.
The problem is that you need a fixed filter on every item when the facet
algorithm is iterating over it; you can't look into future elements to
find out which ones the top 10 will be.

So either you stick with two queries (which may be fast enough anyway
when you only have ca. 100 items in your collection), or you fetch the
data for the top 10 items and do the calculation on your own.

-Michael

Am 17.02.2017 um 11:35 schrieb Zheng Lin Edwin Yeo:
> I'm looking at JSON facet for both of type:terms and type:range.
>
> For example, I may have 100 Items in my collections, and each item can have
> many transactions. But I'm only interested to look at the top 10 items
> which has the highest transaction rate (ie the highest count)
>
> I'm doing a calculation of the total amount and average amount. However, I
> will only want the total amount and average amount to be calculated based
> on the top 10 items which has the highest transaction rate, and not all the
> 100 items.
>
> For now, I need the additional query to get the top 10 items first, before
> I run the JSON Facet to get the total amount and average amount for that 10
> items.
>
> Regards,
> Edwin
>
>
> On 17 February 2017 at 18:02, alessandro.benedetti 
> wrote:
>
>> I think we are missing something here ...
>> You want to fetch the top 10 results for your query, and allow the user to
>> navigate only those 10 results through facets ?
>>
>> Which facets are you interested in ?
>> Field facets ?
>> Whatever facet you want, calculating it in your client, on 10 results
>> shouldn't be that problematic.
>> Are we missing something ? Why you would need an additional query ?
>>
>> Cheers
>>
>>
>>
>> -
>> ---
>> Alessandro Benedetti
>> Search Consultant, R Software Engineer, Director
>> Sease Ltd. - www.sease.io
>> --
>> View this message in context: http://lucene.472066.n3.
>> nabble.com/Select-TOP-10-items-from-Solr-Query-tp4320863p4320910.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>



Re: Select TOP 10 items from Solr Query

2017-02-17 Thread Michael Kuhlmann
Since you already have the top x items then, wouldn't it be much easier
to collect the "facet" data from the result list on your own?

Am 17.02.2017 um 10:18 schrieb Zheng Lin Edwin Yeo:
> Hi Michael,
>
> Yes, I only want the JSON Facet to query based on the returned result set
> of the itemNo from the 1st query.
>
> There's definitely more than the 10, but we just need the top 10 in this
> case. As the top 10 itemNo may change, so we have to get the returned
> result set of the itemNo each time we want to do the JSON Facet.
>
> Regards,
> Edwin
>
>
> On 17 February 2017 at 15:42, Michael Kuhlmann <k...@solr.info> wrote:
>
>> So basically you want faceting only on the returned result set?
>>
>> I doubt that this is possible without additional queries. The issue is
>> that faceting and result collecting is done within one iteration, so
>> when some document (actually the document's internal id) is fetched as a
>> possible result item, you can't determine whether this will make it into
>> the top x elements or not since there will come more.
>>
>> -Michael
>>
>> Am 17.02.2017 um 05:00 schrieb Zheng Lin Edwin Yeo:
>>> Hi,
>>>
>>> Would like to check, is it possible to do a select of say TOP 10 items
>> from
>>> Solr query, and use the list of the items to do another query (Eg: JSON
>>> Facet)?
>>>
>>> Currently, I'm using a normal facet to retrieve the list of the TOP 10
>> item
>>> from the normal faceting.
>>> After which, I have to list out all the 10 items as a filter when I do
>> the
>>> JSON Facet like this
>>> q=itemNo:(001 002 003 004 005 006 007 008 009 010)
>>>
>>> It will help if I can combine both of this into a single query.
>>>
>>> I'm using Solr 6.4.1
>>>
>>> Regards,
>>> Edwin
>>>
>>



Re: Select TOP 10 items from Solr Query

2017-02-16 Thread Michael Kuhlmann
So basically you want faceting only on the returned result set?

I doubt that this is possible without additional queries. The issue is
that faceting and result collecting is done within one iteration, so
when some document (actually the document's internal id) is fetched as a
possible result item, you can't determine whether this will make it into
the top x elements or not since there will come more.

-Michael

Am 17.02.2017 um 05:00 schrieb Zheng Lin Edwin Yeo:
> Hi,
>
> Would like to check, is it possible to do a select of say TOP 10 items from
> Solr query, and use the list of the items to do another query (Eg: JSON
> Facet)?
>
> Currently, I'm using a normal facet to retrieve the list of the TOP 10 item
> from the normal faceting.
> After which, I have to list out all the 10 items as a filter when I do the
> JSON Facet like this
> q=itemNo:(001 002 003 004 005 006 007 008 009 010)
>
> It will help if I can combine both of this into a single query.
>
> I'm using Solr 6.4.1
>
> Regards,
> Edwin
>



Re: Continual garbage collection loop

2017-02-15 Thread Michael Kuhlmann
The number of cores is not *that much* important compared to the index
size, but each core has its memory overhead. For instance, caches are
based on cores, so you're having 36 individual caches per type.

Best,
Michael

Am 14.02.2017 um 16:39 schrieb Leon STRINGER:
>> On 14 February 2017 at 14:44 Michael Kuhlmann <k...@solr.info> wrote:
>>
>>
>> Wow, running 36 cores with only half a gigabyte of heap memory is
>> *really* optimistic!
>>
>> I'd raise the heap size to some gigabytes at least and see how it's
>> working then.
>>
> I'll try increasing the heap size and see if I get the problem again.
>
> Is core quantity a big issue? As opposed to the size of the cores? Yes, 
> there's
> 36 but some relate to largely inactive web sites so the average size (assuming
> my "Master (Searching)" way of calculating this is correct) is less than 4 
> MB. I
> naively assumed a heap size-related issue would result from larger data sets.
>
> Thanks for your recommendation,
>
> Leon Stringer
>



Re: Continual garbage collection loop

2017-02-14 Thread Michael Kuhlmann
Wow, running 36 cores with only half a gigabyte of heap memory is
*really* optimistic!

I'd raise the heap size to some gigabytes at least and see how it's
working then.

-Michael

Am 14.02.2017 um 15:23 schrieb Leon STRINGER:
> Further background on the environment:
>
> There are 36 cores, with a total size of 131 MB (based on the size reported by
> "Master (Searching)" in the web console).
>
> The Java memory parameters in use are: -Xms512m -Xmx512m.
>
>> On 14 February 2017 at 05:45 Erick Erickson 
>> wrote:
>>
>> GCViewer is a nifty tool for visualizing the GC activity BTW.
>>
> I don't know what I'm looking for but for a log covering a 3-hour period today
> the "Summary" tab says (typed manually, apologies for any mistakes):
>
> Total heap (usage / alloc. max): 490.7M (100.0%) / 490.7M
>
> Max heap after conc GC: 488.7M (99.6%)
>
> Max tenured after conc GC: 382M (99.5% / 77.9%)
>
> Max heap after full GC: 490M (99.9%)
>
> Freed Memory: 141,811.4M
>
> Freed Mem/Min: 748.554M/min
>
> Total Time: 3h9m26s
>
> Accumulated pauses: 883.6s
>
> Throughput: 92.23%
>
> Number of full gc pauses: 476
>
> Full GC Performance: 101.4M/s
>
> Number of gc pauses: 15153
>
> GC Performance: 245.5M/s
>
>
> "Memory" tab:
>
> Total heap (usage / alloc. max): 490.7M (100.0%) / 490.7M
>
> Tenured heap (usage / alloc. max): 384M (100.0%) / 384M
>
> Young heap (usage / alloc. max): 106.7M (100.0%) / 106.7M
>
> Perm heap (usage / alloc. max): 205.6M (17.0%) / 1,212M
>
> Max tenured after conc GC: 382M (99.5% / 77.9%)
>
> Avg tenured after conc GC: 247.5M (delta=17.612M)
>
> Max heap after conc GC: 488.7M (99.6%)
>
> Avg heap after conc GC: 252.6M (delta=35.751M)
>
> Max heap after full GC: 490M (99.9%)
>
> Avg heap after full GC: 379M (delta=72.917M)
>
> Avg after GC: 359.9M  (delta=40.965M)
>
> Freed by full GC: 47,692.8M (33.6%)
>
> Freed by GC: 94,118.7M (66.4%)
>
> Avg freed full GC: 100.2M/coll (delta=68.015M) [greyed]
>
> Avg freed GC: 6,360.3K/coll (delta=19.963M) [greyed]
>
> Avg rel inc after FGC: -199,298B/coll
>
> Avg rel inc after GC: 6,360.3K/coll (delta=19.963M)
>
> Slope full GC: -126,380B/s
>
> Slope GC: 14.317M/s
>
> InitiatingOccFraction (avg / max): 65.9% / 100.0%
>
> Avg promotion: 2,215.324K/coll (delta=6,904.174K) [greyed]
>
> Total promotion: 12,504.467M
>
>
> Can anyone can shed any light on this? Is it a problem or is this all normal?
>
> Thanks,
>
> Leon Stringer




Re: FacetField-Result on String-Field contains value with count 0?

2017-01-13 Thread Michael Kuhlmann
Then I don't understand your problem. Solr already does exactly what you
want.

Maybe the problem is different: I assume that there never was a value of
"1" in the index, leading to your confusion.

Solr returns all fields as facet result where there was some value at
some time as long as the the documents are somewhere in the index, even
when they're marked as indexed. So there must have been a document with
m_mediaType_s=1. Even if all these documents are deleted already, its
values still appear in the facet result.

This holds true until segments get merged so that all deleted documents
are pruned. So if you send a forceMerge request, chances are good that
"1" won't come up any more.

-Michael

Am 13.01.2017 um 15:36 schrieb Sebastian Riemer:
> Hi Bill,
>
> Thanks, that's actually where I come from. But I don't want to exclude values 
> leading to a count of zero.
>
> Background to this: A user searched for mediaType "book" which gave him 10 
> results. Now some other task/routine whatever changes all those 10 books to 
> be say 10 ebooks, because the type has been incorrect. The user makes a 
> refresh, still looking for "book" gets 0 results (which is expected) and 
> because we rule out facet.fields having count 0, I don't get back the 
> selected mediaType "book" and thus I cannot select this value in the 
> select-dropdown-filter for the mediaType. This leads to confusion for the 
> user, since he has no results, but doesn't see that it's because of he still 
> has that mediaType-filter set to a value "books" which now actually leads to 
> 0 results.
>
> -Ursprüngliche Nachricht-
> Von: billnb...@gmail.com [mailto:billnb...@gmail.com] 
> Gesendet: Freitag, 13. Januar 2017 15:23
> An: solr-user@lucene.apache.org
> Betreff: Re: AW: FacetField-Result on String-Field contains value with count 
> 0?
>
> Set mincount to 1
>
> Bill Bell
> Sent from mobile
>
>
>> On Jan 13, 2017, at 7:19 AM, Sebastian Riemer  wrote:
>>
>> Pardon me,
>> the second search should have been this: 
>> http://localhost:8983/solr/wemi/select?fq=m_mediaType_s:%221%22
>> =on=*:*=0=0=json (or in other words, give me all 
>> documents having value "1" for field "m_mediaType_s")
>>
>> Since this search gives zero results, why is it included in the facet.fields 
>> result-count list?
>>
>> 
>>
>> Hi,
>>
>> Please help me understand: 
>> http://localhost:8983/solr/wemi/select?facet.field=m_mediaType_s=on=on=*:*=json
>>  returns:
>>
>> "facet_counts":{
>>"facet_queries":{},
>>"facet_fields":{
>>  "m_mediaType_s":[
>>"2",25561,
>>"3",19027,
>>"10",1966,
>>"11",1705,
>>"12",1067,
>>"4",1056,
>>"5",291,
>>"8",68,
>>"13",2,
>>"6",2,
>>"7",1,
>>"9",1,
>>"1",0]},
>>"facet_ranges":{},
>>"facet_intervals":{},
>>"facet_heatmaps":{}}}
>>
>> http://localhost:8983/solr/wemi/select?fq=m_mediaType_s:%222%22
>> =on=*:*=0=0=json
>>
>>
>> ?  "response":{"numFound":25561,"start":0,"docs":[]
>>
>> http://localhost:8983/solr/wemi/select?fq=m_mediaType_s:%220%22
>> =on=*:*=0=0=json
>>
>>
>> ?  "response":{"numFound":0,"start":0,"docs":[]
>>
>> So why does the search for facet.field even contain the value "1", if it 
>> does not exist?
>>
>> And why does it e.g. not contain 
>> "SomeReallyCrazyOtherValueWhichLikeValue"1"DoesNotExistButLetsIncludeI
>> tInTheFacetFieldsResultListAnywaysWithCountZero" : 0
>>
>> Best regards,
>> Sebastian
>>
>> Additional info, field m_mediaType_s is a string;
>> > stored="true" />
>> > />
>>



Re: Solr Suggester

2016-12-22 Thread Michael Kuhlmann
For the suggester, the field must be indexed. It's not necessary to have
it stored.

Best,
Michael

Am 22.12.2016 um 11:24 schrieb Furkan KAMACI:
> Hi Emir,
>
> As far as I know, it should be enough to be stored=true for a suggestion
> field? Should it be both indexed and stored?
>
> Kind Regards,
> Furkan KAMACI
>
> On Thu, Dec 22, 2016 at 11:31 AM, Emir Arnautovic <
> emir.arnauto...@sematext.com> wrote:
>
>> That is because my_field_2 is not indexed.
>>
>> Regards,
>> Emir
>>
>>
>> On 21.12.2016 18:04, Furkan KAMACI wrote:
>>
>>> Hi All,
>>>
>>> I've a field like that:
>>>
>>>  >>   multiValued="false" />
>>>
>>>  >> stored="true" multiValued="false"/>
>>>
>>> When I run a suggester on my_field_1 it returns response. However
>>> my_field_2 doesn't. I've defined suggester as:
>>>
>>>suggester
>>>FuzzyLookupFactory
>>>DocumentDictionaryFactory
>>>
>>> What can be the reason?
>>>
>>> Kind Regards,
>>> Furkan KAMACI
>>>
>>>
>> --
>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>



Re: File system choices?

2016-12-15 Thread Michael Kuhlmann
Yes, and we're doing such things at my company. However we most often do
things you shouldn't do; this is one of these.

Solr needs to load data quite fast, otherwise you'll be having a
performance killer. It's often recommended to use an SSD instead of a
normal hard disk; a network share would be quite contrary to it.

It might make sense when you update very seldom, and all your index fits
into memory.

-Michael


Am 15.12.2016 um 16:37 schrieb Michael Joyner (NewsRx):
> Hello all,
>
> Can the Solr indexes be safely stored and used via mounted NFS shares?
>
> -Mike
>



Re: Again : Query formulation help

2016-11-24 Thread Michael Kuhlmann
Hi Prasanna,

there's no such filter out-of-the-box. It's similar to the mm parameter
in (e)dismax parser, but this only works for full text searches on the
same fields.

So you have to build the query on your own using all possible permutations:

fq=(code1: AND code2:) OR (code1: AND code3:) OR .

Of course, such a query can become huge when there are more than four
constraints.

Best,
Michael

Am 24.11.2016 um 11:40 schrieb Prasanna S. Dhakephalkar:
> Hi,
>
>  
>
> Need to formulate a distinctive field values query on 4 fields with minimum
> match on 2 fields
>
>  
>
> I have 4 fields in my core
>
> Code 1 : Values between 1001 to 
>
> Code 2 : Values between 1001 to 
>
> Code 3 : Values between 1001 to 
>
> Code 4 : Values between 1001 to 
>
>  
>
> I want to formulate a query in following manner
>
>  
>
> Code 1 : 
>
> Code 2 : 
>
> Code 3 : 
>
> Code 4 : 
>
>  
>
> I want to formulate a query, given above parameters, the result should
> contain documents where at least 2 of the above match.
>
>  
>
> Thanks and Regards,
>
>  
>
> Prasanna
>
>  
>
>



Re: Multi word synonyms

2016-11-15 Thread Michael Kuhlmann
Wow, that's great news! I didn't notice that.

Am 15.11.2016 um 13:05 schrieb Vincenzo D'Amore:
> Hi Michael,
>
> an update, reading the article I double checked if at least one of the
> issues were fixed.
> The good news is that https://issues.apache.org/jira/browse/LUCENE-2605 has
> been closed and is available in 6.2.
>
> On Tue, Nov 15, 2016 at 12:32 PM, Michael Kuhlmann <k...@solr.info> wrote:
>
>> This is a nice reading though, but that solution depends on the
>> precondition that you'll already know your synonyms at index time.
>>
>> While having synonyms in the index is mostly the better solution anyway,
>> it's sometimes not feasible.
>>
>> -Michael
>>
>> Am 15.11.2016 um 12:14 schrieb Vincenzo D'Amore:
>>> Hi Midas,
>>>
>>> I suggest this interesting reading:
>>>
>>> https://lucidworks.com/blog/2014/07/12/solution-for-multi-
>> term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
>>>
>>>
>>> On Tue, Nov 15, 2016 at 11:00 AM, Michael Kuhlmann <k...@solr.info>
>> wrote:
>>>> It's not working out of the box, sorry.
>>>>
>>>> We're using this plugin:
>>>> https://github.com/healthonnet/hon-lucene-synonyms#getting-started
>>>>
>>>> It's working nicely, but can lead to OOME when you add many synonyms
>>>> with multiple terms. And I'm not sure whether it#s still working with
>>>> Solr 6.0.
>>>>
>>>> -Michael
>>>>
>>>> Am 15.11.2016 um 10:29 schrieb Midas A:
>>>>> - i have to  use multi word synonyms at query time .
>>>>>
>>>>> Please suggest how can i do it .
>>>>> and let me know it whether it would be visible in debug query or not .
>>>>>
>>
>



Re: Multi word synonyms

2016-11-15 Thread Michael Kuhlmann
This is a nice reading though, but that solution depends on the
precondition that you'll already know your synonyms at index time.

While having synonyms in the index is mostly the better solution anyway,
it's sometimes not feasible.

-Michael

Am 15.11.2016 um 12:14 schrieb Vincenzo D'Amore:
> Hi Midas,
>
> I suggest this interesting reading:
>
> https://lucidworks.com/blog/2014/07/12/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
>
>
>
> On Tue, Nov 15, 2016 at 11:00 AM, Michael Kuhlmann <k...@solr.info> wrote:
>
>> It's not working out of the box, sorry.
>>
>> We're using this plugin:
>> https://github.com/healthonnet/hon-lucene-synonyms#getting-started
>>
>> It's working nicely, but can lead to OOME when you add many synonyms
>> with multiple terms. And I'm not sure whether it#s still working with
>> Solr 6.0.
>>
>> -Michael
>>
>> Am 15.11.2016 um 10:29 schrieb Midas A:
>>> - i have to  use multi word synonyms at query time .
>>>
>>> Please suggest how can i do it .
>>> and let me know it whether it would be visible in debug query or not .
>>>
>>
>



Re: Multi word synonyms

2016-11-15 Thread Michael Kuhlmann
It's not working out of the box, sorry.

We're using this plugin:
https://github.com/healthonnet/hon-lucene-synonyms#getting-started

It's working nicely, but can lead to OOME when you add many synonyms
with multiple terms. And I'm not sure whether it#s still working with
Solr 6.0.

-Michael

Am 15.11.2016 um 10:29 schrieb Midas A:
> - i have to  use multi word synonyms at query time .
>
> Please suggest how can i do it .
> and let me know it whether it would be visible in debug query or not .
>



Re: how to sort search results by count matches

2016-08-02 Thread Michael Kuhlmann
Hi syegorius,

are you sure that there's no synonym "planet,world" defined?

-Michael

Am 02.08.2016 um 15:57 schrieb syegorius:
> I have 4 records index by Solr:
>
> 1 hello planet dear friends 
> 2 hello world dear friends 
> 3 nothing 
> 4 just friends
>
> I'm searching with this query:
>
> select?q=world+dear+friends=json=true
>
> The result is:
>
> 1 hello planet dear friends
> 2 hello world dear friends
> 4 just friends
>
> But as you can see first record has 2 matches, second - 3 and fourth - 1 and
> i need the sequence of the result was:
>
> 2 hello world dear friends //3 matches
> 1 hello planet dear friends //2 matches
> 4 just friends//1 match
>
> How can i do that?
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/how-to-sort-search-results-by-count-matches-tp4290022.html
> Sent from the Solr - User mailing list archive at Nabble.com.




Re: Does Solr support 'Value Search'?

2012-08-09 Thread Michael Kuhlmann

On 08.08.2012 20:56, Bing Hua wrote:

Not quite understand but I'd explain the problem I had. The response would
contain only fields and a list of field values that match the query.
Essentially it's querying for field values rather than documents. The
underlying use case would be, when typing in a quick search box, the drill
down menu may contain matches on authors, on doctitles, and potentially on
other fields.

Still thanks for your response and hopefully I'm making it clearer.
Bing


Hi Bing,

hmh, I implemented myself an autosuggest component that does exactly 
this. You could specify which field you wanted to query, give an 
optional weight to them, and the component returned a list of all fields 
and values beginning with the queried string. Either combined or per 
field, depending on your configuration.


However, that was with Solr 1.4.0, when there was no genuine suggest 
component available. Since then, the Suggester component has been 
implemented: http://wiki.apache.org/solr/Suggester/


This relies on the spell check dictionary and works better than a simple 
term dictionary approach. And that's the reason why I didn't bother my 
old code any more.


So maybe you're simply looking for the suggester component? If not, I 
can try to make my old-style component work with a current Solr version 
and spread it around. Just tell me.


Greetings,
Kuli


Re: Connect to SOLR over socket file

2012-08-08 Thread Michael Kuhlmann

On 07.08.2012 21:43, Jason Axelson wrote:

Hi,

Is it possible to connect to SOLR over a socket file as is possible
with mysql? I've looked around and I get the feeling that I may be
mi-understanding part of SOLR's architecture.

Any pointers are welcome.

Thanks,
Jason


Hi Jason,

not that I know of. This has nothing to do with Solr, it depends on the 
web server you are using. Tomcat, Jetty and the others are using TCP/IP 
directly through java.io or java.nio classes, and Solr is just one web 
app that is handled by them.


Java web servers typically run on a separate host, and in contrast to 
MySQL, the local deployment is rather the exception than the standard.


If you don't want the network overhead, than use an embedded Solr 
server: http://wiki.apache.org/solr/EmbeddedSolr


Greetings,
Kuli


Re: SOLR 3.4 GeoSpatial Query Returning distance

2012-08-02 Thread Michael Kuhlmann

On 02.08.2012 01:52, Anand Henry wrote:

Hi,

In SOLR 3.4, while doing a geo-spatial search, is there a way to retrieve
the distance of each document from the specified location?


Not that I know of.

What we did was to read and parse the location field on client side and 
calculate the distance on our own using this library:


http://code.google.com/p/simplelatlng/

However, it's not as nice as getting the distance from Solr, and 
sometimes the distances seem to slightly differ - e.g. when you filter 
up to a distance of 100 km, there are cases where the client library 
still computes 100.8 km or so.


But at least, it's working.

-Kuli


Re: Urgent: Facetable but not Searchable Field

2012-08-01 Thread Michael Kuhlmann

On 01.08.2012 13:58, jayakeerthi s wrote:

We have a requirement, where we need to implement 2 fields as Facetable,
but the values of the fields should not be Searchable.


Simply don't search for it, then it's not searchable.

Or do I simply don't understand your question? As long as Dismax doesn't 
have the attribute in its qf parameter, it's not getting searched.


Or, if the user has direct access to Solr, then she can search for the 
attribute. And can delete the index, or crash the server, if she likes.


So the short anser is: No. Facettable fields must be searchable. But 
usually, this is no problem.


-Kuli


Re: Urgent: Facetable but not Searchable Field

2012-08-01 Thread Michael Kuhlmann

On 01.08.2012 15:40, Jack Krupansky wrote:

The indexed and stored field attributes are independent, so you can
define a facet field as stored but not indexed (stored=true
indexed=false), so that the field can be faceted but not indexed.


?

A field must be indexed to be used for faceting.

-Kuli


Re: Starts with Query

2012-06-15 Thread Michael Kuhlmann
It's not necessary to do this. You can simply be happy about the fact 
that all digits are ordered strictly in unicode, so you can use a range 
query:


(f)q={!frange l=0 u=\: incl=true incu=false}title

This finds all documents where any token from the title field starts 
with a digit, so if you want to only find documents where the whole 
title starts with a digit, you need a second field with a string or 
untokenized text type. Use the copyField directive then, as Jack 
Krupansky already suggested in a previous reply.


Greetings,
Kuli


Am 15.06.2012 08:38, schrieb Afroz Ahmad:

If you are not searching for the specific digit and want to match all
documents that start with any digit, you could as part of the indexing
process, have another field say startsWithDigit and set it to true if
it the title begins with a digit. All you need to do at query time then
is query for startsWithDigit =true.
Thanks
Afroz


From: nutchsolruser
Sent: 6/14/2012 11:03 PM
To: solr-user@lucene.apache.org
Subject: Re: Starts with Query
Thanks Jack for valuable response,Actually i am trying to match *any* numeric
pattern at the start of each document.  I dont know documents in index i
just want documents title starting with any digit.

--
View this message in context:
http://lucene.472066.n3.nabble.com/Starts-with-Query-tp3989627p3989761.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: what's better for in memory searching?

2012-06-11 Thread Michael Kuhlmann
Set the swapiness to 0 to avoid memory pages being swapped to disk too 
early.


http://en.wikipedia.org/wiki/Swappiness

-Kuli

Am 11.06.2012 10:38, schrieb Li Li:

I have roughly read the codes of RAMDirectory. it use a list of 1024
byte arrays and many overheads.
But as far as I know, using MMapDirectory, I can't prevent the page
faults. OS will swap less frequent pages out. Even if I allocate
enough memory for JVM, I can guarantee all the files in the directory
are in memory. am I understanding right? if it is, then some less
frequent queries will be slow.  How can I let them always in memory?

On Fri, Jun 8, 2012 at 5:53 PM, Lance Norskoggoks...@gmail.com  wrote:

Yes, use MMapDirectory. It is faster and uses memory more efficiently
than RAMDirectory. This sounds wrong, but it is true. With
RAMDirectory, Java has to work harder doing garbage collection.

On Fri, Jun 8, 2012 at 1:30 AM, Li Lifancye...@gmail.com  wrote:

hi all
   I want to use lucene 3.6 providing searching service. my data is
not very large, raw data is less that 1GB and I want to use load all
indexes into memory. also I need save all indexes into disk
persistently.
   I originally want to use RAMDirectory. But when I read its javadoc.

   Warning: This class is not intended to work with huge indexes.
Everything beyond several hundred megabytes
  will waste resources (GC cycles), because it uses an internal buffer
size of 1024 bytes, producing millions of byte
  [1024] arrays. This class is optimized for small memory-resident
indexes. It also has bad concurrency on
  multithreaded environments.
It is recommended to materialize large indexes on disk and use
MMapDirectory, which is a high-performance
  directory implementation working directly on the file system cache of
the operating system, so copying data to
  Java heap space is not useful.

should I use MMapDirectory? it seems another contrib instantiated.
anyone test it with RAMDirectory?




--
Lance Norskog
goks...@gmail.com




Re: what's better for in memory searching?

2012-06-11 Thread Michael Kuhlmann
You cannot guarantee this when you're running out of RAM. You'd have a 
problem then anyway.


Why are you caring that much? Did you yet have performance issues? 1GB 
should load really fast, and both auto warming and OS cache should help 
a lot as well. With such an index, you usually don't need to fine tune 
performance that much.


Did you think about using a SSD? Since you want to persist your index, 
you'll need to live with disk IO anyway.


Greetings,
Kuli

Am 11.06.2012 11:20, schrieb Li Li:

I am sorry. I make a mistake. even use RAMDirectory, I can not
guarantee they are not swapped out.

On Mon, Jun 11, 2012 at 4:45 PM, Michael Kuhlmannk...@solarier.de  wrote:

Set the swapiness to 0 to avoid memory pages being swapped to disk too
early.

http://en.wikipedia.org/wiki/Swappiness

-Kuli

Am 11.06.2012 10:38, schrieb Li Li:


I have roughly read the codes of RAMDirectory. it use a list of 1024
byte arrays and many overheads.
But as far as I know, using MMapDirectory, I can't prevent the page
faults. OS will swap less frequent pages out. Even if I allocate
enough memory for JVM, I can guarantee all the files in the directory
are in memory. am I understanding right? if it is, then some less
frequent queries will be slow.  How can I let them always in memory?

On Fri, Jun 8, 2012 at 5:53 PM, Lance Norskoggoks...@gmail.comwrote:


Yes, use MMapDirectory. It is faster and uses memory more efficiently
than RAMDirectory. This sounds wrong, but it is true. With
RAMDirectory, Java has to work harder doing garbage collection.

On Fri, Jun 8, 2012 at 1:30 AM, Li Lifancye...@gmail.comwrote:


hi all
   I want to use lucene 3.6 providing searching service. my data is
not very large, raw data is less that 1GB and I want to use load all
indexes into memory. also I need save all indexes into disk
persistently.
   I originally want to use RAMDirectory. But when I read its javadoc.

   Warning: This class is not intended to work with huge indexes.
Everything beyond several hundred megabytes
  will waste resources (GC cycles), because it uses an internal buffer
size of 1024 bytes, producing millions of byte
  [1024] arrays. This class is optimized for small memory-resident
indexes. It also has bad concurrency on
  multithreaded environments.
It is recommended to materialize large indexes on disk and use
MMapDirectory, which is a high-performance
  directory implementation working directly on the file system cache of
the operating system, so copying data to
  Java heap space is not useful.

should I use MMapDirectory? it seems another contrib instantiated.
anyone test it with RAMDirectory?





--
Lance Norskog
goks...@gmail.com







Re: timeAllowed flag in the response

2012-06-08 Thread Michael Kuhlmann

Hi Laurent,

alas there is currently no such option. The time limit is handled by an 
internal TimeLimitingCollector, which is used inside SolrIndexSearcher. 
Since the using method only returns the DocList and doesn't have access 
to the QueryResult, it won't be easy to return this information in a 
beautiful way.


Aborted Queries don't feed the caches, so you maybe can check whether 
the cache fill rate has changed, Of course, this is no reasonable 
approach in production environment.


The only way you can get the information is by patching Solr with a 
dirty hack.


Greetings,
Kuli

Am 07.06.2012 22:14, schrieb Laurent Vaills:

Hi everyone,

We have some grouping queries that are quite long to execute. Some are too
long to execute and are not acceptable. We have setup timeout for the
socket but with this we get no result and the query is still running on the
Solr side.
So, we are now using the timeAllowed parameter which is a good compromise.
However, in the response, how can we know that the query was stopped
because it was too long ?

I need this information for monitoring and to tell the user that the
results are not complete.

Regards,
Laurent





Re: timeAllowed flag in the response

2012-06-08 Thread Michael Kuhlmann

Am 08.06.2012 11:55, schrieb Laurent Vaills:

Hi Michael,

Thanks for the details that helped me to take a deeper look in the source
code. I noticed that each time a TimeExceededException is caught the method
  setPartialResults(true) is called...which seems to be what I'm looking for.
I have to investigate, since this partialResults does not seem to be set
for the sharded queries.


Ah, I simply was too blind! ;) The partial results flag indeed is set in 
the response header.


Then I think this is a bug that it's not filled in a sharded response, 
or it simply is not there when sharding.


Greeting,
Kuli


Re: ERROR 400 undefined field

2012-06-07 Thread Michael Kuhlmann

Am 07.06.2012 09:55, schrieb sheethal shreedhar:

http://localhost:8983/solr/select/?q=fruitversion=2.2start=0rows=10indent=on

I get

HTTP ERROR 400

Problem accessing /solr/select/. Reason:

 undefined field text


Look at your schema.xml. You'll find a line like this:

defaultSearchFieldtext/defaultSearchField

Replace text with a field that s defined somewhere in schema.xml.

Or change your query to something with a field name like this:

http://localhost:8983/solr/select/?q=somefield:fruit

Or use the (e)dismax handler and configure it accordingly. See 
http://wiki.apache.org/solr/DisMaxRequestHandler.


Greetings,
Kuli


Re: Query elevation / boosting or something else to guarantee document position

2012-05-31 Thread Michael Kuhlmann

Hi Wenca,

I'm a bit late. but maybe you're still interested.

There's no such functionality in standard Solr. With sorting, this is 
not possible, because sort functions only rank each single document, 
they know nothing about the position of the others. And query elevation 
is similar, you'll raise the score of independent documents.


To achive this, you'll need an own QueryComponent. This isn't too 
complicated. You can't change the SolrIndexSearcher easily, this does 
the search job. But you can subclass 
org.apache.solr.handler.component.QueryComponent and overwrite 
process(). Alas the single main line - searcher.search() - is buried 
deeply in the huge monster method process(), and you first have to check 
for shards, grouping and twentythousand other parameters until you've 
arrived the code line you may want to expand.


Before calling search(), set the GET_DOCSET flag in your QueryCommand 
object, then execute the search. To check whether there's a document of 
the particular manufacturer in the result list, you can either
a) fetch the appropriate field value from the default field cache for 
every single result document until you found one; or
b) call getDocSet() on the SolrIndexSearcher with the manufacturer query 
as the parameter, and perform and and() operation on the resulting 
DocSet with the DocSet of your main query. (That's why you set the flag 
before.) You can then check which document that matches both the 
manufacturer and the main query fits best.


If you found a matching document, but it's behind pos. 5 in the 
resulting DocList, the you simoply have to re-order your list.


If there's no such document within the DocList (which is limited by your 
rows parameter), but there are some in the joined DocSet from strategy 
b), then you can simply choose one of them and ignore the fact that this 
is probably not the best matching one. Or you have to patch Solr and 
modify getDocListNC() in solrIndexSearcher (or one of the Collector 
classes), which is much more complicated.


Good luck!
-Kuli

Am 29.05.2012 14:26, schrieb Wenca:

Hi all,

I have an index with thousands of products with various fields
(manufacturer, price, popularity, type, color, ...) and I want to
guarantee at least one product by a particular manufacturer to be within
the first 5 results.

The search is done mainly by using filter params and results are ordered
by function e.g.: product(price, popularity) asc or by discount desc

And I need to guarantee that if there is any product matching the given
filters made by a concrete manufacturer, then it will be on the 5th
position at worst, even if the position by the order function is worse.

It seems to me that the Query elevation component is not the right thing
for me. I don't know the query in advance (or the set of filter
criteria) and I don't know concrete product that will be the best for
the criteria within the order.

And also I don't think that I can construct a function with such
requirements to use it directly for ordering the results.

Of course I can make a second query in case there is no desired product
on the first page of results and put it there, but it requires
additional request to solr and complicates results processing and
further pagination.

Can anybody suggest any solution?

Thanks
Wenca




Re: How many doc/doc in the XML source file before indexing?

2012-05-24 Thread Michael Kuhlmann

There is no hard limit for the maximum nunmber of documents per update.

It's only memory dependent. The smaller each document, and the more 
memory Solr can acquire, the more documents can you send in one update.


However, I wouldn't pish it too jard anyway. If you can send, say, 100 
documents per update, the you won't gain much if you send 200 documents 
instead, or even 1000. The number of requests don't count that much.


And, if the update fails for some reason, then the whole request will be 
ignored. If you had sent 1000 documents in an update, and one of them 
had a field missing, for example, then it's hard to find out which one.


Greetings,
Michael

Am 24.05.2012 10:58, schrieb Bruno Mannina:

I can't find my answer concerning the max number of doc/doc ?

Can someone can tell me if there is no limit?

Le 24/05/2012 09:55, Bruno Mannina a écrit :

Sorry I just found : http://wiki.apache.org/solr/UpdateXmlMessages

I will take also a look to find the max number of doc/doc.

Le 24/05/2012 09:51, Paul Libbrecht a écrit :

Bruno,
see the solrconfig.xml, you have all sorts of tweaks for this kind of
things.

paul


Le 24 mai 2012 à 09:49, Bruno Mannina a écrit :


Hi All,

Just a little question concerning the max number of

add
doc/doc
/add

that I can write in the xml source file before indexing? only one,
10, 100, 1000, unlimited...?

I must indexed 80M docs so I can't create one xml file by doc.

thanks,
Bruno

















Re: How many doc/doc in the XML source file before indexing?

2012-05-24 Thread Michael Kuhlmann

pish it too jard - sounds funny. :)

I meant push it too hard.

Am 24.05.2012 11:46, schrieb Michael Kuhlmann:

There is no hard limit for the maximum nunmber of documents per update.

It's only memory dependent. The smaller each document, and the more
memory Solr can acquire, the more documents can you send in one update.

However, I wouldn't pish it too jard anyway. If you can send, say, 100
documents per update, the you won't gain much if you send 200 documents
instead, or even 1000. The number of requests don't count that much.

And, if the update fails for some reason, then the whole request will be
ignored. If you had sent 1000 documents in an update, and one of them
had a field missing, for example, then it's hard to find out which one.

Greetings,
Michael

Am 24.05.2012 10:58, schrieb Bruno Mannina:

I can't find my answer concerning the max number of doc/doc ?

Can someone can tell me if there is no limit?

Le 24/05/2012 09:55, Bruno Mannina a écrit :

Sorry I just found : http://wiki.apache.org/solr/UpdateXmlMessages

I will take also a look to find the max number of doc/doc.

Le 24/05/2012 09:51, Paul Libbrecht a écrit :

Bruno,
see the solrconfig.xml, you have all sorts of tweaks for this kind of
things.

paul


Le 24 mai 2012 à 09:49, Bruno Mannina a écrit :


Hi All,

Just a little question concerning the max number of

add
doc/doc
/add

that I can write in the xml source file before indexing? only one,
10, 100, 1000, unlimited...?

I must indexed 80M docs so I can't create one xml file by doc.

thanks,
Bruno



















Re: How many doc/doc in the XML source file before indexing?

2012-05-24 Thread Michael Kuhlmann

Just try it!

Maybe you're lucky, and it works with 80M docs. If each document takes 
100 k, then it only needs 8 GB memory for indexing.


However, I doubt it. I've not been too deeply into the UpdateHandler 
yet, but I think it first needs to parse the complete XML file before it 
starts to index.


But that worst thing that can happen is an OOM exception. And when you 
need to split the xml files, then you can split into smaller chunks as well.


Just a note: In Solr, you're always updating, even in the first 
indexation. There's no difference between updates and inserts.


Greetings,
Michael

Am 24.05.2012 12:37, schrieb Bruno Mannina:

In fact it's not for an update but only for the first indexation.

I mean, I will receive the full database with around 80M docs in some
XML files (one per country in the world).
 From these 80M docs I will generate right XML format for each doc. (I
don't need all fields from the source)

And as actually for my test (12 000 docs), I generate one file per doc,
there is no problem.
But with 80M docs I can't generate one file per doc.

It's for this reason I asked the max number of doc in a file add.

For the first time, if a country file fails, no problem, I will check it
and re-generate it.

Is it bad to create a file with 5M doc ?


Le 24/05/2012 11:46, Michael Kuhlmann a écrit :

There is no hard limit for the maximum nunmber of documents per update.

It's only memory dependent. The smaller each document, and the more
memory Solr can acquire, the more documents can you send in one update.

However, I wouldn't pish it too jard anyway. If you can send, say, 100
documents per update, the you won't gain much if you send 200
documents instead, or even 1000. The number of requests don't count
that much.

And, if the update fails for some reason, then the whole request will
be ignored. If you had sent 1000 documents in an update, and one of
them had a field missing, for example, then it's hard to find out
which one.

Greetings,
Michael

Am 24.05.2012 10:58, schrieb Bruno Mannina:

I can't find my answer concerning the max number of doc/doc ?

Can someone can tell me if there is no limit?

Le 24/05/2012 09:55, Bruno Mannina a écrit :

Sorry I just found : http://wiki.apache.org/solr/UpdateXmlMessages

I will take also a look to find the max number of doc/doc.

Le 24/05/2012 09:51, Paul Libbrecht a écrit :

Bruno,
see the solrconfig.xml, you have all sorts of tweaks for this kind of
things.

paul


Le 24 mai 2012 à 09:49, Bruno Mannina a écrit :


Hi All,

Just a little question concerning the max number of

add
doc/doc
/add

that I can write in the xml source file before indexing? only one,
10, 100, 1000, unlimited...?

I must indexed 80M docs so I can't create one xml file by doc.

thanks,
Bruno























Re: org.apache.solr.common.SolrException: ERROR: [doc=null] missing required field: id

2012-05-21 Thread Michael Kuhlmann

Am 21.05.2012 12:07, schrieb Tolga:

Hi,

I am getting this error:

[doc=null] missing required field: id


[...]


I've got this entry in schema.xml: field name=id type=string
stored=true indexed=true/
What to do?


Simply make sure that every document you're sending to Solr contains 
this id field.


I assume it's declared as your unique id field, so it's mandatory.

Greetings,
Kuli



Re: org.apache.solr.common.SolrException: ERROR: [doc=null] missing required field: id

2012-05-21 Thread Michael Kuhlmann

Am 21.05.2012 12:40, schrieb Tolga:

How do I verify it exists? I've been crawling the same site and it
wasn't giving an error on Thursday.


It depends on what you're doing.

Are you using nutch?

-Kuli


Re: Solr Shards multi core slower then single big core

2012-05-14 Thread Michael Kuhlmann

Am 14.05.2012 05:56, schrieb arjit:

Thanks Erick for the reply.
I have 6 cores which doesn't contain duplicated data. every core has some
unique data. What I thought was when I read it would read parallel 6 cores
and join the result and return the query. And this would be efficient then
reading one big core.


No, it's not. When you request 10 documents from Solr, it can't know in 
prior which shards contain how many of those documents. It could be that 
each shard only needs to fill one or two documents into the result, but 
it might be that only one shard conatins all ten docuemnts. Therefor, 
Solr needs to request 10 documents from each shard, then taking only the 
10 top documents from those 60 ones and drop the rest. And it gets worse 
when you set an offset of, say, 100.


Sharding is (nearly) always slower than using one big index with 
sufficient hardware resources. Only use sharding when your index is too 
huge to fit into one single machine.


Greetings,
Kuli


Re: Solr Shards multi core slower then single big core

2012-05-14 Thread Michael Kuhlmann

Am 14.05.2012 13:22, schrieb Sami Siren:

Sharding is (nearly) always slower than using one big index with sufficient
hardware resources. Only use sharding when your index is too huge to fit
into one single machine.


If you're not constrained by CPU or IO, in other words have plenty of
CPU cores available together with for example separate hard discs for
each shard splitting your index into smaller shards can in some cases
make a huge difference in one box too.


Do you have an example?

This is hard to believe. If you've several shard on the same machine, 
you'll need that much memory that each shard has enough for all its 
caches and duch. With that lot of memory, a single Solr core should be 
really fast.


If dividing the index is the reason, then a software RAID 0 (striping) 
should be much better.


The only point I see is the concurrent search for one request. Maybe, 
for large requests, this might outweigh the sharding overhead, but only 
for long-running requests without disk I/O. I only see the case when 
using very complicated query functions. And, this only stays true as 
long as you don't run multiple concurrent requests.


Greetings,
Kuli


Re: Solr Shards multi core slower then single big core

2012-05-14 Thread Michael Kuhlmann

Am 14.05.2012 16:18, schrieb Otis Gospodnetic:

Hi Kuli,

In a client engagement, I did see this (N shards on 1 beefy box with lots of 
RAM and CPU cores) be faster than 1 big index.



I want to believe you, but I also want to understand. Can you explain 
why? And did this only happen for single requests, or even under heavy load?


Greetings,
Kuli


Re: Identify indexed terms of document

2012-05-11 Thread Michael Kuhlmann

Am 10.05.2012 22:27, schrieb Ahmet Arslan:




It's possible to see what terms are indexed for a field of
document that
stored=false?


One way is to use http://wiki.apache.org/solr/LukeRequestHandler


Another approach is this:

- Query for exactly this document, e.g. by using the unique field
- Add this to your URL parameters:
facet=truefacet.field=Your fieldfacet.mincount=1

-Kuli


Re: Question about cache

2012-05-11 Thread Michael Kuhlmann

Am 11.05.2012 15:48, schrieb Anderson vasconcelos:

Hi

Analysing the solr server in glassfish with Jconsole, the Heap Memory Usage
don't use more than 4 GB. But, when was executed the TOP comand, the free
memory in Operating system is only 200 MB. The physical memory is only 10GB.

Why machine used so much memory? The cache fields are included in Heap
Memory usage? The other 5,8 GB is the caching of Operating System for
recent open files? Exists some way to tunning this?

Thanks

If the OS is Linux or some other Unix variant, it keeps as much disk 
content in memory as possible. Whenever new memory is needed, it 
automatically gets freed. That won't need time, and there's no need to 
tune anything.


Don't look at the free memory in top command, it's nearly useless. Have 
a look at how much memory your Glassfish process is consuming, and use 
the 'free' command (maybe together with the -m parameter for human 
readability) to find out more about your free memory. The 

-/+ buffers/cache line is relevant.

Greetings,
Kuli


Re: Field with attribut in the schema.xml ?

2012-05-10 Thread Michael Kuhlmann

Am 10.05.2012 14:33, schrieb Bruno Mannina:

like that:

field name=inventor-countryCH/field
field name=inventor-countryFR/field

but in this case Ioose the link between inventor and its country?


Of course, you need to index the two inventors into two distinct documents.

Did you mark those fields as multi-valued? That won't make much sense IMHO.

Greetings,
Kuli


Re: Field with attribut in the schema.xml ?

2012-05-10 Thread Michael Kuhlmann
I don't know the details of your schema, but I would create fields like 
name, country, street etc., and a field named role, which contains 
values like inventor, applicant, etc.


How would you do it otherwise? Create only four documents, each fierld 
containing 80 mio. values?


Greetings,
Kuli

Am 10.05.2012 14:47, schrieb Bruno Mannina:

But I have more than 80 000 000 documents with many fields with this
kind of description?!

i.e:
inventor
applicant
assignee
attorney

I must create for each document 4 documents ??

Le 10/05/2012 14:41, G.Long a écrit :

When you add data into Solr, you add documents which contain fields.
In your case, you should create a document for each of your inventors
with every attribute they could have.

Here is an example in Java:

SolrInputDocument doc = new SolrInputDocument();
doc.addField(inventor, Rossi);
doc.addField(country, FR);
solrServer.add(doc);
...
And then you do the same for all your inventors.

This way, each doc in your index represents one inventor and you can
query them like:
q=inventor:rossi AND country:FR

Le 10/05/2012 14:33, Bruno Mannina a écrit :

like that:

field name=inventor-countryCH/field
field name=inventor-countryFR/field

but in this case Ioose the link between inventor and its country?

if I search an inventor named ROSSI with CH:
q=inventor:rossi and inventor-country=CH

the I will get this result but it's not correct because Rossi is FR.

Le 10/05/2012 14:28, G.Long a écrit :

Hi :)

You could just add a field called country and then add the
information to your document.

Regards,
Gary L.

Le 10/05/2012 14:25, Bruno Mannina a écrit :

Dear,

I can't find how can I define in my schema.xml a field with this
format?

My original format is:

exch:inventors

exch:inventor
exch:inventor-name
nameWEBER WALTER/name
/exch:inventor-name
residence
countryCH/country
/residence
/exch:inventor

exch:inventor
exch:inventor-name
nameROSSI PASCAL/name
/exch:inventor-name
residence
countryFR/country
/residence
/exch:inventor

/exch:inventors

I convert it to:
...
field name=inventorWEBER WALTER/field
field name=inventorROSSI PASCAL/field
...

but how can I add Country code to the field without losing the link
between inventor?
Can I use an attribut ?

Any idea are welcome :)

Thanks,
Bruno Mannina
















Re: Partition Question

2012-05-09 Thread Michael Kuhlmann

Am 08.05.2012 23:23, schrieb Lance Norskog:

Lucene does not support more 2^32 unique documents, so you need to
partition.


Just a small note:

I doubt that Solr supports more than 2^31 unique documents, as most 
other Java applications that use int values.


Greetings,
Kuli




Re: Bridge between Solr and NoSQL

2012-05-08 Thread Michael Kuhlmann

Am 08.05.2012 04:13, schrieb Jeff Schmidt:

Francois:

Check out DataStax Enterprise 2.0, Solr integrated with Cassandra: 
http://www.datastax.com/docs/datastax_enterprise2.0/search/index

And, Solbase, Solr integrated with HBase: https://github.com/Photobucket/Solbase

I'm sure there are others, but these two come to mind.


I know of Solandra, Solr integrated with Cassandra: 
https://github.com/tjake/Solandra


In contrast to the DataStax solution, this is open source, but DataStax 
should be the better solution (at least regarding the performance).


Integrating Lucene with CouchDB was discussed here: 
http://lucene.472066.n3.nabble.com/Using-Solr-with-CouchDB-td762856.html

and a project is here: https://github.com/rnewson/couchdb-lucene

Greetings,
Kuli


On May 7, 2012, at 5:29 PM, Francois Perron wrote:


Hi all,

  I would like to know if there is some projects to integrate Solr with NoSQl 
like MongoDB.

They already had a link like this between ElasticSearch and CoughDB. (Cough 
River I think)

Thank you.


Re: Boosting fields in SOLR using Solrj

2012-04-26 Thread Michael Kuhlmann

Am 26.04.2012 00:57, schrieb Joe:

Hi,

I'm using the solrj API to query my SOLR 3.6 index. I have multiple text
fields, which I would like to weight differently. From what I've read, I
should be able to do this using the dismax or edismax query types. I've
tried the following:

SolrQuery query = new SolrQuery();
query.setQuery( title:apples oranges content:apples oranges);
query.setQueryType(edismax);
query.set(qf, title^10.0 content^1.0);
QueryResponse rsp = m_Server.query( query );


Why do you try to construct your own query, when you're using an edismax 
query with a defined qf parameter?


What you're searching is the text title:apples oranges content:apples 
oranges. Depending on your analyzer chain, it might be that title:appes 
and content:apples are kept as one token, so nothing is found because 
there's no such token in the index.


Why don't you simply query for apples oranges? That's how (e)dismax is 
made for. Have a deeper look at http://wiki.apache.org/solr/DisMax.


BTW, if you used the above query in a Lucene parser, it would look for 
apples in title and content field, but look for oranges in your 
default search field. This is because you didn't quote apples oranges. 
Since you want to use Edismax, you can ignore this, it's just that you 
current query won't work as expected in both cases.


-Kuli


Re: Dynamic creation of cores for this use case.

2012-04-26 Thread Michael Kuhlmann

Am 26.04.2012 16:17, schrieb pprabhcisco123:

  The use case is to create a core for each customer as well as partner .
Since its very difficult to create cores statically in solr.xml file for all
4500 customers , is there any way to create the cores dynamically or on the
fly.


Yes there is. Have a look at: http://wiki.apache.org/solr/CoreAdmin#CREATE

I suggest to set the persistent flag in solr.xml to true.

I think all your cores will share the same configuration, so you can 
point all configuration directories to the same one, and install unique 
data dirs.


This should be relative simple in theory. In practise, you might detect 
performance issues with such a configuration. It should be no big 
problem if at most few hundred users work in parallel, but as soon as 
most cores are used all together, I predict you'll have bad performance.


Solr has no hard-coded limitation in the number of cores, but each core 
has its own caches and readers. Depending on your machine configuration, 
this may be too much.


My suggestion is to try it out. It should work first, and if you're 
hitting performance limits, then you can modify yourn configuration.


-Kuli


Re: DIH NoClassFoundError.

2012-04-25 Thread Michael Kuhlmann

Am 25.04.2012 15:57, schrieb stockii:

is it not fucking possible to import DIH !?!?!? WTF!


It is fucking possible, you just need to either point your goddamn 
classpath to the data import handler jar in the contrib folders, or you 
have to add the appropriate contrib folder into the lib dir entries at 
the beginning of your motherfucking solrconfig.xml.


The pissed example already contains those libs. To stay with your woring...

-Kuli


Re: RequestHandler versus SearchComponent

2012-03-23 Thread Michael Kuhlmann

Am 23.03.2012 10:29, schrieb Ahmet Arslan:

I'm looking at the following. I want
to (1) map some query fields to
some other query fields and add some things to FL, and then
(2)
rescore.

I can see how to do it as a RequestHandler that makes a
parser to get
the fields, or I could see making a SearchComponent that was
stuck
into the list just after the QueryComponent.

Anyone care to advise in the choice?


I would choose SearchComponent. I read somewhere that customizations are now 
better fit into SC rather than RH.



I would override QueryComponent and modify the normal query instead.

Adding an own SearchComponent after the regular QueryComponent (or 
better as a last-element) is goof when you simply want to modify the 
existing result. But since you want to rescore, you're likely interested 
in documents that fell already out of the original result list.


Greetings,
Kuli


Re: RequestHandler versus SearchComponent

2012-03-23 Thread Michael Kuhlmann

Am 23.03.2012 11:17, schrieb Michael Kuhlmann:

Adding an own SearchComponent after the regular QueryComponent (or
better as a last-element) is goof ...


Of course, I meant good, not goof! ;)



Greetings,
Kuli




Re: is the SolrJ call to add collection of documents a blocking function call ?

2012-03-20 Thread Michael Kuhlmann

Hi Ramdev,

add() is a blocking call. Otherwise it had to start an own background 
thread which is not what a library like Solrj should do (how many 
threads at most? At which priority? Which thread group? How long keep 
them pooled?)


And, additionally, you might want to know whether the transmission was 
successful, or whether your guinea pig has eaten the network cable just 
in the middle of the transmission.


But it's easy to write your own background task that adds your documents 
to the Solr server. Using Java's ExecutionService class, this is done 
within two minutes.


Greetings,
Kuli

Am 19.03.2012 16:48, schrieb ramdev.wud...@thomsonreuters.com:

Hi:
I am trying to index a collection of SolrInputDocs to a Solr server. I was 
wondering if the call I make to add the documents (the 
add(CollectionSolrInputDocument)  call ) is a blocking function call ?

I would also like to know if the add call is a call that would take longer for 
a larger collection of documents


Thanks

Ramdev





Re: Master/Slave switch on teh fly. Replication

2012-03-16 Thread Michael Kuhlmann

Am 16.03.2012 15:05, schrieb stockii:

i have 8 cores ;-)

i thought that replication is defined in solrconfig.xml and this file is
only load on startup and i cannot change master to slave and slave to master
without restarting the servlet-container ?!?!?!


No, you can reload the whole core at any time, without interruption. 
Even with a new solrconfig.xml.


You can even add a new core at runtime, fill it with data and switch 
cores afterwards.


See http://wiki.apache.org/solr/CoreAdmin for details.

-Kuli


Re: Maybe switching to Solr Cores

2012-03-16 Thread Michael Kuhlmann

Am 16.03.2012 16:42, schrieb Mike Austin:

It seems that the biggest real-world advantage is the ability to control
core creation and replacement with no downtime.  The negative would be the
isolation however the are still somewhat isolated.  What other benefits and
common real-world situations would you use to talk me into switching to
Solr cores?


Different Solr cores already are quite isolated: They use different 
configs, different caches, different readers, different handlers...


In fact, there is not much more common between Solr cores except the 
solr.xml configuration.


One additional advantage is that cores need less footprint in Tomcat 
than fully deployed Solr web applications.


I don't see a single drawback of multiples cores in contrast to multiple 
web apps


...except one, but that has nothing to do with Solr, only with the JVM 
itself: If you have large hardware environment with lots of RAM, than it 
might be better to have multiple Tomcat instances running in different 
OS processes. The reason is Java's garbage collector that works better 
with not-so-huge memory.


Sometimes it might be even better to have two or four replicated Solr 
instances in different Tomcat processes than just one. You'll avoid 
longer stop-the-world pauses with Java's GC as well.


However, this depends on the environment and needs to be avaluated as 
well...


-Kuli


Re: Too many open files - lots of sockets

2012-03-14 Thread Michael Kuhlmann

I had the same problem, without auto-commit.

I never really found out what exactly the reason was, but I think it was 
because commits were triggered before a previous commit had the chance 
to finish.


We now commit after every minute or 1000 (quite large) documents, 
whatever comes first. And we never optimize. We haven't had this 
exceptions for months now.


Good luck!
-Kuli

Am 14.03.2012 11:22, schrieb Colin Howe:

Currently using 3.4.0. We have autocommit enabled but we manually do
commits every 100 documents anyway... I can turn it off if you think that
might help.


Cheers,
Colin


On Wed, Mar 14, 2012 at 10:24 AM, Markus Jelsma
markus.jel...@openindex.iowrote:


Are you running trunk and have auto-commit enabled? Then disable
auto-commit. Even if you increase ulimits it will continue to swallow all
available file descriptors.


On Wed, 14 Mar 2012 10:13:55 +, Colin Howeco...@conversocial.com
wrote:


Hello,

We keep hitting the too many open files exception. Looking at lsof we have
a lot (several thousand) of entries like this:

java  19339 root 1619u sock0,70t0
  682291383 can't identify protocol


However, netstat -a doesn't show any of these.

Can anyone suggest a way to diagnose what these socket entries are? Happy
to post any more information as needed.


Cheers,
Colin



--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/**markus17http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350









Re: Sorting on non-stored field

2012-03-14 Thread Michael Kuhlmann

Am 14.03.2012 11:43, schrieb Finotti Simone:

I was wondering: is it possible to sort a Solr result-set on a non-stored value?


Yes, it is. It must be indexed, indeed.

-Kuli


Re: Too many open files - lots of sockets

2012-03-14 Thread Michael Kuhlmann

Ah, good to know! Thank you!

I already had Jetty under suspicion, but we had this failure quite often 
in October and November, when the bug was not yet reported.


-Kuli

Am 14.03.2012 12:08, schrieb Colin Howe:

After some more digging around I discovered that there was a bug reported
in jetty 6:  https://jira.codehaus.org/browse/JETTY-1458

This prompted me to upgrade to Jetty 7 and things look a bit more stable
now :)


Re: sort my results alphabetically on facetnames

2012-02-14 Thread Michael Kuhlmann

Hi!

On 14.02.2012 13:09, PeterKerk wrote:

I want to sort my results on the facetnames (not by their number of results).


From the example you gave, I'd assume you don't want to sort by facet 
names but by facet values.


Simply add facet.sort=index to your request; see
http://wiki.apache.org/solr/SimpleFacetParameters#facet.sort

Or simply sort the facet result on your own.

Greetings,
Kuli


Re: Help:Solr can't put all pdf files into index

2012-02-09 Thread Michael Kuhlmann
I'd suggest that you check which documents *exactly* are missing in Solr 
index. Or find at least one that's missing, and try to figure out how 
this document differs from the other ones that can be found in Solr.


Maybe we can then find out what exact problem there is.

Greetings,
-Kuli

On 09.02.2012 16:37, Rong Kang wrote:


Yes, I put all file in one directory and I have tested file names using code.




At 2012-02-09 20:45:49,Jan Høydahljan@cominvent.com  wrote:

Hi,

Are you 100% sure that the filename is globally unique, since you use it as the 
uniqueKey?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 9. feb. 2012, at 08:30, 荣康 wrote:


Hey ,
I am using solr as my search engine to search my pdf files. I have 18219 
files(different file names) and all the files are in one same directory。But 
when I use solr to import the files into index using Dataimport method, solr 
report only import 17233 files. It's very strange. This problem has stoped out 
project for a few days. I can't handle it.


please help me!


Schema.xml


fields
   field name=text type=text indexed=true multiValued=true termVectors=true 
termPositions=true termOffsets=true/
   field name=filename type=filenametext indexed=true required=true termVectors=true 
termPositions=true termOffsets=true/
   field name=id type=string stored=true/
/fields
uniqueKeyid/uniqueKey
copyField source=filename dest=text/


and
dataConfig
dataSource type=BinFileDataSource name=bin/
document
entity name=f processor=FileListEntityProcessor recursive=true
rootEntity=false
dataSource=null  baseDir=H:/pdf/cls_1_16800_OCRed/1
fileName=.*\.(PDF)|(pdf)|(Pdf)|(pDf)|(pdF)|(PDf)|(PdF)|(pDF) onError=skip


entity name=tika-test processor=TikaEntityProcessor
url=${f.fileAbsolutePath} format=text dataSource=bin onError=skip
field column=text name=text/
/entity
field column=file name=id/
field column=file name=filename/
/entity
/document
/dataConfig




sincerecly
Rong Kang









Re: Help:Solr can't put all pdf files into index

2012-02-09 Thread Michael Kuhlmann

I don't know much about Tika, but this seems to be a bug in PDFBox.

See: https://issues.apache.org/jira/browse/PDFBOX-797

Yoz might also have a look at this: 
http://stackoverflow.com/questions/7489206/error-while-parsing-binary-files-mostly-pdf


At least that's what I found when I googled the NPE.

Greetings,
Kuli

On 09.02.2012 17:13, Rong Kang wrote:

I test one file that is missing in Solr index. And solr response as below

[...]


Exception in entity : 
tika-test:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable 
to read content Processing Document # 1
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:130)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:591)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:617)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:267)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:186)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:353)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:411)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:392)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException 
from org.apache.tika.parser.ParserDecorator$1@190725e
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:128)
... 8 more
Caused by: java.lang.NullPointerException
at org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:109)
at org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:943)
at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:108)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
... 10 more


I think this is because tika can't read the pdf file or this  pdf file's format 
has some error. But I can read this pdf file in Adobe Reader.
Regards,

Rong Kang


Re: Bad Request (Solr + Weblogic + Oracle DB)

2012-02-02 Thread Michael Kuhlmann

Hi rzao!

I think this is the problem:

On 02.02.2012 13:59, rzoao wrote:

UpdateRequest req = new UpdateRequest();

req.setAction(AbstractUpdateRequest.ACTION.COMMIT, false,
false);
req.add(documento);



You create a commit request, but send a document with it - that won't 
work. Either you add documents, or you perform a commit, but you can't 
do both.


Remove the line with setAction(), send the document, and after that, 
call commit() directly on the SolrServer.


If this doesn't help, then have a look into Weblogic's log files. You 
should find an exception there that helps you more.


-Kuli


Re: java.net.SocketException: Too many open files

2012-01-24 Thread Michael Kuhlmann

Hi Jonty,

no, not really. When we first had such problems, we really thought that 
the number of open files is the problem, so we implemented an algorithm 
that performed an optimize from time to time to force a segment merge. 
Due to some misconfiguration, this ran too often. With the result that 
an optimize was issued before thje previous optimization was finished, 
which is a really bad idea.


We removed the optimization calls, and since then we didn't have any 
more problems.


However, I never found out the initial reason for the exception. Maybe 
there was some bug in Solr's 3.1 version - we're using 3.5 right now -, 
but I couldn't find a hint in the changelog.


At least we didn't have this exception for more than two months now, so 
I'm hoping that the cause for this has disappeared somehow.


Sorry that I can't help you more.

Greetings,
Kuli

On 24.01.2012 07:48, Jonty Rhods wrote:

Hi Kuli,

Did you get the solution of this problem? I am still facing this problem.
Please help me to overcome this problem.

regards


On Wed, Oct 26, 2011 at 1:16 PM, Michael Kuhlmannk...@solarier.de  wrote:


Hi;

we have a similar problem here. We already raised the file ulimit on the
server to 4096, but this only defered the problem. We get a
TooManyOpenFilesException every few months.

The problem has nothing to do with real files. When we had the last
TooManyOpenFilesException, we investigated with netstat -a and saw that
there were about 3900 open sockets in Jetty.

Curiously, we only have one SolrServer instance per Solr client, and we
only have three clients (our running web servers).

We have set defaultMaxConnectionsPerHost to 20 and maxTotalConnections
to 100. There should be room enough.

Sorry that I can't help you, we still have not solved tghe problem on
our own.

Greetings,
Kuli

Am 25.10.2011 22:03, schrieb Jonty Rhods:

Hi,

I am using solrj and for connection to server I am using instance of the
solr server:

SolrServer server =  new CommonsHttpSolrServer(
http://localhost:8080/solr/core0;);

I noticed that after few minutes it start throwing exception
java.net.SocketException: Too many open files.
It seems that it related to instance of the HttpClient. How to resolved

the

instances to a certain no. Like connection pool in dbcp etc..

I am not experienced on java so please help to resolved this problem.

  solr version: 3.4

regards
Jonty










Re: Relevancy and random sorting

2012-01-12 Thread Michael Kuhlmann

Does the random sort function help you here?

http://lucene.apache.org/solr/api/org/apache/solr/schema/RandomSortField.html

However, you will get some very old listings then, if it's okay for you.

-Kuli

Am 12.01.2012 14:38, schrieb Alexandre Rocco:

Erick,

This document already has a field that indicates the source (site).
The issue we are trying to solve is when we list all documents without any
specific criteria. Since we bring the most recent ones and the ones that
contains images, we end up having a lot of listings from a single site,
since the documents are indexed in batches from the same site. At some
point we have several documents from the same site in the same date/time
and having images. I'm trying to give some random aspect to this search so
other documents can also appear in between that big dataset from the same
source.
Does the grouping help to achieve this?

Alexandre

On Thu, Jan 12, 2012 at 12:31 AM, Erick Ericksonerickerick...@gmail.comwrote:


Alexandre:

Have you thought about grouping? If you can analyze the incoming
documents and include a field such that similar documents map
to the same value, than group on that value you'll get output that
isn't dominated by repeated copies of the similar documents. It
depends, though, on being able to do a suitable mapping.

In your case, could the mapping just be the site from which you
got the data?

Best
Erick

On Wed, Jan 11, 2012 at 1:58 PM, Alexandre Roccoalel...@gmail.com
wrote:

Erick,

Probably I really written something silly. You are right on either

sorting

by field or ranking.
I just need to change the ranking to shift things around as you said.

To clarify the use case:
We have a listing aggregator that gets product listings from a lot of
different sites and since they are added in batches, sometimes you see a
lot of pages from the same source (site). We are working on some changes

to

shift things around and reduce this blocking effect, so we can present
mixed sources on the result pages.

I guess I will start with the document random field and later try to
develop a custom plugin to make things better.

Thanks for the pointers.

Regards,
Alexandre

On Wed, Jan 11, 2012 at 1:58 PM, Erick Ericksonerickerick...@gmail.com
wrote:


I really don't understand what this means:
random sorting for the records but also preserving the ranking

Either you're sorting on rank or you're not. If you mean you're
trying to shift things around just a little bit, *mostly* respecting
relevance then I guess you can do what you're thinking.

You could create your own function query to do the boosting, see:
http://wiki.apache.org/solr/SolrPlugins#ValueSourceParser

which would keep you from having to re-index your data to get
a different randomness.

You could also consider external file fields, but I think your
own function query would be cleaner. I don't think math.random
is a supported function OOB

Best
Erick


On Wed, Jan 11, 2012 at 8:29 AM, Alexandre Roccoalel...@gmail.com
wrote:

Hello all,

Recently i've been trying to tweak some aspects of relevancy in one

listing

project.
I need to give a higher score to newer documents and also boost the
document based on a boolean field that indicates the listing has

pictures.

On top of that, in some situations we need a random sorting for the

records

but also preserving the ranking.

I tried to combine some techniques described in the Solr Relevancy FAQ
wiki, but when I add the random sorting, the ranking gets messy (as
expected).

This works well:




http://localhost:18979/solr/select/?start=0rows=15q={!boost%20b=recip(ms(NOW/HOUR,date_updated),3.16e-11,1,1)}active%3a%22true%22+AND+featured%3a%22false%22+_val_:%haspicture%22fl=*,score


This does not work, gives a random order on what is already ranked




http://localhost:18979/solr/select/?start=0rows=15q={!boost%20b=recip(ms(NOW/HOUR,date_updated),3.16e-11,1,1)}active%3a%22true%22+AND+featured%3a%22false%22+_val_:%haspicture%22fl=*,scoresort=random_1+desc


The only way I see is to create another field on the schema

containing a

random value and use it to boost the document the same way that was

tone

on

the boolean field.
Anyone tried something like this before and knows some way to get it
working?

Thanks,
Alexandre










Re: Solr response writer

2011-12-07 Thread Michael Kuhlmann

Am 07.12.2011 14:26, schrieb Finotti Simone:

That's the scenario:
I have an XML that maps words W to URLs; when a search request is issued by my 
web client, a query will be issued to my Solr application. If, after stemming, 
the query matches any in W, the client must be redirected to the associated URL.

I agree that it should be handled outside, but we are currently on progress of 
migrating from Endeca, and it has a feature that allow this scenario. For this 
reason, my boss asked if it was somehow possible to leave that functionality in 
the search engine.


Of course, your customers will never directly connect to your Solr 
server. They instead connect to your web application, which is itself a 
client to Solr.


Therefore, it's useless to return redirect response codes directly from 
Solr, since you customer's browsers will never get them.


Instead, you should handle Solr responses in your web application 
individually, and redirect your customers then.


-Kuli


Re: R: Solr response writer

2011-12-07 Thread Michael Kuhlmann

Am 07.12.2011 15:09, schrieb Finotti Simone:

I got your and Michael's point. Indeed, I'm not very skilled in web devolpment 
so there may be something that I'm missing. Anyway, Endeca does something like 
this:

1. accept a query
2. does the stemming;
3. check if the result of the step 2. matches one of the redirectable words. If 
so, returns an URL, otherwise returns the regular matching documents (our 
products' description).

Do you think that in Solr I will be able to replicate this behaviour without 
writing a custom plugin (request handler, response writer, etc)? Maybe I'm a 
little dense, but I fail to see how it would be possible...


Endeca not only is a search engine, it's part of a web application. You 
can send a query to the Endeca engine and send the response directly to 
the user; it's already fully rendered. (At least when you configured it 
this way.)


Solr can't do this in any way. Solr responses are always pure technical 
data, not meant to be delivered to an end user. An exception to this is 
the VelocityResponseWriter which can fill a web template.


Anything beyond the possibilities of the VelocityReponseWriter must be 
handled by some web application that anaylzes Solr's reponses.


How do you want ot display your product descriptions, the default case? 
I don't think you want to show some XML data.


Solr is a great search engine, but not more. It's just a small subset of 
commercial search frameworks like Endeca. Therefore, you can't simply 
replace it, you'll need some web application.


However, you don't need a custom response writer in this case, nor do 
you have to Solr extend in any way. At least not for this requrement.


-Kuli


Re: SolR for time-series data

2011-12-05 Thread Michael Kuhlmann

Hi Alan,

Solr can do this fast and easy, but I wonder if a simple key-value-store 
won't fit better for your suits.


Do you really only need to query be chart_id, or do you also need to 
query by time range?


In either case, as long as your data fits into an in-memory database, I 
would suggest Redis to you. It's easy to install and use, and it's fast 
as hell.


If you want to query by time ranges, you can use lists and query them by 
range using lrange (http://www.redis.io/commands/lrange), at least when 
you know the first timestamp and the steps are even. Or use a sorted 
set, and make sure that the values differ.


In my opinion, Solr has too many features that you don't need.

-Kuli

Am 03.12.2011 18:10, schrieb Alan Miller:

Hi,

I have a webapp that plots a bunch of time series data which
is just a series of doubles coupled with a timestamp.

Every chart in my webapp has a chart_id in my db and i am wondering if it
would be
effective to usr solr to serve the data to my app instead of keeping the
data in my rdbms.

Currently I'm using hadoop to calc and generate the report data and the
sticking it in my
rdbms but I could use solrj client to upload the data to a solr index
directly.

I know solr if for indexing text documents but would it be effective to use
solr in this way?

I want to query by chart_id and get back a series of timestamp:double pairs.

Regards
Alan





Re: Replication not done for real on commit?

2011-12-05 Thread Michael Kuhlmann

Am 05.12.2011 14:28, schrieb Per Steffensen:

Hi

Reading http://wiki.apache.org/solr/SolrReplication I notice the
pollInterval (guess it should have been pullInterval) on the slaves.
That indicate to me that indexed information is not really pushed from
master to slave(s) on events defined by replicateAfter (e.g. commit),
but that it only will be made available for pulling by the slaves at
those events. So even though I run with a master with
replicateAfter=commit, I am not sure that I will be able to query a
document that I have just indexed from one of the slaves immediately
after having done the indexing on the master - I will have to wait
pollInterval (+ time for replication). Can anyone confirm that this is
a correct interpretation, or explain how to understand pollInterval if
it is not?


This is totally correct.



I want to acheive this always-in-sync property between master and slaves
(primary and replica if you like). What is the easiest way? Will I just
have to make sure myself that indexing goes on directly on all replica
of a shard, and then drop using the replication explained on
http://wiki.apache.org/solr/SolrReplication?


When committing, Solr will need some time (at least some microseconds, 
may be much more) to zpdate your changes into its index. In the 
meantime, the existing index readers will still work on the old, 
uncommitted index state. Therefore you'll surely fail when you rely on a 
committed index state immediately after your commit command, even 
without any replication on a single machine.


Why do you need such a feature? I don't think that there's a way to make 
Solr behave like this.


-Kuli


Re: Best practise to automatically change a field value for a specific period of time

2011-12-02 Thread Michael Kuhlmann

Hi Mark,

I'm sure you can manage this using function queries somehow, but this is 
rather complicated, esp. if you both want to return the price and sort 
on it.


I'd rather update the index as soon as a campaign starts or ends. At 
least that's how we did it when I worked for online shops. Normally this 
isn't a matter of seconds, and you would need to update Solr anyway when 
you create such a campaign.


As a benefit, you're not limited in the number of running campaigns (at 
least not on the Solr side). Maybe you want to plan a campaign when the 
current one hasn't ended yet, which would be (nearly) impossible when 
you calculate the price at query time.


Greetings,
Kuli

Am 02.12.2011 12:21, schrieb Mark Schoy:

Hi,

I have an solr index for an online shop with a field price which
contains the standard price of a product.
But in the database, the shop owner can specify a period of time with
an alternative price.

For example: standard price is $20.00, but 12/24/11 08:00am to
12/26/11 11:59pm = $12.59

Of course I could use an cronjob to updating the documents. But I
think this is too unstable.
I also could save all price campaigns in a field an then extracting
the correct price. But then I could not sort by price or only by the
standard price.

What I need is an field where I can put a condition like that: if
[current_time between one of the price campains] then [return price of
price campaign]. But (unfortunately) this is not possible.

Thanks for advice.




Re: PatternTokenizer failure

2011-11-29 Thread Michael Kuhlmann

Am 29.11.2011 15:20, schrieb Erick Erickson:

Hmmm, I tried this in straight Java, no Solr/Lucene involved and the
behavior I'm seeing is that no example works if it has more than
one whitespace character after the hyphen, including your failure
example.

I haven't lived inside regexes for long enough that I don't know what
the right regex should be, but it doesn't appear to be a Solr problem


Jay,
I think the problem is this:

You're checking whether the character preceding the array of at least 
one whitespace is not a hyphen.


However, when you've more than one whitespace, like this:
foo- \n bar
then there's another array of whitespaces - \n  - which is precedes by 
the first whitespace -  .


Therefore, you'll need to not only check for preceding hyphens, but also 
for preceding whitespaces.


I'll leave this as an exercise for you. ;)

-Kuli


Re: Aggregated indexing of updating RSS feeds

2011-11-17 Thread Michael Kuhlmann

Am 17.11.2011 11:53, schrieb sbarriba:

The 'params' logging pointer was what I needed. So for reference its not a
good idea to use a 'wget' command directly in a crontab.
I was using:

wget http://localhost/solr/myfeed?command=full-importrows=5000clean=false


:))

I think the shell handled the and sign as a flag to put the wget command 
into background.


You could put the full url into quotes, or escape the and sign with a 
backslash. Then it should work as well.


-Kuli


Re: Problems installing Solr PHP extension

2011-11-16 Thread Michael Kuhlmann

Am 16.11.2011 17:11, schrieb Travis Low:


If I can't solve this problem then we'll basically have to write our own
PHP Solr client, which would royally suck.


Oh, if you really can't get the library work, no problem - there are 
several PHP clients out there that don't need a PECL installation.


Personally, I have used http://code.google.com/p/solr-php-client/, it 
works well.


-Kuli


Re: Add copyTo Field without re-indexing?

2011-11-16 Thread Michael Kuhlmann

Am 17.11.2011 08:46, schrieb Kashif Khan:

Please advise how we can reindex SOLR with having fields stored=false. we
can not reindex data from the beginning just want to read and write indexes
from the SOLRJ only. Please advise a solution. I know we can do it using
lucene classes using indexreader and indexwriter but want to index all
fields


This is not possible. At least not when the index is modified in any way 
(stemmed, lowercased, tokenized, etc.).


The original data is not saved when stored is false. You'll need your 
original source data to reindex then.


-Kuli


Re: two word phrase search using dismax

2011-11-15 Thread Michael Kuhlmann

Am 14.11.2011 21:50, schrieb alx...@aim.com:

Hello,

I use solr3.4 and nutch 1.3. In request handler we have
str name=mm2lt;-1 5lt;-2 6lt;90%/str

As fas as I know this means that for two word phrase search match must be 100%.
However, I noticed that in most cases documents with both words are ranked 
around 20 place.
In the first places are documents with one of the words in the phrase.

Any ideas why this happening and is it possible to fix it?


Hi,

are you sure that only one of the words matched in the found documents? 
Have you checked all fields that are listed in the qf parameter? And did 
you check for stemmed versions of your search terms?


If all this is true, you maybe want to give an example.

And AFAIK the mm parameter does not affect the ranking.



Re: creating solr index from nutch segments, no errors, no results

2011-11-15 Thread Michael Kuhlmann
I don't know much about nutch, but it looks like there's simply a commit 
missing at the end.


Try to send a commit, e.g  by executing

curl http://host:port/solr/core/update -H Content-Type: text/xml 
--data-binary 'commit /'


-Kuli

Am 15.11.2011 09:11, schrieb Armin Schleicher:

hi there,

[...]


Re: Solr 3.3 Sorting is not working for long fields

2011-11-15 Thread Michael Kuhlmann

Hi,

Am 15.11.2011 10:25, schrieb rajini maski:

 fieldType name=long class=solr.TrieLongField precisionStep=0
omitNorms=true positionIncrementGap=0/


[...]


 fieldType name=tlong class=solr.TrieLongField precisionStep=8
omitNorms=true positionIncrementGap=0/


[...]


field name=studyid type=long indexed=true stored=true/


Hmh, why didn't you just changed the field type to tlong as you 
mentioned before? Instead you changed the class of the long type. 
There's nothing against this, it's just a bit confusing since long 
fields normally are of type solr.LongField, which is not sortable on its 
own.


You specified a precisionStep of 0, which means that the field would be 
slow in range queries, but it shouldn't harm for sorting. All in all, it 
should work.


So, the only chance I see is to re-index once again (and commit after 
that). I don't really see an error in your config except the confusing 
long type. It should work after reindexing, and it can't work if it 
was indexed with a genuine long type.


-Kuli


Re: Solr 3.3 Sorting is not working for long fields

2011-11-14 Thread Michael Kuhlmann

Am 14.11.2011 09:33, schrieb rajini maski:

query :
http://localhost:8091/Group/select?/indent=onq=studyid:120sort=studyidasc,groupid
asc,subjectid ascstart=0rows=10


Is it a copy-and-paste error, or did you realls sort on studyidasc?

I don't think you have a field studyidasc, and Solr should've given an 
exception that either asc or desc is missing.


-Kuli


Re: representing latlontype in pojo

2011-11-09 Thread Michael Kuhlmann

Am 08.11.2011 23:38, schrieb Cam Bazz:

How can I store a 2d point and index it to a field type that is
latlontype, if I am using solrj?


Simply use a String field. The format is $latitude,$longitude.

-Kuli



Re: Is SQL Like operator feature available in Apache Solr query

2011-11-01 Thread Michael Kuhlmann

Hi,

this is not exactly true. In Solr, you can't have the wildcard operator 
on both sides of the operator.


However, you can tokenize your fields and simply query for Solr. This 
is what's Solr made for. :)


-Kuli

Am 01.11.2011 13:24, schrieb François Schiettecatte:

Arshad

Actually it is available, you need to use the ReversedWildcardFilterFactory 
which I am sure you can Google for.

Solr and SQL address different problem sets with some overlaps but there are 
significant differences between the two technologies. Actually '%Solr%' is a 
worse case for SQL but handled quite elegantly in Solr.

Hope this helps!

Cheers

François


On Nov 1, 2011, at 7:46 AM, arshad ansari wrote:


Hi,

Is SQL Like operator feature available in Apache Solr Just like we have it
in SQL.

SQL example below -

*Select * from Employee where employee_name like '%Solr%'*

If not is it a Bug with Solr. If this feature available, please tell the
examples available.

Thanks!

--
Best Regards,
Arshad






Re: Is SQL Like operator feature available in Apache Solr query

2011-11-01 Thread Michael Kuhlmann

Am 01.11.2011 16:06, schrieb Erick Erickson:

NGrams are often used in Solr for this case, but they will also add to
your index size.

It might be worthwhile to look closely at your user requirements
before going ahead
and supporting this functionality

Best
Erick


My opinion. Wildcards are good for peeking into the index, i.e. for 
checking data in the browser. I haven't yet found a real life use case 
for them.


-Kuli


Re: Always return total number of documents

2011-10-28 Thread Michael Kuhlmann
Am 28.10.2011 11:16, schrieb Robert Brown:
 Is there no way to return the total number of docs as part of a search?

No, it isn't. Usually this information is of absolutely no value to the
end user.

A workaround would be to add some field to the schema that has the same
value for every document, and use this for facetting.

Greetings,
Kuli


Re: Query/Delete performance difference between straight HTTP and SolrJ

2011-10-27 Thread Michael Kuhlmann
Am 26.10.2011 18:29, schrieb Shawn Heisey:
 For inserting, I do use a Collection of SolrInputDocuments.  The delete
 process grabs values from idx_delete, does a query like the above (the
 part that's slow in Java), then if any documents are found, issues a
 deleteByQuery with the same string.

Why do you first query for these documents? Why don't you just delete
them? Solr won't harm if no documents are affected by your delete query,
and you'll get the number of affected documents in your response anyway.

When deleting, Solrj nearly does nothing on its own, it just sends the
POST request and analyzes the simple response. The behaviour in a get
request is similar. We do thousands of update, delete and get requests
in a minute using Solrj without problems, your timing problems must come
frome somewhere else.

-Kuli


  1   2   >