Re: Learning to rank

2018-07-24 Thread Zheng Lin Edwin Yeo
Hi,

Which version of Solr are you using?
And do you have the error log for your error?

Regards,
Edwin

On Mon, 16 Jul 2018 at 21:20, Akshay Patil  wrote:

>  Hi
>
> I am student. for my master thesis I am working on the Learning To rank. As
> I did research on it. I found solution provided by the Bloomberg. But I
> would like to ask. With the example that you have provided It always shows
> the error of Bad Request.
>
> Do you have running example of it. So i can adapt it to my application.
>
> I am trying to use the example that you have provided in github.
>
> core :- techproducts
> traning_and_uploading_demo.py
>
> It generates the training data. But I am getting the problem in uploading
> the model. It shows error of bad request (empty request body). please help
> me out with this problem. So I will be able to adapt it to my application.
>
> Best Regards !
>
> Any help would be appreciated 
>


Re: Block Join Faceting issue

2018-07-24 Thread sagandhi
Hi Mikhail,

Thank you for suggesting to use json facet. I tried json.facet, it works
great and I am able to make a single query instead of two. Now I am planning
to get rid of the duplicate child fields in parent docs. However I ran into
problems while forming negative queries with block join.

Here's what I would like to query - Get me parent docs whose children do not
have a particular field.
I tried these but none worked - 

q=*:*={!parent which="doc_type:parent"}*-*child_color:*
q=*:*={!parent which="doc_type:parent" v=$qq}=(!child_color:*)

Currently I have duplicate entries of child fields in parent docs, so I am
able to do this - 
=!parent_color:*

Is there a way to form this query using block join? 

Thanks,
Soham




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: How to retrieve nested documents (parents and their children together) ?

2018-07-24 Thread TK Solr

Thank you. I'll try the child doc transformer.

On a related question, if I delete a parent document, will its children be 
deleted also? Or do I have to have a parent_id field in each child so that the 
child docs can be deleted?



On 7/22/18 10:05 AM, Mikhail Khludnev wrote:

Hello,
Check [child]
https://lucene.apache.org/solr/guide/7_4/transforming-result-documents.html#child-childdoctransformerfactory
or [subquery].
Although, it's worth to put reference to it somewhere in blockjoin
qparsers.
Documentation patches are welcome.


On Sun, Jul 22, 2018 at 10:25 AM TK Solr  wrote:


https://lucene.apache.org/solr/guide/7_4/other-parsers.html#block-join-parent-query-parser

talks about {!parent which=} , which returns parent docs only, and
{!child of=} ,
which
returns child docs only.

Is there a way to retrieve the matched documents in the original, nested
form?
Using the sample document, is there way to get:


1
Solr has block join support
parentDocument

2
SolrCloud supports it too!



rather than just the parent or the child docs?







Re: Possible to define a field so that substring-search is always used?

2018-07-24 Thread Chris Hostetter


: We are using Solr as a user index, and users have email addresses.
: 
: Our old search behavior used a SQL substring match for any search
: terms entered, and so users are used to being able to search for e.g.
: "chr" and finding my email address ("ch...@christopherschultz.net").
: 
: By default, Solr doesn't perform substring matches, and it might be
: difficult to re-train users to use *chr* to find email addresses by
: substring.

In the past, were you really doing arbitrary substring matching, or just 
prefix matching?  ie would a search for "sto" match 
"ch...@christopherschultz.net"

Personally, if you know you have an email field, would suggest using a 
custom tokenizer that splits on "@" and "." (and maybe other punctuation 
characters like "-") and then take your raw user input and feed it to the 
prefix parser (instead of requiring your users to add the "*")...

 q={!prefix f=email v=$user_input}_input=chr

...which would match ch...@gmail.com, f...@chris.com, f...@bar.chr etc. 

(this wouldn't help you though if you *really* want arbitrary substring 
matching -- as erick suggested ngrams is pretty much your best bet for 
something like that)

Bear in mind, you can combine that "forced prefix" query against 
the (otkenized) email field with other queries that 
could parse your input in other ways...

user_input=...
q=({!prefix f=email v=$user_input} 
   OR {!dismax qf="first_name last_name" ..etc.. v=$user_input})

so if your user input is "chris" you'll get term matches on the 
first_name field, or the last_name field as well as prefix matches on the 
email field.



-Hoss
http://www.lucidworks.com/


Re: Alias field names when searching (not for results)

2018-07-24 Thread Chris Hostetter


: >  defType=edismax q=sysadmin name:Mike qf=title text last_name
: > first_name
: 
: Aside: I'm curious about the use of "qf", here. Since I didn't want my
: users to have to specify any particular field to search, I created an
: "all" field and dumped everything into it. It seems like it would be
: better to change that so that I don't have an "all" field at all and
...
: Does that sound like a better approach than packing-together an "all"
: field during indexing?

well -- you may have other reasons why an "all" field is useful, but yeah 
-- when using dismax/edismax the "qf" param is really designed to let you 
search across many diff fields, and to associate query time weights with 
those fields.  see the docs i linked to earlier, but there's also a blog 
post on the scoring implications i wrote a lifetime ago...

https://lucidworks.com/2010/05/23/whats-a-dismax/

: > ...the examples above all show the request params, so "f.last.qf"
: > is a param name, "last_name" is the corrisponding param value.
: 
: Awesome. I didn't realize that "f.alias.qf" was the name of the actual
: parameter to send. I was staring at the Solr Dashboard's selection of
: edismax parameters and not seeing anything that seemed correct. That's
: because it's a new parameter! Makes sense, now.

that syntax is an example of a "per field override" where in this case the 
"field" you are overriding doesn't *have* to be a "real" field in the 
index -- it can be an alias and for that alias (when used by your users) 
you are defining the qf to use.  it could in fact be a "real" field name, 
where you override what gets searched "I'm not going to let them search 
directly against just the last_name, when they try i'm going to *actually* 
search against last_name and full_name" etc...)


-Hoss
http://www.lucidworks.com/


Re: Alias field names when searching (not for results)

2018-07-24 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Chris,

On 7/24/18 1:40 PM, Chris Hostetter wrote:
> 
> : So if I want to alias the "first_name" field to "first" and the :
> "last_name" field to "last", then I would ... do what, exactly?
> 
> se the last example here...
> 
> https://lucene.apache.org/solr/guide/7_4/the-extended-dismax-query-par
ser.html#examples-of-edismax-queries
>
>  defType=edismax q=sysadmin name:Mike qf=title text last_name
> first_name

Aside: I'm curious about the use of "qf", here. Since I didn't want my
users to have to specify any particular field to search, I created an
"all" field and dumped everything into it. It seems like it would be
better to change that so that I don't have an "all" field at all and
instead I mention all of the fields I would normally have packed into
the "all" field in the "qf" parameter. That would reduce my index size
and also help with another question I had today (subject: Possible to
define a field so that substring-search is always used?).

Does that sound like a better approach than packing-together an "all"
field during indexing?

> f.name.qf=last_name first_name
> 
> the "f.name.qf" has created an "alias" so that when the "q"
> contains "name:Mike" it searches for "Mike" in both the last_name
> and first_name fields.  if it were "f.name.qf=last_name
> first_name^2" then there would be a boost on matches in the
> first_name field.
> 
> For your usecase you want something like...
> 
> defType=edismax q=sysadmin first:Mike last:Smith qf=title text
> last_name first_name f.first.qf=first_name f.last.qf=last_name
> 
> : I'm using SolrJ as the client.
> 
> ...the examples above all show the request params, so "f.last.qf"
> is a param name, "last_name" is the corrisponding param value.

Awesome. I didn't realize that "f.alias.qf" was the name of the actual
parameter to send. I was staring at the Solr Dashboard's selection of
edismax parameters and not seeing anything that seemed correct. That's
because it's a new parameter! Makes sense, now.

Thanks a bunch,
- -chris
-BEGIN PGP SIGNATURE-
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAltXb9IACgkQHPApP6U8
pFifZxAAgQGXwsMzSQf9shJYmjgLgFWTYxQQBRJDFRgtEz0wtYkRS0nEoE+kO0xs
BEGC6iXfXChAkOQ3Bv/QittRCxCQvXL+aoZA5ewcyumf0XhmU0My4R7AJOoIRGpO
C9oPfUf8bwqynrTN0cXBIN8pr+KAG4rimAEMLxuscVeQAm3McrNbmmX22LL9VNRv
/QBDnil8rPCYiprQn7SnN88IkU9irgwN/1QQ+YaUhwOMubPwygfxGTdkTJivi0KA
fi5nmYE8A+wOzAGlP8GrMUZpkIfVx8VV96fwKdCyw+fi8MXVF+6rd+Z0u4TOI6Yq
ZQ3d/GK7W5OImWpQOJUX9oHRmoKiUgE/27XRb6QSC/WwF1WOonClmHggSKkh24a8
dGa+5A6tbPdCxJwv9T2NPn7XBqOyvNfxzMUnItpIdNoM0lrHCOMmANoU6nnSjrPg
iInAM9oG2p41zO8S83tv7KLVbOwS1xogmeUn5fr/5XQ5Z7g7V5yBE5oYgVTiUleB
Sd+wjoCWeZIfLSJJfRYFLLjQmFqQOh2Fc6XCoyBYQeGLrlCiNLRHIS6dEisHFNq8
PLbXNuMyZOkrvLNFUWwYhC9pwQ8Q8z3C0i1uVSYlOVDd1GHVwJowVI9XCFbAGFoO
0ZXSy3TuHMgk8VGUZNNO0H9nHf3i8MAoMo4TDsgROs2Y9TXRVPM=
=AEkI
-END PGP SIGNATURE-


Re: [EXTERNAL] Re: Facet Sorting

2018-07-24 Thread Chris Hostetter

: Chris, I was trying the below method for sorting the faceted buckets but 
: am seeing that the function query query($q) applies only to the score 
: from “q” parameter. My solr request has a combination of q, “bq” and 
: “bf” and it looks like the function query query($q) is calculating the 
: scores only on q and not on the aggregate score of q, bq and bf

right.  ok -- yeah, that makes sense.

The thing to udnerstand is that when you use request params as "variables" 
in functions like that, the function doesn't know the context of your 
request -- "query($q)" doesn't know/treat the "q" param special, it could 
just as easily be "query($make_up_a_param_name_thats_in_your_request)"

awhen when the query() function goes and evalutes the param you specify, 
it's not going to know that you have a defType of e/dismax that affects 
"q" param when the main query is executed -- it just parses it as a lucene 
query.

so what you need is something like "query({!dismax bf=$bf bq=$bq v=$q})" 
... i think that should work, or if not then use "query($facet_sort)" 
where facet_sort is a new param you add that contains "{!dismax bf=$bf 
bq=$bq v=$q}"

alternatively, you could change your "q" param to be very explicit about 
the query you want, w/o depending on defType, and use a custom param name 
for the original query string provided by the user -- that's what i 
frequently do...

   ie: q={!dismax bf=$bf bq=$bq v=$qq}=dogs and cats

...and then the "query($q)" i suggested before should work as is.

does that make sense?


-Hoss
http://www.lucidworks.com/

Re: Alias field names when searching (not for results)

2018-07-24 Thread Chris Hostetter


: So if I want to alias the "first_name" field to "first" and the
: "last_name" field to "last", then I would ... do what, exactly?

se the last example here...

https://lucene.apache.org/solr/guide/7_4/the-extended-dismax-query-parser.html#examples-of-edismax-queries

defType=edismax
q=sysadmin name:Mike
qf=title text last_name first_name
f.name.qf=last_name first_name

the "f.name.qf" has created an "alias" so that when the "q" contains 
"name:Mike" it searches for "Mike" in both the last_name and first_name 
fields.  if it were "f.name.qf=last_name first_name^2" then there would be 
a boost on matches in the first_name field.

For your usecase you want something like...

defType=edismax
q=sysadmin first:Mike last:Smith
qf=title text last_name first_name
f.first.qf=first_name
f.last.qf=last_name

: I'm using SolrJ as the client.

...the examples above all show the request params, so "f.last.qf" is a 
param name, "last_name" is the corrisponding param value.



-Hoss
http://www.lucidworks.com/


Re: Good practices on indexing larger amount of documents at once using SolrJ

2018-07-24 Thread Arunan Sugunakumar
Dear Erick,

Unfortunately I deleted the original Solr logs, so I couldn't post it here.
But removing the hard commit from the loop solved my problem and made
indexing faster. Now there are no errors thrown from the client side.

Thanks
Arunan


On 22 July 2018 at 04:45, Erick Erickson  wrote:

> commitWithin parameter.
>
> Well, what I usually do is set my autocommit interval in my
> solrconfig.xml file and forget about it.
> For searching, set your autosoftcommit in solrconfig.xml and forget
> about _that_.
>
> Here's more than you want to know about the topic.
> https://lucidworks.com/2013/08/23/understanding-
> transaction-logs-softcommit-and-commit-in-sorlcloud/
>
> As for what to set them to? soft commit "as long as you can stand".
>
> For hard commit (openSearcher value doesn't really matter) I like a
> minute or so. Especially if openSearcher=false,
> then that defines the limit of how much data you'd have to replay from
> the tlog if your process terminates
> abnormally.
>
> But for your original problem, what do the solr logs say? The error
> you posted doesn't really shed any light on the root cause.
>
> Best,
> Erick
>
> On Fri, Jul 20, 2018 at 9:39 PM, Arunan Sugunakumar
>  wrote:
> > Dear Erick,
> >
> > Thank you for your reply. I initialize the arraylist variable with a new
> > Array List after I add and commit the solrDocumentList into the
> solrClient.
> > So I dont think I have the problem of ever increasing ArrayList. (I hope
> > the add method in solrClient flushes the previous documents added). But
> as
> > you said I do a hard commit during the loop. I can change it by adding
> > commitWithin. What is the value you would recommend for this type of
> > scenario.
> >
> > Thank you,
> > Arunan
> >
> > *Sugunakumar Arunan*
> > Undergraduate - CSE | UOM
> >
> > Email : aruna ns...@cse.mrt.ac.lk
> > Mobile : 0094 766016272 <076%20601%206272>
> > LinkedIn : https://www.linkedin.com/in/arunans23/
> >
> > On 20 July 2018 at 23:21, Erick Erickson 
> wrote:
> >
> >> I do this all the time with batches of 1,000 and don't see this problem.
> >>
> >> one thing that sometimes bites people is to fail to clear the doclist
> >> after every call to add. So you send ever-increasing batches to Solr.
> >> Assuming when you talk about batch size meaning the size of the
> >> solrDocunentList, increasing it would make  the broken pipe problem
> >> worse if anything...
> >>
> >> Also, it's generally bad practice to commit after every batch. That's
> not
> >> your problem here, just something to note. Let your autocommit
> >> settings in solrconfig handle it or specify commitWithin in your
> >> add call.
> >>
> >> I'd also look in your Solr logs and see if there's a problem there.
> >>
> >> Net-net is this is a perfectly reasonable pattern, I suspect some
> >> innocent-seeming problem with your indexing code.
> >>
> >> Best,
> >> Erick
> >>
> >>
> >>
> >> On Fri, Jul 20, 2018 at 9:32 AM, Arunan Sugunakumar
> >>  wrote:
> >> > Hi,
> >> >
> >> > I have around 12 millions objects in my PostgreSQL database to be
> >> indexed.
> >> > I'm running a thread to fetch the rows from the database. The thread
> will
> >> > also create the documents and put it in an indexing queue. While this
> is
> >> > happening my main process will retrieve the documents from the queue
> and
> >> > will index it in the size of 1000. For some time the process is
> running
> >> as
> >> > expected, but after some time, I get an exception.
> >> >
> >> > *[corePostProcess] org.apache.solr.client.solrj.SolrServerException:
> >> > IOException occured when talking to server at:
> >> > http://localhost:8983/solr/mine-search
> >> > ……….…
> >> …….[corePostProcess]
> >> > Caused by: java.net.SocketException: Broken pipe (Write
> >> > failed)[corePostProcess]at
> >> > java.net.SocketOutputStream.socketWrite0(Native Method)*
> >> >
> >> >
> >> > I tried increasing the batch size upto 3. Then I got a different
> >> > exception.
> >> >
> >> > *[corePostProcess] org.apache.solr.client.solrj.SolrServerException:
> >> > IOException occured when talking to server at:
> >> > http://localhost:8983/solr/mine-search
> >> > ……
> >> .….[corePostProcess]
> >> > Caused by: org.apache.http.NoHttpResponseException: localhost:8983
> >> failed
> >> > to respond*
> >> >
> >> >
> >> > I would like to know whether there are any good practices on handling
> >> such
> >> > situation, such as max no of documents to index in one attempt etc.
> >> >
> >> > My environement :
> >> >
> >> > Version : solr 7.2, solrj 7.2
> >> > Ubuntu 16.04
> >> > RAM 20GB
> >> > I started Solr in standalone mode.
> >> > Number of replicas and shards : 1
> >> >
> >> > The method I used :
> >> > UpdateResponse response = solrClient.add(
> >> solrDocumentList);
> >> > solrClient.commit();
> >> >
> >> >
> >> > Thanks in advance.
> 

Re: Possible to define a field so that substring-search is always used?

2018-07-24 Thread Erick Erickson
1. the standard way to do this is to use ngrams. The index is larger,
but it gives you much quicker searches than trying to to
pre-and-postfix wildcards

2. use a fieldType with KeywordTokenizerFactory + (probably)
LowerCaseFilterFactory + TrimFilterFactory. And, in your case,
NGramTokenizerFactory (I'd start with bigrams, i.e. min=2 and max=2)

3. no. The destination field has it's own field type and that's how
the input stream is analyzed. There's no good way to say "don't
analyze input from field X when copied to field Y". Probably best not
to copy it there at all.

Best,
Erick

On Tue, Jul 24, 2018 at 9:05 AM, Christopher Schultz
 wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> All,
>
> We are using Solr as a user index, and users have email addresses.
>
> Our old search behavior used a SQL substring match for any search
> terms entered, and so users are used to being able to search for e.g.
> "chr" and finding my email address ("ch...@christopherschultz.net").
>
> By default, Solr doesn't perform substring matches, and it might be
> difficult to re-train users to use *chr* to find email addresses by
> substring.
>
> Is there a way to define the field such that searches are always done
> as a substring? While we are at it, I'd like to define the field to
> avoid tokenization because it's never useful to search for
> "m...@gmail.com" and find a few million search results because many
> users use @gmail.com email addresses.
>
> Here is the current field definition from our create-schema script:
>
>   "add-field":{
>  "name":"email_address",
>  "type":"text_general",
>  "multiValued" : false,
>  "stored":true },
>
> Later, we add the email address to the "all" field (which aggregates
> everything from all useful fields into the field used as the
> default-field):
>
>   "add-copy-field":{
>  "source":"email_address",
>  "dest":"all" },
>
> Is there a way to define these fields such that:
>
> 1. The email_address field is always searched using a substring
> 2. The email_address field is not tokenized
> 3. The copied-email-address is not tokenized in the "all" field
>
> Thanks,
> - -chris
> -BEGIN PGP SIGNATURE-
> Comment: GPGTools - http://gpgtools.org
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAltXTkcACgkQHPApP6U8
> pFh1aRAAilB2nVGycjVyY2taAJv6x2ss33UcVL6xBATRUkHTCbyAr5LFN3FWmcOR
> iCbZdxCU5LSa0x0clMTlRjR0U8HF+l2J4ArMQYiveA9mXc6fZz+ovAYrBqDguE6b
> UZnbOcR3pDF+P5h3ch9aMbdkHAhsVN7AX5yiSIS0fqKn6irNrI7TkvRmiZqNzVFx
> sDIPChL9meMfh8rz7vVmu5IjaImnQZ+2tmc+QruFsbgKGXJMR4n+d0CjacIfd5vp
> hoZDpg9qcasnYau925xqlj4BBrPS1XiYOqvdgCxnO1l6qqVfBK+lVsPaP5FOtXZP
> 7Fe/unkzuK8j1Y0mZNpcZtMYYhsMHboT1Kegrn1mUZp9S6iL1NzbqzmsbDQyNqlg
> 8HghvGG7ROj/hkqLPOlGy6wp72GFQYrHuIEzdyDI9wHOaP+cdliCdkkmqIAQJilR
> ketzTVhEbOHGEHGa9obHg0NPqmYwP4DDmSOZ42z5UPr2KqaqpeXsqcB2CV7nnvB3
> 6hvKuHVWIrHE1P1k1XFwMF3Vy+YbeojFbvKLH+eNKXXOXu8PEn2MaZU5v12WNWEr
> 0l6K16VnFf436WqH/fSa1DZUfuphA4z0qg/oHqcUcfhVFjc+U1wSZVvdvpG+rSf1
> n3NS9pqFAWruWq7V0ID5cV0PVRwp9g6pgs4XJAhKYEkiXVO8u7Y=
> =wAsa
> -END PGP SIGNATURE-


Re: Solr Optimization Failure due to Timeout Waiting for Server - How to Address

2018-07-24 Thread THADC
Thanks, we feel confident we will not need the optimization for our
circumstances and just remove the code. Appreciated the response!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Alias field names when searching (not for results)

2018-07-24 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Emir,

On 3/6/18 2:42 AM, Emir Arnautović wrote:
> I did not try it, but the first thing that came to my mind is to
> use edismax’s ability to define field aliases, something like 
> f.f1.fq=field_1. Note that it is not recommended to have field
> name starting with number so not sure if it will work with “1”.

So if I want to alias the "first_name" field to "first" and the
"last_name" field to "last", then I would ... do what, exactly?

I'm using SolrJ as the client.

   queryParamMap.put("defType", "edismax");
   queryParamMap.put([??], "f.first.fq=first_name f.last.fq=last_name");

??

Thanks,
- -chris

>> On 5 Mar 2018, at 17:51, Christopher Schultz
>>  wrote:
>> 
> All,
> 
> I'd like for users to be able to search a field by multiple names 
> without performing a "copy-field" when analyzing a document. Is
> that possible? Whenever I search for "solr alias field" I get
> results about how to re-name fields in the results.
> 
> Here's what I'd like to do. Let's say I have a document:
> 
> { id: 1234, field_1: valueA, field_2: valueB, field_3: valueC }
> 
> I'd like users to be able to find this document using any of the 
> following queries:
> 
> field_1:valueA f1:valueA 1:valueA
> 
> I just want the query parser to say "oh, 'f1' is an alias for 
> 'field_1'" and substitute that when performing the search. Is that 
> possible?
> 
> -chris
> 
> 
-BEGIN PGP SIGNATURE-
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAltXUWwACgkQHPApP6U8
pFgytBAAnI5pSrmfo4vPr2Tvol+qeuBXXcJ1WogXI2CvE1wWfHUm0xOXXzJ+YzMb
glv9UFs+VBfzksM9p4anJ0zLSQ82DxMv+dQ4c/rgMxTMkA/Yj7/9yBxp2jniFz5k
Jaq6FlAcpmQYDKTTx8pZb9srIWfXRoQg2Kv4zFDftD9jQi5Fekn1wt4PuhIWdrWi
9ROX4Pajx6wyJccamfTr5xSiBnzDcA6CBGGMFPmXVPWozYqcDfz4Ohry5MgbHMaR
wz0NMHSFjQ6zF9ZI28RM1z7gMT5xB1mG5HgC5oQWVD2V0PULdAIWC7tDZhlFGE6p
USjELBdeV6NNARz3sIbI8MD+T0Ww0SIekJgz3xNcs8TMIi2k5s1ksEdJl5flrsZ5
wbR7hNYol2nb0Bx6p/wk9wXwxqfDrW9yT3gNg+kYRrEWZdfLqLOXrytTZ7BhTz1O
6xoUX58FugULPyj9zT/DFTxMicjzdLrXUZR9kpRZXZSDhhn9NrzC1zFYJVs/E7W5
2LzguS3zD6pR7stxAory4KaeuJEaU3pBo80P9jslOjBDrmZRIKFLCSaynTwxi2pF
Z0LXwGw/Vpc96sznBe4BYvWmxKkjYGCAUjrXM+tortr2SxH2dd2/umXySB5uQRV8
hAjBkidVLm1pB6jirzxLOzOMeIXb6zXnlLhBbvXBvYVpY9yQNuw=
=f0ub
-END PGP SIGNATURE-


Re: Alias field names when searching (not for results)

2018-07-24 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Rick,

On 3/6/18 6:39 PM, Rick Leir wrote:
> The first thing that came to mind is that you are planning not to 
> have an app in front of Solr. Without a web app, you will need to 
> trust whoever can get access to Solr. Maybe you are on an
> intranet.
Nope, we have a web application between the user and Solr. But I would
rather not parse the user's query string and re-write it so that the
search field-names are canonicalized.

Thanks,
- -chris

> On March 6, 2018 2:42:26 AM EST, "Emir Arnautović"
>  wrote:
>> Hi, I did not try it, but the first thing that came to my mind is
>> to use edismax’s ability to define field aliases, something like 
>> f.f1.fq=field_1. Note that it is not recommended to have field
>> name starting with number so not sure if it will work with “1”.
>> 
>> HTH, Emir -- Monitoring - Log Management - Alerting - Anomaly
>> Detection Solr & Elasticsearch Consulting Support Training -
>> http://sematext.com/
>> 
>> 
>> 
>>> On 5 Mar 2018, at 17:51, Christopher Schultz
>>  wrote:
>>> 
> All,
> 
> I'd like for users to be able to search a field by multiple names 
> without performing a "copy-field" when analyzing a document. Is
> that possible? Whenever I search for "solr alias field" I get
> results
>>> about
> how to re-name fields in the results.
> 
> Here's what I'd like to do. Let's say I have a document:
> 
> { id: 1234, field_1: valueA, field_2: valueB, field_3: valueC }
> 
> I'd like users to be able to find this document using any of the 
> following queries:
> 
> field_1:valueA f1:valueA 1:valueA
> 
> I just want the query parser to say "oh, 'f1' is an alias for 
> 'field_1'" and substitute that when performing the search. Is that 
> possible?
> 
> -chris
> 
> 
-BEGIN PGP SIGNATURE-
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAltXTxwACgkQHPApP6U8
pFjHaA/9HYIjEges8puLMM8S+hwoKigLFzrGstbyWlrj5xHQsBnaeQbNCAv8TSO5
/Yx911UbJEc00etpJSiXVUbMWbwvzt1QmjKZADaUtUQpKJ3i/eORnFOu3/FXrojX
LJFWNxasO/gpFMqz6ADqdsfjKLDiqDQHg6letg0QVQ4d3k3diD3rahJaoJYg67/e
OeEOHqK9LTY+v9HGdLUzLQ87C2FQScsvnTX6vmCU7HLXcbJFOly/KXamL8gulM5g
+sVQbMSB1l+jkU3TOkWZ2ovJJzB49qVto2ZxcrT682GHyHq8sZIX6nsFSRZQl7Af
rCe0Esgdk0SPCf3NIcZugEKmlawqWulzDhheyFVDwc5kQhMmi9CFU+/JbQcT4yeM
Q72TRCdESnH8W9jWDa9+WuBT7PW+BPBogBXhTT2JgptqPxA2iUPl1M9HdjqiZd4K
qdt65YZrpomAQpcDBa4Rzl0yG7UXOuu5A3Ms6nYFyOB0lHdsQqtSVLVSgw1hw3g9
3tnRlyBi1FrrSpwDew8oNobMGVMigb3sxvjAO3lv6g6DH8YEcIyJE197xFVd5091
m+OQSpgO3iZtr7YxruDlM/fvofOLNevQS4LcdhXZoZ4Txi6cAi12svxId8w4yycq
SEOfyXZvd9S0IOdC4UZVfJ+8Ome6Iy1BV+WHsdO8SWKoHW+m7cE=
=x/55
-END PGP SIGNATURE-


Possible to define a field so that substring-search is always used?

2018-07-24 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

All,

We are using Solr as a user index, and users have email addresses.

Our old search behavior used a SQL substring match for any search
terms entered, and so users are used to being able to search for e.g.
"chr" and finding my email address ("ch...@christopherschultz.net").

By default, Solr doesn't perform substring matches, and it might be
difficult to re-train users to use *chr* to find email addresses by
substring.

Is there a way to define the field such that searches are always done
as a substring? While we are at it, I'd like to define the field to
avoid tokenization because it's never useful to search for
"m...@gmail.com" and find a few million search results because many
users use @gmail.com email addresses.

Here is the current field definition from our create-schema script:

  "add-field":{
 "name":"email_address",
 "type":"text_general",
 "multiValued" : false,
 "stored":true },

Later, we add the email address to the "all" field (which aggregates
everything from all useful fields into the field used as the
default-field):

  "add-copy-field":{
 "source":"email_address",
 "dest":"all" },

Is there a way to define these fields such that:

1. The email_address field is always searched using a substring
2. The email_address field is not tokenized
3. The copied-email-address is not tokenized in the "all" field

Thanks,
- -chris
-BEGIN PGP SIGNATURE-
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAltXTkcACgkQHPApP6U8
pFh1aRAAilB2nVGycjVyY2taAJv6x2ss33UcVL6xBATRUkHTCbyAr5LFN3FWmcOR
iCbZdxCU5LSa0x0clMTlRjR0U8HF+l2J4ArMQYiveA9mXc6fZz+ovAYrBqDguE6b
UZnbOcR3pDF+P5h3ch9aMbdkHAhsVN7AX5yiSIS0fqKn6irNrI7TkvRmiZqNzVFx
sDIPChL9meMfh8rz7vVmu5IjaImnQZ+2tmc+QruFsbgKGXJMR4n+d0CjacIfd5vp
hoZDpg9qcasnYau925xqlj4BBrPS1XiYOqvdgCxnO1l6qqVfBK+lVsPaP5FOtXZP
7Fe/unkzuK8j1Y0mZNpcZtMYYhsMHboT1Kegrn1mUZp9S6iL1NzbqzmsbDQyNqlg
8HghvGG7ROj/hkqLPOlGy6wp72GFQYrHuIEzdyDI9wHOaP+cdliCdkkmqIAQJilR
ketzTVhEbOHGEHGa9obHg0NPqmYwP4DDmSOZ42z5UPr2KqaqpeXsqcB2CV7nnvB3
6hvKuHVWIrHE1P1k1XFwMF3Vy+YbeojFbvKLH+eNKXXOXu8PEn2MaZU5v12WNWEr
0l6K16VnFf436WqH/fSa1DZUfuphA4z0qg/oHqcUcfhVFjc+U1wSZVvdvpG+rSf1
n3NS9pqFAWruWq7V0ID5cV0PVRwp9g6pgs4XJAhKYEkiXVO8u7Y=
=wAsa
-END PGP SIGNATURE-


Re: ConcurrentUpdateSolrClient threads

2018-07-24 Thread TerjeAndersen
Hi. I'm wondering the same. Some "updateBean" takes a very long time, up to
130 000 ms. Typically it takes around 100 ms. I'm using 2 threads and a
queue size of 30. Haven't figured out what the default thread size is? 0?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr Optimization Failure due to Timeout Waiting for Server - How to Address

2018-07-24 Thread Erick Erickson
Does the optimize actually fail or just take a long time? That is, if
you wait does the index eventually get down to one segment?
For long-running operations, the _request_ can time out even though
the action is still continuing.

But that brings up whether you should optimize in the first place.
Optimize will reclaim resources from
deleted (or replaced) documents, but in terms of query speed, the
effects may be minimal, and it is
very expensive. See:

https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

So what I'd do is if at the end of your indexing, the percentage of
deleted docs was less than, say,
20% I wouldn't optimize.

If you have some tests that show enough increased query speed to be
worth the bother, then sure. But
optimize isn't usually necessary.

Best,
Erick

On Tue, Jul 24, 2018 at 4:52 AM, THADC
 wrote:
> Hi,
>
> We have recently been performing a bulk reindexing against a large database
> of ours. At the end of reindexing all documents we successfully perform a
> CloudSolrClient.commit(). The entire reindexing process takes around 9
> hours. This is solr 7.3, by the way..
>
> Anyway, immediately after the commit, we execute a
> CloudSolrClient.optimize(), but immediately receive a "SolrServerException:
> Timeout occurred while waiting response from server at" (followed by URL of
> this collection).
>
> We have never had an issue with this against bulk reindexes of smaller
> databases (most are much smaller and reindexing takes only 10-15 minutes
> with those). The other difference with this environment is that the
> reindexing is performed across multiple threads (calls to solrCloud server
> from a multi-threaded setup) for performance reasons, rather than a single
> thread. However, the reindexing process itself is completely successful, its
> just the subsequent optimization that fails.
>
> Is there a simple way to avoid this timeout failure issue? Could it be a
> matter of retrying until the optimize() request is successful (that is,
> after a reasonable number of attempts) rather than just trying once and
> quitting? Any and all ideas are greatly appreciated. Thanks!
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Question regarding searching Chinese characters

2018-07-24 Thread Tomoko Uchida
Hi Amanda,

> do all I need to do is modify the settings from smartChinese to the ones
you posted here

Yes, the settings I posted should work for you, at least partially.
If you are happy with the results, it's OK!
But please take this as a starting point because it's not perfect.

> Or do I need to still do something with the SmartChineseAnalyzer?

Try the settings, then if you notice something strange and want to know why
and how to solve it, that may be the time to dive deep into. ;)

I cannot explain how analyzers works here... but you should start off with
the Solr documentation.
https://lucene.apache.org/solr/guide/7_0/understanding-analyzers-tokenizers-and-filters.html

Regards,
Tomoko



2018年7月24日(火) 21:08 Amanda Shuman :

> Hi Tomoko,
>
> Thanks so much for this explanation - I did not even know this was
> possible! I will try it out but I have one question: do all I need to do is
> modify the settings from smartChinese to the ones you posted here:
>
> 
>   
>   
>id="Traditional-Simplified"/>
> 
>
> Or do I need to still do something with the SmartChineseAnalyzer? I did not
> quite understand this in your first message:
>
> " I think you need two steps if you want to use HMMChineseTokenizer
> correctly.
>
> 1. transform all traditional characters to simplified ones and save to
> temporary files.
> I do not have clear idea for doing this, but you can create a Java
> program that calls Lucene's ICUTransformFilter
> 2. then, index to Solr using SmartChineseAnalyzer."
>
> My understanding is that with the new settings you posted, I don't need to
> do these steps. Is that correct? Otherwise, I don't really know how to do
> step 1 with the java program
>
> Thanks!
> Amanda
>
>
> --
> Dr. Amanda Shuman
> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> 
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 4925
>
>
> On Fri, Jul 20, 2018 at 8:03 PM, Tomoko Uchida <
> tomoko.uchida.1...@gmail.com
> > wrote:
>
> > Yes, while traditional - simplified transformation would be out of the
> > scope of Unicode normalization,
> > you would like to add ICUNormalizer2CharFilterFactory anyway :)
> >
> > Let me refine my example settings:
> >
> > 
> >   
> >   
> >> id="Traditional-Simplified"/>
> > 
> >
> > Regards,
> > Tomoko
> >
> >
> > 2018年7月21日(土) 2:54 Alexandre Rafalovitch :
> >
> > > Would  ICUNormalizer2CharFilterFactory do? Or at least serve as a
> > > template of what needs to be done.
> > >
> > > Regards,
> > >Alex.
> > >
> > > On 20 July 2018 at 12:40, Walter Underwood 
> > wrote:
> > > > Looks like we need a charfilter version of the ICU transforms. That
> > > could run before the tokenizer.
> > > >
> > > > I’ve never built a charfilter, but it seems like this would be a good
> > > first project for someone who wants to contribute.
> > > >
> > > > wunder
> > > > Walter Underwood
> > > > wun...@wunderwood.org
> > > > http://observer.wunderwood.org/  (my blog)
> > > >
> > > >> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida <
> > > tomoko.uchida.1...@gmail.com> wrote:
> > > >>
> > > >> Exactly. More concretely, the starting point is: replacing your
> > analyzer
> > > >>
> > > >>  > > class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
> > > >>
> > > >> to
> > > >>
> > > >> 
> > > >>  
> > > >>   > > >> id="Traditional-Simplified"/>
> > > >> 
> > > >>
> > > >> and see if the results are as expected. Then research another
> filters
> > if
> > > >> your requirements is not met.
> > > >>
> > > >> Just a reminder: HMMChineseTokenizerFactory do not handle
> traditional
> > > >> characters as I noted previous in post, so ICUTransformFilterFactory
> > is
> > > an
> > > >> incomplete workaround.
> > > >>
> > > >> 2018年7月21日(土) 0:05 Walter Underwood :
> > > >>
> > > >>> I expect that this is the line that does the transformation:
> > > >>>
> > > >>>> > >>> id="Traditional-Simplified"/>
> > > >>>
> > > >>> This mapping is a standard feature of ICU. More info on ICU
> > transforms
> > > is
> > > >>> in this doc, though not much detail on this particular transform.
> > > >>>
> > > >>> http://userguide.icu-project.org/transforms/general
> > > >>>
> > > >>> wunder
> > > >>> Walter Underwood
> > > >>> wun...@wunderwood.org
> > > >>> http://observer.wunderwood.org/  (my blog)
> > > >>>
> > >  On Jul 20, 2018, at 7:43 AM, Susheel Kumar  >
> > > >>> wrote:
> > > 
> > >  I think so.  I used the exact as in github
> > > 
> > >   > >  positionIncrementGap="1" autoGeneratePhraseQueries="false">
> > >  
> > >    
> > >    
> > > > > class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
> > > > > >>> id="Traditional-Simplified"/>
> > > > > >>> id="Katakana-Hiragana"/>
> > >    
> > > > >  hiragana="true" katakana="true" hangul="true"
> 

Re: Solr fails even ZK quorum has majority

2018-07-24 Thread Susheel Kumar
Thank you, Shalin.

Here is the Jira  https://issues.apache.org/jira/browse/SOLR-12585

On Mon, Jul 23, 2018 at 11:21 PM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> Can you please open a Jira issue? I don't think we handle DNS problems very
> well during startup. Thanks.
>
> On Tue, Jul 24, 2018 at 2:31 AM Susheel Kumar 
> wrote:
>
> > Something messed up with DNS which resulted into unknown host exception
> for
> > one the machines in our env and caused Solr to throw the above exception
> >
> >  Eric,  I have the Solr configured using service installation script and
> > the ZK_HOST entry in
> > solr.in.sh="server1:2181,server2:2181,server3:2181/collection"
> > and after removing the server1 from above, was able to start Solr
> otherwise
> > it was throwing above exception.
> >
> > Thnx
> >
> >
> > On Mon, Jul 23, 2018 at 4:20 PM, Erick Erickson  >
> > wrote:
> >
> > > And how do you start Solr? Do you use the entire 3-node ensemble
> address?
> > >
> > > On Mon, Jul 23, 2018 at 12:55 PM, Michael Braun 
> > wrote:
> > > > Per the exception, this looks like a network / DNS resolution issue,
> > > > independent of Solr and Zookeeper code:
> > > >
> > > > Caused by: org.apache.solr.common.SolrException:
> > > > java.net.UnknownHostException: ditsearch001.es.com: Name or service
> > not
> > > > known
> > > >
> > > > Is this address actually resolvable at the time?
> > > >
> > > > On Mon, Jul 23, 2018 at 3:46 PM, Susheel Kumar <
> susheel2...@gmail.com>
> > > > wrote:
> > > >
> > > >> In usual circumstances when one Zookeeper goes down while others 2
> are
> > > up,
> > > >> Solr continues to operate but when one of the ZK machine was not
> > > reachable
> > > >> with ping returning below results, Solr count't starts.  See stack
> > trace
> > > >> below
> > > >>
> > > >> ping: cannot resolve ditsearch001.es.com: Unknown host
> > > >>
> > > >>
> > > >> Setup: Solr 6.6.2 and Zookeeper 3.4.10
> > > >>
> > > >> I had to remove this server name from the ZK_HOST list (solr.in.sh)
> > in
> > > >> order to get Solr started. Ideally whatever issue is there as far as
> > > >> majority is there, Solr should get started.
> > > >>
> > > >> Has any one noticed this issue?
> > > >>
> > > >> Thnx
> > > >>
> > > >> 2018-07-23 15:30:47.218 INFO  (main) [   ] o.e.j.s.Server
> > > >> jetty-9.3.14.v20161028
> > > >>
> > > >> 2018-07-23 15:30:47.817 INFO  (main) [   ]
> o.a.s.s.SolrDispatchFilter
> > > ___
> > > >> _   Welcome to Apache Solr‚Ñ¢ version 6.6.2
> > > >>
> > > >> 2018-07-23 15:30:47.829 INFO  (main) [   ]
> o.a.s.s.SolrDispatchFilter
> > /
> > > __|
> > > >> ___| |_ _   Starting in cloud mode on port 8080
> > > >>
> > > >> 2018-07-23 15:30:47.830 INFO  (main) [   ]
> o.a.s.s.SolrDispatchFilter
> > > \__
> > > >> \/ _ \ | '_|  Install dir: /opt/solr
> > > >>
> > > >> 2018-07-23 15:30:47.861 INFO  (main) [   ]
> o.a.s.s.SolrDispatchFilter
> > > >> |___/\___/_|_|Start time: 2018-07-23T15:30:47.832Z
> > > >>
> > > >> 2018-07-23 15:30:47.863 INFO  (main) [   ]
> o.a.s.s.StartupLoggingUtils
> > > >> Property solr.log.muteconsole given. Muting ConsoleAppender named
> > > CONSOLE
> > > >>
> > > >> 2018-07-23 15:30:47.929 INFO  (main) [   ]
> o.a.s.c.SolrResourceLoader
> > > Using
> > > >> system property solr.solr.home: /app/solr/data
> > > >>
> > > >> 2018-07-23 15:30:48.037 ERROR (main) [   ]
> o.a.s.s.SolrDispatchFilter
> > > Could
> > > >> not start Solr. Check solr/home property and the logs
> > > >>
> > > >> 2018-07-23 15:30:48.235 ERROR (main) [   ] o.a.s.c.SolrCore
> > > >> null:org.apache.solr.common.SolrException: Error occurred while
> > loading
> > > >> solr.xml from zookeeper
> > > >>
> > > >> at
> > > >> org.apache.solr.servlet.SolrDispatchFilter.loadNodeConfig(
> > > >> SolrDispatchFilter.java:270)
> > > >>
> > > >> at
> > > >> org.apache.solr.servlet.SolrDispatchFilter.createCoreContainer(
> > > >> SolrDispatchFilter.java:242)
> > > >>
> > > >> at
> > > >> org.apache.solr.servlet.SolrDispatchFilter.init(
> > > >> SolrDispatchFilter.java:173)
> > > >>
> > > >> at
> > > >> org.eclipse.jetty.servlet.FilterHolder.initialize(
> > > FilterHolder.java:137)
> > > >>
> > > >> at
> > > >> org.eclipse.jetty.servlet.ServletHandler.initialize(
> > > >> ServletHandler.java:873)
> > > >>
> > > >> at
> > > >> org.eclipse.jetty.servlet.ServletContextHandler.startContext(
> > > >> ServletContextHandler.java:349)
> > > >>
> > > >> at
> > > >> org.eclipse.jetty.webapp.WebAppContext.startWebapp(
> > > >> WebAppContext.java:1404)
> > > >>
> > > >> at
> > > >> org.eclipse.jetty.webapp.WebAppContext.startContext(
> > > >> WebAppContext.java:1366)
> > > >>
> > > >> at
> > > >> org.eclipse.jetty.server.handler.ContextHandler.
> > > >> doStart(ContextHandler.java:778)
> > > >>
> > > >> at
> > > >> org.eclipse.jetty.servlet.ServletContextHandler.doStart(
> > > >> ServletContextHandler.java:262)
> > > >>
> > > >> at
> > > >> 

Re: Question regarding searching Chinese characters

2018-07-24 Thread Amanda Shuman
Hi Tomoko,

Thanks so much for this explanation - I did not even know this was
possible! I will try it out but I have one question: do all I need to do is
modify the settings from smartChinese to the ones you posted here:


  
  
  


Or do I need to still do something with the SmartChineseAnalyzer? I did not
quite understand this in your first message:

" I think you need two steps if you want to use HMMChineseTokenizer
correctly.

1. transform all traditional characters to simplified ones and save to
temporary files.
I do not have clear idea for doing this, but you can create a Java
program that calls Lucene's ICUTransformFilter
2. then, index to Solr using SmartChineseAnalyzer."

My understanding is that with the new settings you posted, I don't need to
do these steps. Is that correct? Otherwise, I don't really know how to do
step 1 with the java program

Thanks!
Amanda


--
Dr. Amanda Shuman
Post-doc researcher, University of Freiburg, The Maoist Legacy Project

PhD, University of California, Santa Cruz
http://www.amandashuman.net/
http://www.prchistoryresources.org/
Office: +49 (0) 761 203 4925


On Fri, Jul 20, 2018 at 8:03 PM, Tomoko Uchida  wrote:

> Yes, while traditional - simplified transformation would be out of the
> scope of Unicode normalization,
> you would like to add ICUNormalizer2CharFilterFactory anyway :)
>
> Let me refine my example settings:
>
> 
>   
>   
>id="Traditional-Simplified"/>
> 
>
> Regards,
> Tomoko
>
>
> 2018年7月21日(土) 2:54 Alexandre Rafalovitch :
>
> > Would  ICUNormalizer2CharFilterFactory do? Or at least serve as a
> > template of what needs to be done.
> >
> > Regards,
> >Alex.
> >
> > On 20 July 2018 at 12:40, Walter Underwood 
> wrote:
> > > Looks like we need a charfilter version of the ICU transforms. That
> > could run before the tokenizer.
> > >
> > > I’ve never built a charfilter, but it seems like this would be a good
> > first project for someone who wants to contribute.
> > >
> > > wunder
> > > Walter Underwood
> > > wun...@wunderwood.org
> > > http://observer.wunderwood.org/  (my blog)
> > >
> > >> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida <
> > tomoko.uchida.1...@gmail.com> wrote:
> > >>
> > >> Exactly. More concretely, the starting point is: replacing your
> analyzer
> > >>
> > >>  > class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
> > >>
> > >> to
> > >>
> > >> 
> > >>  
> > >>   > >> id="Traditional-Simplified"/>
> > >> 
> > >>
> > >> and see if the results are as expected. Then research another filters
> if
> > >> your requirements is not met.
> > >>
> > >> Just a reminder: HMMChineseTokenizerFactory do not handle traditional
> > >> characters as I noted previous in post, so ICUTransformFilterFactory
> is
> > an
> > >> incomplete workaround.
> > >>
> > >> 2018年7月21日(土) 0:05 Walter Underwood :
> > >>
> > >>> I expect that this is the line that does the transformation:
> > >>>
> > >>>> >>> id="Traditional-Simplified"/>
> > >>>
> > >>> This mapping is a standard feature of ICU. More info on ICU
> transforms
> > is
> > >>> in this doc, though not much detail on this particular transform.
> > >>>
> > >>> http://userguide.icu-project.org/transforms/general
> > >>>
> > >>> wunder
> > >>> Walter Underwood
> > >>> wun...@wunderwood.org
> > >>> http://observer.wunderwood.org/  (my blog)
> > >>>
> >  On Jul 20, 2018, at 7:43 AM, Susheel Kumar 
> > >>> wrote:
> > 
> >  I think so.  I used the exact as in github
> > 
> >   >  positionIncrementGap="1" autoGeneratePhraseQueries="false">
> >  
> >    
> >    
> > > class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
> > > >>> id="Traditional-Simplified"/>
> > > >>> id="Katakana-Hiragana"/>
> >    
> > >  hiragana="true" katakana="true" hangul="true" outputUnigrams="true"
> />
> >  
> >  
> > 
> > 
> > 
> >  On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <
> > amanda.shu...@gmail.com
> > 
> >  wrote:
> > 
> > > Thanks! That does indeed look promising... This can be added on top
> > of
> > > Smart Chinese, right? Or is it an alternative?
> > >
> > >
> > > --
> > > Dr. Amanda Shuman
> > > Post-doc researcher, University of Freiburg, The Maoist Legacy
> > Project
> > > 
> > > PhD, University of California, Santa Cruz
> > > http://www.amandashuman.net/
> > > http://www.prchistoryresources.org/
> > > Office: +49 (0) 761 203 4925
> > >
> > >
> > > On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <
> > susheel2...@gmail.com>
> > > wrote:
> > >
> > >> I think CJKFoldingFilter will work for you.  I put 舊小說 in index
> and
> > >>> then
> > >> each of A, B or C or D in query and they seems to be matching and
> > CJKFF
> > > is
> > >> transforming the 舊 to 旧
> > >>
> > >> On Fri, Jul 20, 

Solr Optimization Failure due to Timeout Waiting for Server - How to Address

2018-07-24 Thread THADC
Hi,

We have recently been performing a bulk reindexing against a large database
of ours. At the end of reindexing all documents we successfully perform a
CloudSolrClient.commit(). The entire reindexing process takes around 9
hours. This is solr 7.3, by the way..

Anyway, immediately after the commit, we execute a
CloudSolrClient.optimize(), but immediately receive a "SolrServerException:
Timeout occurred while waiting response from server at" (followed by URL of
this collection).

We have never had an issue with this against bulk reindexes of smaller
databases (most are much smaller and reindexing takes only 10-15 minutes
with those). The other difference with this environment is that the
reindexing is performed across multiple threads (calls to solrCloud server
from a multi-threaded setup) for performance reasons, rather than a single
thread. However, the reindexing process itself is completely successful, its
just the subsequent optimization that fails.

Is there a simple way to avoid this timeout failure issue? Could it be a
matter of retrying until the optimize() request is successful (that is,
after a reasonable number of attempts) rather than just trying once and
quitting? Any and all ideas are greatly appreciated. Thanks!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Exception when processing streaming expression

2018-07-24 Thread Christian Spitzlay
Hi,


> Am 15.06.2018 um 14:54 schrieb Christian Spitzlay 
> :
> 
> 
>> Am 15.06.2018 um 01:23 schrieb Joel Bernstein :
>> 
>> We have to check the behavior of the innerJoin. I suspect that its closing
>> the second stream when the first stream his finished. This would cause a
>> broken pipe with the second stream. The export handler has specific code
>> that eats the broken pipe exception so it doesn't end up in the logs. The
>> select hander does not have this code.
> 
> Ah, I see.  The stack trace in my original mail has the "broken pipe" message:
> 
> [...]
> Caused by: java.io.IOException: Broken pipe
>   at java.base/sun.nio.ch.FileDispatcherImpl.writev0(Native Method)
> [...]



Should I open a Jira ticket about the innerJoin issue?




>> In general you never want to use the select handler and set the rows to
>> such a big number. If you have that many rows you'll want to use the export
>> and handler which is designed to export the entire result set.
> 
> 
> We started out with the export handler but we are updating documents using 
> streaming expressions and we had fields that had types 
> that do not support docValues, according to the documentation at
> https://lucene.apache.org/solr/guide/7_3/docvalues.html#enabling-docvalues
> 
> We switched to the select handler in some places and it worked. 
> We set the rows parameter to a large value:
> "If you want to tell Solr to return all possible results from the query 
> without an 
> upper bound, specify rows to be 1000 or some other ridiculously 
> large value that is higher than the possible number of rows that are 
> expected."
> From:
> https://wiki.apache.org/solr/CommonQueryParameters#rows



Since we have trouble switching back to the export handler,
do you have any ideas how we could temporarily keep this exception from 
filling the solr log file when I run my code?


Christian



--  

Christian Spitzlay
Diplom-Physiker,
Senior Software-Entwickler

Tel: +49 69 / 348739116
E-Mail: christian.spitz...@biologis.com

bio.logis Genetic Information Management GmbH
Altenhöferallee 3
60438 Frankfurt am Main

Geschäftsführung: Prof. Dr. med. Daniela Steinberger, Dipl.Betriebswirt Enrico 
Just
Firmensitz Frankfurt am Main, Registergericht Frankfurt am Main, HRB 97945
Umsatzsteuer-Identifikationsnummer DE293587677




SOLR in Openshift with indexing from Hadoop

2018-07-24 Thread SOLR4189
Hi all,

We try to use SOLR cloud in openshift. We manage our Solr by StatefulSet.
All SOLR functionalities work good except indexing.
We index our docs from HADOOP by SolrJ jar that try to index to specific
Pod, but openshift blocks access to internal Pods.

In my case, separate service for external traffic to solr doesn't help
because SolrJ jar looks for pod names in zookeeper.

Does somebody encounter this problem? What can I do in this case?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: [EXTERNAL] Re: Facet Sorting

2018-07-24 Thread Satheesh . Akkinepally
Chris, I was trying the below method for sorting the faceted buckets but am 
seeing that the function query query($q) applies only to the score from “q” 
parameter. My solr request has a combination of q, “bq” and “bf” and it looks 
like the function query query($q) is calculating the scores only on q and not 
on the aggregate score of q, bq and bf



My solr query is something like this

q===

when I apply json facet with the below sort by score, only the scores 
calculated from q seem to be applying and not the aggregate score of q,bf and bq



Am I missing anything here?







On 7/18/18, 3:45 PM, "Chris Hostetter"  wrote:



: If I want to plug in my own sorting for facets, what would be the best

: approach. I know, out of the box, solr supports sort by facet count and

: sort by alpha. I want to plug in my own sorting (say by relevancy). Is

: there a way to do that? Where should I start with if I need to write a

: Custom Facet Component?



it sounds like you're talking about the "classic" facets (using

"facet.field") where facet.sort only supports "count" (desc) and "index"

(asc)



Adding a custom sort option there would be close to impossible.



The newer "json.facets" API supports a much more robust set of options,

that includes the ability to sort on an "aggregate" function across all

documents in the bucket...



https://lucene.apache.org/solr/guide/7_4/json-facet-api.html



some of the existing sort options there might solve your need, but it's

also possible using that API to write your own ValueSourceParser plugin

that can be used to sort facets as long as it returns ValueSources that

extend "AggValueSource"



: Basically I want to plug the scores calculated in earlier steps for the

: documents matched, do some kind of aggregation of the scores of the

: documents that fall under a facet and use this aggregate score to rank



IIUC what you want is possibly something like...



curl http://localhost:8983/solr/techproducts/query -d 
'q=features:lcd=0&

 json.facet={

   categories:{

 type : terms,

 field : cat,

 sort : { x : desc },

 facet:{

   x : "sum(query($q))",

 }

   }

 }

'



...which will sort the buckets by the sum of the scores of every document

in that bucket (using the original query .. but you could alternatively

sort by any aggregation of the scores from any arbitrary query / document

based function)











-Hoss

http://www.lucidworks.com/