Re: Performance & CPU Usage of 6.2.1 vs 6.5.1 & above

2018-04-18 Thread mganeshs
Hello Deepak,

We are not querying when indexing is going on. Whatever CPU graph I shared
for 6.2.1 and 6.5.1 was only while we do batch indexing. During that time we
don't query and no queries are getting executed.

We index in a batch with a rate of around 100 documents / sec. And it's not
so high too. But same piece of code and same config, with 6.2.1 CPU is
normal and in 6.5.1 it always stays above 90% or 95%. 

@Solr Experts, 

>From one of the thread by " Yasoob

 
" it's mentioned as 

/I compared the source code for the two versions and found that different 
merge functions were being used to merge the postings. In 5.4, the default 
merge method of FieldsConsumer class was being used. While in 6.6, the 
PerFieldPostingsFormat's merge method is being used. I checked and it looks 
like this change went in Solr 6.3. So I replaced the 6.6 instance with 6.2.1 
and re-indexed all the data, and it is working very well, even with the 
settings I had initially used. /

Is anyone else facing this issue or any fixes got released in future build
for this ? 

Keep us posted


Deepak Goel wrote
> Please post the exact results. Many a times the high cpu utilisation may
> be
> a boon as it improves query response times
> 
> On Tue, 17 Apr 2018, 13:55 mganeshs, 

> mganeshs@

>  wrote:
> 
>> Regarding query times, we couldn't see big improvements. Both are more or
>> less same.
>>
>> Our main worry is that, why CPU usage is so high in 6.5.1 and above ?
>> What's
>> going wrong ?
>>
>> Is any one else facing this sort of issue ? If yes, how to bring down the
>> CPU usage? Is there any settings which we need to set ( not default one )
>> in
>> 6.5.1 ?
>>
>>
>>
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>>


Deepak Goel wrote
> Please post the exact results. Many a times the high cpu utilisation may
> be
> a boon as it improves query response times
> 
> On Tue, 17 Apr 2018, 13:55 mganeshs, 

> mganeshs@

>  wrote:
> 
>> Regarding query times, we couldn't see big improvements. Both are more or
>> less same.
>>
>> Our main worry is that, why CPU usage is so high in 6.5.1 and above ?
>> What's
>> going wrong ?
>>
>> Is any one else facing this sort of issue ? If yes, how to bring down the
>> CPU usage? Is there any settings which we need to set ( not default one )
>> in
>> 6.5.1 ?
>>
>>
>>
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>>





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: How to ptotect middile initials during search

2018-04-18 Thread Walter Underwood
Or even better, don’t remove stopwords.

Stopwords are a technique invented for 16-bit machines, where common words made 
posting lists too long to to handle.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Apr 18, 2018, at 2:20 PM, Jay Potharaju  wrote:
> 
> A is part of stopwords ...that is why it got dropped. Protected words will
> only stop it from stemming
> 
> https://lucene.apache.org/solr/guide/6_6/language-analysis.html
> 
> Thanks
> Jay Potharaju
> 
> 
> On Wed, Apr 18, 2018 at 11:35 AM, Wendy2  wrote:
> 
>> Hi fellow Users,
>> 
>> Why did Solr return "Ellington, W.R." when I did a name search for
>> "Ellington, A."?
>> I even added "A." in the protwords.txt file. The debugQuery shows that the
>> middle initial got dropped in the parsedquery.
>> How can I make Solr NOT to drop the middle initial?  Thanks for your
>> help!!
>> 
>> ==Search results
>> Ellington, A.D.
>> Ellington, R.W..
>> 
>> ===debugQuery=
>> {
>>  "responseHeader":{
>>"status":0,
>>"QTime":51,
>>"params":{
>>  "q":"\"Ellington, A.\"",
>>  "indent":"on",
>>  "fl":"audit_author.name",
>>  "wt":"json",
>>  "debugQuery":"true"}},
>>  "response":{"numFound":2,"start":0,"docs":[
>>  {
>>"audit_author.name":"Azzi, A., Clark, S.A., Ellington, R.W.,
>> Chapman, M.S."},
>>  {
>>"audit_author.name":"Ye, X., Gorin, A., Ellington, A.D., Patel,
>> D.J."}]
>>  },
>>  "debug":{
>>"rawquerystring":"\"Ellington, A.\"",
>>"querystring":"\"Ellington, A.\"",
>> 
>> "parsedquery":"(+DisjunctionMaxQuery(((entity_name_com.name:
>> ellington)^20.0)))/no_coord",
>>"parsedquery_toString":"+((entity_name_com.name:ellington)^20.0)",
>>   "QParser":"ExtendedDismaxQParser",
>> 
>> 
>> 
>> 
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>> 



Re: need help on search on last name + middile initial

2018-04-18 Thread Shawn Heisey
On 4/18/2018 1:12 PM, Wendy2 wrote:
>   "debug":{
> "debugQuery mode indicates that Solr dropped the ""A."" when parsing the
> query:
>   ""debug"":{
> ""rawquerystring"":""\""Ellington, A.\,
> ""querystring"":""\""Ellington, A.\,
>
> ""parsedquery"":""(+DisjunctionMaxQuery(((entity_name_com.name:ellington)^20.0)))/no_coord"",
> ""parsedquery_toString"":""+((entity_name_com.name:ellington)^20.0)"",
>""QParser"":""ExtendedDismaxQParser"", "

Very likely the period was removed by the tokenizer or a filter like
WordDelimiterFilter.  Then I would guess that the "a" was removed by a
StopFilter.

Open your admin UI, choose your index from the dropdown, and click
"Analysis."  Then choose the entity_name_com.name field from the
dropdown, and type "Ellington, A." (without the quotes) in the "Query"
side.  When the analysis completes, you will be able to tell exactly 
what each step in your analysis chain is doing to the input.

I would recommend NOT using a stopword filter.  Modern server hardware
usually has plenty of resources to handle indexes with stopwords still
included, and removing stopwords can cause certain search problems.

Thanks,
Shawn



Re: How to ptotect middile initials during search

2018-04-18 Thread Jay Potharaju
A is part of stopwords ...that is why it got dropped. Protected words will
only stop it from stemming

https://lucene.apache.org/solr/guide/6_6/language-analysis.html

Thanks
Jay Potharaju


On Wed, Apr 18, 2018 at 11:35 AM, Wendy2  wrote:

> Hi fellow Users,
>
> Why did Solr return "Ellington, W.R." when I did a name search for
> "Ellington, A."?
> I even added "A." in the protwords.txt file. The debugQuery shows that the
> middle initial got dropped in the parsedquery.
> How can I make Solr NOT to drop the middle initial?  Thanks for your
> help!!
>
> ==Search results
> Ellington, A.D.
> Ellington, R.W..
>
> ===debugQuery=
> {
>   "responseHeader":{
> "status":0,
> "QTime":51,
> "params":{
>   "q":"\"Ellington, A.\"",
>   "indent":"on",
>   "fl":"audit_author.name",
>   "wt":"json",
>   "debugQuery":"true"}},
>   "response":{"numFound":2,"start":0,"docs":[
>   {
> "audit_author.name":"Azzi, A., Clark, S.A., Ellington, R.W.,
> Chapman, M.S."},
>   {
> "audit_author.name":"Ye, X., Gorin, A., Ellington, A.D., Patel,
> D.J."}]
>   },
>   "debug":{
> "rawquerystring":"\"Ellington, A.\"",
> "querystring":"\"Ellington, A.\"",
>
> "parsedquery":"(+DisjunctionMaxQuery(((entity_name_com.name:
> ellington)^20.0)))/no_coord",
> "parsedquery_toString":"+((entity_name_com.name:ellington)^20.0)",
>"QParser":"ExtendedDismaxQParser",
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


How to ptotect middile initials during search

2018-04-18 Thread Wendy2
Hi fellow Users,

Why did Solr return "Ellington, W.R." when I did a name search for
"Ellington, A."?  
I even added "A." in the protwords.txt file. The debugQuery shows that the
middle initial got dropped in the parsedquery.
How can I make Solr NOT to drop the middle initial?  Thanks for your help!! 
 
==Search results
Ellington, A.D.
Ellington, R.W..

===debugQuery=
{
  "responseHeader":{
"status":0,
"QTime":51,
"params":{
  "q":"\"Ellington, A.\"",
  "indent":"on",
  "fl":"audit_author.name",
  "wt":"json",
  "debugQuery":"true"}},
  "response":{"numFound":2,"start":0,"docs":[
  {
"audit_author.name":"Azzi, A., Clark, S.A., Ellington, R.W.,
Chapman, M.S."},
  {
"audit_author.name":"Ye, X., Gorin, A., Ellington, A.D., Patel,
D.J."}]
  },
  "debug":{
"rawquerystring":"\"Ellington, A.\"",
"querystring":"\"Ellington, A.\"",
   
"parsedquery":"(+DisjunctionMaxQuery(((entity_name_com.name:ellington)^20.0)))/no_coord",
"parsedquery_toString":"+((entity_name_com.name:ellington)^20.0)",
   "QParser":"ExtendedDismaxQParser",




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: CdcrReplicator Forwarder not working on some shards

2018-04-18 Thread Susheel Kumar
I was able to resolve this issue by start/stop the cdcr process couple of
times until all shards leaders started forwarding updates...

Thnx

On Tue, Apr 17, 2018 at 3:20 PM, Susheel Kumar 
wrote:

> Hi Amrit,
>
> The cdcr?action=ERRORS is returning consecutiveErrors=1 on the shards
> which are not forwarding updates.  Any clue does that gives?
>
> 
> 1
> 1
> 0
> 
> bad_request
> 
> 
>
>
>
>
> On Tue, Apr 17, 2018 at 1:22 PM, Amrit Sarkar 
> wrote:
>
>> Susheel,
>>
>> At the time of core reload, logs must be complaining or atleast pointing
>> to
>> some direction. Each leader of shard is responsible to spawn a threadpool
>> for cdcr replicator to get the data over.
>>
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> Medium: https://medium.com/@sarkaramrit2
>>
>> On Tue, Apr 17, 2018 at 9:04 PM, Susheel Kumar 
>> wrote:
>>
>> > Hi,
>> >
>> > Has anyone gone thru this issue where few shard leaders are forwarding
>> > updates to their counterpart leaders in target cluster while some of the
>> > shards leaders are not forwarding the updates.
>> >
>> > on Solr 6.6,  4 of the shards logs I see below entries and their
>> > counterpart in target are getting updated but for other 4 shards I don't
>> > below entries and neither being replicated to target.
>> >
>> > Any suggestion on how / what can be done to start cdcr-replicator
>> threads
>> > on other shards?
>> >
>> > 2018-04-17 15:26:38.394 INFO
>> > (cdcr-replicator-24-thread-6-processing-n:dc2prsrcvap0049.
>> > whc.dc02.us.adp:8080_solr)
>> > [   ] o.a.s.h.CdcrReplicator Forwarded 0 updates to target COLL
>> > 2018-04-17 15:26:39.394 INFO
>> > (cdcr-replicator-24-thread-7-processing-n:dc2prsrcvap0049.
>> > whc.dc02.us.adp:8080_solr)
>> > [   ] o.a.s.h.CdcrReplicator Forwarded 0 updates to target COLL
>> >
>> > Thanks
>> > Susheel
>> >
>>
>
>


PF, PF2, PF3 clauses missing in solr7 with query-time synonyms?

2018-04-18 Thread Elizabeth Haubert
I'm seeing pf and pf3 clauses fail to generate in long queries containing
synonyms.  Wondering if anyone else has run into this, or if it needs to be
submitted as a bug in Jira.   It is a showstopper problem for the current
project, as the pf and pf3 were pretty heavily tuned.

Using Solr 7.1; all fields are using the following type:

With query-time synonyms:














  
















Without query-time synonyms:















  















Synonyms file is pretty long, so I'll just include the relevent bits for an
example:

allergic, hypersensitive
aspirin, acetylsalicylic acid
dog, canine, canis familiris, k 9
rat, rattus


The problem seems to occur when part of the query has a synonym, but the
whole phrase is not.  Whitespace added to piece out what is going on;
believe any parentheses errors are due to my tinkering around.  Beyond that
though, this is as from Solr.  Slop has been tinkered with to identify
PF/PF2/PF3 clauses where PF fields have a slop ending in 0, pf2 ending in
1, pf3 ending in 2 eg ~10, ~11, ~12, etc.

=
Example 1:  "aspirin dose in rats"
==

With query-time synonyms:
===
/// Q terms generate as expected ///
+kw1:\"acetylsalicylic acid\" kw1:aspirin)^100.0 |
(species:\"acetylsalicylic acid\" species:aspirin) |
(keywords_bm25_no_norms:\"acetylsalicylic acid\"
keywords_bm25_no_norms:aspirin)^50.0 | (description:\"acetylsalicylic
acid\" description:aspirin) | (kw1ranked:\"acetylsalicylic acid\"
kw1ranked:aspirin)^100.0 | (text:\"acetylsalicylic acid\" text:aspirin) |
(title:\"acetylsalicylic acid\" title:aspirin)^100.0 |
(keywordsranked_bm25_no_norms:\"acetylsalicylic acid\"
keywordsranked_bm25_no_norms:aspirin)^50.0 | (authors:\"acetylsalicylic
acid\" authors:aspirin))~0.4 ((Synonym(kw1:dosage kw1:dose kw1:dose
kw1:dose))^100.0 | Synonym(species:dosage species:dose species:dose
species:dose) | (Synonym(keywords_bm25_no_norms:dosage
keywords_bm25_no_norms:dose keywords_bm25_no_norms:dose
keywords_bm25_no_norms:dose))^50.0 | Synonym(description:dosage
description:dose description:dose description:dose) |
(Synonym(kw1ranked:dosage kw1ranked:dose kw1ranked:dose
kw1ranked:dose))^100.0 | Synonym(text:dosage text:dose text:dose text:dose)
| (Synonym(title:dosage title:dose title:dose title:dose))^100.0 |
(Synonym(keywordsranked_bm25_no_norms:dosage
keywordsranked_bm25_no_norms:dose keywordsranked_bm25_no_norms:dose
keywordsranked_bm25_no_norms:dose))^50.0 | Synonym(authors:dosage
authors:dose authors:dose authors:dose))~0.4 ((Synonym(kw1:rat
kw1:rattu))^100.0 | Synonym(species:rat species:rattu) |
(Synonym(keywords_bm25_no_norms:rat keywords_bm25_no_norms:rattu))^50.0 |
Synonym(description:rat description:rattu) | (Synonym(kw1ranked:rat
kw1ranked:rattu))^100.0 | Synonym(text:rat text:rattu) | (Synonym(title:rat
title:rattu))^100.0 | (Synonym(keywordsranked_bm25_no_norms:rat
keywordsranked_bm25_no_norms:rattu))^50.0 | Synonym(authors:rat
authors:rattu))~0.4)~3)

/// PF and PF2 are missing. ///
 () () () () ()

/// This is actually PF3 with a missing ? where the stopword 'in' belonged.
///
 ((title:\"(dosage dose dose dose) (rattu rat)\"~22)^1000.0 |
(keywordsranked_bm25_no_norms:\"(dosage dose dose dose) (rattu
rat)\"~22)^1000.0 | (text:\"(dosage dose dose dose) (rattu
rat)\"~22)^100.0)~0.4 ((keywords_bm25_no_norms:\"(dosage dose dose dose)
(rattu rat)\"~12)^500.0 | (kw1ranked:\"(dosage dose dose dose) (rattu
rat)\"~12)^100.0 | (kw1:\"(dosage dose dose dose) (rattu
rat)\"~12)^100.0)~0.4,product(max(10.0/(3.16E-11*float(ms(const(14560),date(dateint)))+6.0),int(documentdatefix)),scale(map(int(rank),-1.0,-1.0,const(0.5),null),0.5,2.0)))",

With index-time synonyms:
===

/// Q ///
 "boost(+kw1:aspirin)^100.0 | species:aspirin |
(keywords_bm25_no_norms:aspirin)^50.0 | description:aspirin |
(kw1ranked:aspirin)^100.0 | text:aspirin | (title:aspirin)^100.0 |
(keywordsranked_bm25_no_norms:aspirin)^50.0 | authors:aspirin)~0.4
((kw1:dose)^100.0 | species:dose | (keywords_bm25_no_norms:dose)^50.0 |
description:dose | (kw1ranked:dose)^100.0 | text:dose | (title:dose)^100.0
| (keywordsranked_bm25_no_norms:dose)^50.0 | authors:dose)~0.4
((kw1:rats)^100.0 | species:rats | (keywords_bm25_no_norms:rats)^50.0 |
description:rats | (kw1ranked:rats)^100.0 | text:rats | (title:rats)^100.0
| (keywordsranked_bm25_no_norms:rats)^50.0 | authors:rats)~0.4)~3)
/// PF  ///
  ((title:\"aspirin dose ? rats\"~20)^5000.0 |
(keywordsranked_bm25_no_norms:\"aspirin dose ? rats\"~20)^5000.0 |
(keywords_bm25_no_norms:\"aspirin dose ? rats\"~20)^1500.0 |
(text:\"aspirin dose ? rats\"~20)^1000.0)~0.4 ((kw1ranked:\"aspirin dose ?
rats\"~10)^5000.0 | (kw1:\"aspirin dose ? rats\"~10)^500.0)~0.4
((authors:\"aspirin dose ? rats\")^250.0 | description:\"aspirin dose ?
rats\")~0.4

/// PF2 ///
  ((text:\"aspirin dose ? rats\"~100)^500.0)~0.4 (authors:\"aspirin
dose\"~11 | species:\"aspirin dose\"~11)~0.4

/// PF3 ///
(((title:\"aspirin dose\"~22)^1000.0 |

need help on search on last name + middile initial

2018-04-18 Thread Wendy2
Hi Solr experts:

How can I make sure Solr doesn't drop middle initial when I do a name
search?
I did a search with double quotes for "Ellington, A.", but Solr parser
dropped the middle initial, so I got both back:
I even tried keeping A. in the protwords.txt file, but didn't work. 
Any work around or suggestions?  Thanks!!!

*RESULTS:*
 "response":{"numFound":2,"start":0,"docs":[
  {
"audit_author.name":"Azzi, A., Clark, S.A., Ellington, R.W.,
Chapman, M.S."},
  {
"audit_author.name":"Ye, X., Gorin, A., Ellington, A.D., Patel,
D.J."}]
  },
  "debug":{
"debugQuery mode indicates that Solr dropped the ""A."" when parsing the
query:
  ""debug"":{
""rawquerystring"":""\""Ellington, A.\,
""querystring"":""\""Ellington, A.\,
   
""parsedquery"":""(+DisjunctionMaxQuery(((entity_name_com.name:ellington)^20.0)))/no_coord"",
""parsedquery_toString"":""+((entity_name_com.name:ellington)^20.0)"",
   ""QParser"":""ExtendedDismaxQParser"", "



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: Specialized Solr Application

2018-04-18 Thread Allison, Timothy B.
To be Waldorf to Erick's Statler (if I may), lots of things can go wrong during 
content extraction.[1]  I had two big concerns when I heard of your task:



1) image only pdfs, which can parse without problem, but which might yield 0 
content.

2) emails (see, e.g. SOLR-12048)



It sounds like you're taking care of 1), and 2) doesn't apply because you're 
using Tika (although note that we've made some major changes to our RFC822 
parsing in the upcoming Tika 1.18).  So, no need to read further! 



In general, surprising things can happen during the content extraction phase, 
and unless you are monitoring/measuring/evaluating what's extracted, your 
search system can yield results that are downright dangerous if you assume that 
the full stack is actually working.



I worked with one batch of documents where HALF of the Excel files weren't 
being parsed.  They all had the same quirk which caused an exception in POI, 
and because they were inside zip files, and Tika's legacy/default behavior is 
to silently ignore embedded exceptions -- the owners of the search system had 
_no idea_ that they'd never be able to find those documents.  At one point, 
Tika wasn't extracting sdt form fields in docx or form fields in pdf...at 
all...imagine if your document set was a bunch docx with sdts or pdfs with form 
fields...  We just fixed a bug to pull text from joined shapes in ppt...we've 
been missing that text for years!



Those are a few horror stories, I have many, and there are countless more yet 
to be discovered!



The goal of tika-eval[2] is to allow you to see if things don't look right 
based on your expectations.[3]  It doesn't help with indexing at all per se, 
but it can allow you to see odd things and 1) change your processing pipeline 
(add OCR where necessary or use an alternate parser for some file formats) or 
2) raise an issue to fix bugs in the content extraction libraries, or at least 
3) recognize that you aren't getting reliable content out of ~x% of your 
documents.  If manually checking PDFs to determine whether or not to run OCR is 
a hassle, run tika-eval and identify those docs that have a low word count/page 
ratio.



Couple of handfuls of Welsh documents; I thought we only had English...what?!  
No, that's just bad content extraction (character mapping failure in the PDF or 
other mojibake).  Average token length in this document is 1, and it is 
supposed to be English...what?  No, that's the spacing problem that Erick 
Mentioned.  Average words per page in some pdfs = 2?  No, that's an image-only 
pdf...that needs to go through OCR.  Ratio of out of vocabulary words = 
90%...no that's character encoding mojibake.





> I was recently indexing a set of about

13,000 documents and at one point, a document caused solr to crash.  I had to 
restart it.  I removed the offending document, and restarted the indexing.  It 
then eventually happened again, so I did the same thing.



Crash, crash like OOM?  If you're able to share that with Tika or PDFBox, we 
can _try_ to fix the underlying bug if there is one.  Sometimes, though, our 
parsers require far more memory that is ideal. 



If you have questions about tika-eval, please ask over on the Tika list.  
Apologies for too many words.  Thank you, all, for this discussion!



Cheers,



   Tim





P.S. On metadata author vs. creator, for a good while, we've been trying to 
standardize to Dublin core -- dc:creator.  If you see areas for improvement, 
let us know.



[1] https://www.slideshare.net/TimAllison6/haystack-2018-apachetikaevaltallison

[2] https://wiki.apache.org/tika/TikaEval

[3] Obviously, without ground truth, there is no automated way to detect the 
sdt/form field/grouped text box problems, but tika-eval does what it can to 
identify and count:

a) catastrophic problems (oom, permanent hang)

b) catchable exceptions

c) corrupted text

d) nearly entirely missing text






Re: NER question

2018-04-18 Thread Steve Rowe
Hi Alexey,

Did you see my response to your “Solr OpenNLP named entity extraction” thread?  
I think I’ve answered your questions.

--
Steve
www.lucidworks.com

> On Apr 18, 2018, at 4:28 AM, Alexey Ponomarenko  
> wrote:
> 
> Hi, I have a question regarding NER
> 
> https://stackoverflow.com/questions/49894727/using-named-entity-extraction-in-solr-7-3
> 
> can you help me?



Re: Specialized Solr Application

2018-04-18 Thread Erick Erickson
Terry:

If your process works, then it works and there's no real reason to change.

I was commingling the structure of the content with the metadata. You're
right that the content doesn't really have any useful structure. Sometimes
you can get some useful information out of the metadata, particularly
metadata that doesn't require a user action (last_modified and the like,
sometimes).

Whether that effort is worth it in your use-case is, of course, a valid
question.

bq: On OCRs, I presume you're referring to PDFs that are images?

No, I was referring to scanned images. I once had to try to index
a document (I wouldn't lie to you) that was a scanned image of
a "family tree" where the most remote ancestor was written
vertically on the trunk, and each branch had a descendant
written at various angles. The resulting scanned image
was run through an OCR program that produces...well, let's
just say little of value ;)..

Best,
Erick

On Wed, Apr 18, 2018 at 8:10 AM, Terry Steichen  wrote:
> Thanks, Erick.  What I don't understand that "rich text documents" (aka,
> PDF and DOC) lack any internal structure (unlike JSON, XML, etc.), so
> there's not much potential in trying to get really precise in parsing
> them.  Or am I overlooking something here?
>
> And, as you say, the metadata of such documents is not somewhat variable
> (some PDFs have a field and others don't), which suggests that you may
> not want the parser to be rigid.
>
> Moreover, as I noted earlier, most of the metadata fields of such
> documents seem to be of little value (since many document authors are
> not consistent in creating that information).
>
> I take your point about non-optimum Tika workload distribution - but I
> am only occasionally doing indexing so I don't think that would be a
> significant factor (for me, at least).
>
> A point of possible interest: I was recently indexing a set of about
> 13,000 documents and at one point, a document caused solr to crash.  I
> had to restart it.  I removed the offending document, and restarted the
> indexing.  It then eventually happened again, so I did the same thing.
> It then completed indexing successfully.  IOW, out of 13,000 documents
> there were two that caused a crash, but once they were removed, the
> other 12,998 were parsed/indexed fine.
>
> On OCRs, I presume you're referring to PDFs that are images?  Part of
> our team uses Acrobat Pro to screen and convert such documents (which
> are very common in legal circles) so they can be searched.  Or did you
> mean something else?
>
> Thanks for the insights.  And the long answers (from you, Tim and
> Charlie).  These are helping me (and I hope others on the list) to
> better understand some of the nuances of effectively implementing
> (small-scale) solr.
>
>
> On 04/17/2018 10:35 PM, Erick Erickson wrote:
>> Terry:
>>
>> Tika has a horrible problem to deal with and it's approaching a
>> miracle that it does so well ;)
>>
>> Let's take a PDF file. Which vendor's version? From what _decade_? Did
>> that vendor adhere
>> to the spec? Every spec has gray areas so even good-faith efforts can
>> result in some version/vendor
>> behaving slightly differently from the other.
>>
>> And what about Word .vs. PDF? One might have "last_modified" and the
>> other might have
>> "last_edited" to mean the same thing. You mentioned that you're aware
>> of this, you can make
>> it more useful if you have finer-grained control over the ETL process.
>>
>> You say "As I understand it, Tika is integrated with Solr"  which is
>> correct, you're talking about
>> the "Extracting Request Handler". However that has a couple of
>> important caveats:
>>
>> 1> It does the best it can. But Tika has a _lot_ of tuning options
>> that allow you to get down-and-dirty
>> with the data you're indexing. You mentioned that precision is
>> important. You can do some interesting
>> things with extracting specific fields from specific kinds of
>> documents and making use of them. The
>> "last_modified" and "last_edited" fields above are an example.
>>
>> 2> It loads the work on a single Solr node. So the very expensive
>> process of extracting data from the
>> semi-structure document is all on the Solr node. If you use Tika in a
>> client-side program you can
>> parallelize the extraction and get through your indexing much more quickly.
>>
>> 3> Tika can occasionally get its knickers in a knot over some
>> particular document. That'll also bring
>> down the Solr instance.
>>
>> Here's a blog that can get you started doing client-side parsing,
>> ignore the RDBMS bits.
>> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>>
>> I'll leave Tim to talk about tika-eval ;) But the general problem is
>> that the extraction process can
>> result in garbage, lots of garbage. OCR is particularly prone to
>> nonsense. PDFs can be tricky,
>> there's this spacing parameter that, depending on it's setting can
>> render e r i c k as 5 separate
>> letters or my name.

Re: Writing config directly to zookeeper

2018-04-18 Thread Walter Underwood
I didn’t want to install Solr just so Jenkins could use one script. The Python 
is standalone.

I was using the zkCli tools, which were just not all that well documented. I 
never could find a description of exactly which files were copied where. The 
solr.xml directory structure had /conf/, but it wasn’t clear what 
was expected for the bootstrap commands.

Fetching the zk information from a running cluster is also less error prone. 
Don’t need to keep Jenkins configured the same as the cluster.

Oh, I skipped a step, the Python script also uploads solr.xml.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Apr 18, 2018, at 9:14 AM, Erick Erickson  wrote:
> 
> There are some perhaps easier ways to manipulate ZK in the "bin/solr"
> script if you haven't seen it
> 
> bin/solr zk -help
> 
> Best,
> Erick
> 
> On Wed, Apr 18, 2018 at 8:30 AM, Arturas Mazeika  wrote:
>> Hi Walter,
>> 
>> Thanks for the message. Would you care to share the tool with us? I would
>> be interested.. Or have you shared it already?
>> 
>> Cheers,
>> Arturas
>> 
>> On Wed, Apr 18, 2018 at 5:09 PM, Walter Underwood 
>> wrote:
>> 
>>> I wrote a Python tool to do this. I use the kazoo package to talk to
>>> Zookeeper. It starts with the load balancer URL to Solr.
>>> 
>>> 1. Get cluster status.
>>> 2. Parse out the Zookeeper config string including chroot.
>>> 3. Connect to Zookeeper.
>>> 4. Copy the config to the location described in Shawn’s message.
>>> 5. Send linkconfig command to the cluster, just to be sure.
>>> 6. Reload the collection with an async command.
>>> 7. Ping the cluster until the reload is successful on every node.
>>> 8. Optionally, rebuild the suggester on each node.
>>> 
>>> The actual location of the config in Zookeeper is undocumented, as far as
>>> I could tell. I used the Solr ZK CLI, then reverse engineered where it put
>>> stuff.
>>> 
>>> The docs need a “Zookeeper file organization” chapter with this info.
>>> 
>>> Also, it would be nice if the ZKHOST info was available pre-parsed in
>>> cluster status.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
 On Apr 17, 2018, at 8:20 PM, Shawn Heisey  wrote:
 
 On 4/17/2018 8:54 PM, Aristedes Maniatis wrote:
> Is there any difference between using the tools supplied with Solr to
>>> write configuration to Zookeeper or just writing directly to our Zookeeper
>>> cluster?
> 
> We have tooling that makes it much easier to write directly to ZK
>>> rather than having to use yet another tool to do it.
 
 As long as it ends up in the correct path in the ZK structure, it
>>> doesn't matter how it gets there.
 
 The /configs/ location (where  is the config name) should have
>>> the same contents that would normally be found in a conf directory if it
>>> were standalone Solr and not using the standalone configsets feature.
 
 Thanks,
 Shawn
 
>>> 
>>> 



Re: Writing config directly to zookeeper

2018-04-18 Thread Erick Erickson
There are some perhaps easier ways to manipulate ZK in the "bin/solr"
script if you haven't seen it

bin/solr zk -help

Best,
Erick

On Wed, Apr 18, 2018 at 8:30 AM, Arturas Mazeika  wrote:
> Hi Walter,
>
> Thanks for the message. Would you care to share the tool with us? I would
> be interested.. Or have you shared it already?
>
> Cheers,
> Arturas
>
> On Wed, Apr 18, 2018 at 5:09 PM, Walter Underwood 
> wrote:
>
>> I wrote a Python tool to do this. I use the kazoo package to talk to
>> Zookeeper. It starts with the load balancer URL to Solr.
>>
>> 1. Get cluster status.
>> 2. Parse out the Zookeeper config string including chroot.
>> 3. Connect to Zookeeper.
>> 4. Copy the config to the location described in Shawn’s message.
>> 5. Send linkconfig command to the cluster, just to be sure.
>> 6. Reload the collection with an async command.
>> 7. Ping the cluster until the reload is successful on every node.
>> 8. Optionally, rebuild the suggester on each node.
>>
>> The actual location of the config in Zookeeper is undocumented, as far as
>> I could tell. I used the Solr ZK CLI, then reverse engineered where it put
>> stuff.
>>
>> The docs need a “Zookeeper file organization” chapter with this info.
>>
>> Also, it would be nice if the ZKHOST info was available pre-parsed in
>> cluster status.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>> > On Apr 17, 2018, at 8:20 PM, Shawn Heisey  wrote:
>> >
>> > On 4/17/2018 8:54 PM, Aristedes Maniatis wrote:
>> >> Is there any difference between using the tools supplied with Solr to
>> write configuration to Zookeeper or just writing directly to our Zookeeper
>> cluster?
>> >>
>> >> We have tooling that makes it much easier to write directly to ZK
>> rather than having to use yet another tool to do it.
>> >
>> > As long as it ends up in the correct path in the ZK structure, it
>> doesn't matter how it gets there.
>> >
>> > The /configs/ location (where  is the config name) should have
>> the same contents that would normally be found in a conf directory if it
>> were standalone Solr and not using the standalone configsets feature.
>> >
>> > Thanks,
>> > Shawn
>> >
>>
>>


NER question

2018-04-18 Thread Alexey Ponomarenko
Hi, I have a question regarding NER

https://stackoverflow.com/questions/49894727/using-named-entity-extraction-in-solr-7-3

 can you help me?


Run solr server using Java program

2018-04-18 Thread rameshkjes
Hi guys, 

I am able to run the solr instance, add the core and import the data
manually. But I want to do everything with the help of Java program, I
searched a lot but did not find any relevant answer. 

In order to run the solr server, i execute following command inside
directory: D:\software\solr-7.2.0\solr-7.2.0\bin 

" /solr.cmd -s "C:\Users\lucky\github\myproject\solr-config"/  "  

After that I access to " /http://localhost:8983/solr// "  

and select the name of core which is "demo" 

and then I select/ dataimport/ tab and "/execute/" to import documents. 

First thing what i tried is to run the solr server using Java program, which
I am unable to do. Could anyone please with that? 

I am using Solr 7.2.0

Thanks 



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: SolrCloud [subquery] with join on multiple terms

2018-04-18 Thread gallex2000
Thanks, it's working.

Regards, 
Alex G



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: solr 5.4.1 - updates/inserts suddenly very slow. Search still fast

2018-04-18 Thread Shalin Shekhar Mangar
You can get a thread dump by calling
http://localhost:8983/solr/admin/threads or by using the Admin UI.

On Wed, Apr 18, 2018 at 9:11 PM, Felix XY  wrote:

> Thank you Emir, but I'm not able to make a thread dump while doing updates
> because the updates are very fast again:
>
>
> While I wrote this email my colleague was googling around.
>
> He found this
> http://lucene.472066.n3.nabble.com/HttpSolrServer-
> commit-is-taking-more-time-td4330954.html
>
> and my colleague changed some values
>
> from:
>
> 
>
> to:
>  size="512"
>  initialSize="512"
>  autowarmCount="0"/>
> from:
>
> 
>
> to:
>size="512"
>  initialSize="512"
>  autowarmCount="0"/>
>
>
> and it seems, that our problems are gone completely. Updates fast.
> Search seems not to be much slower.
>
>
> But I'm still curious why our problems started so suddenly and what
> negative side effects these changes could have.
>
> Cheers
>
> Felix
>
>
>
> 2018-04-18 17:11 GMT+02:00 Emir Arnautović :
>
> > Hi Felix,
> > Did you try to do thread dump while doing update. Did it show anything?
> >
> > Emir
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection
> > Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >
> >
> >
> > > On 18 Apr 2018, at 17:06, Felix XY  wrote:
> > >
> > > Hello group,
> > >
> > > since two days we have huge problems with our solr 5.4.1 installation.
> > >
> > > ( yes, we have to update it. But this will not be a solution right now
> )
> > >
> > > All path=/select requests are still very fast. But all /update Requests
> > > take >30sec up to 3 minutes.
> > >
> > > The index is not very big (1.000.000 documents) and its size on disk is
> > > about 1GB
> > >
> > > The virtual server (ESX) has 8GB RAM and 8 cores. IO is good.
> > >
> > > solr was started with -Xms4096M -Xmx4096M
> > > ( but we changed it to higher and lower values during our tests )
> > >
> > > We have a lot of /select requests at the moment (10.000/Minute) but
> this
> > is
> > > not unusual for this installation and we didn't have this update
> problems
> > > before.
> > >
> > > On another identical sleeping core on the same server, we are able to
> > make
> > > fast updates. We experience slow updates only on the core with high
> > > selecting traffic. So it seems not to be a general problem with java,
> GC,
> > > 
> > >
> > > We disabled all other insert/updates and we are able to reproduce this
> > slow
> > > update behaviour in the Solr Admin console with a single update of one
> > > document.
> > >
> > > We are lost.
> > >
> > > We didn't change the Solr configuration.
> > > The load seems to be not higher then during previous peaks
> > > The developers didn't change anything (so they say)
> > > Search is still fast.
> > >
> > > But single simple updates takes >30sec
> > >
> > > Any ideas about this? We tried quite a lot the last two days
> > >
> > > Cheers
> > > Felix
> >
> >
>



-- 
Regards,
Shalin Shekhar Mangar.


Re: solr 5.4.1 - updates/inserts suddenly very slow. Search still fast

2018-04-18 Thread Felix XY
Thank you Emir, but I'm not able to make a thread dump while doing updates
because the updates are very fast again:


While I wrote this email my colleague was googling around.

He found this
http://lucene.472066.n3.nabble.com/HttpSolrServer-commit-is-taking-more-time-td4330954.html

and my colleague changed some values

from:



to:
   
from:



to:
 


and it seems, that our problems are gone completely. Updates fast.
Search seems not to be much slower.


But I'm still curious why our problems started so suddenly and what
negative side effects these changes could have.

Cheers

Felix



2018-04-18 17:11 GMT+02:00 Emir Arnautović :

> Hi Felix,
> Did you try to do thread dump while doing update. Did it show anything?
>
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 18 Apr 2018, at 17:06, Felix XY  wrote:
> >
> > Hello group,
> >
> > since two days we have huge problems with our solr 5.4.1 installation.
> >
> > ( yes, we have to update it. But this will not be a solution right now )
> >
> > All path=/select requests are still very fast. But all /update Requests
> > take >30sec up to 3 minutes.
> >
> > The index is not very big (1.000.000 documents) and its size on disk is
> > about 1GB
> >
> > The virtual server (ESX) has 8GB RAM and 8 cores. IO is good.
> >
> > solr was started with -Xms4096M -Xmx4096M
> > ( but we changed it to higher and lower values during our tests )
> >
> > We have a lot of /select requests at the moment (10.000/Minute) but this
> is
> > not unusual for this installation and we didn't have this update problems
> > before.
> >
> > On another identical sleeping core on the same server, we are able to
> make
> > fast updates. We experience slow updates only on the core with high
> > selecting traffic. So it seems not to be a general problem with java, GC,
> > 
> >
> > We disabled all other insert/updates and we are able to reproduce this
> slow
> > update behaviour in the Solr Admin console with a single update of one
> > document.
> >
> > We are lost.
> >
> > We didn't change the Solr configuration.
> > The load seems to be not higher then during previous peaks
> > The developers didn't change anything (so they say)
> > Search is still fast.
> >
> > But single simple updates takes >30sec
> >
> > Any ideas about this? We tried quite a lot the last two days
> >
> > Cheers
> > Felix
>
>


Re: Writing config directly to zookeeper

2018-04-18 Thread Arturas Mazeika
Hi Walter,

Thanks for the message. Would you care to share the tool with us? I would
be interested.. Or have you shared it already?

Cheers,
Arturas

On Wed, Apr 18, 2018 at 5:09 PM, Walter Underwood 
wrote:

> I wrote a Python tool to do this. I use the kazoo package to talk to
> Zookeeper. It starts with the load balancer URL to Solr.
>
> 1. Get cluster status.
> 2. Parse out the Zookeeper config string including chroot.
> 3. Connect to Zookeeper.
> 4. Copy the config to the location described in Shawn’s message.
> 5. Send linkconfig command to the cluster, just to be sure.
> 6. Reload the collection with an async command.
> 7. Ping the cluster until the reload is successful on every node.
> 8. Optionally, rebuild the suggester on each node.
>
> The actual location of the config in Zookeeper is undocumented, as far as
> I could tell. I used the Solr ZK CLI, then reverse engineered where it put
> stuff.
>
> The docs need a “Zookeeper file organization” chapter with this info.
>
> Also, it would be nice if the ZKHOST info was available pre-parsed in
> cluster status.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Apr 17, 2018, at 8:20 PM, Shawn Heisey  wrote:
> >
> > On 4/17/2018 8:54 PM, Aristedes Maniatis wrote:
> >> Is there any difference between using the tools supplied with Solr to
> write configuration to Zookeeper or just writing directly to our Zookeeper
> cluster?
> >>
> >> We have tooling that makes it much easier to write directly to ZK
> rather than having to use yet another tool to do it.
> >
> > As long as it ends up in the correct path in the ZK structure, it
> doesn't matter how it gets there.
> >
> > The /configs/ location (where  is the config name) should have
> the same contents that would normally be found in a conf directory if it
> were standalone Solr and not using the standalone configsets feature.
> >
> > Thanks,
> > Shawn
> >
>
>


Re: Howto change log level with Solr Admin UI ?

2018-04-18 Thread Shalin Shekhar Mangar
The changes made using the admin logging UI are local to the node. It will
not change logging settings on other nodes and these changes do not persist
between restarts.

On Wed, Apr 18, 2018 at 7:33 PM, Bernd Fehling <
bernd.fehl...@uni-bielefeld.de> wrote:

> I just tried to change the log level with Solr Admin UI but it
> does not change any logging on my running SolrCloud.
> It just shows the changes in the Admin UI and the commands in the
> request log, but no changes in the level of logging.
>
> Do I have to RELOAD the collection after changing log level?
>
> I tried all setting from ALL, TRACE, DEBUG, ...
>
> Also the Reference Guide 6.6 shows the Admin UI as I see it, but
> the table below the image has levels FINEST, FINE, CONFIG, ...
> https://lucene.apache.org/solr/guide/6_6/configuring-logging.html
> This is confusing.
>
>
> Regards,
> Bernd
>



-- 
Regards,
Shalin Shekhar Mangar.


Re: schema-api: modifying schema in xml format

2018-04-18 Thread Arturas Mazeika
Hi Steve,

it is reasonable that schema api understand the commands only JSON. Great
that you'll update the ref guide. Thanks for taking care of it. Nice of you
:-)

Cheers,
Arturas


On Wed, Apr 18, 2018 at 3:27 PM, Steve Rowe  wrote:

> Hi Arturas,
>
> The Schema API only understands commands in JSON.  I looked through the
> ref guide page, and I’m surprised that this isn’t stated directly; I’ll try
> to fix that.
>
> --
> Steve
> www.lucidworks.com
>
> > On Apr 18, 2018, at 4:12 AM, Arturas Mazeika  wrote:
> >
> > Hi solr-users,
> >
> > is it possible to modify the managed schema using schema api and submit
> the
> > commands in XML format? I am able to add a data type using:
> >
> > curl -X POST -H 'Content-type:application/json' --data-binary '{
> >  "add-field-type": {
> >  "name":"text_de_ph",
> >  "class":"solr.TextField",
> >  "positionIncrementGap":"100",
> >  "analyzer": {
> >"tokenizer": {"class":"solr.StandardTokenizerFactory"},
> >"filters": [
> >  {"class":"solr.LowerCaseFilterFactory"},
> >  {"class":"solr.StopFilterFactory", "format":"snowball",
> > "words":"lang/stopwords_de.txt", "ignoreCase":true},
> >  {"class":"solr.GermanNormalizationFilterFactory"},
> >  {"class":"solr.GermanLightStemFilterFactory"},
> >  {"class":"solr.PhoneticFilterFactory", "encoder":"DoubleMetaphone"}
> >  ]}}
> > }' http://localhost:8983/solr/tph/schema
> >
> > so I thought I could submit something like:
> >
> > curl -X POST -H 'Content-Type: text/xml' --data-binary '
> >  > positionIncrementGap="100">
> >   
> >  
> >  
> >   > words="lang/stopwords_de.txt" ignoreCase="true"/>
> >  
> >  
> >   encoder="DoubleMetaphone"/>
> >
> > 
> > ' http://localhost:8983/solr/tph/schema
> >
> > This however failed with the error:
> >
> > {
> >  "responseHeader":{
> >"status":500,
> >"QTime":1},
> >  "error":{
> >"msg":"JSON Parse Error: char=<,position=1 AFTER=' ...
> >
> > The examples in the documentation (I am using solr 7.2) are all in JSON
> > format, but does not say explicitly, that one needs to send the updates
> in
> > json format only..
> >
> > https://lucene.apache.org/solr/guide/7_2/schema-api.html#schema-api
> >
> > Comments?
> >
> > Cheers,
> > Arturas
>
>


Re: Howto change log level with Solr Admin UI ?

2018-04-18 Thread Emir Arnautović
Hi,
It is not exposed in the admin console (would be nice if it is!), but there is 
a way to set threshold for admin UI logs. You can simply execute following:  
http://localhost:8983/solr/admin/info/logging?since=0=INFO 
 and INFO 
logs will start appearing in admin UI.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 18 Apr 2018, at 16:30, Shawn Heisey  wrote:
> 
> On 4/18/2018 8:03 AM, Bernd Fehling wrote:
>> I just tried to change the log level with Solr Admin UI but it
>> does not change any logging on my running SolrCloud.
>> It just shows the changes in the Admin UI and the commands in the
>> request log, but no changes in the level of logging.
>> 
>> Do I have to RELOAD the collection after changing log level?
>> 
>> I tried all setting from ALL, TRACE, DEBUG, ...
>> 
>> Also the Reference Guide 6.6 shows the Admin UI as I see it, but
>> the table below the image has levels FINEST, FINE, CONFIG, ...
>> https://lucene.apache.org/solr/guide/6_6/configuring-logging.html
>> This is confusing.
> 
> What exact setting in the logging tab did you change, and what did you expect 
> to happen that didn't happen?
> 
> The logging events that show up in the admin UI will never include anything 
> with a severity lower than WARN.  Anything lower would be far too much 
> information for the admin UI to handle.  Changing the level shown in the 
> admin UI is likely possible, but probably requires a code change.  If 
> changed, I think it would result in a UI page that's unusable because it 
> contains far too many events.
> 
> Assuming that log4j.properties hasn't been altered, you will find lower 
> severity events in solr.log, a file on disk.  The default logging level that 
> Solr uses is INFO, but INFO logs never show up in the admin UI.
> 
> Also, changes made to logging levels in the admin UI only last as long as 
> Solr is running.  When Solr is restarted, those changes are gone.  Only 
> changes made in log4j.properties will survive a restart.
> 
> Thanks,
> Shawn
> 



Re: solr 5.4.1 - updates/inserts suddenly very slow. Search still fast

2018-04-18 Thread Emir Arnautović
Hi Felix,
Did you try to do thread dump while doing update. Did it show anything?

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 18 Apr 2018, at 17:06, Felix XY  wrote:
> 
> Hello group,
> 
> since two days we have huge problems with our solr 5.4.1 installation.
> 
> ( yes, we have to update it. But this will not be a solution right now )
> 
> All path=/select requests are still very fast. But all /update Requests
> take >30sec up to 3 minutes.
> 
> The index is not very big (1.000.000 documents) and its size on disk is
> about 1GB
> 
> The virtual server (ESX) has 8GB RAM and 8 cores. IO is good.
> 
> solr was started with -Xms4096M -Xmx4096M
> ( but we changed it to higher and lower values during our tests )
> 
> We have a lot of /select requests at the moment (10.000/Minute) but this is
> not unusual for this installation and we didn't have this update problems
> before.
> 
> On another identical sleeping core on the same server, we are able to make
> fast updates. We experience slow updates only on the core with high
> selecting traffic. So it seems not to be a general problem with java, GC,
> 
> 
> We disabled all other insert/updates and we are able to reproduce this slow
> update behaviour in the Solr Admin console with a single update of one
> document.
> 
> We are lost.
> 
> We didn't change the Solr configuration.
> The load seems to be not higher then during previous peaks
> The developers didn't change anything (so they say)
> Search is still fast.
> 
> But single simple updates takes >30sec
> 
> Any ideas about this? We tried quite a lot the last two days
> 
> Cheers
> Felix



Re: Specialized Solr Application

2018-04-18 Thread Terry Steichen
Thanks, Erick.  What I don't understand that "rich text documents" (aka,
PDF and DOC) lack any internal structure (unlike JSON, XML, etc.), so
there's not much potential in trying to get really precise in parsing
them.  Or am I overlooking something here?

And, as you say, the metadata of such documents is not somewhat variable
(some PDFs have a field and others don't), which suggests that you may
not want the parser to be rigid.

Moreover, as I noted earlier, most of the metadata fields of such
documents seem to be of little value (since many document authors are
not consistent in creating that information). 

I take your point about non-optimum Tika workload distribution - but I
am only occasionally doing indexing so I don't think that would be a
significant factor (for me, at least).

A point of possible interest: I was recently indexing a set of about
13,000 documents and at one point, a document caused solr to crash.  I
had to restart it.  I removed the offending document, and restarted the
indexing.  It then eventually happened again, so I did the same thing. 
It then completed indexing successfully.  IOW, out of 13,000 documents
there were two that caused a crash, but once they were removed, the
other 12,998 were parsed/indexed fine.

On OCRs, I presume you're referring to PDFs that are images?  Part of
our team uses Acrobat Pro to screen and convert such documents (which
are very common in legal circles) so they can be searched.  Or did you
mean something else?

Thanks for the insights.  And the long answers (from you, Tim and
Charlie).  These are helping me (and I hope others on the list) to
better understand some of the nuances of effectively implementing
(small-scale) solr.


On 04/17/2018 10:35 PM, Erick Erickson wrote:
> Terry:
>
> Tika has a horrible problem to deal with and it's approaching a
> miracle that it does so well ;)
>
> Let's take a PDF file. Which vendor's version? From what _decade_? Did
> that vendor adhere
> to the spec? Every spec has gray areas so even good-faith efforts can
> result in some version/vendor
> behaving slightly differently from the other.
>
> And what about Word .vs. PDF? One might have "last_modified" and the
> other might have
> "last_edited" to mean the same thing. You mentioned that you're aware
> of this, you can make
> it more useful if you have finer-grained control over the ETL process.
>
> You say "As I understand it, Tika is integrated with Solr"  which is
> correct, you're talking about
> the "Extracting Request Handler". However that has a couple of
> important caveats:
>
> 1> It does the best it can. But Tika has a _lot_ of tuning options
> that allow you to get down-and-dirty
> with the data you're indexing. You mentioned that precision is
> important. You can do some interesting
> things with extracting specific fields from specific kinds of
> documents and making use of them. The
> "last_modified" and "last_edited" fields above are an example.
>
> 2> It loads the work on a single Solr node. So the very expensive
> process of extracting data from the
> semi-structure document is all on the Solr node. If you use Tika in a
> client-side program you can
> parallelize the extraction and get through your indexing much more quickly.
>
> 3> Tika can occasionally get its knickers in a knot over some
> particular document. That'll also bring
> down the Solr instance.
>
> Here's a blog that can get you started doing client-side parsing,
> ignore the RDBMS bits.
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>
> I'll leave Tim to talk about tika-eval ;) But the general problem is
> that the extraction process can
> result in garbage, lots of garbage. OCR is particularly prone to
> nonsense. PDFs can be tricky,
> there's this spacing parameter that, depending on it's setting can
> render e r i c k as 5 separate
> letters or my name.
>
> Hey, you asked! Don't complain about long answers ;)
>
> Best,
> Erick
>
> On Tue, Apr 17, 2018 at 1:50 PM, Terry Steichen  wrote:
>> Hi Timothy,
>>
>> As I understand it, Tika is integrated with Solr.  All my indexed
>> documents declare that they've been parsed by tika.  For the eml files
>> it's: |org.apache.tika.parser.mail.RFC822Parser   Word docs show they
>> were parsed by ||org.apache.tika.parser.microsoft.ooxml.OOXMLParser  PDF
>> files show: ||org.apache.tika.parser.pdf.PDFParser|
>>
>> ||
>>
>> ||
>>
>> What do you mean by improving the output with "tika-eval?"  I confess I
>> don't completely understand how documents should be prepared for
>> indexing.  But with the eml docs, solr/tika seems to properly pull out
>> things like date, subject, to and from.  Other (so-called 'rich text')
>> documents (like pdfs and Word-type), the metadata is not so useful, but
>> on the other hand, there's not much consistent structure to the
>> documents I have to deal with.
>>
>> I may be missing something - am I?
>>
>> Regards,
>>
>> Terry
>>
>>
>> On 04/17/2018 09:38 AM, Allison, Timothy B. wrote:

Re: Writing config directly to zookeeper

2018-04-18 Thread Walter Underwood
I wrote a Python tool to do this. I use the kazoo package to talk to Zookeeper. 
It starts with the load balancer URL to Solr.

1. Get cluster status.
2. Parse out the Zookeeper config string including chroot.
3. Connect to Zookeeper.
4. Copy the config to the location described in Shawn’s message.
5. Send linkconfig command to the cluster, just to be sure.
6. Reload the collection with an async command.
7. Ping the cluster until the reload is successful on every node.
8. Optionally, rebuild the suggester on each node.

The actual location of the config in Zookeeper is undocumented, as far as I 
could tell. I used the Solr ZK CLI, then reverse engineered where it put stuff.

The docs need a “Zookeeper file organization” chapter with this info.

Also, it would be nice if the ZKHOST info was available pre-parsed in cluster 
status. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Apr 17, 2018, at 8:20 PM, Shawn Heisey  wrote:
> 
> On 4/17/2018 8:54 PM, Aristedes Maniatis wrote:
>> Is there any difference between using the tools supplied with Solr to write 
>> configuration to Zookeeper or just writing directly to our Zookeeper cluster?
>> 
>> We have tooling that makes it much easier to write directly to ZK rather 
>> than having to use yet another tool to do it. 
> 
> As long as it ends up in the correct path in the ZK structure, it doesn't 
> matter how it gets there.
> 
> The /configs/ location (where  is the config name) should have the 
> same contents that would normally be found in a conf directory if it were 
> standalone Solr and not using the standalone configsets feature.
> 
> Thanks,
> Shawn
> 



solr 5.4.1 - updates/inserts suddenly very slow. Search still fast

2018-04-18 Thread Felix XY
Hello group,

since two days we have huge problems with our solr 5.4.1 installation.

( yes, we have to update it. But this will not be a solution right now )

All path=/select requests are still very fast. But all /update Requests
take >30sec up to 3 minutes.

The index is not very big (1.000.000 documents) and its size on disk is
about 1GB

The virtual server (ESX) has 8GB RAM and 8 cores. IO is good.

solr was started with -Xms4096M -Xmx4096M
( but we changed it to higher and lower values during our tests )

We have a lot of /select requests at the moment (10.000/Minute) but this is
not unusual for this installation and we didn't have this update problems
before.

On another identical sleeping core on the same server, we are able to make
fast updates. We experience slow updates only on the core with high
selecting traffic. So it seems not to be a general problem with java, GC,


We disabled all other insert/updates and we are able to reproduce this slow
update behaviour in the Solr Admin console with a single update of one
document.

We are lost.

We didn't change the Solr configuration.
The load seems to be not higher then during previous peaks
The developers didn't change anything (so they say)
Search is still fast.

But single simple updates takes >30sec

Any ideas about this? We tried quite a lot the last two days

Cheers
Felix


Re: Howto change log level with Solr Admin UI ?

2018-04-18 Thread Shawn Heisey

On 4/18/2018 8:03 AM, Bernd Fehling wrote:

I just tried to change the log level with Solr Admin UI but it
does not change any logging on my running SolrCloud.
It just shows the changes in the Admin UI and the commands in the
request log, but no changes in the level of logging.

Do I have to RELOAD the collection after changing log level?

I tried all setting from ALL, TRACE, DEBUG, ...

Also the Reference Guide 6.6 shows the Admin UI as I see it, but
the table below the image has levels FINEST, FINE, CONFIG, ...
https://lucene.apache.org/solr/guide/6_6/configuring-logging.html
This is confusing.


What exact setting in the logging tab did you change, and what did you 
expect to happen that didn't happen?


The logging events that show up in the admin UI will never include 
anything with a severity lower than WARN.  Anything lower would be far 
too much information for the admin UI to handle.  Changing the level 
shown in the admin UI is likely possible, but probably requires a code 
change.  If changed, I think it would result in a UI page that's 
unusable because it contains far too many events.


Assuming that log4j.properties hasn't been altered, you will find lower 
severity events in solr.log, a file on disk.  The default logging level 
that Solr uses is INFO, but INFO logs never show up in the admin UI.


Also, changes made to logging levels in the admin UI only last as long 
as Solr is running.  When Solr is restarted, those changes are gone.  
Only changes made in log4j.properties will survive a restart.


Thanks,
Shawn



Howto change log level with Solr Admin UI ?

2018-04-18 Thread Bernd Fehling
I just tried to change the log level with Solr Admin UI but it
does not change any logging on my running SolrCloud.
It just shows the changes in the Admin UI and the commands in the
request log, but no changes in the level of logging.

Do I have to RELOAD the collection after changing log level?

I tried all setting from ALL, TRACE, DEBUG, ...

Also the Reference Guide 6.6 shows the Admin UI as I see it, but
the table below the image has levels FINEST, FINE, CONFIG, ...
https://lucene.apache.org/solr/guide/6_6/configuring-logging.html
This is confusing.


Regards,
Bernd


Re: Issue with Solr Case Insensitive Issue

2018-04-18 Thread Kapil Bhardwaj
Thanks Shwan,

I guess i will get in touch with my DB support team for the full
re-index.Even i was in doubt whether the re-index via Core Admin is really
serving the purpose.

Regards,
Kapil Bhardwaj

On Wed, Apr 18, 2018 at 6:58 PM Shawn Heisey  wrote:

> On 4/18/2018 3:45 AM, Kapil Bhardwaj wrote:
> > After making changes i RELOADED the schema via terminal command and tried
> > to re-index the schema using solr core admin button.
>
> You can't reindex by clicking a button.  Unless it's the same button you
> used to do the indexing the first time.
>
> https://wiki.apache.org/solr/HowToReindex
>
> > But after making above changes i am not seeing case insensitive search
> > working.
>
> If you're sorting on layout_path_search, you will need to reindex.  And
> like I said above, you can't do it by just clicking a button in the
> admin UI.
>
> Thanks,
> Shawn
>
>


Re: Issue with Solr Case Insensitive Issue

2018-04-18 Thread Shawn Heisey

On 4/18/2018 3:45 AM, Kapil Bhardwaj wrote:

After making changes i RELOADED the schema via terminal command and tried
to re-index the schema using solr core admin button.


You can't reindex by clicking a button.  Unless it's the same button you 
used to do the indexing the first time.


https://wiki.apache.org/solr/HowToReindex


But after making above changes i am not seeing case insensitive search
working.


If you're sorting on layout_path_search, you will need to reindex.  And 
like I said above, you can't do it by just clicking a button in the 
admin UI.


Thanks,
Shawn



Re: schema-api: modifying schema in xml format

2018-04-18 Thread Steve Rowe
Hi Arturas,

The Schema API only understands commands in JSON.  I looked through the ref 
guide page, and I’m surprised that this isn’t stated directly; I’ll try to fix 
that.

--
Steve
www.lucidworks.com

> On Apr 18, 2018, at 4:12 AM, Arturas Mazeika  wrote:
> 
> Hi solr-users,
> 
> is it possible to modify the managed schema using schema api and submit the
> commands in XML format? I am able to add a data type using:
> 
> curl -X POST -H 'Content-type:application/json' --data-binary '{
>  "add-field-type": {
>  "name":"text_de_ph",
>  "class":"solr.TextField",
>  "positionIncrementGap":"100",
>  "analyzer": {
>"tokenizer": {"class":"solr.StandardTokenizerFactory"},
>"filters": [
>  {"class":"solr.LowerCaseFilterFactory"},
>  {"class":"solr.StopFilterFactory", "format":"snowball",
> "words":"lang/stopwords_de.txt", "ignoreCase":true},
>  {"class":"solr.GermanNormalizationFilterFactory"},
>  {"class":"solr.GermanLightStemFilterFactory"},
>  {"class":"solr.PhoneticFilterFactory", "encoder":"DoubleMetaphone"}
>  ]}}
> }' http://localhost:8983/solr/tph/schema
> 
> so I thought I could submit something like:
> 
> curl -X POST -H 'Content-Type: text/xml' --data-binary '
>  positionIncrementGap="100">
>   
>  
>  
>   words="lang/stopwords_de.txt" ignoreCase="true"/>
>  
>  
>  
>
> 
> ' http://localhost:8983/solr/tph/schema
> 
> This however failed with the error:
> 
> {
>  "responseHeader":{
>"status":500,
>"QTime":1},
>  "error":{
>"msg":"JSON Parse Error: char=<,position=1 AFTER=' ...
> 
> The examples in the documentation (I am using solr 7.2) are all in JSON
> format, but does not say explicitly, that one needs to send the updates in
> json format only..
> 
> https://lucene.apache.org/solr/guide/7_2/schema-api.html#schema-api
> 
> Comments?
> 
> Cheers,
> Arturas



Re: SolrCloud [subquery] with join on multiple terms

2018-04-18 Thread Mikhail Khludnev
Could it be like
article.q=+{!terms f=articleid v=$row.articleid} +{!terms f=variantid
 v=$row.variantid} +{!terms f=language v=$row.language}
?

On Wed, Apr 18, 2018 at 12:33 PM, gallex2000  wrote:

> Hi,
>
> I have two Aliases in SolrCloud,
>
> 1. *Article *with columns id, articleid, variantid, language, content_type,
> description.
> 2. *ArticleAttributes *with columns for PARENT id, articleid, variantid,
> language, description (multivalued field with all values of attributes) and
> CHILDS (detailed information about each attribute) with columns id,
> attributeid, content_type, value (value for each attribute).
>
> My problem is to search in alias ArticleAttributes all Parent records where
> some attributeid="someid" and value="text1" and language="DE" joined in the
> same time with alias Article where description="text2"
> on Article.articleid=ArticleAttributes.articleid AND
> Article.variantid=ArticleAttributes.variantid AND
> Article.language=ArticleAttributes.language
>
> For this i have a query like this:
> http://localhost:8983/solr/articleattributes/select?fq={!parent
> which="content_type:1 AND language:DE"}((value:*16mm*) AND
> attributeid:517310)=*,article:[subquery]={!terms f=articleid
> v=$row.articleid}=*Schlauchverschraubung*
>
> In my case with [subquery] I need to join 3 fileds, but [subquery] logic
> support only one field to joining.
> Now question, exist some syntacs or another possibility to make this join
> on
> multiple fields (idee to concatenate all this fields in one field is not
> accepted from the point that this is only one case, but in some case join
> must be on others fields and for each case to create on field is a utopia).
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>



-- 
Sincerely yours
Mikhail Khludnev


Re: custom response writer which extends RawResponseWriter fails when shards > 1

2018-04-18 Thread Mikhail Khludnev
Injecting headers might require deeper customisation up to establishing own
filter or so.
Speaking regarding your own WT, there might be some issues because usually
it's not a big deal to use one wt for responding user query like (wt=csv)
and wt=javabin in internal communication between aggregator and slaves like
it happens in wt=csv query.

On Wed, Apr 18, 2018 at 2:19 PM, Lee Carroll 
wrote:

> Inventive. I need to control content-type of the response from the document
> field value. I have the actual content field and the content-type field to
> use configured in the response writer. I've just noticed that the xslt
> transformer allows you to do this but not controlled by document values. I
> may also need to set some headers based on content-type and perhaps content
> size, accept ranges comes to mind. Although I might be getting ahead of
> myself.
>
>
>
> On 18 April 2018 at 12:05, Mikhail Khludnev  wrote:
>
> > well ..
> > what if
> > http://localhost:8983/solr/images/select?fl=content=id:
> 1=1=csv&
> > csv.separator==null
> > ?
> >
> > On Wed, Apr 18, 2018 at 1:18 PM, Lee Carroll <
> lee.a.carr...@googlemail.com
> > >
> > wrote:
> >
> > > sorry cut n paste error i'd get
> > >
> > > {
> > >   "responseHeader":{
> > > "zkConnected":true,
> > > "status":0,
> > > "QTime":0,
> > > "params":{
> > >   "q":"*:*",
> > >   "fl":"content",
> > >   "rows":"1"}},
> > >   "response":{"numFound":1,"start":0,"docs":[
> > >   {
> > > "content":"my-content-value"}]
> > >   }}
> > >
> > >
> > > but you get my point
> > >
> > >
> > >
> > > On 18 April 2018 at 11:13, Lee Carroll 
> > > wrote:
> > >
> > > > for http://localhost:8983/solr/images/select?fl=content=id:
> 1=1
> > > >
> > > > I'd get
> > > >
> > > > {
> > > >   "responseHeader":{
> > > > "zkConnected":true,
> > > > "status":0,
> > > > "QTime":1,
> > > > "params":{
> > > >   "q":"*:*",
> > > >   "_":"1524046333220"}},
> > > >   "response":{"numFound":1,"start":0,"docs":[
> > > >   {
> > > > "id":"1",
> > > > "content":"my-content-value",
> > > > "*content-type*":"text/plain"}]
> > > >   }}
> > > >
> > > > when i want
> > > >
> > > > my-content-value
> > > >
> > > >
> > > >
> > > > On 18 April 2018 at 10:55, Mikhail Khludnev  wrote:
> > > >
> > > >> Lee, from this description I don see why it can't be addressed by
> > > fl,rows
> > > >> params. What makes it different form the typical Solr usage?
> > > >>
> > > >>
> > > >> On Wed, Apr 18, 2018 at 12:31 PM, Lee Carroll <
> > > >> lee.a.carr...@googlemail.com>
> > > >> wrote:
> > > >>
> > > >> > Sure, we want to return a single field's value for the top
> matching
> > > >> > document for a given query. Bare content rather than a full search
> > > >> result
> > > >> > listing.
> > > >> >
> > > >> > To be concrete:
> > > >> >
> > > >> > For a schema of fields id [unique key],
> > content[stored],content-type[
> > > >> > stored]
> > > >> > For a request:
> > > >> >
> > > >> >1. Request URL:
> > > >> >https://localhost/solr/content?q=id:1
> > > >> >2. Request Method:
> > > >> >GET
> > > >> >
> > > >> > We get a response
> > > >> > HTTP/1.1 200 OK Content-Length: 16261 Content-Type: [content-type
> > > value]
> > > >> >
> > > >> > and the body to be the raw value of content
> > > >> >
> > > >> > In short clients consume directly the most relevant "content"
> > returned
> > > >> from
> > > >> > solr queries they construct.
> > > >> >
> > > >> > Naively I've implemented a subclass of RawResponseWriter which
> takes
> > > the
> > > >> > first docs values and adds them to the appended "content" stream.
> > > >> Should I
> > > >> > selectively add the content stream depending on if this is the
> final
> > > >> > aggregation of cloud results (and provide a base class writer to
> act
> > > if
> > > >> > not), if so how do I know its the final aggregation. Or is adding
> > the
> > > >> > content stream within the response writer a bad idea. Should that
> be
> > > >> being
> > > >> > added to the response somewhere else?
> > > >> >
> > > >> > Failing all of the above is asking about response writer an X / Y
> > > >> problem.
> > > >> > Is their a better way to achieve the above. I'd looked at
> > transforming
> > > >> > response xml but that seemed not to offer a complete bare slate.
> > > >> >
> > > >> > Cheers Lee C
> > > >> >
> > > >> >
> > > >> > On 17 April 2018 at 21:36, Mikhail Khludnev 
> > wrote:
> > > >> >
> > > >> > > In distributed search response writer is used twice
> > > >> > > https://lucene.apache.org/solr/guide/7_1/distributed-
> > requests.html
> > > >> > > once slave node that's where response writer yields "json"
> content
> > > >> and it
> > > >> > > upset aggregator node which is expect only javabin.
> > > >> > > I hardly can comment on rrw, it's probably used for responding
> > > >> 

SolrCloud [subquery] with join on multiple terms

2018-04-18 Thread gallex2000
Hi,

I have two Aliases in SolrCloud,

1. *Article *with columns id, articleid, variantid, language, content_type,
description.
2. *ArticleAttributes *with columns for PARENT id, articleid, variantid,
language, description (multivalued field with all values of attributes) and
CHILDS (detailed information about each attribute) with columns id,
attributeid, content_type, value (value for each attribute).

My problem is to search in alias ArticleAttributes all Parent records where
some attributeid="someid" and value="text1" and language="DE" joined in the
same time with alias Article where description="text2"
on Article.articleid=ArticleAttributes.articleid AND
Article.variantid=ArticleAttributes.variantid AND
Article.language=ArticleAttributes.language

For this i have a query like this:
http://localhost:8983/solr/articleattributes/select?fq={!parent
which="content_type:1 AND language:DE"}((value:*16mm*) AND
attributeid:517310)=*,article:[subquery]={!terms f=articleid
v=$row.articleid}=*Schlauchverschraubung*

In my case with [subquery] I need to join 3 fileds, but [subquery] logic
support only one field to joining. 
Now question, exist some syntacs or another possibility to make this join on
multiple fields (idee to concatenate all this fields in one field is not
accepted from the point that this is only one case, but in some case join
must be on others fields and for each case to create on field is a utopia).





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: custom response writer which extends RawResponseWriter fails when shards > 1

2018-04-18 Thread Lee Carroll
Inventive. I need to control content-type of the response from the document
field value. I have the actual content field and the content-type field to
use configured in the response writer. I've just noticed that the xslt
transformer allows you to do this but not controlled by document values. I
may also need to set some headers based on content-type and perhaps content
size, accept ranges comes to mind. Although I might be getting ahead of
myself.



On 18 April 2018 at 12:05, Mikhail Khludnev  wrote:

> well ..
> what if
> http://localhost:8983/solr/images/select?fl=content=id:1=1=csv;
> csv.separator==null
> ?
>
> On Wed, Apr 18, 2018 at 1:18 PM, Lee Carroll  >
> wrote:
>
> > sorry cut n paste error i'd get
> >
> > {
> >   "responseHeader":{
> > "zkConnected":true,
> > "status":0,
> > "QTime":0,
> > "params":{
> >   "q":"*:*",
> >   "fl":"content",
> >   "rows":"1"}},
> >   "response":{"numFound":1,"start":0,"docs":[
> >   {
> > "content":"my-content-value"}]
> >   }}
> >
> >
> > but you get my point
> >
> >
> >
> > On 18 April 2018 at 11:13, Lee Carroll 
> > wrote:
> >
> > > for http://localhost:8983/solr/images/select?fl=content=id:1=1
> > >
> > > I'd get
> > >
> > > {
> > >   "responseHeader":{
> > > "zkConnected":true,
> > > "status":0,
> > > "QTime":1,
> > > "params":{
> > >   "q":"*:*",
> > >   "_":"1524046333220"}},
> > >   "response":{"numFound":1,"start":0,"docs":[
> > >   {
> > > "id":"1",
> > > "content":"my-content-value",
> > > "*content-type*":"text/plain"}]
> > >   }}
> > >
> > > when i want
> > >
> > > my-content-value
> > >
> > >
> > >
> > > On 18 April 2018 at 10:55, Mikhail Khludnev  wrote:
> > >
> > >> Lee, from this description I don see why it can't be addressed by
> > fl,rows
> > >> params. What makes it different form the typical Solr usage?
> > >>
> > >>
> > >> On Wed, Apr 18, 2018 at 12:31 PM, Lee Carroll <
> > >> lee.a.carr...@googlemail.com>
> > >> wrote:
> > >>
> > >> > Sure, we want to return a single field's value for the top matching
> > >> > document for a given query. Bare content rather than a full search
> > >> result
> > >> > listing.
> > >> >
> > >> > To be concrete:
> > >> >
> > >> > For a schema of fields id [unique key],
> content[stored],content-type[
> > >> > stored]
> > >> > For a request:
> > >> >
> > >> >1. Request URL:
> > >> >https://localhost/solr/content?q=id:1
> > >> >2. Request Method:
> > >> >GET
> > >> >
> > >> > We get a response
> > >> > HTTP/1.1 200 OK Content-Length: 16261 Content-Type: [content-type
> > value]
> > >> >
> > >> > and the body to be the raw value of content
> > >> >
> > >> > In short clients consume directly the most relevant "content"
> returned
> > >> from
> > >> > solr queries they construct.
> > >> >
> > >> > Naively I've implemented a subclass of RawResponseWriter which takes
> > the
> > >> > first docs values and adds them to the appended "content" stream.
> > >> Should I
> > >> > selectively add the content stream depending on if this is the final
> > >> > aggregation of cloud results (and provide a base class writer to act
> > if
> > >> > not), if so how do I know its the final aggregation. Or is adding
> the
> > >> > content stream within the response writer a bad idea. Should that be
> > >> being
> > >> > added to the response somewhere else?
> > >> >
> > >> > Failing all of the above is asking about response writer an X / Y
> > >> problem.
> > >> > Is their a better way to achieve the above. I'd looked at
> transforming
> > >> > response xml but that seemed not to offer a complete bare slate.
> > >> >
> > >> > Cheers Lee C
> > >> >
> > >> >
> > >> > On 17 April 2018 at 21:36, Mikhail Khludnev 
> wrote:
> > >> >
> > >> > > In distributed search response writer is used twice
> > >> > > https://lucene.apache.org/solr/guide/7_1/distributed-
> requests.html
> > >> > > once slave node that's where response writer yields "json" content
> > >> and it
> > >> > > upset aggregator node which is expect only javabin.
> > >> > > I hardly can comment on rrw, it's probably used for responding
> > >> separate
> > >> > > files in distrib=false mode.
> > >> > > You can start from describing why you need to create own response
> > >> writer.
> > >> > >
> > >> > > On Tue, Apr 17, 2018 at 7:02 PM, Lee Carroll <
> > >> > lee.a.carr...@googlemail.com
> > >> > > >
> > >> > > wrote:
> > >> > >
> > >> > > > Ok. My expectation was the response writer would not be used
> until
> > >> the
> > >> > > > final serialization of the result. If my response writer breaks
> > the
> > >> > > > response writer contract, exactly the way rawResponseWriter does
> > and
> > >> > just
> > >> > > > out puts a filed value how does that work? Does
> rawResponseWriter
> > >> > support
> > >> > > > cloud mode?
> > >> > > >
> > >> > > >
> > >> > > >
> > 

Re: custom response writer which extends RawResponseWriter fails when shards > 1

2018-04-18 Thread Mikhail Khludnev
well ..
what if
http://localhost:8983/solr/images/select?fl=content=id:1=1=csv;
csv.separator==null
?

On Wed, Apr 18, 2018 at 1:18 PM, Lee Carroll 
wrote:

> sorry cut n paste error i'd get
>
> {
>   "responseHeader":{
> "zkConnected":true,
> "status":0,
> "QTime":0,
> "params":{
>   "q":"*:*",
>   "fl":"content",
>   "rows":"1"}},
>   "response":{"numFound":1,"start":0,"docs":[
>   {
> "content":"my-content-value"}]
>   }}
>
>
> but you get my point
>
>
>
> On 18 April 2018 at 11:13, Lee Carroll 
> wrote:
>
> > for http://localhost:8983/solr/images/select?fl=content=id:1=1
> >
> > I'd get
> >
> > {
> >   "responseHeader":{
> > "zkConnected":true,
> > "status":0,
> > "QTime":1,
> > "params":{
> >   "q":"*:*",
> >   "_":"1524046333220"}},
> >   "response":{"numFound":1,"start":0,"docs":[
> >   {
> > "id":"1",
> > "content":"my-content-value",
> > "*content-type*":"text/plain"}]
> >   }}
> >
> > when i want
> >
> > my-content-value
> >
> >
> >
> > On 18 April 2018 at 10:55, Mikhail Khludnev  wrote:
> >
> >> Lee, from this description I don see why it can't be addressed by
> fl,rows
> >> params. What makes it different form the typical Solr usage?
> >>
> >>
> >> On Wed, Apr 18, 2018 at 12:31 PM, Lee Carroll <
> >> lee.a.carr...@googlemail.com>
> >> wrote:
> >>
> >> > Sure, we want to return a single field's value for the top matching
> >> > document for a given query. Bare content rather than a full search
> >> result
> >> > listing.
> >> >
> >> > To be concrete:
> >> >
> >> > For a schema of fields id [unique key], content[stored],content-type[
> >> > stored]
> >> > For a request:
> >> >
> >> >1. Request URL:
> >> >https://localhost/solr/content?q=id:1
> >> >2. Request Method:
> >> >GET
> >> >
> >> > We get a response
> >> > HTTP/1.1 200 OK Content-Length: 16261 Content-Type: [content-type
> value]
> >> >
> >> > and the body to be the raw value of content
> >> >
> >> > In short clients consume directly the most relevant "content" returned
> >> from
> >> > solr queries they construct.
> >> >
> >> > Naively I've implemented a subclass of RawResponseWriter which takes
> the
> >> > first docs values and adds them to the appended "content" stream.
> >> Should I
> >> > selectively add the content stream depending on if this is the final
> >> > aggregation of cloud results (and provide a base class writer to act
> if
> >> > not), if so how do I know its the final aggregation. Or is adding the
> >> > content stream within the response writer a bad idea. Should that be
> >> being
> >> > added to the response somewhere else?
> >> >
> >> > Failing all of the above is asking about response writer an X / Y
> >> problem.
> >> > Is their a better way to achieve the above. I'd looked at transforming
> >> > response xml but that seemed not to offer a complete bare slate.
> >> >
> >> > Cheers Lee C
> >> >
> >> >
> >> > On 17 April 2018 at 21:36, Mikhail Khludnev  wrote:
> >> >
> >> > > In distributed search response writer is used twice
> >> > > https://lucene.apache.org/solr/guide/7_1/distributed-requests.html
> >> > > once slave node that's where response writer yields "json" content
> >> and it
> >> > > upset aggregator node which is expect only javabin.
> >> > > I hardly can comment on rrw, it's probably used for responding
> >> separate
> >> > > files in distrib=false mode.
> >> > > You can start from describing why you need to create own response
> >> writer.
> >> > >
> >> > > On Tue, Apr 17, 2018 at 7:02 PM, Lee Carroll <
> >> > lee.a.carr...@googlemail.com
> >> > > >
> >> > > wrote:
> >> > >
> >> > > > Ok. My expectation was the response writer would not be used until
> >> the
> >> > > > final serialization of the result. If my response writer breaks
> the
> >> > > > response writer contract, exactly the way rawResponseWriter does
> and
> >> > just
> >> > > > out puts a filed value how does that work? Does rawResponseWriter
> >> > support
> >> > > > cloud mode?
> >> > > >
> >> > > >
> >> > > >
> >> > > > On 17 April 2018 at 15:55, Mikhail Khludnev 
> >> wrote:
> >> > > >
> >> > > > > That's what should happen.
> >> > > > >
> >> > > > > Expected mime type application/octet-stream but got
> >> application/json.
> >> > > > >
> >> > > > > Distributed search coordinator expect to merge slave responses
> in
> >> > > javabin
> >> > > > > format. But slave's wt indicated json.
> >> > > > > As far as I know only javabin might be used to distributed
> search
> >> > > > > underneath. Coordinator itself might yield json.
> >> > > > >
> >> > > > > On Tue, Apr 17, 2018 at 4:23 PM, Lee Carroll <
> >> > > > lee.a.carr...@googlemail.com
> >> > > > > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Sure
> >> > > > > >
> >> > > > > > with 1 shard 1 replica this request works fine
> >> > > > > >
> >> > > > > >1. 

Re: custom response writer which extends RawResponseWriter fails when shards > 1

2018-04-18 Thread Lee Carroll
sorry cut n paste error i'd get

{
  "responseHeader":{
"zkConnected":true,
"status":0,
"QTime":0,
"params":{
  "q":"*:*",
  "fl":"content",
  "rows":"1"}},
  "response":{"numFound":1,"start":0,"docs":[
  {
"content":"my-content-value"}]
  }}


but you get my point



On 18 April 2018 at 11:13, Lee Carroll  wrote:

> for http://localhost:8983/solr/images/select?fl=content=id:1=1
>
> I'd get
>
> {
>   "responseHeader":{
> "zkConnected":true,
> "status":0,
> "QTime":1,
> "params":{
>   "q":"*:*",
>   "_":"1524046333220"}},
>   "response":{"numFound":1,"start":0,"docs":[
>   {
> "id":"1",
> "content":"my-content-value",
> "*content-type*":"text/plain"}]
>   }}
>
> when i want
>
> my-content-value
>
>
>
> On 18 April 2018 at 10:55, Mikhail Khludnev  wrote:
>
>> Lee, from this description I don see why it can't be addressed by fl,rows
>> params. What makes it different form the typical Solr usage?
>>
>>
>> On Wed, Apr 18, 2018 at 12:31 PM, Lee Carroll <
>> lee.a.carr...@googlemail.com>
>> wrote:
>>
>> > Sure, we want to return a single field's value for the top matching
>> > document for a given query. Bare content rather than a full search
>> result
>> > listing.
>> >
>> > To be concrete:
>> >
>> > For a schema of fields id [unique key], content[stored],content-type[
>> > stored]
>> > For a request:
>> >
>> >1. Request URL:
>> >https://localhost/solr/content?q=id:1
>> >2. Request Method:
>> >GET
>> >
>> > We get a response
>> > HTTP/1.1 200 OK Content-Length: 16261 Content-Type: [content-type value]
>> >
>> > and the body to be the raw value of content
>> >
>> > In short clients consume directly the most relevant "content" returned
>> from
>> > solr queries they construct.
>> >
>> > Naively I've implemented a subclass of RawResponseWriter which takes the
>> > first docs values and adds them to the appended "content" stream.
>> Should I
>> > selectively add the content stream depending on if this is the final
>> > aggregation of cloud results (and provide a base class writer to act if
>> > not), if so how do I know its the final aggregation. Or is adding the
>> > content stream within the response writer a bad idea. Should that be
>> being
>> > added to the response somewhere else?
>> >
>> > Failing all of the above is asking about response writer an X / Y
>> problem.
>> > Is their a better way to achieve the above. I'd looked at transforming
>> > response xml but that seemed not to offer a complete bare slate.
>> >
>> > Cheers Lee C
>> >
>> >
>> > On 17 April 2018 at 21:36, Mikhail Khludnev  wrote:
>> >
>> > > In distributed search response writer is used twice
>> > > https://lucene.apache.org/solr/guide/7_1/distributed-requests.html
>> > > once slave node that's where response writer yields "json" content
>> and it
>> > > upset aggregator node which is expect only javabin.
>> > > I hardly can comment on rrw, it's probably used for responding
>> separate
>> > > files in distrib=false mode.
>> > > You can start from describing why you need to create own response
>> writer.
>> > >
>> > > On Tue, Apr 17, 2018 at 7:02 PM, Lee Carroll <
>> > lee.a.carr...@googlemail.com
>> > > >
>> > > wrote:
>> > >
>> > > > Ok. My expectation was the response writer would not be used until
>> the
>> > > > final serialization of the result. If my response writer breaks the
>> > > > response writer contract, exactly the way rawResponseWriter does and
>> > just
>> > > > out puts a filed value how does that work? Does rawResponseWriter
>> > support
>> > > > cloud mode?
>> > > >
>> > > >
>> > > >
>> > > > On 17 April 2018 at 15:55, Mikhail Khludnev 
>> wrote:
>> > > >
>> > > > > That's what should happen.
>> > > > >
>> > > > > Expected mime type application/octet-stream but got
>> application/json.
>> > > > >
>> > > > > Distributed search coordinator expect to merge slave responses in
>> > > javabin
>> > > > > format. But slave's wt indicated json.
>> > > > > As far as I know only javabin might be used to distributed search
>> > > > > underneath. Coordinator itself might yield json.
>> > > > >
>> > > > > On Tue, Apr 17, 2018 at 4:23 PM, Lee Carroll <
>> > > > lee.a.carr...@googlemail.com
>> > > > > >
>> > > > > wrote:
>> > > > >
>> > > > > > Sure
>> > > > > >
>> > > > > > with 1 shard 1 replica this request works fine
>> > > > > >
>> > > > > >1. Request URL:
>> > > > > >http://localhost:8983/solr/images/image?q=id:1
>> > > > > >2. Request Method:
>> > > > > >GET
>> > > > > >3. Status Code:
>> > > > > >200 OK
>> > > > > >
>> > > > > > logs are clean
>> > > > > >
>> > > > > > with 2 shards 2 replicas the same request fails and in the logs
>> > > > > >
>> > > > > >
>> > > > > > INFO  - 2018-04-17 13:20:32.052; [c:images s:shard2 r:core_node7
>> > > > > > x:images_shard2_replica_n4] org.apache.solr.core.SolrCore;
>> > > > > > 

Re: Learning to Rank (LTR) with grouping

2018-04-18 Thread Diego Ceccarelli (BLOOMBERG/ LONDON)
I just updated the PR to upstream - I still have to fix some things in 
distribute mode, but unit tests in non distribute mode works. 

Hope this helps, 
Diego 

From: solr-user@lucene.apache.org At: 04/15/18 03:37:54To:  
solr-user@lucene.apache.org
Subject: Re: Learning to Rank (LTR) with grouping

People sometimes fill in the Fix/Version field when they're creating
the JIRA, since anyone can open a JIRA it's hard to control. I took
that out just now.

Basically if the "Resolution" field doesn't indicate it's fixed, you
should assume that it hasn't been addressed.

Patches welcome.

Best,
Erick

On Tue, Apr 3, 2018 at 9:11 AM, ilayaraja  wrote:
> Thanks Roopa.
>
> I was expecting that the issue has been fixed in solr 7.0 as per here
> https://issues.apache.org/jira/browse/SOLR-8776.
>
> Let me see why it is still not working on solr-ltr-7.2.1
>
>
>
> -
> --Ilay
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html




Re: custom response writer which extends RawResponseWriter fails when shards > 1

2018-04-18 Thread Lee Carroll
for http://localhost:8983/solr/images/select?fl=content=id:1=1

I'd get

{
  "responseHeader":{
"zkConnected":true,
"status":0,
"QTime":1,
"params":{
  "q":"*:*",
  "_":"1524046333220"}},
  "response":{"numFound":1,"start":0,"docs":[
  {
"id":"1",
"content":"my-content-value",
"*content-type*":"text/plain"}]
  }}

when i want

my-content-value



On 18 April 2018 at 10:55, Mikhail Khludnev  wrote:

> Lee, from this description I don see why it can't be addressed by fl,rows
> params. What makes it different form the typical Solr usage?
>
>
> On Wed, Apr 18, 2018 at 12:31 PM, Lee Carroll <
> lee.a.carr...@googlemail.com>
> wrote:
>
> > Sure, we want to return a single field's value for the top matching
> > document for a given query. Bare content rather than a full search result
> > listing.
> >
> > To be concrete:
> >
> > For a schema of fields id [unique key], content[stored],content-type[
> > stored]
> > For a request:
> >
> >1. Request URL:
> >https://localhost/solr/content?q=id:1
> >2. Request Method:
> >GET
> >
> > We get a response
> > HTTP/1.1 200 OK Content-Length: 16261 Content-Type: [content-type value]
> >
> > and the body to be the raw value of content
> >
> > In short clients consume directly the most relevant "content" returned
> from
> > solr queries they construct.
> >
> > Naively I've implemented a subclass of RawResponseWriter which takes the
> > first docs values and adds them to the appended "content" stream. Should
> I
> > selectively add the content stream depending on if this is the final
> > aggregation of cloud results (and provide a base class writer to act if
> > not), if so how do I know its the final aggregation. Or is adding the
> > content stream within the response writer a bad idea. Should that be
> being
> > added to the response somewhere else?
> >
> > Failing all of the above is asking about response writer an X / Y
> problem.
> > Is their a better way to achieve the above. I'd looked at transforming
> > response xml but that seemed not to offer a complete bare slate.
> >
> > Cheers Lee C
> >
> >
> > On 17 April 2018 at 21:36, Mikhail Khludnev  wrote:
> >
> > > In distributed search response writer is used twice
> > > https://lucene.apache.org/solr/guide/7_1/distributed-requests.html
> > > once slave node that's where response writer yields "json" content and
> it
> > > upset aggregator node which is expect only javabin.
> > > I hardly can comment on rrw, it's probably used for responding separate
> > > files in distrib=false mode.
> > > You can start from describing why you need to create own response
> writer.
> > >
> > > On Tue, Apr 17, 2018 at 7:02 PM, Lee Carroll <
> > lee.a.carr...@googlemail.com
> > > >
> > > wrote:
> > >
> > > > Ok. My expectation was the response writer would not be used until
> the
> > > > final serialization of the result. If my response writer breaks the
> > > > response writer contract, exactly the way rawResponseWriter does and
> > just
> > > > out puts a filed value how does that work? Does rawResponseWriter
> > support
> > > > cloud mode?
> > > >
> > > >
> > > >
> > > > On 17 April 2018 at 15:55, Mikhail Khludnev  wrote:
> > > >
> > > > > That's what should happen.
> > > > >
> > > > > Expected mime type application/octet-stream but got
> application/json.
> > > > >
> > > > > Distributed search coordinator expect to merge slave responses in
> > > javabin
> > > > > format. But slave's wt indicated json.
> > > > > As far as I know only javabin might be used to distributed search
> > > > > underneath. Coordinator itself might yield json.
> > > > >
> > > > > On Tue, Apr 17, 2018 at 4:23 PM, Lee Carroll <
> > > > lee.a.carr...@googlemail.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Sure
> > > > > >
> > > > > > with 1 shard 1 replica this request works fine
> > > > > >
> > > > > >1. Request URL:
> > > > > >http://localhost:8983/solr/images/image?q=id:1
> > > > > >2. Request Method:
> > > > > >GET
> > > > > >3. Status Code:
> > > > > >200 OK
> > > > > >
> > > > > > logs are clean
> > > > > >
> > > > > > with 2 shards 2 replicas the same request fails and in the logs
> > > > > >
> > > > > >
> > > > > > INFO  - 2018-04-17 13:20:32.052; [c:images s:shard2 r:core_node7
> > > > > > x:images_shard2_replica_n4] org.apache.solr.core.SolrCore;
> > > > > > [images_shard2_replica_n4]  webapp=/solr path=/image
> > > > > > params={df=text=false=/image=id=score&
> > > > > > shards.purpose=4=0=true=
> > > > > > http://10.224.30.207:8983/solr/images_shard2_replica_n4/
> > > > > > |http://10.224.30.207:7574/solr/images_shard2_replica_n6/
> > > > > > =10=2=id:1=1523971232039=true=
> > javabin}
> > > > > > hits=0 status=0 QTime=0
> > > > > > ERROR - 2018-04-17 13:20:32.055; [c:images s:shard1 r:core_node3
> > > > > > x:images_shard1_replica_n1] org.apache.solr.common.
> SolrException;
> > > > > > 

Re: custom response writer which extends RawResponseWriter fails when shards > 1

2018-04-18 Thread Mikhail Khludnev
Lee, from this description I don see why it can't be addressed by fl,rows
params. What makes it different form the typical Solr usage?


On Wed, Apr 18, 2018 at 12:31 PM, Lee Carroll 
wrote:

> Sure, we want to return a single field's value for the top matching
> document for a given query. Bare content rather than a full search result
> listing.
>
> To be concrete:
>
> For a schema of fields id [unique key], content[stored],content-type[
> stored]
> For a request:
>
>1. Request URL:
>https://localhost/solr/content?q=id:1
>2. Request Method:
>GET
>
> We get a response
> HTTP/1.1 200 OK Content-Length: 16261 Content-Type: [content-type value]
>
> and the body to be the raw value of content
>
> In short clients consume directly the most relevant "content" returned from
> solr queries they construct.
>
> Naively I've implemented a subclass of RawResponseWriter which takes the
> first docs values and adds them to the appended "content" stream. Should I
> selectively add the content stream depending on if this is the final
> aggregation of cloud results (and provide a base class writer to act if
> not), if so how do I know its the final aggregation. Or is adding the
> content stream within the response writer a bad idea. Should that be being
> added to the response somewhere else?
>
> Failing all of the above is asking about response writer an X / Y problem.
> Is their a better way to achieve the above. I'd looked at transforming
> response xml but that seemed not to offer a complete bare slate.
>
> Cheers Lee C
>
>
> On 17 April 2018 at 21:36, Mikhail Khludnev  wrote:
>
> > In distributed search response writer is used twice
> > https://lucene.apache.org/solr/guide/7_1/distributed-requests.html
> > once slave node that's where response writer yields "json" content and it
> > upset aggregator node which is expect only javabin.
> > I hardly can comment on rrw, it's probably used for responding separate
> > files in distrib=false mode.
> > You can start from describing why you need to create own response writer.
> >
> > On Tue, Apr 17, 2018 at 7:02 PM, Lee Carroll <
> lee.a.carr...@googlemail.com
> > >
> > wrote:
> >
> > > Ok. My expectation was the response writer would not be used until the
> > > final serialization of the result. If my response writer breaks the
> > > response writer contract, exactly the way rawResponseWriter does and
> just
> > > out puts a filed value how does that work? Does rawResponseWriter
> support
> > > cloud mode?
> > >
> > >
> > >
> > > On 17 April 2018 at 15:55, Mikhail Khludnev  wrote:
> > >
> > > > That's what should happen.
> > > >
> > > > Expected mime type application/octet-stream but got application/json.
> > > >
> > > > Distributed search coordinator expect to merge slave responses in
> > javabin
> > > > format. But slave's wt indicated json.
> > > > As far as I know only javabin might be used to distributed search
> > > > underneath. Coordinator itself might yield json.
> > > >
> > > > On Tue, Apr 17, 2018 at 4:23 PM, Lee Carroll <
> > > lee.a.carr...@googlemail.com
> > > > >
> > > > wrote:
> > > >
> > > > > Sure
> > > > >
> > > > > with 1 shard 1 replica this request works fine
> > > > >
> > > > >1. Request URL:
> > > > >http://localhost:8983/solr/images/image?q=id:1
> > > > >2. Request Method:
> > > > >GET
> > > > >3. Status Code:
> > > > >200 OK
> > > > >
> > > > > logs are clean
> > > > >
> > > > > with 2 shards 2 replicas the same request fails and in the logs
> > > > >
> > > > >
> > > > > INFO  - 2018-04-17 13:20:32.052; [c:images s:shard2 r:core_node7
> > > > > x:images_shard2_replica_n4] org.apache.solr.core.SolrCore;
> > > > > [images_shard2_replica_n4]  webapp=/solr path=/image
> > > > > params={df=text=false=/image=id=score&
> > > > > shards.purpose=4=0=true=
> > > > > http://10.224.30.207:8983/solr/images_shard2_replica_n4/
> > > > > |http://10.224.30.207:7574/solr/images_shard2_replica_n6/
> > > > > =10=2=id:1=1523971232039=true=
> javabin}
> > > > > hits=0 status=0 QTime=0
> > > > > ERROR - 2018-04-17 13:20:32.055; [c:images s:shard1 r:core_node3
> > > > > x:images_shard1_replica_n1] org.apache.solr.common.SolrException;
> > > > > org.apache.solr.client.solrj.impl.HttpSolrClient$
> > RemoteSolrException:
> > > > > Error
> > > > > from server at http://10.224.30.207:8983/
> > solr/images_shard2_replica_n4
> > > :
> > > > > Expected mime type application/octet-stream but got
> application/json.
> > > > > at
> > > > > org.apache.solr.client.solrj.impl.HttpSolrClient.
> > > > > executeMethod(HttpSolrClient.java:607)
> > > > > at
> > > > > org.apache.solr.client.solrj.impl.HttpSolrClient.request(
> > > > > HttpSolrClient.java:255)
> > > > > at
> > > > > org.apache.solr.client.solrj.impl.HttpSolrClient.request(
> > > > > HttpSolrClient.java:244)
> > > > > at
> > > > > org.apache.solr.client.solrj.impl.LBHttpSolrClient.
> > > > > 

Issue with Solr Case Insensitive Issue

2018-04-18 Thread Kapil Bhardwaj
Hi Team,

Warm Greeting to all,

I have started using Solr lately.Currently i am facing issue with case
insensitive sorting for a field.

We are using Solr on top of Cassandra v5 for index based searching.We have
a field layoutpath which we want to make it case insensitive because
currently it sorts first Numbers then LowerCase and then UpperCase
characters.So we want to make it unique by converting into lowercase and
apply sort.

To achieve this we have created schema like below:-


  
 
   
 
 

  
 
  

 

We have to create copy field because layout_path is part if Unique key.

layout_path
(version, shardkey, layout_path)

Now we apply above created lower_case_search on the derived field.



After making changes i RELOADED the schema via terminal command and tried
to re-index the schema using solr core admin button.

But after making above changes i am not seeing case insensitive search
working.

Its urgent.Any help would be highly appreciated.

Regards,
Kapil Bhardwaj


Re: custom response writer which extends RawResponseWriter fails when shards > 1

2018-04-18 Thread Lee Carroll
Sure, we want to return a single field's value for the top matching
document for a given query. Bare content rather than a full search result
listing.

To be concrete:

For a schema of fields id [unique key], content[stored],content-type[stored]
For a request:

   1. Request URL:
   https://localhost/solr/content?q=id:1
   2. Request Method:
   GET

We get a response
HTTP/1.1 200 OK Content-Length: 16261 Content-Type: [content-type value]

and the body to be the raw value of content

In short clients consume directly the most relevant "content" returned from
solr queries they construct.

Naively I've implemented a subclass of RawResponseWriter which takes the
first docs values and adds them to the appended "content" stream. Should I
selectively add the content stream depending on if this is the final
aggregation of cloud results (and provide a base class writer to act if
not), if so how do I know its the final aggregation. Or is adding the
content stream within the response writer a bad idea. Should that be being
added to the response somewhere else?

Failing all of the above is asking about response writer an X / Y problem.
Is their a better way to achieve the above. I'd looked at transforming
response xml but that seemed not to offer a complete bare slate.

Cheers Lee C


On 17 April 2018 at 21:36, Mikhail Khludnev  wrote:

> In distributed search response writer is used twice
> https://lucene.apache.org/solr/guide/7_1/distributed-requests.html
> once slave node that's where response writer yields "json" content and it
> upset aggregator node which is expect only javabin.
> I hardly can comment on rrw, it's probably used for responding separate
> files in distrib=false mode.
> You can start from describing why you need to create own response writer.
>
> On Tue, Apr 17, 2018 at 7:02 PM, Lee Carroll  >
> wrote:
>
> > Ok. My expectation was the response writer would not be used until the
> > final serialization of the result. If my response writer breaks the
> > response writer contract, exactly the way rawResponseWriter does and just
> > out puts a filed value how does that work? Does rawResponseWriter support
> > cloud mode?
> >
> >
> >
> > On 17 April 2018 at 15:55, Mikhail Khludnev  wrote:
> >
> > > That's what should happen.
> > >
> > > Expected mime type application/octet-stream but got application/json.
> > >
> > > Distributed search coordinator expect to merge slave responses in
> javabin
> > > format. But slave's wt indicated json.
> > > As far as I know only javabin might be used to distributed search
> > > underneath. Coordinator itself might yield json.
> > >
> > > On Tue, Apr 17, 2018 at 4:23 PM, Lee Carroll <
> > lee.a.carr...@googlemail.com
> > > >
> > > wrote:
> > >
> > > > Sure
> > > >
> > > > with 1 shard 1 replica this request works fine
> > > >
> > > >1. Request URL:
> > > >http://localhost:8983/solr/images/image?q=id:1
> > > >2. Request Method:
> > > >GET
> > > >3. Status Code:
> > > >200 OK
> > > >
> > > > logs are clean
> > > >
> > > > with 2 shards 2 replicas the same request fails and in the logs
> > > >
> > > >
> > > > INFO  - 2018-04-17 13:20:32.052; [c:images s:shard2 r:core_node7
> > > > x:images_shard2_replica_n4] org.apache.solr.core.SolrCore;
> > > > [images_shard2_replica_n4]  webapp=/solr path=/image
> > > > params={df=text=false=/image=id=score&
> > > > shards.purpose=4=0=true=
> > > > http://10.224.30.207:8983/solr/images_shard2_replica_n4/
> > > > |http://10.224.30.207:7574/solr/images_shard2_replica_n6/
> > > > =10=2=id:1=1523971232039=true=javabin}
> > > > hits=0 status=0 QTime=0
> > > > ERROR - 2018-04-17 13:20:32.055; [c:images s:shard1 r:core_node3
> > > > x:images_shard1_replica_n1] org.apache.solr.common.SolrException;
> > > > org.apache.solr.client.solrj.impl.HttpSolrClient$
> RemoteSolrException:
> > > > Error
> > > > from server at http://10.224.30.207:8983/
> solr/images_shard2_replica_n4
> > :
> > > > Expected mime type application/octet-stream but got application/json.
> > > > at
> > > > org.apache.solr.client.solrj.impl.HttpSolrClient.
> > > > executeMethod(HttpSolrClient.java:607)
> > > > at
> > > > org.apache.solr.client.solrj.impl.HttpSolrClient.request(
> > > > HttpSolrClient.java:255)
> > > > at
> > > > org.apache.solr.client.solrj.impl.HttpSolrClient.request(
> > > > HttpSolrClient.java:244)
> > > > at
> > > > org.apache.solr.client.solrj.impl.LBHttpSolrClient.
> > > > doRequest(LBHttpSolrClient.java:483)
> > > > at
> > > > org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(
> > > > LBHttpSolrClient.java:413)
> > > > at
> > > > org.apache.solr.handler.component.HttpShardHandlerFactory.
> > > > makeLoadBalancedRequest(HttpShardHandlerFactory.java:273)
> > > > at
> > > > org.apache.solr.handler.component.HttpShardHandler.lambda$submit$0(
> > > > HttpShardHandler.java:175)
> > > > at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> > > > at 

schema-api: modifying schema in xml format

2018-04-18 Thread Arturas Mazeika
Hi solr-users,

is it possible to modify the managed schema using schema api and submit the
commands in XML format? I am able to add a data type using:

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "add-field-type": {
  "name":"text_de_ph",
  "class":"solr.TextField",
  "positionIncrementGap":"100",
  "analyzer": {
"tokenizer": {"class":"solr.StandardTokenizerFactory"},
"filters": [
  {"class":"solr.LowerCaseFilterFactory"},
  {"class":"solr.StopFilterFactory", "format":"snowball",
"words":"lang/stopwords_de.txt", "ignoreCase":true},
  {"class":"solr.GermanNormalizationFilterFactory"},
  {"class":"solr.GermanLightStemFilterFactory"},
  {"class":"solr.PhoneticFilterFactory", "encoder":"DoubleMetaphone"}
  ]}}
}' http://localhost:8983/solr/tph/schema

so I thought I could submit something like:

curl -X POST -H 'Content-Type: text/xml' --data-binary '

   
  
  
  
  
  
  


' http://localhost:8983/solr/tph/schema

This however failed with the error:

{
  "responseHeader":{
"status":500,
"QTime":1},
  "error":{
"msg":"JSON Parse Error: char=<,position=1 AFTER=' ...

The examples in the documentation (I am using solr 7.2) are all in JSON
format, but does not say explicitly, that one needs to send the updates in
json format only..

https://lucene.apache.org/solr/guide/7_2/schema-api.html#schema-api

Comments?

Cheers,
Arturas


Re: solr 6.6.3 intermittent group faceting errors(Lucene54DocValuesProducer)

2018-04-18 Thread Jay Potharaju
Thanks Eric & Shawn for chiming in ! In my solrconfig the lucene version is
set to 6.6.3.  I do see that the index has lucene54 files.

With respect to the error regarding group faceting error it is similar
to what is being reported in SOLR-7867
.

Thanks
Jay Potharaju


On Tue, Apr 17, 2018 at 8:17 PM, Shawn Heisey  wrote:

> On 4/17/2018 8:44 PM, Erick Erickson wrote:
>
>> The other possibility is that you have LuceneMatchVersion set to
>> 5-something in solrconfig.xml.
>>
>
> It's my understanding that luceneMatchVersion does NOT affect index format
> in any way, that about the only things that pay attention to this value are
> a subset of analysis components. Do I have an incorrect understanding?
>
> Does a Solr user even have the ability to influence the index format used
> without writing custom code?
>
> Thanks,
> Shawn
>
>


Re: Infostream question

2018-04-18 Thread Bernd Fehling
You have to check your log4j.properties, usually located 
server/resources/log4j.properties
There is a line about infostream logging, change it from OFF to ON.

# set to INFO to enable infostream log messages
log4j.logger.org.apache.solr.update.LoggingInfoStream=OFF

Regards
Bernd


Am 17.04.2018 um 20:56 schrieb Yunee Lee:
> Hi,
> Current solr server is 5.2 and I want to enable infoStream and updated the 
> solrconfig.xml.
> Reload the config. But it doesn’t create any logs. Do I need to configure 
> anything else?
> Thanks.
> true
>