Re: Integrating Stanford NLP or any other NLP for Natural Language Query

2016-07-07 Thread Yangrui Guo
My solution lets users retrieve data entities using queries like "find me a
job that only requires a high school degree" and "I want a car from
American with alloy wheels". It can also be expanded to perform other
database queries, like date time or price range searches. I use Stanford
NLP to identify the main entity and its related attributes in a user
sentence.

Yangrui

On Thursday, July 7, 2016, Puneet Pawaia  wrote:

> Hi  Yangrui
> I would like users to be able to write queries in natural language rather
> than keyword based search.
> A link to your solution would be worth looking at.
> Regards
> Puneet
>
> On 8 Jul 2016 03:02, "Yangrui Guo" >
> wrote:
>
> What is your NLP search like? I have a NLP solution for Solr and just open
> sourced it. Not sure if it fits your need
>
> Yangrui
>
> On Thursday, July 7, 2016, Puneet Pawaia  > wrote:
>
> > Hi
> >
> > I am currently using Solr 5.5.x to test but can upgrade to Solr 6.x if
> > required.
> > I am working on a POC for natural language query using Solr. Should I use
> > the Stanford libraries or are there any other libraries having
> integration
> > with Solr already available.
> > Any direction in how to do this would be most appreciated. How should I
> > process the query to give relevant results.
> >
> > Regards
> > Puneet
> >
>


Re: Integrating Stanford NLP or any other NLP for Natural Language Query

2016-07-07 Thread Puneet Pawaia
Thanks for the link.
I'll take a look at it later in the day once I am at office.
Puneet

On 8 Jul 2016 08:19, "Yangrui Guo"  wrote:

https://github.com/guoyangrui/squery

It's not well documented yet but the idea was simple. Users should first
format their database tables into triples by creating view, then Solr and
Stanford NLP handles the data retrieval part. I hope someone could continue
contribute to its developement.

Yangrui

On Thursday, July 7, 2016, John Blythe  wrote:

> can you share a link, i'd be interested in checking it out.
>
> thanks-
>
> --
> *John Blythe*
> Product Manager & Lead Developer
>
> 251.605.3071 | j...@curvolabs.com 
> www.curvolabs.com
>
> 58 Adams Ave
> Evansville, IN 47713
>
> On Thu, Jul 7, 2016 at 4:32 PM, Yangrui Guo  > wrote:
>
> > What is your NLP search like? I have a NLP solution for Solr and just
> open
> > sourced it. Not sure if it fits your need
> >
> > Yangrui
> >
> > On Thursday, July 7, 2016, Puneet Pawaia  > wrote:
> >
> > > Hi
> > >
> > > I am currently using Solr 5.5.x to test but can upgrade to Solr 6.x if
> > > required.
> > > I am working on a POC for natural language query using Solr. Should I
> use
> > > the Stanford libraries or are there any other libraries having
> > integration
> > > with Solr already available.
> > > Any direction in how to do this would be most appreciated. How should
I
> > > process the query to give relevant results.
> > >
> > > Regards
> > > Puneet
> > >
> >
>


Re: Integrating Stanford NLP or any other NLP for Natural Language Query

2016-07-07 Thread Puneet Pawaia
Hi Jay
Any place I can learn more on this method of integration?
Thanks
Puneet

On 8 Jul 2016 02:58, "Jay Urbain"  wrote:

> I use Stanford NLP and cTakes (based on OpenNLP) while indexing with a
> SOLRJ application.
>
> Best,
> Jay
>
> On Thu, Jul 7, 2016 at 12:09 PM, Puneet Pawaia 
> wrote:
>
> > Hi
> >
> > I am currently using Solr 5.5.x to test but can upgrade to Solr 6.x if
> > required.
> > I am working on a POC for natural language query using Solr. Should I use
> > the Stanford libraries or are there any other libraries having
> integration
> > with Solr already available.
> > Any direction in how to do this would be most appreciated. How should I
> > process the query to give relevant results.
> >
> > Regards
> > Puneet
> >
>


Re: Integrating Stanford NLP or any other NLP for Natural Language Query

2016-07-07 Thread Puneet Pawaia
Hi  Yangrui
I would like users to be able to write queries in natural language rather
than keyword based search.
A link to your solution would be worth looking at.
Regards
Puneet

On 8 Jul 2016 03:02, "Yangrui Guo"  wrote:

What is your NLP search like? I have a NLP solution for Solr and just open
sourced it. Not sure if it fits your need

Yangrui

On Thursday, July 7, 2016, Puneet Pawaia  wrote:

> Hi
>
> I am currently using Solr 5.5.x to test but can upgrade to Solr 6.x if
> required.
> I am working on a POC for natural language query using Solr. Should I use
> the Stanford libraries or are there any other libraries having integration
> with Solr already available.
> Any direction in how to do this would be most appreciated. How should I
> process the query to give relevant results.
>
> Regards
> Puneet
>


Re: Integrating Stanford NLP or any other NLP for Natural Language Query

2016-07-07 Thread Yangrui Guo
https://github.com/guoyangrui/squery

It's not well documented yet but the idea was simple. Users should first
format their database tables into triples by creating view, then Solr and
Stanford NLP handles the data retrieval part. I hope someone could continue
contribute to its developement.

Yangrui

On Thursday, July 7, 2016, John Blythe  wrote:

> can you share a link, i'd be interested in checking it out.
>
> thanks-
>
> --
> *John Blythe*
> Product Manager & Lead Developer
>
> 251.605.3071 | j...@curvolabs.com 
> www.curvolabs.com
>
> 58 Adams Ave
> Evansville, IN 47713
>
> On Thu, Jul 7, 2016 at 4:32 PM, Yangrui Guo  > wrote:
>
> > What is your NLP search like? I have a NLP solution for Solr and just
> open
> > sourced it. Not sure if it fits your need
> >
> > Yangrui
> >
> > On Thursday, July 7, 2016, Puneet Pawaia  > wrote:
> >
> > > Hi
> > >
> > > I am currently using Solr 5.5.x to test but can upgrade to Solr 6.x if
> > > required.
> > > I am working on a POC for natural language query using Solr. Should I
> use
> > > the Stanford libraries or are there any other libraries having
> > integration
> > > with Solr already available.
> > > Any direction in how to do this would be most appreciated. How should I
> > > process the query to give relevant results.
> > >
> > > Regards
> > > Puneet
> > >
> >
>


Re: Integrating Stanford NLP or any other NLP for Natural Language Query

2016-07-07 Thread John Blythe
can you share a link, i'd be interested in checking it out.

thanks-

-- 
*John Blythe*
Product Manager & Lead Developer

251.605.3071 | j...@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

On Thu, Jul 7, 2016 at 4:32 PM, Yangrui Guo  wrote:

> What is your NLP search like? I have a NLP solution for Solr and just open
> sourced it. Not sure if it fits your need
>
> Yangrui
>
> On Thursday, July 7, 2016, Puneet Pawaia  wrote:
>
> > Hi
> >
> > I am currently using Solr 5.5.x to test but can upgrade to Solr 6.x if
> > required.
> > I am working on a POC for natural language query using Solr. Should I use
> > the Stanford libraries or are there any other libraries having
> integration
> > with Solr already available.
> > Any direction in how to do this would be most appreciated. How should I
> > process the query to give relevant results.
> >
> > Regards
> > Puneet
> >
>


Re: Boosting query results

2016-07-07 Thread Walter Underwood
I think it works to join against the other collection to get scores. But I’m 
not sure. I think that was suggested for a fairly static collection of 
documents with rapidly changing scoring inputs.

Personally, I would try a straight popularity boost to see if it got you 80% of 
the way there.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jul 7, 2016, at 2:46 PM, Mark T. Trembley  
> wrote:
> 
> Yes, the spam issue is something I'm aware of. I plan on having some sanity 
> checks in place to make sure that the boosts are in line with expectations 
> either at query time or while indexing the scores into Solr.
> 
> I just read through that document along with some of the more recent posts 
> about signals, and it appears that I'm going down the same path as 
> Lucidworks. I'm storing the aggregated search term and product id in an 
> alternate index.  It seems that the piece that I'm missing is getting the 
> boost per document. In the following post, it appears to me that Fusion is 
> applying a boost to the main query by obtaining the scores from a set number 
> of documents from the aggregate collection. I'm going to assume that part of 
> it's query processing pipeline is to run a query on the aggregation 
> collection to obtain the scores from that query and return them for use on 
> the main query.
> 
> https://lucidworks.com/blog/2015/09/01/better-search-fusion-signals/
> 
> I think I could possibly hack something together on my side that mimics what 
> I think is happening in Fusion, but with my tinkering, it seems to me that 
> using a !join query (with scoring) like I've been trying could handle the job 
> if I could only understand how the query executes on the joined collection 
> and how I can pass a calculated score back to the main query for use in 
> calculating a final score on the main collection.
> 
> 
> On 7/7/2016 1:34 PM, Walter Underwood wrote:
>> If it is running in an environment protected from spammers, you might want 
>> to start with the work that LucidWorks did on click scoring.
>> 
>> https://lucidworks.com/blog/2015/03/23/mixed-signals-using-lucidworks-fusions-signals-api/
>>  
>> 
>> 
>> Of course, there are no environments free of spammers. I’ve seen them in 
>> enterprise search, too. But they are easier to deal with there. Call them up 
>> and tell them they need to stop immediately or their pages disappear from 
>> the search engine.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Jul 7, 2016, at 11:29 AM, Walter Underwood  wrote:
>>> 
>>> You understand that you are making your site extremely easy to spam, right? 
>>> This is how Microsoft became the top hit for “evil empire” on Google.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> 
 On Jul 7, 2016, at 11:25 AM, Mark T. Trembley  
 wrote:
 
 I've found that it is definitely complicated!
 
 Essentially what I am attempting to do is boost products based on the 
 number of times that particular product has been selected via historical 
 searches using the same search term or phrase.
 
 
 On 7/7/2016 11:55 AM, Walter Underwood wrote:
> That is a very complicated design. What are you trying to achieve? Maybe 
> there is a different approach that is simpler.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Jul 7, 2016, at 9:26 AM, Mark T. Trembley 
>>  wrote:
>> 
>> That works with static boosts based on documents matching the query 
>> "Boost2". I want to apply a different boost to documents based on the 
>> value assigned to Boost2 within the document.
>> 
>> From my sample documents, when running a query with "Boost2," I want 
>> Document2 boosted by 20.0 and Document6 boosted by 15.0:
>> 
>> {
>>  "id" : "Document2_Boost2",
>>  "B1_s" : "Boost2",
>>  "B1_f" : 20
>> }
>> {
>>  "id" : "Document6_Boost2",
>>  "B1_s" : "Boost2",
>>  "B1_f" : 15
>> }
>> 
>> 
>> On 7/7/2016 10:21 AM, Walter Underwood wrote:
>>> This looks like a job for “bq”, the boost query parameter. I used this 
>>> to boost textbooks which were used at the student’s school. bq does not 
>>> force documents to be included in the result set. It does affect the 
>>> ranking of the included documents.
>>> 
>>> bq=B1_ss:Boost2 will boost documents that match that. You can use 
>>> weights, like bq=B1_ss:Boost2^10
>>> 
>>> Here is the relationship between fq, q, and bq:
>>> 
>>> fq: selection, does not 

Re: Boosting query results

2016-07-07 Thread Mark T. Trembley
Yes, the spam issue is something I'm aware of. I plan on having some 
sanity checks in place to make sure that the boosts are in line with 
expectations either at query time or while indexing the scores into Solr.


I just read through that document along with some of the more recent 
posts about signals, and it appears that I'm going down the same path as 
Lucidworks. I'm storing the aggregated search term and product id in an 
alternate index.  It seems that the piece that I'm missing is getting 
the boost per document. In the following post, it appears to me that 
Fusion is applying a boost to the main query by obtaining the scores 
from a set number of documents from the aggregate collection. I'm going 
to assume that part of it's query processing pipeline is to run a query 
on the aggregation collection to obtain the scores from that query and 
return them for use on the main query.


https://lucidworks.com/blog/2015/09/01/better-search-fusion-signals/

I think I could possibly hack something together on my side that mimics 
what I think is happening in Fusion, but with my tinkering, it seems to 
me that using a !join query (with scoring) like I've been trying could 
handle the job if I could only understand how the query executes on the 
joined collection and how I can pass a calculated score back to the main 
query for use in calculating a final score on the main collection.



On 7/7/2016 1:34 PM, Walter Underwood wrote:

If it is running in an environment protected from spammers, you might want to 
start with the work that LucidWorks did on click scoring.

https://lucidworks.com/blog/2015/03/23/mixed-signals-using-lucidworks-fusions-signals-api/
 


Of course, there are no environments free of spammers. I’ve seen them in 
enterprise search, too. But they are easier to deal with there. Call them up 
and tell them they need to stop immediately or their pages disappear from the 
search engine.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Jul 7, 2016, at 11:29 AM, Walter Underwood  wrote:

You understand that you are making your site extremely easy to spam, right? 
This is how Microsoft became the top hit for “evil empire” on Google.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Jul 7, 2016, at 11:25 AM, Mark T. Trembley  
wrote:

I've found that it is definitely complicated!

Essentially what I am attempting to do is boost products based on the number of 
times that particular product has been selected via historical searches using 
the same search term or phrase.


On 7/7/2016 11:55 AM, Walter Underwood wrote:

That is a very complicated design. What are you trying to achieve? Maybe there 
is a different approach that is simpler.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Jul 7, 2016, at 9:26 AM, Mark T. Trembley  wrote:

That works with static boosts based on documents matching the query "Boost2". I 
want to apply a different boost to documents based on the value assigned to Boost2 within 
the document.

 From my sample documents, when running a query with "Boost2," I want Document2 
boosted by 20.0 and Document6 boosted by 15.0:

{
  "id" : "Document2_Boost2",
  "B1_s" : "Boost2",
  "B1_f" : 20
}
{
  "id" : "Document6_Boost2",
  "B1_s" : "Boost2",
  "B1_f" : 15
}


On 7/7/2016 10:21 AM, Walter Underwood wrote:

This looks like a job for “bq”, the boost query parameter. I used this to boost 
textbooks which were used at the student’s school. bq does not force documents 
to be included in the result set. It does affect the ranking of the included 
documents.

bq=B1_ss:Boost2 will boost documents that match that. You can use weights, like 
bq=B1_ss:Boost2^10

Here is the relationship between fq, q, and bq:

fq: selection, does not affect ranking
q: selection and ranking
bq: does not affect selection, affects ranking

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Jul 7, 2016, at 7:30 AM, Mark T. Trembley  wrote:

I have a question about the best way to rank my results based on a score field 
that can have different values per document and where each document can have 
different scores based on which term is queried.

Essentially what I'm wanting to have happen is provide a list of terms that when matched via a 
query it returns a corresponding score to help boost the original document. So if I had a document 
with a multi-valued field named B1_ss with terms [Boost1|10], [Boost2|20], [Boost3|100] and my 
search query is "Boost2", I want that document's result to be boosted by 20. Also note 
that "Boost2" can boost different documents at different levels. The query to select the 
actual documents will select 

Re: Integrating Stanford NLP or any other NLP for Natural Language Query

2016-07-07 Thread Yangrui Guo
What is your NLP search like? I have a NLP solution for Solr and just open
sourced it. Not sure if it fits your need

Yangrui

On Thursday, July 7, 2016, Puneet Pawaia  wrote:

> Hi
>
> I am currently using Solr 5.5.x to test but can upgrade to Solr 6.x if
> required.
> I am working on a POC for natural language query using Solr. Should I use
> the Stanford libraries or are there any other libraries having integration
> with Solr already available.
> Any direction in how to do this would be most appreciated. How should I
> process the query to give relevant results.
>
> Regards
> Puneet
>


Re: Integrating Stanford NLP or any other NLP for Natural Language Query

2016-07-07 Thread Jay Urbain
I use Stanford NLP and cTakes (based on OpenNLP) while indexing with a
SOLRJ application.

Best,
Jay

On Thu, Jul 7, 2016 at 12:09 PM, Puneet Pawaia 
wrote:

> Hi
>
> I am currently using Solr 5.5.x to test but can upgrade to Solr 6.x if
> required.
> I am working on a POC for natural language query using Solr. Should I use
> the Stanford libraries or are there any other libraries having integration
> with Solr already available.
> Any direction in how to do this would be most appreciated. How should I
> process the query to give relevant results.
>
> Regards
> Puneet
>


Re: Solr Merge Index

2016-07-07 Thread Kalpana
This did the trick

 
 
 
localhost:8983/solr/sitecore_web_index,localhost:8983/solr/SharePo‌​int_All
 


Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Merge-Index-tp4286081p4286272.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Some questions

2016-07-07 Thread Jamal, Sarfaraz
Of course, yes -=)

Sas



Sarfaraz Jamal (Sas)
Revenue Assurance Tech Ops
614-560-8556
sarfaraz.ja...@verizonwireless.com

-Original Message-
From: Siwei Lv [mailto:si...@microsoft.com] 
Sent: Thursday, July 7, 2016 4:40 AM
To: solr-user@lucene.apache.org
Subject: Some questions

Hi all,

I have some questions about solr, Can I send them to this mail box?

Thanks,
Siwei


Re: Facet in SOLR Cloud vs Core

2016-07-07 Thread Pablo Anzorena
Sorry for introducing bad information.
Because it happens in the json facet api, I thought it would also happen in
the facet. Soyrry again for the misunderstood.

2016-07-07 16:08 GMT-03:00 Chris Hostetter :

>
> : The problem with the shards appears in the following scenario (note that
> : the problem below also applies in a solr standalone enviroment with
> : distributed search):
> :
> : Shard1: DATA_SOURCE1 (3 docs), DATA_SOURCE2 (2 docs), DATA_SOURCE3 (2
> docs).
> : Shard2: DATA_SOURCE3 (2 docs), DATA_SOURCE2 (1 docs).
> :
> : If you make a distributed search across these two shards, faceting
> : dataSourceName with a limit of 1, it will ask for the top 1 in the first
> : shard (DATA_SOURCE1 (3 docs)) and for the top 1 in the second shard
> : (DATA_SOURCE3
> : (2 docs)). After that it will merge the results and return DATA_SOURCE1
> (3
> : docs), when it should have return DATA_SOURCE3 (4 docs).
>
> That's completley false.
>
> a) in the first pass, even if you ask for "top 1" (ie: facet.limit=1) solr
> will overrequest when comunicating with each shard (the amount of
> overrequest is a function of your facet.limit, so as facet.limit increases
> so does the overrequest amount)
>
> b) if *any* (but not *all*) shards returns DATA_SOURCE3 from the
> initial shard request, a second "refinement" step will request the count
> for DATA_SOURCE3 from all of the other shards to get an accurate count,
> and to accurately sort DATA_SOURCE3 to the top of the facet constraint
> list.
>
>
> -Hoss
> http://www.lucidworks.com/
>


Re: Facet in SOLR Cloud vs Core

2016-07-07 Thread Chris Hostetter

: The problem with the shards appears in the following scenario (note that
: the problem below also applies in a solr standalone enviroment with
: distributed search):
: 
: Shard1: DATA_SOURCE1 (3 docs), DATA_SOURCE2 (2 docs), DATA_SOURCE3 (2 docs).
: Shard2: DATA_SOURCE3 (2 docs), DATA_SOURCE2 (1 docs).
: 
: If you make a distributed search across these two shards, faceting
: dataSourceName with a limit of 1, it will ask for the top 1 in the first
: shard (DATA_SOURCE1 (3 docs)) and for the top 1 in the second shard
: (DATA_SOURCE3
: (2 docs)). After that it will merge the results and return DATA_SOURCE1 (3
: docs), when it should have return DATA_SOURCE3 (4 docs).

That's completley false.

a) in the first pass, even if you ask for "top 1" (ie: facet.limit=1) solr 
will overrequest when comunicating with each shard (the amount of 
overrequest is a function of your facet.limit, so as facet.limit increases 
so does the overrequest amount)

b) if *any* (but not *all*) shards returns DATA_SOURCE3 from the 
initial shard request, a second "refinement" step will request the count 
for DATA_SOURCE3 from all of the other shards to get an accurate count, 
and to accurately sort DATA_SOURCE3 to the top of the facet constraint 
list.


-Hoss
http://www.lucidworks.com/


Re: Facet in SOLR Cloud vs Core

2016-07-07 Thread Chris Hostetter

: My question specifically has to do with Facets in a SOLR 
: cloud/collection (distributed environment). The core I am working with 
...
: I am using the following facet query which works fine in more Core based index
: 
: 
http://localhost:8983/solr/gamra/select?q=*:*=0=true=dataSourceName
: 
: It returns counts for each distinct dataSourceName as follows (which is the 
desired behavior).
...
: I am wondering if this should work fine in the SOLR Cloud as well?  
: Will this method give me accurate counts out of the box in a SOLR Cloud 
: configuration?

Yes it will.

solr uses a two pass aproach for faceting -- in pass #1 the "top" 
constraints are determined from each shard (overrequesting based on your 
original facet.limit), and then aggregated together.  pass #2 is a 
"refinement" step: any terms from the agregated "top" constraints are 
checked to see shich shards (if any) did not include them in the per-shard 
"top" constraints, and those shards are asked to compute a constraint 
count for terms as needed -- these are then added into the aggregated 
counts for each term, and the terms are resorted.

This means that in some pathelogical term distributions, a term may be 
excluded from the list of "top" terms if it isn't returned by *any* shard 
in pass #1, but for any term that is returned to the end client, the count 
is 100% accurate.

(NOTE: this info applies to the default solr faceting, and solr's pivot 
faceting -- but the relatively new "json faceting" does not support these 
multi-pass refinement of the facet counts.

: PS: The reason I ask is because I know there is some estimating 
: performed in certain cases for the Facet "unique" function (as is 
: outlined here: http://yonik.com/solr-count-distinct/ ). So I guess I am 
: wondering why folks wouldn't just do what I have done vs going throught 
: the trouble of using the unique(dataSourceName) function?

what you linked to is addressing a diff problem then simple facet 
counts.  in your case you are getting the "top" terms with their 
document counts, but what that blog post is refering to is counting the 
total number of unique *terms* (ie: in your data set: what is the total 
number of all unique values in the "dataSourceName" field?

distributed counting of unique values in a high cardinality sets is a 
"hard" problem, as the only way to be 100% accurate is to aggregate all 
terms from all shards into a single node to be hashed (or sorted) ... for 
"batch" style analytics this is a trivial map-reduce style job that can 
offload to disk, but in "real time" situations, statistical sampling 
approaches like HyperLogLog (used in solr) make more sense to get 
aproximations w/o exploding ram usage.



-Hoss
http://www.lucidworks.com/


Re: Facet in SOLR Cloud vs Core

2016-07-07 Thread Pablo Anzorena
As long as you don't shard your index, you will have no problem migrating
to solrcloud.

The problem with the shards appears in the following scenario (note that
the problem below also applies in a solr standalone enviroment with
distributed search):

Shard1: DATA_SOURCE1 (3 docs), DATA_SOURCE2 (2 docs), DATA_SOURCE3 (2 docs).
Shard2: DATA_SOURCE3 (2 docs), DATA_SOURCE2 (1 docs).

If you make a distributed search across these two shards, faceting
dataSourceName with a limit of 1, it will ask for the top 1 in the first
shard (DATA_SOURCE1 (3 docs)) and for the top 1 in the second shard
(DATA_SOURCE3
(2 docs)). After that it will merge the results and return DATA_SOURCE1 (3
docs), when it should have return DATA_SOURCE3 (4 docs).

Summarizing: if you make a distributed search with a facet.limit, there is
a chance that the count is not correct (it also applies to stats).

2016-07-07 15:28 GMT-03:00 Whelan, Andy :

> Hello,
>
> I have am somewhat of a novice when it comes to using SOLR in a
> distributed SolrCloud environment. My team and I are doing development work
> with a SOLR core. We will shortly be transitioning over to a SolrCloud
> environment.
>
> My question specifically has to do with Facets in a SOLR cloud/collection
> (distributed environment). The core I am working with has a field
> "dataSourceName" defined as following in its schema.xml file.
>
>  required="true"/>
>
> I am using the following facet query which works fine in more Core based
> index
>
>
> http://localhost:8983/solr/gamra/select?q=*:*=0=true=dataSourceName
>
> It returns counts for each distinct dataSourceName as follows (which is
> the desired behavior).
>
> 
>
>   169
>   121
>   68
>
> 
>
> I am wondering if this should work fine in the SOLR Cloud as well?  Will
> this method give me accurate counts out of the box in a SOLR Cloud
> configuration?
>
> Thanks
> -Andrew
>
> PS: The reason I ask is because I know there is some estimating performed
> in certain cases for the Facet "unique" function (as is outlined here:
> http://yonik.com/solr-count-distinct/ ). So I guess I am wondering why
> folks wouldn't just do what I have done vs going throught the trouble of
> using the unique(dataSourceName) function?
>
>
>


Re: Solr Merge Index

2016-07-07 Thread Kalpana
Some more info:

I am using Solrnet in my MVC project for search results:

var urlHealthInfo = ConfigurationManager.AppSettings["solrSPHealthInfo"] !=
null ?
ConfigurationManager.AppSettings["solrSitecoreSPHealthInfo"].ToString() :
"http://localhost:8983/solr/Sitecore_SharePoint_HealthInformation;;

var solrServers = new SolrServers {
new SolrServerElement {
Id = "solrHealthInfo",
Url = urlHealthInfo,
DocumentType = typeof
(SPHealthInfoSearchResultsViewModel).AssemblyQualifiedName
}
}

Will I be able to use Shards?

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Merge-Index-tp4286081p4286251.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Boosting query results

2016-07-07 Thread Walter Underwood
If it is running in an environment protected from spammers, you might want to 
start with the work that LucidWorks did on click scoring.

https://lucidworks.com/blog/2015/03/23/mixed-signals-using-lucidworks-fusions-signals-api/
 


Of course, there are no environments free of spammers. I’ve seen them in 
enterprise search, too. But they are easier to deal with there. Call them up 
and tell them they need to stop immediately or their pages disappear from the 
search engine.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jul 7, 2016, at 11:29 AM, Walter Underwood  wrote:
> 
> You understand that you are making your site extremely easy to spam, right? 
> This is how Microsoft became the top hit for “evil empire” on Google.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Jul 7, 2016, at 11:25 AM, Mark T. Trembley  
>> wrote:
>> 
>> I've found that it is definitely complicated!
>> 
>> Essentially what I am attempting to do is boost products based on the number 
>> of times that particular product has been selected via historical searches 
>> using the same search term or phrase.
>> 
>> 
>> On 7/7/2016 11:55 AM, Walter Underwood wrote:
>>> That is a very complicated design. What are you trying to achieve? Maybe 
>>> there is a different approach that is simpler.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> 
 On Jul 7, 2016, at 9:26 AM, Mark T. Trembley  
 wrote:
 
 That works with static boosts based on documents matching the query 
 "Boost2". I want to apply a different boost to documents based on the 
 value assigned to Boost2 within the document.
 
 From my sample documents, when running a query with "Boost2," I want 
 Document2 boosted by 20.0 and Document6 boosted by 15.0:
 
 {
  "id" : "Document2_Boost2",
  "B1_s" : "Boost2",
  "B1_f" : 20
 }
 {
  "id" : "Document6_Boost2",
  "B1_s" : "Boost2",
  "B1_f" : 15
 }
 
 
 On 7/7/2016 10:21 AM, Walter Underwood wrote:
> This looks like a job for “bq”, the boost query parameter. I used this to 
> boost textbooks which were used at the student’s school. bq does not 
> force documents to be included in the result set. It does affect the 
> ranking of the included documents.
> 
> bq=B1_ss:Boost2 will boost documents that match that. You can use 
> weights, like bq=B1_ss:Boost2^10
> 
> Here is the relationship between fq, q, and bq:
> 
> fq: selection, does not affect ranking
> q: selection and ranking
> bq: does not affect selection, affects ranking
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Jul 7, 2016, at 7:30 AM, Mark T. Trembley 
>>  wrote:
>> 
>> I have a question about the best way to rank my results based on a score 
>> field that can have different values per document and where each 
>> document can have different scores based on which term is queried.
>> 
>> Essentially what I'm wanting to have happen is provide a list of terms 
>> that when matched via a query it returns a corresponding score to help 
>> boost the original document. So if I had a document with a multi-valued 
>> field named B1_ss with terms [Boost1|10], [Boost2|20], [Boost3|100] and 
>> my search query is "Boost2", I want that document's result to be boosted 
>> by 20. Also note that "Boost2" can boost different documents at 
>> different levels. The query to select the actual documents will select 
>> against other fields in the document and could possibly return documents 
>> with any combination of B1 terms.
>> 
>> I'm still trying to figure out how best to model this in my index, 
>> either as child documents, or in another collection, or if it would make 
>> more sense to figure out how to make it work via payloads or by boosting 
>> the terms at index time.
>> 
>> I'm running Solr 5.5.1 in cloud mode. Each server has a complete replica 
>> of all collections.
>> 
>> The document structure I've been toying with the most is to put the 
>> boosts into a separate index and join them using !join syntax and 
>> returning the scores, but I've not had any luck getting quality results 
>> from those tests. The extra "scores" index is structured like this (I'll 
>> add the json for my test collections at the end of the email):
>> id:Document1_Boost1
>> B1_s:Boost1
>> B1_f:10
>> id:Document1_Boost3
>> B1_s:Boost3
>> B1_f:100
>> 

Re: Boosting query results

2016-07-07 Thread Walter Underwood
You understand that you are making your site extremely easy to spam, right? 
This is how Microsoft became the top hit for “evil empire” on Google.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jul 7, 2016, at 11:25 AM, Mark T. Trembley  
> wrote:
> 
> I've found that it is definitely complicated!
> 
> Essentially what I am attempting to do is boost products based on the number 
> of times that particular product has been selected via historical searches 
> using the same search term or phrase.
> 
> 
> On 7/7/2016 11:55 AM, Walter Underwood wrote:
>> That is a very complicated design. What are you trying to achieve? Maybe 
>> there is a different approach that is simpler.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Jul 7, 2016, at 9:26 AM, Mark T. Trembley  
>>> wrote:
>>> 
>>> That works with static boosts based on documents matching the query 
>>> "Boost2". I want to apply a different boost to documents based on the value 
>>> assigned to Boost2 within the document.
>>> 
>>> From my sample documents, when running a query with "Boost2," I want 
>>> Document2 boosted by 20.0 and Document6 boosted by 15.0:
>>> 
>>> {
>>>   "id" : "Document2_Boost2",
>>>   "B1_s" : "Boost2",
>>>   "B1_f" : 20
>>> }
>>> {
>>>   "id" : "Document6_Boost2",
>>>   "B1_s" : "Boost2",
>>>   "B1_f" : 15
>>> }
>>> 
>>> 
>>> On 7/7/2016 10:21 AM, Walter Underwood wrote:
 This looks like a job for “bq”, the boost query parameter. I used this to 
 boost textbooks which were used at the student’s school. bq does not force 
 documents to be included in the result set. It does affect the ranking of 
 the included documents.
 
 bq=B1_ss:Boost2 will boost documents that match that. You can use weights, 
 like bq=B1_ss:Boost2^10
 
 Here is the relationship between fq, q, and bq:
 
 fq: selection, does not affect ranking
 q: selection and ranking
 bq: does not affect selection, affects ranking
 
 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)
 
 
> On Jul 7, 2016, at 7:30 AM, Mark T. Trembley  
> wrote:
> 
> I have a question about the best way to rank my results based on a score 
> field that can have different values per document and where each document 
> can have different scores based on which term is queried.
> 
> Essentially what I'm wanting to have happen is provide a list of terms 
> that when matched via a query it returns a corresponding score to help 
> boost the original document. So if I had a document with a multi-valued 
> field named B1_ss with terms [Boost1|10], [Boost2|20], [Boost3|100] and 
> my search query is "Boost2", I want that document's result to be boosted 
> by 20. Also note that "Boost2" can boost different documents at different 
> levels. The query to select the actual documents will select against 
> other fields in the document and could possibly return documents with any 
> combination of B1 terms.
> 
> I'm still trying to figure out how best to model this in my index, either 
> as child documents, or in another collection, or if it would make more 
> sense to figure out how to make it work via payloads or by boosting the 
> terms at index time.
> 
> I'm running Solr 5.5.1 in cloud mode. Each server has a complete replica 
> of all collections.
> 
> The document structure I've been toying with the most is to put the 
> boosts into a separate index and join them using !join syntax and 
> returning the scores, but I've not had any luck getting quality results 
> from those tests. The extra "scores" index is structured like this (I'll 
> add the json for my test collections at the end of the email):
> id:Document1_Boost1
>  B1_s:Boost1
>  B1_f:10
> id:Document1_Boost3
>  B1_s:Boost3
>  B1_f:100
> Using this structure, I get close, but the scores are not what I'm 
> expecting. If I use the following query, the explain says it's using the 
> score from Document6_Boost2 even though my query is specifying B1_s:Boost3
> http://localhost:8983/solr/generic/select?q={!join from=id to=B1_name_ss 
> fromIndex=scores 
> score=max}B1_s:Boost3{!func}B1_f=*,score=true
> 
> 

Facet in SOLR Cloud vs Core

2016-07-07 Thread Whelan, Andy
Hello,

I have am somewhat of a novice when it comes to using SOLR in a distributed 
SolrCloud environment. My team and I are doing development work with a SOLR 
core. We will shortly be transitioning over to a SolrCloud environment.

My question specifically has to do with Facets in a SOLR cloud/collection 
(distributed environment). The core I am working with has a field 
"dataSourceName" defined as following in its schema.xml file.



I am using the following facet query which works fine in more Core based index

http://localhost:8983/solr/gamra/select?q=*:*=0=true=dataSourceName

It returns counts for each distinct dataSourceName as follows (which is the 
desired behavior).


   
  169
  121
  68
   


I am wondering if this should work fine in the SOLR Cloud as well?  Will this 
method give me accurate counts out of the box in a SOLR Cloud configuration?

Thanks
-Andrew

PS: The reason I ask is because I know there is some estimating performed in 
certain cases for the Facet "unique" function (as is outlined here: 
http://yonik.com/solr-count-distinct/ ). So I guess I am wondering why folks 
wouldn't just do what I have done vs going throught the trouble of using the 
unique(dataSourceName) function?




Re: Boosting query results

2016-07-07 Thread Mark T. Trembley

I've found that it is definitely complicated!

Essentially what I am attempting to do is boost products based on the 
number of times that particular product has been selected via historical 
searches using the same search term or phrase.



On 7/7/2016 11:55 AM, Walter Underwood wrote:

That is a very complicated design. What are you trying to achieve? Maybe there 
is a different approach that is simpler.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Jul 7, 2016, at 9:26 AM, Mark T. Trembley  wrote:

That works with static boosts based on documents matching the query "Boost2". I 
want to apply a different boost to documents based on the value assigned to Boost2 within 
the document.

 From my sample documents, when running a query with "Boost2," I want Document2 
boosted by 20.0 and Document6 boosted by 15.0:

{
   "id" : "Document2_Boost2",
   "B1_s" : "Boost2",
   "B1_f" : 20
}
{
   "id" : "Document6_Boost2",
   "B1_s" : "Boost2",
   "B1_f" : 15
}


On 7/7/2016 10:21 AM, Walter Underwood wrote:

This looks like a job for “bq”, the boost query parameter. I used this to boost 
textbooks which were used at the student’s school. bq does not force documents 
to be included in the result set. It does affect the ranking of the included 
documents.

bq=B1_ss:Boost2 will boost documents that match that. You can use weights, like 
bq=B1_ss:Boost2^10

Here is the relationship between fq, q, and bq:

fq: selection, does not affect ranking
q: selection and ranking
bq: does not affect selection, affects ranking

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Jul 7, 2016, at 7:30 AM, Mark T. Trembley  wrote:

I have a question about the best way to rank my results based on a score field 
that can have different values per document and where each document can have 
different scores based on which term is queried.

Essentially what I'm wanting to have happen is provide a list of terms that when matched via a 
query it returns a corresponding score to help boost the original document. So if I had a document 
with a multi-valued field named B1_ss with terms [Boost1|10], [Boost2|20], [Boost3|100] and my 
search query is "Boost2", I want that document's result to be boosted by 20. Also note 
that "Boost2" can boost different documents at different levels. The query to select the 
actual documents will select against other fields in the document and could possibly return 
documents with any combination of B1 terms.

I'm still trying to figure out how best to model this in my index, either as 
child documents, or in another collection, or if it would make more sense to 
figure out how to make it work via payloads or by boosting the terms at index 
time.

I'm running Solr 5.5.1 in cloud mode. Each server has a complete replica of all 
collections.

The document structure I've been toying with the most is to put the boosts into a 
separate index and join them using !join syntax and returning the scores, but I've not 
had any luck getting quality results from those tests. The extra "scores" index 
is structured like this (I'll add the json for my test collections at the end of the 
email):
id:Document1_Boost1
  B1_s:Boost1
  B1_f:10
id:Document1_Boost3
  B1_s:Boost3
  B1_f:100
Using this structure, I get close, but the scores are not what I'm expecting. 
If I use the following query, the explain says it's using the score from 
Document6_Boost2 even though my query is specifying B1_s:Boost3
http://localhost:8983/solr/generic/select?q={!join from=id to=B1_name_ss 
fromIndex=scores score=max}B1_s:Boost3{!func}B1_f=*,score=true


Re: Integrating Stanford NLP or any other NLP for Natural Language Query

2016-07-07 Thread Joel Bernstein
You may want to take a look at NLP4J. There is no integration yet with
Solr, but it seems like it would be a good fit.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Jul 7, 2016 at 1:09 PM, Puneet Pawaia 
wrote:

> Hi
>
> I am currently using Solr 5.5.x to test but can upgrade to Solr 6.x if
> required.
> I am working on a POC for natural language query using Solr. Should I use
> the Stanford libraries or are there any other libraries having integration
> with Solr already available.
> Any direction in how to do this would be most appreciated. How should I
> process the query to give relevant results.
>
> Regards
> Puneet
>


Re: Custom Post-Filter in Solr 3.3.0

2016-07-07 Thread Erick Erickson
Post filtering was added in Solr 3.4. What the interface is like in that code
line I have no idea though. You at least have to upgrade that far.

Best,
Erick

On Thu, Jul 7, 2016 at 10:19 AM, Vasu Y  wrote:
> Hi,
>  Thanks to Erik Hatcher's blog on Custom security filtering in Solr (
> https://lucidworks.com/blog/2012/02/22/custom-security-filtering-in-solr/ ).
> I have a similar requirement to do some post-filtering. Our environment is
> Solr 3.3.0 and when i use the Erik's sample, AccessControlQuery wouldn't
> compile as the classes
> "org.apache.solr.search.DelegatingCollector,
> org.apache.solr.search.ExtendedQueryBase
> & org.apache.solr.search.PostFilter" doesn't seem to be available in Solr
> 3.3.0.
>
> Any suggestion on how Erik's post-filter can be adapted to Solr 3.3.0?
>
> Thanks,
> Vasu


Custom Post-Filter in Solr 3.3.0

2016-07-07 Thread Vasu Y
Hi,
 Thanks to Erik Hatcher's blog on Custom security filtering in Solr (
https://lucidworks.com/blog/2012/02/22/custom-security-filtering-in-solr/ ).
I have a similar requirement to do some post-filtering. Our environment is
Solr 3.3.0 and when i use the Erik's sample, AccessControlQuery wouldn't
compile as the classes
"org.apache.solr.search.DelegatingCollector,
org.apache.solr.search.ExtendedQueryBase
& org.apache.solr.search.PostFilter" doesn't seem to be available in Solr
3.3.0.

Any suggestion on how Erik's post-filter can be adapted to Solr 3.3.0?

Thanks,
Vasu


Integrating Stanford NLP or any other NLP for Natural Language Query

2016-07-07 Thread Puneet Pawaia
Hi

I am currently using Solr 5.5.x to test but can upgrade to Solr 6.x if
required.
I am working on a POC for natural language query using Solr. Should I use
the Stanford libraries or are there any other libraries having integration
with Solr already available.
Any direction in how to do this would be most appreciated. How should I
process the query to give relevant results.

Regards
Puneet


Some questions

2016-07-07 Thread Siwei Lv
Hi all,

I have some questions about solr, Can I send them to this mail box?

Thanks,
Siwei


Re: Boosting query results

2016-07-07 Thread Walter Underwood
That is a very complicated design. What are you trying to achieve? Maybe there 
is a different approach that is simpler.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jul 7, 2016, at 9:26 AM, Mark T. Trembley  
> wrote:
> 
> That works with static boosts based on documents matching the query "Boost2". 
> I want to apply a different boost to documents based on the value assigned to 
> Boost2 within the document.
> 
> From my sample documents, when running a query with "Boost2," I want 
> Document2 boosted by 20.0 and Document6 boosted by 15.0:
> 
> {
>   "id" : "Document2_Boost2",
>   "B1_s" : "Boost2",
>   "B1_f" : 20
> }
> {
>   "id" : "Document6_Boost2",
>   "B1_s" : "Boost2",
>   "B1_f" : 15
> }
> 
> 
> On 7/7/2016 10:21 AM, Walter Underwood wrote:
>> This looks like a job for “bq”, the boost query parameter. I used this to 
>> boost textbooks which were used at the student’s school. bq does not force 
>> documents to be included in the result set. It does affect the ranking of 
>> the included documents.
>> 
>> bq=B1_ss:Boost2 will boost documents that match that. You can use weights, 
>> like bq=B1_ss:Boost2^10
>> 
>> Here is the relationship between fq, q, and bq:
>> 
>> fq: selection, does not affect ranking
>> q: selection and ranking
>> bq: does not affect selection, affects ranking
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Jul 7, 2016, at 7:30 AM, Mark T. Trembley  
>>> wrote:
>>> 
>>> I have a question about the best way to rank my results based on a score 
>>> field that can have different values per document and where each document 
>>> can have different scores based on which term is queried.
>>> 
>>> Essentially what I'm wanting to have happen is provide a list of terms that 
>>> when matched via a query it returns a corresponding score to help boost the 
>>> original document. So if I had a document with a multi-valued field named 
>>> B1_ss with terms [Boost1|10], [Boost2|20], [Boost3|100] and my search query 
>>> is "Boost2", I want that document's result to be boosted by 20. Also note 
>>> that "Boost2" can boost different documents at different levels. The query 
>>> to select the actual documents will select against other fields in the 
>>> document and could possibly return documents with any combination of B1 
>>> terms.
>>> 
>>> I'm still trying to figure out how best to model this in my index, either 
>>> as child documents, or in another collection, or if it would make more 
>>> sense to figure out how to make it work via payloads or by boosting the 
>>> terms at index time.
>>> 
>>> I'm running Solr 5.5.1 in cloud mode. Each server has a complete replica of 
>>> all collections.
>>> 
>>> The document structure I've been toying with the most is to put the boosts 
>>> into a separate index and join them using !join syntax and returning the 
>>> scores, but I've not had any luck getting quality results from those tests. 
>>> The extra "scores" index is structured like this (I'll add the json for my 
>>> test collections at the end of the email):
>>> id:Document1_Boost1
>>>  B1_s:Boost1
>>>  B1_f:10
>>> id:Document1_Boost3
>>>  B1_s:Boost3
>>>  B1_f:100
>>> Using this structure, I get close, but the scores are not what I'm 
>>> expecting. If I use the following query, the explain says it's using the 
>>> score from Document6_Boost2 even though my query is specifying B1_s:Boost3
>>> http://localhost:8983/solr/generic/select?q={!join from=id to=B1_name_ss 
>>> fromIndex=scores score=max}B1_s:Boost3{!func}B1_f=*,score=true
>>> 
>>> 

Solr search for PDF is case sensitive

2016-07-07 Thread rohinikrishna
Hi All, 

I have implemented Solr search in my Sitecore application. 
Scenario: 
I have a Sitecore item and an field of the item is associated to media
Item(PDF). 
I am able to index the entire content of the item including the media item
i.e. PDF. I have used Computed text for indexing. 
When I search for any keyword which is directly the content of the item, I
get correct result (its not case sensitive). If I give a search for a word
which is in Caps in the item, i get the result. 
But when i search for a keyword from the associated PDF, i get case
sensitive result. That is, if the word i search for is in Caps in the PDF
document, I get the result only when I enter the word in Caps in the search.
If the word is in small letter, it doesn't provide the result. 
Also, when I search for words separated by space, I don't get the result.
This happens for the PDF content. 
However, the content that is directly on the item fetches correct result
(i.e Its not case sensitive and it also works for words with space.)

Anyone having an idea on this, please provide your input. 

Thanks in advance.

 




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-search-for-PDF-is-case-sensitive-tp4286158.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Boosting query results

2016-07-07 Thread Mark T. Trembley
That works with static boosts based on documents matching the query 
"Boost2". I want to apply a different boost to documents based on the 
value assigned to Boost2 within the document.


From my sample documents, when running a query with "Boost2," I want 
Document2 boosted by 20.0 and Document6 boosted by 15.0:


 {
   "id" : "Document2_Boost2",
   "B1_s" : "Boost2",
   "B1_f" : 20
 }
 {
   "id" : "Document6_Boost2",
   "B1_s" : "Boost2",
   "B1_f" : 15
 }


On 7/7/2016 10:21 AM, Walter Underwood wrote:

This looks like a job for “bq”, the boost query parameter. I used this to boost 
textbooks which were used at the student’s school. bq does not force documents 
to be included in the result set. It does affect the ranking of the included 
documents.

bq=B1_ss:Boost2 will boost documents that match that. You can use weights, like 
bq=B1_ss:Boost2^10

Here is the relationship between fq, q, and bq:

fq: selection, does not affect ranking
q: selection and ranking
bq: does not affect selection, affects ranking

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Jul 7, 2016, at 7:30 AM, Mark T. Trembley  wrote:

I have a question about the best way to rank my results based on a score field 
that can have different values per document and where each document can have 
different scores based on which term is queried.

Essentially what I'm wanting to have happen is provide a list of terms that when matched via a 
query it returns a corresponding score to help boost the original document. So if I had a document 
with a multi-valued field named B1_ss with terms [Boost1|10], [Boost2|20], [Boost3|100] and my 
search query is "Boost2", I want that document's result to be boosted by 20. Also note 
that "Boost2" can boost different documents at different levels. The query to select the 
actual documents will select against other fields in the document and could possibly return 
documents with any combination of B1 terms.

I'm still trying to figure out how best to model this in my index, either as 
child documents, or in another collection, or if it would make more sense to 
figure out how to make it work via payloads or by boosting the terms at index 
time.

I'm running Solr 5.5.1 in cloud mode. Each server has a complete replica of all 
collections.

The document structure I've been toying with the most is to put the boosts into a 
separate index and join them using !join syntax and returning the scores, but I've not 
had any luck getting quality results from those tests. The extra "scores" index 
is structured like this (I'll add the json for my test collections at the end of the 
email):
id:Document1_Boost1
  B1_s:Boost1
  B1_f:10
id:Document1_Boost3
  B1_s:Boost3
  B1_f:100
Using this structure, I get close, but the scores are not what I'm expecting. 
If I use the following query, the explain says it's using the score from 
Document6_Boost2 even though my query is specifying B1_s:Boost3
http://localhost:8983/solr/generic/select?q={!join from=id to=B1_name_ss 
fromIndex=scores score=max}B1_s:Boost3{!func}B1_f=*,score=true


Re: File Descriptor/Memory Leak

2016-07-07 Thread Anshum Gupta
I've created a JIRA to track this:
https://issues.apache.org/jira/browse/SOLR-9290

On Thu, Jul 7, 2016 at 8:00 AM, Shai Erera  wrote:

> Shalin, we're seeing that issue too (and actually actively debugging it
> these days). So far I can confirm the following (on a 2-node cluster):
>
> 1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1
> 2) It does not reproduce when SSL is disabled
> 3) Restarting the Solr process (sometimes both need to be restarted), the
> count drops to 0, but if indexing continues, they climb up again
>
> When it does happen, Solr seems stuck. The leader cannot talk to the
> replica, or vice versa, the replica is usually put in DOWN state and
> there's no way to fix it besides restarting the JVM.
>
> Reviewing the changes from 5.4.1 to 5.5.1 I tried reverting some that
> looked suspicious (SOLR-8451 and SOLR-8578), even though the changes look
> legit. That did not help, and honestly I've done that before we suspected
> it might be the SSL. Therefore I think those are "safe", but just FYI.
>
> When it does happen, the number of CLOSE_WAITS climb very high, to the
> order of 30K+ entries in 'netstat'.
>
> When I say it does not reproduce on 5.4.1 I really mean the numbers don't
> go as high as they do in 5.5.1. Meaning, when running without SSL, the
> number of CLOSE_WAITs is smallish, usually less than a 10 (I would
> separately like to understand why we have any in that state at all). When
> running with SSL and 5.4.1, they stay low at the order of hundreds the
> most.
>
> Unfortunately running without SSL is not an option for us. We will likely
> roll back to 5.4.1, even if the problem exists there, but to a lesser
> degree.
>
> I will post back here when/if we have more info about this.
>
> Shai
>
> On Thu, Jul 7, 2016 at 5:32 PM Shalin Shekhar Mangar <
> shalinman...@gmail.com>
> wrote:
>
> > I have myself seen this CLOSE_WAIT issue at a customer. I am running some
> > tests with different versions trying to pinpoint the cause of this leak.
> > Once I have some more information and a reproducible test, I'll open a
> jira
> > issue. I'll keep you posted.
> >
> > On Thu, Jul 7, 2016 at 5:13 PM, Mads Tomasgård Bjørgan 
> > wrote:
> >
> > > Hello there,
> > > Our SolrCloud is experiencing a FD leak while running with SSL. This is
> > > occurring on the one machine that our program is sending data too. We
> > have
> > > a total of three servers running as an ensemble.
> > >
> > > While running without SSL does the FD Count remain quite constant at
> > > around 180 while indexing. Performing a garbage collection also clears
> > > almost the entire JVM-memory.
> > >
> > > However - when indexing with SSL does the FDC grow polynomial. The
> count
> > > increases with a few hundred every five seconds or so, but reaches
> easily
> > > 50 000 within three to four minutes. Performing a GC swipes most of the
> > > memory on the two machines our program isn't transmitting the data
> > directly
> > > to. The last machine is unaffected by the GC, and both memory nor FDC
> > > doesn't reset before Solr is restarted on that machine.
> > >
> > > Performing a netstat reveals that the FDC mostly consists of
> > > TCP-connections in the state of "CLOSE_WAIT".
> > >
> > >
> > >
> >
> >
> > --
> > Regards,
> > Shalin Shekhar Mangar.
> >
>



-- 
Anshum Gupta


Re: Boosting query results

2016-07-07 Thread Walter Underwood
This looks like a job for “bq”, the boost query parameter. I used this to boost 
textbooks which were used at the student’s school. bq does not force documents 
to be included in the result set. It does affect the ranking of the included 
documents.

bq=B1_ss:Boost2 will boost documents that match that. You can use weights, like 
bq=B1_ss:Boost2^10

Here is the relationship between fq, q, and bq:

fq: selection, does not affect ranking
q: selection and ranking
bq: does not affect selection, affects ranking

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jul 7, 2016, at 7:30 AM, Mark T. Trembley  
> wrote:
> 
> I have a question about the best way to rank my results based on a score 
> field that can have different values per document and where each document can 
> have different scores based on which term is queried.
> 
> Essentially what I'm wanting to have happen is provide a list of terms that 
> when matched via a query it returns a corresponding score to help boost the 
> original document. So if I had a document with a multi-valued field named 
> B1_ss with terms [Boost1|10], [Boost2|20], [Boost3|100] and my search query 
> is "Boost2", I want that document's result to be boosted by 20. Also note 
> that "Boost2" can boost different documents at different levels. The query to 
> select the actual documents will select against other fields in the document 
> and could possibly return documents with any combination of B1 terms.
> 
> I'm still trying to figure out how best to model this in my index, either as 
> child documents, or in another collection, or if it would make more sense to 
> figure out how to make it work via payloads or by boosting the terms at index 
> time.
> 
> I'm running Solr 5.5.1 in cloud mode. Each server has a complete replica of 
> all collections.
> 
> The document structure I've been toying with the most is to put the boosts 
> into a separate index and join them using !join syntax and returning the 
> scores, but I've not had any luck getting quality results from those tests. 
> The extra "scores" index is structured like this (I'll add the json for my 
> test collections at the end of the email):
> id:Document1_Boost1
>  B1_s:Boost1
>  B1_f:10
> id:Document1_Boost3
>  B1_s:Boost3
>  B1_f:100
> Using this structure, I get close, but the scores are not what I'm expecting. 
> If I use the following query, the explain says it's using the score from 
> Document6_Boost2 even though my query is specifying B1_s:Boost3
> http://localhost:8983/solr/generic/select?q={!join from=id to=B1_name_ss 
> fromIndex=scores score=max}B1_s:Boost3{!func}B1_f=*,score=true
> 
> 

Re: File Descriptor/Memory Leak

2016-07-07 Thread Shai Erera
Shalin, we're seeing that issue too (and actually actively debugging it
these days). So far I can confirm the following (on a 2-node cluster):

1) It consistently reproduces on 5.5.1, but *does not* reproduce on 5.4.1
2) It does not reproduce when SSL is disabled
3) Restarting the Solr process (sometimes both need to be restarted), the
count drops to 0, but if indexing continues, they climb up again

When it does happen, Solr seems stuck. The leader cannot talk to the
replica, or vice versa, the replica is usually put in DOWN state and
there's no way to fix it besides restarting the JVM.

Reviewing the changes from 5.4.1 to 5.5.1 I tried reverting some that
looked suspicious (SOLR-8451 and SOLR-8578), even though the changes look
legit. That did not help, and honestly I've done that before we suspected
it might be the SSL. Therefore I think those are "safe", but just FYI.

When it does happen, the number of CLOSE_WAITS climb very high, to the
order of 30K+ entries in 'netstat'.

When I say it does not reproduce on 5.4.1 I really mean the numbers don't
go as high as they do in 5.5.1. Meaning, when running without SSL, the
number of CLOSE_WAITs is smallish, usually less than a 10 (I would
separately like to understand why we have any in that state at all). When
running with SSL and 5.4.1, they stay low at the order of hundreds the most.

Unfortunately running without SSL is not an option for us. We will likely
roll back to 5.4.1, even if the problem exists there, but to a lesser
degree.

I will post back here when/if we have more info about this.

Shai

On Thu, Jul 7, 2016 at 5:32 PM Shalin Shekhar Mangar 
wrote:

> I have myself seen this CLOSE_WAIT issue at a customer. I am running some
> tests with different versions trying to pinpoint the cause of this leak.
> Once I have some more information and a reproducible test, I'll open a jira
> issue. I'll keep you posted.
>
> On Thu, Jul 7, 2016 at 5:13 PM, Mads Tomasgård Bjørgan 
> wrote:
>
> > Hello there,
> > Our SolrCloud is experiencing a FD leak while running with SSL. This is
> > occurring on the one machine that our program is sending data too. We
> have
> > a total of three servers running as an ensemble.
> >
> > While running without SSL does the FD Count remain quite constant at
> > around 180 while indexing. Performing a garbage collection also clears
> > almost the entire JVM-memory.
> >
> > However - when indexing with SSL does the FDC grow polynomial. The count
> > increases with a few hundred every five seconds or so, but reaches easily
> > 50 000 within three to four minutes. Performing a GC swipes most of the
> > memory on the two machines our program isn't transmitting the data
> directly
> > to. The last machine is unaffected by the GC, and both memory nor FDC
> > doesn't reset before Solr is restarted on that machine.
> >
> > Performing a netstat reveals that the FDC mostly consists of
> > TCP-connections in the state of "CLOSE_WAIT".
> >
> >
> >
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


RE: Data import handler in techproducts example

2016-07-07 Thread Brooks Chuck (FCA)
Hello Jonas,

Did you figure this out? 

Dr. Chuck Brooks
248-838-5070


-Original Message-
From: Jonas Vasiliauskas [mailto:jonas.vasiliaus...@yahoo.com.INVALID] 
Sent: Saturday, July 02, 2016 11:37 AM
To: solr-user@lucene.apache.org
Subject: Data import handler in techproducts example

Hey,

I'm quite new to solr and java environments. I have a goal for myself to import 
some data from mysql database in techproducts (core) example.

I have setup data import handler (DIH) for techproducts based on instructions 
here https://wiki.apache.org/solr/DIHQuickStart , but looks like solr doesn't 
load DIH libraries, could someone please explain in quick words on how to check 
if DIH is loaded and if not - how can I load it ?

Stacktrace is here: http://pastebin.ca/3654347

Thanks,


Re: File Descriptor/Memory Leak

2016-07-07 Thread Shalin Shekhar Mangar
I have myself seen this CLOSE_WAIT issue at a customer. I am running some
tests with different versions trying to pinpoint the cause of this leak.
Once I have some more information and a reproducible test, I'll open a jira
issue. I'll keep you posted.

On Thu, Jul 7, 2016 at 5:13 PM, Mads Tomasgård Bjørgan  wrote:

> Hello there,
> Our SolrCloud is experiencing a FD leak while running with SSL. This is
> occurring on the one machine that our program is sending data too. We have
> a total of three servers running as an ensemble.
>
> While running without SSL does the FD Count remain quite constant at
> around 180 while indexing. Performing a garbage collection also clears
> almost the entire JVM-memory.
>
> However - when indexing with SSL does the FDC grow polynomial. The count
> increases with a few hundred every five seconds or so, but reaches easily
> 50 000 within three to four minutes. Performing a GC swipes most of the
> memory on the two machines our program isn't transmitting the data directly
> to. The last machine is unaffected by the GC, and both memory nor FDC
> doesn't reset before Solr is restarted on that machine.
>
> Performing a netstat reveals that the FDC mostly consists of
> TCP-connections in the state of "CLOSE_WAIT".
>
>
>


-- 
Regards,
Shalin Shekhar Mangar.


Boosting query results

2016-07-07 Thread Mark T. Trembley
I have a question about the best way to rank my results based on a score 
field that can have different values per document and where each 
document can have different scores based on which term is queried.


Essentially what I'm wanting to have happen is provide a list of terms 
that when matched via a query it returns a corresponding score to help 
boost the original document. So if I had a document with a multi-valued 
field named B1_ss with terms [Boost1|10], [Boost2|20], [Boost3|100] and 
my search query is "Boost2", I want that document's result to be boosted 
by 20. Also note that "Boost2" can boost different documents at 
different levels. The query to select the actual documents will select 
against other fields in the document and could possibly return documents 
with any combination of B1 terms.


I'm still trying to figure out how best to model this in my index, 
either as child documents, or in another collection, or if it would make 
more sense to figure out how to make it work via payloads or by boosting 
the terms at index time.


I'm running Solr 5.5.1 in cloud mode. Each server has a complete replica 
of all collections.


The document structure I've been toying with the most is to put the 
boosts into a separate index and join them using !join syntax and 
returning the scores, but I've not had any luck getting quality results 
from those tests. The extra "scores" index is structured like this (I'll 
add the json for my test collections at the end of the email):

id:Document1_Boost1
  B1_s:Boost1
  B1_f:10
id:Document1_Boost3
  B1_s:Boost3
  B1_f:100
Using this structure, I get close, but the scores are not what I'm 
expecting. If I use the following query, the explain says it's using the 
score from Document6_Boost2 even though my query is specifying B1_s:Boost3
http://localhost:8983/solr/generic/select?q={!join from=id to=B1_name_ss 
fromIndex=scores score=max}B1_s:Boost3{!func}B1_f=*,score=true



Re: sorlcloud connection issue

2016-07-07 Thread Shawn Heisey
On 7/6/2016 5:26 AM, Kent Mu wrote:
> Hi friends!
> *solr version: 4.9.0*
>
> I came across a problem when use solrcloud, it becomes dead lock, we got
> the java core log, it looks like the http connection pool is exhausted and
> most threads are waiting to get a free connection..
>
> I posted the problem in JIRA, the link is
> https://issues.apache.org/jira/browse/SOLR-9253
> I have increased http connection defaults for the SolrJ client, and also
> configed the connection defaults in solr.xml for all shard servers as below.
>
>  class="HttpShardHandlerFactory">
> 6
> 3
> 1
> 500
> 

I can see JBoss classes in the thread dump that was added to SOLR-9253.

That thread dump shows 213 threads in the RUNNABLE state, and 507 in the
WAITING state.  I do not think you are running into the configured shard
handler limits.  I think your container is not allowing enough Solr
threads to run.

Just like Tomcat and Jetty, JBoss has a "maxThreads" setting that
defaults to 200.  Increasing this setting is critical for scalability
when using a third-party container.  I recommend 1 -- which is the
setting you'll find in the Jetty that's included with Solr.

Note that if you upgrade Solr to 5.x or 6.x, running in JBoss will no
longer be a supported configuration.

https://wiki.apache.org/solr/WhyNoWar

Thanks,
Shawn



Re: solr-8258

2016-07-07 Thread Joel Bernstein
Hi,

Erick Erickson has been working on a ticket for this:

https://issues.apache.org/jira/browse/SOLR-9166

Originally this wasn't implemented because much of streaming API in the
early days didn't properly handle nulls.



Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Jul 7, 2016 at 8:09 AM, Matteo Grolla 
wrote:

> Hi,
> the export handler returns 0 for null numeric values.
> Can someone explain me why it doesn't leave the field off the record like
> string or multivalue fields?
> thanks
>
> Matteo
>


Re: Suggester Issue

2016-07-07 Thread Rajesh Kapur
Hi,

Any update on this?

Could you please let me know is it possible to pass CFQ parameter on
multiple fields?

Thanks

On Tue, Jul 5, 2016 at 9:40 AM, Rajesh Kapur 
wrote:

> Hi,
>
>
>
> I tried to implement suggester using SOLR 6.0.1 with context field. PFB
> the configuration we are using to implement suggester
>
>
>
>
>
>   
>
> 
>
>   mySuggester
>
>   
>
>   AnalyzingInfixLookupFactory
>
>   suggester_infixdata_dir
>
>   DocumentDictionaryFactory
>
>   SearchSuggestions
>
> BrandName
>
>   suggest
>
>   true
>
>   
>
> 
>
>   
>
>
>
>   
>
> 
>
>   true
>
>   10
>
>   mySuggester
>
>   
>
> 
>
> 
>
>   suggest_sitesearch
>
> 
>
>   
>
>
>
> But I am not able to get the desired output using suggest.cfq parameter.
> Could you please help me in getting the correct output.
>
>
>
> -Thanks,
>
> Rajesh Kapur
>


solr-8258

2016-07-07 Thread Matteo Grolla
Hi,
the export handler returns 0 for null numeric values.
Can someone explain me why it doesn't leave the field off the record like
string or multivalue fields?
thanks

Matteo


File Descriptor/Memory Leak

2016-07-07 Thread Mads Tomasgård Bjørgan
Hello there,
Our SolrCloud is experiencing a FD leak while running with SSL. This is 
occurring on the one machine that our program is sending data too. We have a 
total of three servers running as an ensemble.

While running without SSL does the FD Count remain quite constant at around 180 
while indexing. Performing a garbage collection also clears almost the entire 
JVM-memory.

However - when indexing with SSL does the FDC grow polynomial. The count 
increases with a few hundred every five seconds or so, but reaches easily 50 
000 within three to four minutes. Performing a GC swipes most of the memory on 
the two machines our program isn't transmitting the data directly to. The last 
machine is unaffected by the GC, and both memory nor FDC doesn't reset before 
Solr is restarted on that machine.

Performing a netstat reveals that the FDC mostly consists of TCP-connections in 
the state of "CLOSE_WAIT".




Re: Solr Merge Index

2016-07-07 Thread Kalpana
Thanks for your reply.
I am using Solrnet to set up the search object. Is it possible to use the same 
for sharding?





On Jul 7, 2016, at 4:11 AM, Shalin Shekhar Mangar [via Lucene] 
>
 wrote:

Why do you need the merged core? If the underlying data is changing then
obviously the merged core won't automatically update it self. And another
merge will introduce duplicate data. So this is a bad solution.

You can just keep the two cores and do a distributed search across both of
them? You can specify shards=http://localhost:8983/solr/

core1,http://localhost:8983/solr/
core2
as a parameter to your search requests.

On Thu, Jul 7, 2016 at 4:32 AM, Kalpana <[hidden 
email]>
wrote:

> Hello
>
> I have two sources - Sitecore web index (core 1) and a database table (core
> 2). I have created core 3 which is a merge of core1 and core 2.
>
>
> http://localhost:8983/solr/admin/cores?action=mergeindexes=core3=sitecore_web_index=core2
>
> But when someone publishes a page on Sitecore, the sitecore web index gets
> updated but not the merged core. How can get the real time data with the
> merge? Is there a way?
>
> Thanks
> Kalpana
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-Merge-Index-tp4286081.html
> Sent from the Solr - User mailing list archive at 
> Nabble.com.
>



--
Regards,
Shalin Shekhar Mangar.



If you reply to this email, your message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/Solr-Merge-Index-tp4286081p4286130.html
To unsubscribe from Solr Merge Index, click 
here.
NAML




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Merge-Index-tp4286081p4286153.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: "Block join faceting is allowed with ToParentBlockJoinQuery only"

2016-07-07 Thread Mikhail Khludnev
Hello,

I hardly I understand why you need to find text:(Moby*) guys twice. Find
them once under {parent} is enough from my pov. Are you sure that just
using fl=[child] isn't enough, and why if so?
06 июля 2016 г. 13:48 пользователь "Sebastian Riemer" 
написал:

> Hi,
>
> Please consider the following three queries:
>
>
> (1)this works:
>
> {
> "responseHeader": {
> "status": 0,
> "QTime": 5,
> "params": {
>   "q": "(type_s:wemi AND {!parent which='type_s:wemi'v='-type_s:wemi
> AND (((text:(Moby*'})",
>   "facet.field": "m_mainAuthority_s",
>   "indent": "true",
>   "fq": "m_id_l:[* TO *]",
>   "wt": "json",
>   "facet": "true",
>   "child.facet.field": [
> "corporateBodyContainer_name_t_ns_fac",
> "personContainer_name_t_ns_fac"
>   ],
>   "_": "1467801413472"
> }
>   },
>
> (2)this also works:
>
> "responseHeader": {
>
> "status": 0,
>
> "QTime": 0,
>
> "params": {
>
>   "q": "(((text:(Moby*(type_s:wemi AND {!parent
> which='type_s:wemi'v='-type_s:wemi AND (((text:(Moby*'})",
>
>   "indent": "true",
>
>   "fq": "m_id_l:[* TO *]",
>
>   "wt": "json",
>
>   "_": "1467801481986"
>
> }
>
>   },
>
>
>
> (3)this does not:
>
> {
>
>   "responseHeader": {
>
> "status": 400,
>
> "QTime": 3,
>
> "params": {
>
>   "q": "(((text:(Moby*(type_s:wemi AND {!parent
> which='type_s:wemi'v='-type_s:wemi AND (((text:(Moby*'})",
>
>   "facet.field": "m_mainAuthority_s",
>
>   "indent": "true",
>
>   "fq": "m_id_l:[* TO *]",
>
>   "wt": "json",
>
>   "facet": "true",
>
>   "child.facet.field": [
>
> "corporateBodyContainer_name_t_ns_fac",
>
> "personContainer_name_t_ns_fac"
>
>   ],
>
>   "_": "1467801452826"
>
> }
>
>   },
>
>
> (1)returns me parent documents where the child document contains the
> term "Moby*" including facets on a parent doc field AND facets on child doc
> fields (Nice!)
>
> (2)returns me parent documents where either the parent document or the
> child document contains the term "Moby*" (Hell yea!)
>
> (3)Fails with the error message "Block join faceting is allowed with
> ToParentBlockJoinQuery only" (Nay :()
>
> So, I want both, the possibility to search for a term in all fields of the
> parent and the child docs AND to receive the facet counts for fields of the
> parent AND the child. Is what I long for possible, and if so could you
> please punch me in the right direction?
>
> Many thanks,
> Sebastian
>


Re: export handler date fields

2016-07-07 Thread Alexandre Rafalovitch
https://issues.apache.org/jira/browse/SOLR-9187

Should be fixed in 6.2

Regards,
   Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 7 July 2016 at 19:13, Matteo Grolla  wrote:
> Hi,
> is there a reason why the export handler doesn't support date fields?
> thanks
>
> Matteo Grolla


export handler date fields

2016-07-07 Thread Matteo Grolla
Hi,
is there a reason why the export handler doesn't support date fields?
thanks

Matteo Grolla


Re: deploy solr on cloud providers

2016-07-07 Thread Lorenzo Fundaró
Thank you Tomas, I would take a thorough look to the jira ticket you're
pointing out.

On 6 July 2016 at 20:49, Tomás Fernández Löbbe 
wrote:

> On Wed, Jul 6, 2016 at 2:30 AM, Lorenzo Fundaró <
> lorenzo.fund...@dawandamail.com> wrote:
>
> > On 6 July 2016 at 00:00, Tomás Fernández Löbbe 
> > wrote:
> >
> > > The leader will do the replication before responding to the client, so
> > lets
> > > say the leader gets to update it's local copy, but it's terminated
> before
> > > sending the request to the replicas, the client should get either an
> HTTP
> > > 500 or no http response. From the client code you can take action (log,
> > > retry, etc).
> > >
> >
> > If this true then whenever I ask for min_rf having three nodes (1 leader
> +
> > 2 replicas)
> > I should get rf = 3, but in reality i don't.
> >
> >
> > > The "min_rf" is useful for the case where replicas may be down or not
> > > accessible. Again, you can use this for retrying or take any necessary
> > > action on the client side if the desired rf is not achieved.
> > >
> >
> >
> > I think both paragraphs are contradictory. If the leader does the
> > replication before responding to the client, then
> > why is there a need to use the min_rf ? I don;t think is true that you
> get
> > a 200 when the update has been passed to all replicas.
> >
>
> The reason why "min_rf" is there is because:
> * If there are no replicas at the time of the request (e.g. if replicas are
> unreachable and disconnected from ZK)
> * Replicas could fail to ACK the update request from the leader, in that
> case the leader will mark them as unhealthy but would HTTP 200 to the
> client.
>
> So, it could happen that you think your data is being replicated to 3
> replicas, but 2 of them are currently out of service, this means that your
> doc is in a single host, and if that one dies, then you lose that data. In
> order to prevent this, you can ask Solr to tell you how many replicas
> succeeded that update request. You can read more about this in
> https://issues.apache.org/jira/browse/SOLR-5468
>
>
> >
> > The thing is that, when you have persistent storage yo shouldn't worry
> > about this because you know when the node comes back
> > the rest of the index will be sync, the problem is when you don't have
> > persistent storage. For my particular case I have to be extra careful and
> > always
> > make sure that all my replicas have all the data I sent.
> >
> > In any case you should assume that storage on a host can be completely
> lost, no mater if you are deploying on premises or on the cloud. Consider
> that once that host comes back (could be hours later) it could be already
> out of date, and will replicate from the current leader, possibly dropping
> parts or all it's current index.
>
> Tomás
>
>
> >
> > > Tomás
> > >
> > > On Tue, Jul 5, 2016 at 11:39 AM, Lorenzo Fundaró <
> > > lorenzo.fund...@dawandamail.com> wrote:
> > >
> > > > @Tomas and @Steven
> > > >
> > > > I am a bit skeptical about this two statements:
> > > >
> > > > If a node just disappears you should be fine in terms of data
> > > > > availability, since Solr in "SolrCloud" replicates the data as it
> > comes
> > > > it
> > > > > (before sending the http response)
> > > >
> > > >
> > > > and
> > > >
> > > > >
> > > > > You shouldn't "need" to move the storage as SolrCloud will
> replicate
> > > all
> > > > > data to the new node and anything in the transaction log will
> already
> > > be
> > > > > distributed through the rest of the machines..
> > > >
> > > >
> > > > because according to the official documentation here
> > > > <
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance
> > > > >:
> > > > (Write side fault tolerant -> recovery)
> > > >
> > > > If a leader goes down, it may have sent requests to some replicas and
> > not
> > > > > others. So when a new potential leader is identified, it runs a
> synch
> > > > > process against the other replicas. If this is successful,
> everything
> > > > > should be consistent, the leader registers as active, and normal
> > > actions
> > > > > proceed
> > > >
> > > >
> > > > I think there is a possibility that an update is not sent by the
> leader
> > > but
> > > > is kept in the local disk and after it comes up again it can sync the
> > > > non-sent data.
> > > >
> > > > Furthermore:
> > > >
> > > > Achieved Replication Factor
> > > > > When using a replication factor greater than one, an update request
> > may
> > > > > succeed on the shard leader but fail on one or more of the
> replicas.
> > > For
> > > > > instance, consider a collection with one shard and replication
> factor
> > > of
> > > > > three. In this case, you have a shard leader and two additional
> > > replicas.
> > > > > If an update request succeeds on the leader but fails on both
> > replicas,
> > > > for
> > > > > whatever reason, the update request is still considered successful
> > from
> > > 

Re: Solr Merge Index

2016-07-07 Thread Shalin Shekhar Mangar
Why do you need the merged core? If the underlying data is changing then
obviously the merged core won't automatically update it self. And another
merge will introduce duplicate data. So this is a bad solution.

You can just keep the two cores and do a distributed search across both of
them? You can specify shards=http://localhost:8983/solr/

core1,http://localhost:8983/solr/
core2
as a parameter to your search requests.

On Thu, Jul 7, 2016 at 4:32 AM, Kalpana 
wrote:

> Hello
>
> I have two sources - Sitecore web index (core 1) and a database table (core
> 2). I have created core 3 which is a merge of core1 and core 2.
>
>
> http://localhost:8983/solr/admin/cores?action=mergeindexes=core3=sitecore_web_index=core2
>
> But when someone publishes a page on Sitecore, the sitecore web index gets
> updated but not the merged core. How can get the real time data with the
> merge? Is there a way?
>
> Thanks
> Kalpana
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-Merge-Index-tp4286081.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Regards,
Shalin Shekhar Mangar.


search custom tags and attributes and get contents in solr

2016-07-07 Thread Valentina Cavazza

I have a different problem so I created a new thead:

I have a custom field type:

positionIncrementGap="1000">













   

in this field i have to seach custom tags and their attributes (i mean 
tag like html tag lile ) i would be able to search:


a tag with an attribute equal to something, like: attribute="ablock">*


a tag with an attribute that contain a certain word, like: attribute="lang" * >word or like *word*


a tag with an attribute that contain another tag that contain a certain 
word: **>word*: in this case is important to find the final  
match


In the highlighter if I search a div I want to get the contents inside 
the div.


I think i have to change the tokenizer but do not know which tokenizer 
to use. The tokenizer must be compatible with ICUFoldingFilterFactory 
because I need to make accents insensitive searches.





Re: sorlcloud connection issue

2016-07-07 Thread Shalin Shekhar Mangar
Hi Kent,

There is no point sending multiple emails for the same subject. It
distracts people from the other messages, distributes the conversation and
discourages people from helping you.

Please provide more details about your cluster.
1. How many nodes?
2. How many collections?
3. How many shards?
4. What is the replication factor?
5. Are there connection and read timeouts specified on both client and
server? If yes, what are the values?
6. How often do you commit? Is there auto-commit or explicit commits from
clients?
7. What else is happening on your cluster? What is the write and query load?

The important thing is to figure out why these pools are getting exhausted.
1 max connections and 500 connections per host is already pretty high
so there is an underlying cause for such stuck requests.

On Wed, Jul 6, 2016 at 4:56 PM, Kent Mu  wrote:

> Hi friends!
> *solr version: 4.9.0*
>
> I came across a problem when use solrcloud, it becomes dead lock, we got
> the java core log, it looks like the http connection pool is exhausted and
> most threads are waiting to get a free connection..
>
> I posted the problem in JIRA, the link is
> https://issues.apache.org/jira/browse/SOLR-9253
> I have increased http connection defaults for the SolrJ client, and also
> configed the connection defaults in solr.xml for all shard servers as
> below.
>
>  class="HttpShardHandlerFactory">
> 6
> 3
> 1
> 500
> 
>
> *besides, we use the singleton pattern to connect solrlcoud.*
>
> public synchronized static CloudSolrServer getSolrCloudReadServer() {
> if (reviewSolrCloudServer == null) {
>  ModifiableSolrParams params = new ModifiableSolrParams();
>  params.set(HttpClientUtil.PROP_MAX_CONNECTIONS, 2000);
>  params.set(HttpClientUtil.PROP_MAX_CONNECTIONS_PER_HOST, 500);
>  HttpClient client = HttpClientUtil.createClient(params);
>  LBHttpSolrServer lbServer = new LBHttpSolrServer(client);
>
>
> lbServer.setConnectionTimeout(ReviewProperties.getCloudConnectionTimeOut());
>lbServer.setSoTimeout(ReviewProperties.getCloudSoTimeOut());
>
>  reviewSolrCloudServer = new
> CloudSolrServer(ReviewProperties.getZkHost(),lbServer);
>  reviewSolrCloudServer.setDefaultCollection("commodityReview");
>
>
> reviewSolrCloudServer.setZkClientTimeout(ReviewProperties.getZkClientTimeout());
>
>
> reviewSolrCloudServer.setZkConnectTimeout(ReviewProperties.getZkConnectTimeout());
>  reviewSolrCloudServer.connect();
> }
> return reviewSolrCloudServer;
> }
>
> *the java stack as below*
>
> "httpShardExecutor-3-thread-541" prio=10 tid=0x7f7b1c02b000 nid=0x20af
> waiting on condition [0x7f79fd49]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000605710068> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> at org.apache.http.pool.PoolEntryFuture.await(PoolEntryFuture.java:133)
> at
>
> org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:282)
> at
> org.apache.http.pool.AbstractConnPool.access$000(AbstractConnPool.java:64)
> at
>
> org.apache.http.pool.AbstractConnPool$2.getPoolEntry(AbstractConnPool.java:177)
> at
>
> org.apache.http.pool.AbstractConnPool$2.getPoolEntry(AbstractConnPool.java:170)
> at org.apache.http.pool.PoolEntryFuture.get(PoolEntryFuture.java:102)
> at
>
> org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:208)
> at
>
> org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195)
> at
>
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:422)
> at
>
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
> at
>
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at
>
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
> at
>
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
> at
>
> org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:452)
> at
>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
> at
>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
> at
>
> org.apache.solr.client.solrj.impl.LBHttpSolrServer.doRequest(LBHttpSolrServer.java:340)
> at
>
> org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:301)
> at
>
> org.apache.solr.handler.component.HttpShardHandlerFactory.makeLoadBalancedRequest(HttpShardHandlerFactory.java:205)
> at
>