date:20150910

Re: Stemmer and stopword Development

2015-09-10 Thread Upayavira

On Thu, Sep 10, 2015, at 04:45 AM, Imtiaz Shakil Siddique wrote:
> Hi,
> 
> I am trying to develop stemmer and stopword for Bengaly language which is
> not shipped with solr.
> 
> I am trying to make this with machine learning approach but I couldn't
> find
> any good documents to study. It would be very helpful if you could shed
> some lights into this matter.

How are you going to do this with machine learning? What corpus are you
going to use to learn from? Do you have some documents that have been
manually stemmed for which you also have the originals?

Upayavira

Re: Can solr ttf functionQuery support ngram (n>2) ?

2015-09-10 Thread Jie Gao

A typo is fixed in the following query url.

On 10 September 2015 at 10:25, Jie Gao  wrote:

> Hi,
>
> I'm wondering whether solr ttf functionQuery support (compound words)
> ngram (n>2) ?
>
> I'm using "
> http://localhost:8983/solr/collection1/select?q=*:*=ttf(content,%22apple%20banana%22)=1"
> to query total term frequency of bigram tokens in "content" field in the
> whole index.
>
> However, the result (returned with 20) is not consistent with the result
> queried via http://localhost:8983/solr/
> 
> collection1
> /select?q=content:%22apple%20banana%22.
> I manually checked the actual occurrence is 15.
>
> What is the actual behaviour of the ttf function query (i'm using solr
> 5.3.0)? The reference guide does not explain the details.
>
> Does it perform full text index query on this field ? or it relies on the
> tf values stored by tvComponent?
>
> I have configured the content field with the following textField type:
>
>  positionIncrementGap="100">
> 
> 
>  words="stopwords.txt" enablePositionIncrements="true" />
> 
>  maxShingleSize="5"
> outputUnigrams="true" outputUnigramsIfNoShingles="false" 
> tokenSeparator=" "/>
> 
> 
> 
>  words="stopwords.txt" enablePositionIncrements="true" />
>  ignoreCase="true" expand="true" />
> 
> 
> 
> 
>
> Any ideas ?
>
> Thanks,
> Jerry
>

Re: Boosting related doubt?

2015-09-10 Thread Upayavira

That's curious. Have a look at both the parsed query, and the explains
output for a very simple (even *:*) query. You should see the boost
present there and be able to see whether it is applied once or twice.

Upayavira

On Thu, Sep 10, 2015, at 06:16 AM, Aman Tandon wrote:
> Hi,
> 
> I need to ask that when i am looking for the all the parameters of the
> query using the *echoParams=ALL*, I am getting the boost parameter twice
> in
> the information printed on the browser screen.
> 
> So does it mean that it is also applying twice on the data/result set and
> we are using the ?
> 
> 
> **
> *  0*
> *  66*
> *  *
> **
> *  map(query({!dismax qf=mcatid v=$mc1 pf=""}),0,0,1,2.0)*
> *  map(eff_views,1,2,1.15,1)*
> *  map(query({!dismax qf=titlex v=$ql1 pf=""}),0,0,1,1.5)*
> *  map(query({!dismax qf=titlex v=$ql2 pf=""}),0,0,1,1.5)*
> *  map(query({!dismax qf=attribs v='poorDescription'
> pf=''},0),0,0,1,0.02)*
> *  if(exists(itemprice2),map(query({!dismax qf=itemprice2
> v='0'}),0,0,1.2,1),1)*
> *  map(sdesclen,0,150,1,1.5)*
> *  map(sdesclen,0,0,0.1,1)*
> *  map(CustTypeWt,700,1869,1.1,1)*
> *  map(CustTypeWt,699,699,1.2,1)*
> *  map(CustTypeWt,199,199,1.3,1)*
> *  map(CustTypeWt,0,179,1.35,1)*
> *  map(CustTypeWt,3399,3999,0.07,1)*
> *  map(query({!dismax qf=attribs v='hot'}),0,0,1,1.2)*
> *  map(query({!dismax qf=isphoto v='true'
> pf=""}),0,0,0.05,1)*
> **
> **
> *  mcatid:(1223 6240 825 1936 31235)
> titlex:("imswjutebagimsw")*
> *  attribs:(locprefglobal locprefnational locprefcity
> locprefunknown)*
> *  displayid:4768979112*
> *  +((+datatype:product +attribs:(aprstatus20 aprstatus40
> aprstatus50) +aggregate:true -attribs:liststatusnfl +((+countryiso:IN
> +isfcp:true) (+CustTypeWt:[149 TO 1499]) CustTypeWt:1870))
> (+datatype:company -attribs:liststatusnfl +((+countryiso:IN +isfcp:true)
> (+CustTypeWt:[149 TO 1499]) CustTypeWt:1870)))
> -attribs:liststatusdnf*
> **
> *2-1 470%*
> **
> *  {!ex=cityf}city*
> *  {!ex=datatypef}datatype*
> *  {!ex=biztypef}biztype*
> **
> *default*
> *ALL*
> * name="fl">displayid,datatype,title,smalldescorg,photo,catid,mcatname,companyname,CustTypeWt,glusrid,usrpcatflname,paidurl,fcpurl,city,state,countryname,countryiso,tscode,address,state,zipcode,phone,mobile,contactperson,pns,dupimg,smalldesc,etoofrqty,lastactiondatet,mcatid,isadult,pnsdisabled,membersince,locpref,categoryinfo,distance:geodist($lat,$lon,latlon),iildisplayflag,dispflagval,biztype,datarefid,parentglusrid,itemcode,itemprice,itemcurrency,largedesc,ecom_url,ecom_source_id,moq,moq_type*
> *0*
> *20*
> *true*
> *true*
> *15*
> *true*
> **
> *  mcatnametext^0.2*
> *  titlews^0.5*
> *  smalldesc^0.01*
> *  title_text^1.5*
> *  usrpcatname^0.1*
> *  customspell^0.1*
> **
> *true*
> **
> *  mcatnametext^0.5*
> *  titlews*
> *  title_text^3*
> *  usrpcatname^0.1*
> *  smalldesc^0.01*
> *  customspell^0.1*
> **
> *true*
> *1*
> *10*
> *xml*
> *true*
> *0*
> *parentglusrid*
> *true*
> *true*
> *im.search*
> *2*
> *true*
> *ALL*
> *1*
> *0*
> **
> *  mcatid:(1223 6240 825 1936 31235)
> titlex:("imswjutebagimsw")*
> *  attribs:(locprefglobal locprefnational locprefcity
> locprefunknown)*
> *  displayid:4768979112*
> *  +((+datatype:product +attribs:(aprstatus20 aprstatus40
> aprstatus50) +aggregate:true -attribs:liststatusnfl +((+countryiso:IN
> +isfcp:true) (+CustTypeWt:[149 TO 1499]) CustTypeWt:1870))
> (+datatype:company -attribs:liststatusnfl +((+countryiso:IN +isfcp:true)
> (+CustTypeWt:[149 TO 1499]) CustTypeWt:1870)))
> -attribs:liststatusdnf*
> **
> *20*
> *jute bags*
> *true*
> *"jutebagimsw"*
> *"bagimsw"*
> *"1223"*
> **
> *  map(query({!dismax qf=mcatid v=$mc1 pf=""}),0,0,1,2.0)*
> *  map(eff_views,1,2,1.15,1)*
> *  map(query({!dismax qf=titlex v=$ql1 pf=""}),0,0,1,1.5)*
> *  map(query({!dismax qf=titlex v=$ql2 pf=""}),0,0,1,1.5)*
> *  map(query({!dismax qf=attribs v='poorDescription'
> pf=''},0),0,0,1,0.02)*
> *  if(exists(itemprice2),map(query({!dismax qf=itemprice2
> v='0'}),0,0,1.2,1),1)*
> *  map(sdesclen,0,150,1,1.5)*
> *  map(sdesclen,0,0,0.1,1)*
> *  map(CustTypeWt,700,1869,1.1,1)*
> *  map(CustTypeWt,699,699,1.2,1)*
> *  map(CustTypeWt,199,199,1.3,1)*
> *  map(CustTypeWt,0,179,1.35,1)*
> *  map(CustTypeWt,3399,3999,0.07,1)*
> *  map(query({!dismax qf=attribs v='hot'}),0,0,1,1.2)*
> *  map(query({!dismax qf=isphoto v='true'
> pf=""}),0,0,0.05,1)*
> **
> *xml*
> *0*
> *0.3*
> *synonym_edismax*
> *on*
> *true*
> *  *
> **
> 
> 
> With Regards
> Aman Tandon

Re: Search results differs with sorting on pagination.

2015-09-10 Thread Modassar Ather

Upayavira! I add the fl=id,score,[shard] and saw the shards changing in the
response every time and for different shards the response changes but for
the same shard result is same on multiple hits.
When I add secondary sort field e.g. score the shard remains same across
hits.

On Thu, Sep 10, 2015 at 12:52 PM, Upayavira  wrote:

> Add fl=id,score,[shard] to your query, and show us the results of two
> differing executions.
>
> Perhaps we will be able to see the cause of the difference.
>
> Upayavira
>
> On Thu, Sep 10, 2015, at 05:35 AM, Modassar Ather wrote:
> > Thanks Erick. There are no replicas on my cluster and the indexing is one
> > time. No updates or additions are done to the index and the segments are
> > optimized at the end of indexing.
> > So adding a secondary sort criteria is the only solution for such issue
> > in
> > sort?
> >
> > Regards,
> > Modassar
> >
> > On Wed, Sep 9, 2015 at 8:21 PM, Erick Erickson 
> > wrote:
> >
> > > When the primary sort criteria is identical for two documents,
> > > then the _internal_ Lucene document ID is used to break the
> > > tie. The internal ID for two docs can be not only different, but
> > > in different _order_ on two separate shards. I'm assuming here
> > > that  each of your shards has multiple replicas and/or you're
> > > continuing to index to your cluster.
> > >
> > > The relative internal doc IDs for may change even relative to
> > > each other when segments get merged.
> > >
> > > So yes, if you are sorting by something that can be identical
> > > in documents, it's always best to specify a secondary sort
> > > criteria. It's not referenced unless there's a tie so it's
> > > not that expensive. People often use whatever field
> > > is defined for  since that's _guaranteed_ to
> > > never be the same for two docs.
> > >
> > > Best,
> > > Erick
> > >
> > > On Wed, Sep 9, 2015 at 1:45 AM, Modassar Ather  >
> > > wrote:
> > > > Hi,
> > > >
> > > > Search results are changed every time the following query is hit.
> Please
> > > > note that it is 7 shard cluster of Solr-5.2.1.
> > > >
> > > > Query: q=network=50=50=f_sort
> > > asc=true=id
> > > >
> > > > Following are the fields and their types in my schema.xml.
> > > >
> > > >  > > > stored="false" omitNorms="true"/>
> > > >  sortMissingLast="true"
> > > > stored="false" indexed="true" docValues="true"/>
> > > >
> > > > 
> > > > 
> > > >
> > > > As per my understanding it seems to be the issue of tie among the
> > > document
> > > > as when I added a new sort field like below the result never changed
> > > across
> > > > multiple hits.
> > > > q=network=50=50=f_sort asc, score
> > > > asc=true=id
> > > >
> > > > Kindly let me know if this is an issue or how this can be fixed.
> > > >
> > > > Thanks,
> > > > Modassar
> > >
>

Re: Stemmer and stopword Development

2015-09-10 Thread Imtiaz Shakil Siddique

Thanks for the reply.

Currently I have 20GB Bengali newspaper data ( for corpus building )
I don't have manual stemmed corpus but if needed I will build one.

Basically I need guidance regarding how to do this.
If there are some standard approaches of building stemmer and stopword for
use with solr then please
share it .

Thank you Upayavira for your kind help.

Imtiaz Shakil Siddique

On 10 September 2015 at 13:23, Upayavira  wrote:

>
>
> On Thu, Sep 10, 2015, at 04:45 AM, Imtiaz Shakil Siddique wrote:
> > Hi,
> >
> > I am trying to develop stemmer and stopword for Bengaly language which is
> > not shipped with solr.
> >
> > I am trying to make this with machine learning approach but I couldn't
> > find
> > any good documents to study. It would be very helpful if you could shed
> > some lights into this matter.
>
> How are you going to do this with machine learning? What corpus are you
> going to use to learn from? Do you have some documents that have been
> manually stemmed for which you also have the originals?
>
> Upayavira
>

Re: How to reordering search result by some function query

2015-09-10 Thread Leonardo Foderaro

Hi Aman,
if you want to sort/filter/boost on a custom function query please take a
look at the Alba Framework, maybe it can be useful.

For example, you can define a new function query (and then
sort/filter/boost on it) simply by adding some annotations to your method,
as explained in the wiki:

https://github.com/leonardofoderaro/alba/wiki/Your-first-queryfunction:-the-title-length

Disclaimer: I'm the author of Alba. Please keep in mind it's a young
project, and it has never been used in production.

Thanks
Leonardo

On Thu, Sep 10, 2015 at 8:33 AM, Aman Tandon 
wrote:

> Hi,
>
> I figured it out to implement the same. I will be doing this by using the
> boost parameter
>
> e.g. http://server:8112/solr/products/select?q=jute=title
> *=product(1,product_guideline_score)*
>
> If there is any other alternative then please suggest.
>
> With Regards
> Aman Tandon
>
> On Thu, Sep 10, 2015 at 11:02 AM, Aman Tandon 
> wrote:
>
> > Hi,
> >
> > I have a requirement to reorder the search results by multiplying the
> *text relevance
> > score* of a product with the *product_guideline_score,* which will be
> > stored in index and will have some floating point number.
> >
> > e.g. On searching the *jute* in title if we got some results ID1 & ID2
> >
> > ID1 -> title = jute
> >   score = 8.0
> > *  product_guideline_score = 2.0*
> >
> > ID2 -> title = jute bags
> >   score = 7.5
> > *  product_guideline_score** = 2.2*
> >
> > So the new score should be like this
> >
> > ID1 -> title = jute
> >   score = *product_score * 8 = 16.0*
> > *  product_guideline_score** = 2.0*
> >
> > ID2 -> title = jute bags
> >   score = *product_score * 7.5 = 16.5*
> > *  product_guideline_score** = 2.2*
> >
> > *So new ordering should be*
> >
> > ID2 -> title = jute bags
> >   score* = 16.5*
> >
> > ID1 -> title = jute
> >   score =* 16.0*
> >
> > How can I do this in single query on runtime in solr.
> >
> > With Regards
> > Aman Tandon
> >
>

Broken highlight truncation for hl.alternateField

2015-09-10 Thread Thibaud Bioulac

Hello everybody,

I have the exact same issue as Arcadius Ahouansou (see Broken highlight
truncation for hl.alternateField

).

To sum up : when i'm using the highlighting feature, the breakIterator
(boundaryScanner) doesn't seem to be applied to hl.alternateField and
hl.maxAlternateFieldLength.

I was wondering if this feature is now available (the mail from Arcadius
Ahouansou is dated September 2012...) or if there is a workaround.

I'm using Solr 4.6.1.

Thank you all :)

Thibaud

Re: Search results differs with sorting on pagination.

2015-09-10 Thread Upayavira

Add fl=id,score,[shard] to your query, and show us the results of two
differing executions.

Perhaps we will be able to see the cause of the difference.

Upayavira

On Thu, Sep 10, 2015, at 05:35 AM, Modassar Ather wrote:
> Thanks Erick. There are no replicas on my cluster and the indexing is one
> time. No updates or additions are done to the index and the segments are
> optimized at the end of indexing.
> So adding a secondary sort criteria is the only solution for such issue
> in
> sort?
> 
> Regards,
> Modassar
> 
> On Wed, Sep 9, 2015 at 8:21 PM, Erick Erickson 
> wrote:
> 
> > When the primary sort criteria is identical for two documents,
> > then the _internal_ Lucene document ID is used to break the
> > tie. The internal ID for two docs can be not only different, but
> > in different _order_ on two separate shards. I'm assuming here
> > that  each of your shards has multiple replicas and/or you're
> > continuing to index to your cluster.
> >
> > The relative internal doc IDs for may change even relative to
> > each other when segments get merged.
> >
> > So yes, if you are sorting by something that can be identical
> > in documents, it's always best to specify a secondary sort
> > criteria. It's not referenced unless there's a tie so it's
> > not that expensive. People often use whatever field
> > is defined for  since that's _guaranteed_ to
> > never be the same for two docs.
> >
> > Best,
> > Erick
> >
> > On Wed, Sep 9, 2015 at 1:45 AM, Modassar Ather 
> > wrote:
> > > Hi,
> > >
> > > Search results are changed every time the following query is hit. Please
> > > note that it is 7 shard cluster of Solr-5.2.1.
> > >
> > > Query: q=network=50=50=f_sort
> > asc=true=id
> > >
> > > Following are the fields and their types in my schema.xml.
> > >
> > >  > > stored="false" omitNorms="true"/>
> > >  > > stored="false" indexed="true" docValues="true"/>
> > >
> > > 
> > > 
> > >
> > > As per my understanding it seems to be the issue of tie among the
> > document
> > > as when I added a new sort field like below the result never changed
> > across
> > > multiple hits.
> > > q=network=50=50=f_sort asc, score
> > > asc=true=id
> > >
> > > Kindly let me know if this is an issue or how this can be fixed.
> > >
> > > Thanks,
> > > Modassar
> >

Re: Search results differs with sorting on pagination.

2015-09-10 Thread Modassar Ather

To add to my previous observation I saw the response having results from
multiple shards when the secondary sort field is added and they remain same
across hits.
Kindly help me understand this behavior. Why the results are changing as I
understand that the result should be first clubbed together from all shard
and then based on their score it should be sorted.
But here I see that every time I hit the sort query I am getting results
from different shard which has different scores.

Thanks,
Modassar

On Thu, Sep 10, 2015 at 2:59 PM, Modassar Ather 
wrote:

> Upayavira! I add the fl=id,score,[shard] and saw the shards changing in
> the response every time and for different shards the response changes but
> for the same shard result is same on multiple hits.
> When I add secondary sort field e.g. score the shard remains same across
> hits.
>
> On Thu, Sep 10, 2015 at 12:52 PM, Upayavira  wrote:
>
>> Add fl=id,score,[shard] to your query, and show us the results of two
>> differing executions.
>>
>> Perhaps we will be able to see the cause of the difference.
>>
>> Upayavira
>>
>> On Thu, Sep 10, 2015, at 05:35 AM, Modassar Ather wrote:
>> > Thanks Erick. There are no replicas on my cluster and the indexing is
>> one
>> > time. No updates or additions are done to the index and the segments are
>> > optimized at the end of indexing.
>> > So adding a secondary sort criteria is the only solution for such issue
>> > in
>> > sort?
>> >
>> > Regards,
>> > Modassar
>> >
>> > On Wed, Sep 9, 2015 at 8:21 PM, Erick Erickson > >
>> > wrote:
>> >
>> > > When the primary sort criteria is identical for two documents,
>> > > then the _internal_ Lucene document ID is used to break the
>> > > tie. The internal ID for two docs can be not only different, but
>> > > in different _order_ on two separate shards. I'm assuming here
>> > > that  each of your shards has multiple replicas and/or you're
>> > > continuing to index to your cluster.
>> > >
>> > > The relative internal doc IDs for may change even relative to
>> > > each other when segments get merged.
>> > >
>> > > So yes, if you are sorting by something that can be identical
>> > > in documents, it's always best to specify a secondary sort
>> > > criteria. It's not referenced unless there's a tie so it's
>> > > not that expensive. People often use whatever field
>> > > is defined for  since that's _guaranteed_ to
>> > > never be the same for two docs.
>> > >
>> > > Best,
>> > > Erick
>> > >
>> > > On Wed, Sep 9, 2015 at 1:45 AM, Modassar Ather <
>> modather1...@gmail.com>
>> > > wrote:
>> > > > Hi,
>> > > >
>> > > > Search results are changed every time the following query is hit.
>> Please
>> > > > note that it is 7 shard cluster of Solr-5.2.1.
>> > > >
>> > > > Query: q=network=50=50=f_sort
>> > > asc=true=id
>> > > >
>> > > > Following are the fields and their types in my schema.xml.
>> > > >
>> > > > > sortMissingLast="true"
>> > > > stored="false" omitNorms="true"/>
>> > > > > sortMissingLast="true"
>> > > > stored="false" indexed="true" docValues="true"/>
>> > > >
>> > > > 
>> > > > 
>> > > >
>> > > > As per my understanding it seems to be the issue of tie among the
>> > > document
>> > > > as when I added a new sort field like below the result never changed
>> > > across
>> > > > multiple hits.
>> > > > q=network=50=50=f_sort asc, score
>> > > > asc=true=id
>> > > >
>> > > > Kindly let me know if this is an issue or how this can be fixed.
>> > > >
>> > > > Thanks,
>> > > > Modassar
>> > >
>>
>
>

Can solr ttf functionQuery support ngram (n>2) ?

2015-09-10 Thread Jie Gao

Hi,

I'm wondering whether solr ttf functionQuery support (compound words) ngram
(n>2) ?

I'm using "
http://localhost:8983/solr/collection1/select?q=*:*=ttf(content,%22apple%20banana%22)=1"
to query total term frequency of bigram tokens in "content" field in the
whole index.

However, the result (returned with 20) is not consistent with the result
queried via
http://localhost:8983/solr/tatasteel/select?q=content:%22apple%20banana%22.
I manually checked the actual occurrence is 15.

What is the actual behaviour of the ttf function query (i'm using solr
5.3.0)? The reference guide does not explain the details.

Does it perform full text index query on this field ? or it relies on the
tf values stored by tvComponent?

I have configured the content field with the following textField type:

















Any ideas ?

Thanks,
Jerry

Re: How to reordering search result by some function query

2015-09-10 Thread Upayavira

Aman,

If you are using edismax then what you have written is just fine.

For Lucene query parser queries, wrap them with the boost query parser:

q={!boost b=product_guideline_score v=$qq}=jute

Note in your example you don't need product(), just do
boost=product_guideline_score

Upayavira

On Thu, Sep 10, 2015, at 07:33 AM, Aman Tandon wrote:
> Hi,
> 
> I figured it out to implement the same. I will be doing this by using the
> boost parameter
> 
> e.g. http://server:8112/solr/products/select?q=jute=title
> *=product(1,product_guideline_score)*
> 
> If there is any other alternative then please suggest.
> 
> With Regards
> Aman Tandon
> 
> On Thu, Sep 10, 2015 at 11:02 AM, Aman Tandon 
> wrote:
> 
> > Hi,
> >
> > I have a requirement to reorder the search results by multiplying the *text 
> > relevance
> > score* of a product with the *product_guideline_score,* which will be
> > stored in index and will have some floating point number.
> >
> > e.g. On searching the *jute* in title if we got some results ID1 & ID2
> >
> > ID1 -> title = jute
> >   score = 8.0
> > *  product_guideline_score = 2.0*
> >
> > ID2 -> title = jute bags
> >   score = 7.5
> > *  product_guideline_score** = 2.2*
> >
> > So the new score should be like this
> >
> > ID1 -> title = jute
> >   score = *product_score * 8 = 16.0*
> > *  product_guideline_score** = 2.0*
> >
> > ID2 -> title = jute bags
> >   score = *product_score * 7.5 = 16.5*
> > *  product_guideline_score** = 2.2*
> >
> > *So new ordering should be*
> >
> > ID2 -> title = jute bags
> >   score* = 16.5*
> >
> > ID1 -> title = jute
> >   score =* 16.0*
> >
> > How can I do this in single query on runtime in solr.
> >
> > With Regards
> > Aman Tandon
> >

Re: Stemmer and stopword Development

2015-09-10 Thread Upayavira

I haven't heard of any machine learning based stemmers. I'm not really
sure what algorithm you would use to do stemming - what you'd be looking
for is something that says, well, running stemmed to run, walking
stemmed to walk, therefore hopping should stem to hop, but that'd be
quite an algorithm to develop, I'd say.

There are a few ways you could handle this:

1) locate a Bengali linguist who can help you define an algorithm
 2) manually stem a large number of documents and use that as a basis
 for stemming

If you had a stemmed corpus, you could simply use synonyms to do it, in
English, you could map:

run,running,runs,ran,runner=>run
walk,walked,walking,walker=>walk

Then all you need to do is generate a synonym file and use the
SynonymFilterFactory with it, in place of a stemmer.

Would that work?

Upayavira

On Thu, Sep 10, 2015, at 09:59 AM, Imtiaz Shakil Siddique wrote:
> Thanks for the reply.
> 
> Currently I have 20GB Bengali newspaper data ( for corpus building )
> I don't have manual stemmed corpus but if needed I will build one.
> 
> Basically I need guidance regarding how to do this.
> If there are some standard approaches of building stemmer and stopword
> for
> use with solr then please
> share it .
> 
> Thank you Upayavira for your kind help.
> 
> Imtiaz Shakil Siddique
> 
> 
> On 10 September 2015 at 13:23, Upayavira  wrote:
> 
> >
> >
> > On Thu, Sep 10, 2015, at 04:45 AM, Imtiaz Shakil Siddique wrote:
> > > Hi,
> > >
> > > I am trying to develop stemmer and stopword for Bengaly language which is
> > > not shipped with solr.
> > >
> > > I am trying to make this with machine learning approach but I couldn't
> > > find
> > > any good documents to study. It would be very helpful if you could shed
> > > some lights into this matter.
> >
> > How are you going to do this with machine learning? What corpus are you
> > going to use to learn from? Do you have some documents that have been
> > manually stemmed for which you also have the originals?
> >
> > Upayavira
> >

Re: Search results differs with sorting on pagination.

2015-09-10 Thread Modassar Ather

If two documents come back from different
shards with the same score, the order would not be predictable

This is fine.

What I am not able to understand is that when I do not give a secondary
field for sort I am getting the result from one shard which changes to
other shard in other hits. Here the results are always from one shard.
E.g In first hit all the results are from shard1 and in next hit all the
results are from shard2.

But when I add the secondary sort field I see the results from multiple
shards. E.g It has results from shard1 and shard2 both. This does not
change in multiple hits.

So please help me understand why the similar result merge and aggregation
in not happening in when a single sort field is given?

Regards,
Modassar



On Thu, Sep 10, 2015 at 5:03 PM, Upayavira  wrote:

> What scores are you getting? If two documents come back from different
> shards with the same score, the order would not be predictable -
> probably down to which shard responds first.
>
> Fix it with something like sort=score,timestamp or some other time
> related field.
>
> Upayavira
>
> On Thu, Sep 10, 2015, at 11:01 AM, Modassar Ather wrote:
> > To add to my previous observation I saw the response having results from
> > multiple shards when the secondary sort field is added and they remain
> > same
> > across hits.
> > Kindly help me understand this behavior. Why the results are changing as
> > I
> > understand that the result should be first clubbed together from all
> > shard
> > and then based on their score it should be sorted.
> > But here I see that every time I hit the sort query I am getting results
> > from different shard which has different scores.
> >
> > Thanks,
> > Modassar
> >
> > On Thu, Sep 10, 2015 at 2:59 PM, Modassar Ather 
> > wrote:
> >
> > > Upayavira! I add the fl=id,score,[shard] and saw the shards changing in
> > > the response every time and for different shards the response changes
> but
> > > for the same shard result is same on multiple hits.
> > > When I add secondary sort field e.g. score the shard remains same
> across
> > > hits.
> > >
> > > On Thu, Sep 10, 2015 at 12:52 PM, Upayavira  wrote:
> > >
> > >> Add fl=id,score,[shard] to your query, and show us the results of two
> > >> differing executions.
> > >>
> > >> Perhaps we will be able to see the cause of the difference.
> > >>
> > >> Upayavira
> > >>
> > >> On Thu, Sep 10, 2015, at 05:35 AM, Modassar Ather wrote:
> > >> > Thanks Erick. There are no replicas on my cluster and the indexing
> is
> > >> one
> > >> > time. No updates or additions are done to the index and the
> segments are
> > >> > optimized at the end of indexing.
> > >> > So adding a secondary sort criteria is the only solution for such
> issue
> > >> > in
> > >> > sort?
> > >> >
> > >> > Regards,
> > >> > Modassar
> > >> >
> > >> > On Wed, Sep 9, 2015 at 8:21 PM, Erick Erickson <
> erickerick...@gmail.com
> > >> >
> > >> > wrote:
> > >> >
> > >> > > When the primary sort criteria is identical for two documents,
> > >> > > then the _internal_ Lucene document ID is used to break the
> > >> > > tie. The internal ID for two docs can be not only different, but
> > >> > > in different _order_ on two separate shards. I'm assuming here
> > >> > > that  each of your shards has multiple replicas and/or you're
> > >> > > continuing to index to your cluster.
> > >> > >
> > >> > > The relative internal doc IDs for may change even relative to
> > >> > > each other when segments get merged.
> > >> > >
> > >> > > So yes, if you are sorting by something that can be identical
> > >> > > in documents, it's always best to specify a secondary sort
> > >> > > criteria. It's not referenced unless there's a tie so it's
> > >> > > not that expensive. People often use whatever field
> > >> > > is defined for  since that's _guaranteed_ to
> > >> > > never be the same for two docs.
> > >> > >
> > >> > > Best,
> > >> > > Erick
> > >> > >
> > >> > > On Wed, Sep 9, 2015 at 1:45 AM, Modassar Ather <
> > >> modather1...@gmail.com>
> > >> > > wrote:
> > >> > > > Hi,
> > >> > > >
> > >> > > > Search results are changed every time the following query is
> hit.
> > >> Please
> > >> > > > note that it is 7 shard cluster of Solr-5.2.1.
> > >> > > >
> > >> > > > Query: q=network=50=50=f_sort
> > >> > > asc=true=id
> > >> > > >
> > >> > > > Following are the fields and their types in my schema.xml.
> > >> > > >
> > >> > > >  > >> sortMissingLast="true"
> > >> > > > stored="false" omitNorms="true"/>
> > >> > > >  > >> sortMissingLast="true"
> > >> > > > stored="false" indexed="true" docValues="true"/>
> > >> > > >
> > >> > > > 
> > >> > > > 
> > >> > > >
> > >> > > > As per my understanding it seems to be the issue of tie among
> the
> > >> > > document
> > >> > > > as when I added a new sort field like below the result never
> changed
> > >> > > across
> > >> > > > multiple hits.
> > >> > > > q=network=50=50=f_sort asc, score
> > >> > >

Re: Search results differs with sorting on pagination.

2015-09-10 Thread Upayavira

What scores are you getting? If two documents come back from different
shards with the same score, the order would not be predictable -
probably down to which shard responds first. 

Fix it with something like sort=score,timestamp or some other time
related field.

Upayavira

On Thu, Sep 10, 2015, at 11:01 AM, Modassar Ather wrote:
> To add to my previous observation I saw the response having results from
> multiple shards when the secondary sort field is added and they remain
> same
> across hits.
> Kindly help me understand this behavior. Why the results are changing as
> I
> understand that the result should be first clubbed together from all
> shard
> and then based on their score it should be sorted.
> But here I see that every time I hit the sort query I am getting results
> from different shard which has different scores.
> 
> Thanks,
> Modassar
> 
> On Thu, Sep 10, 2015 at 2:59 PM, Modassar Ather 
> wrote:
> 
> > Upayavira! I add the fl=id,score,[shard] and saw the shards changing in
> > the response every time and for different shards the response changes but
> > for the same shard result is same on multiple hits.
> > When I add secondary sort field e.g. score the shard remains same across
> > hits.
> >
> > On Thu, Sep 10, 2015 at 12:52 PM, Upayavira  wrote:
> >
> >> Add fl=id,score,[shard] to your query, and show us the results of two
> >> differing executions.
> >>
> >> Perhaps we will be able to see the cause of the difference.
> >>
> >> Upayavira
> >>
> >> On Thu, Sep 10, 2015, at 05:35 AM, Modassar Ather wrote:
> >> > Thanks Erick. There are no replicas on my cluster and the indexing is
> >> one
> >> > time. No updates or additions are done to the index and the segments are
> >> > optimized at the end of indexing.
> >> > So adding a secondary sort criteria is the only solution for such issue
> >> > in
> >> > sort?
> >> >
> >> > Regards,
> >> > Modassar
> >> >
> >> > On Wed, Sep 9, 2015 at 8:21 PM, Erick Erickson  >> >
> >> > wrote:
> >> >
> >> > > When the primary sort criteria is identical for two documents,
> >> > > then the _internal_ Lucene document ID is used to break the
> >> > > tie. The internal ID for two docs can be not only different, but
> >> > > in different _order_ on two separate shards. I'm assuming here
> >> > > that  each of your shards has multiple replicas and/or you're
> >> > > continuing to index to your cluster.
> >> > >
> >> > > The relative internal doc IDs for may change even relative to
> >> > > each other when segments get merged.
> >> > >
> >> > > So yes, if you are sorting by something that can be identical
> >> > > in documents, it's always best to specify a secondary sort
> >> > > criteria. It's not referenced unless there's a tie so it's
> >> > > not that expensive. People often use whatever field
> >> > > is defined for  since that's _guaranteed_ to
> >> > > never be the same for two docs.
> >> > >
> >> > > Best,
> >> > > Erick
> >> > >
> >> > > On Wed, Sep 9, 2015 at 1:45 AM, Modassar Ather <
> >> modather1...@gmail.com>
> >> > > wrote:
> >> > > > Hi,
> >> > > >
> >> > > > Search results are changed every time the following query is hit.
> >> Please
> >> > > > note that it is 7 shard cluster of Solr-5.2.1.
> >> > > >
> >> > > > Query: q=network=50=50=f_sort
> >> > > asc=true=id
> >> > > >
> >> > > > Following are the fields and their types in my schema.xml.
> >> > > >
> >> > > >  >> sortMissingLast="true"
> >> > > > stored="false" omitNorms="true"/>
> >> > > >  >> sortMissingLast="true"
> >> > > > stored="false" indexed="true" docValues="true"/>
> >> > > >
> >> > > > 
> >> > > > 
> >> > > >
> >> > > > As per my understanding it seems to be the issue of tie among the
> >> > > document
> >> > > > as when I added a new sort field like below the result never changed
> >> > > across
> >> > > > multiple hits.
> >> > > > q=network=50=50=f_sort asc, score
> >> > > > asc=true=id
> >> > > >
> >> > > > Kindly let me know if this is an issue or how this can be fixed.
> >> > > >
> >> > > > Thanks,
> >> > > > Modassar
> >> > >
> >>
> >
> >

Re: Detect term occurrences

2015-09-10 Thread Walter Underwood

Doing a query for each term should work well. Solr is fast for queries. Write a 
script.

I assume you only need to do this once. Running all the queries will probably 
take less time than figuring out a different approach.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Sep 10, 2015, at 7:37 AM, Markus Jelsma  wrote:

> If you are interested in just the number of occurences of an indexed term. 
> The TermsComponent will give that answer.
> MArkus 
> 
> -Original message-
>> From:Francisco Andrés Fernández 
>> Sent: Thursday 10th September 2015 15:58
>> To: solr-user@lucene.apache.org
>> Subject: Detect term occurrences
>> 
>> Hi all, I'm new to Solr.
>> I want to detect all ocurrences of terms existing in a thesaurus into 1 or
>> more documents.
>> What´s the best strategy to make it?
>> Doing a query for each term doesn't seem to be the best way.
>> Many thanks,
>> 
>> Francisco
>>

RE: Detect term occurrences

2015-09-10 Thread Markus Jelsma

If you are interested in just the number of occurences of an indexed term. The 
TermsComponent will give that answer.
MArkus 
 
-Original message-
> From:Francisco Andrés Fernández 
> Sent: Thursday 10th September 2015 15:58
> To: solr-user@lucene.apache.org
> Subject: Detect term occurrences
> 
> Hi all, I'm new to Solr.
> I want to detect all ocurrences of terms existing in a thesaurus into 1 or
> more documents.
> What´s the best strategy to make it?
> Doing a query for each term doesn't seem to be the best way.
> Many thanks,
> 
> Francisco
>

Re: Can solr ttf functionQuery support ngram (n>2) ?

2015-09-10 Thread Jie Gao

Please ignore for this question.

No problem for the ttf functionQuery now.

I did wrong for manually checking of the tf result. The row size should be
set to more than default size (10) for phrase query "
http://localhost:8983/solr/

collection1

/select?q=content:%22apple%20banana%22&*rows=100*".

Thanks,
Jerry

Jie Gao,
Research Assistant,
Department of Computer Science, The University of Sheffield,
Regent Court, 211 Portobello, S1 4DP, Sheffield, UK

On 10 September 2015 at 10:27, Jie Gao  wrote:

> A typo is fixed in the following query url.
>
> On 10 September 2015 at 10:25, Jie Gao  wrote:
>
>> Hi,
>>
>> I'm wondering whether solr ttf functionQuery support (compound words)
>> ngram (n>2) ?
>>
>> I'm using "
>> http://localhost:8983/solr/collection1/select?q=*:*=ttf(content,%22apple%20banana%22)=1"
>> to query total term frequency of bigram tokens in "content" field in the
>> whole index.
>>
>> However, the result (returned with 20) is not consistent with the result
>> queried via http://localhost:8983/solr/
>> 
>> collection1
>> /select?q=content:%22apple%20banana%22.
>> I manually checked the actual occurrence is 15.
>>
>> What is the actual behaviour of the ttf function query (i'm using solr
>> 5.3.0)? The reference guide does not explain the details.
>>
>> Does it perform full text index query on this field ? or it relies on the
>> tf values stored by tvComponent?
>>
>> I have configured the content field with the following textField type:
>>
>> > positionIncrementGap="100">
>> 
>> 
>> > words="stopwords.txt" enablePositionIncrements="true" />
>> 
>> > maxShingleSize="5"
>> outputUnigrams="true" outputUnigramsIfNoShingles="false" 
>> tokenSeparator=" "/>
>> 
>> 
>> 
>> > words="stopwords.txt" enablePositionIncrements="true" />
>> > synonyms="synonyms.txt" ignoreCase="true" expand="true" />
>> 
>> 
>> 
>> 
>>
>> Any ideas ?
>>
>> Thanks,
>> Jerry
>>
>
>

Debugging Angular JS Application

2015-09-10 Thread Esther-Melaine Quansah

Hi, 

Is there a way for me to debug and modify Angular JS code in the Solr Admin UI 
without needing to completely rebuild the server and clearing browser cache?

Thanks,
Esther

Re: Detect term occurrences

2015-09-10 Thread Alexandre Rafalovitch

Can you tell us a bit more about the business case? Not the current
technical one. Because it is entirely possible Solr can solve the
higher level problem out of the box without you doing manual term
comparisons.In which case, your problem scope is not quite right.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/

On 10 September 2015 at 09:58, Francisco Andrés Fernández
 wrote:
> Hi all, I'm new to Solr.
> I want to detect all ocurrences of terms existing in a thesaurus into 1 or
> more documents.
> What´s the best strategy to make it?
> Doing a query for each term doesn't seem to be the best way.
> Many thanks,
>
> Francisco

Re: ghostly config issues

2015-09-10 Thread Mark Fenbers


On 9/7/2015 4:52 PM, Shawn Heisey wrote:


The only files that should be in server/lib is jetty and servlet jars.
The only files that should be in server/lib/ext is logging jars (slf4j,
log4j, etc).

In the server/lib directory on Solr 5.3.0:

ext/
javax.servlet-api-3.1.0.jar
jetty-continuation-9.2.11.v20150529.jar
jetty-deploy-9.2.11.v20150529.jar
jetty-http-9.2.11.v20150529.jar
jetty-io-9.2.11.v20150529.jar
jetty-jmx-9.2.11.v20150529.jar
jetty-rewrite-9.2.11.v20150529.jar
jetty-security-9.2.11.v20150529.jar
jetty-server-9.2.11.v20150529.jar
jetty-servlet-9.2.11.v20150529.jar
jetty-servlets-9.2.11.v20150529.jar
jetty-util-9.2.11.v20150529.jar
jetty-webapp-9.2.11.v20150529.jar
jetty-xml-9.2.11.v20150529.jar

In the server/lib/ext directory on Solr 5.3.0:

jcl-over-slf4j-1.7.7.jar
jul-to-slf4j-1.7.7.jar
log4j-1.2.17.jar
slf4j-api-1.7.7.jar
slf4j-log4j12-1.7.7.jar


Excellent!!  Based on this info, I decided to blow away the Solr 
installation and reinstall from the tarball file.  After "tar -xzvf", I 
created a "lib" subdir under /localapps/dev/EventLog and copied my 
postgres jar and the dist/dataImportHandler jar into the "lib".  I 
restarted solr and "Viola!"  All works as designed!  It even indexed my 
entire database on the first try of a full-import! Woohooo!


Thanks for your help.  I would have abandoned this project without your 
persistence.


Mark

Detect term occurrences

2015-09-10 Thread Francisco Andrés Fernández

Hi all, I'm new to Solr.
I want to detect all ocurrences of terms existing in a thesaurus into 1 or
more documents.
What´s the best strategy to make it?
Doing a query for each term doesn't seem to be the best way.
Many thanks,

Francisco

Re: Search results differs with sorting on pagination.

2015-09-10 Thread Erick Erickson

First, if Upayavira's intuition is correct (and I'm guessing it is),
then the behavior you're seeing is probably an accident of
coding rather than intentional. I think the algorithm is something
like this:

Node1 gets the original query
Node1 sends sub-queries out to each shard.
As the results come back, they're sorted one by one into a final
list.

For simplicity, let's claim _all_ the docs have the exact same score.
The _first_
shard's response will completely fill up the final list. The rest will
be thrown on
the floor as none of the docs from the other 6 shards will have a
higher score than
any doc currently in the list.

Here's the important part. The order that the sub-requests come back varies
due to a zillion possible causes, network latency, a minor GC pause on one
of the shards, whether all the caches are loaded, whatever. So subsequent
calls will happen to get some _other_ shards docs in the list first.

Does that make sense?

On Thu, Sep 10, 2015 at 4:48 AM, Modassar Ather  wrote:
> If two documents come back from different
> shards with the same score, the order would not be predictable
>
> This is fine.
>
> What I am not able to understand is that when I do not give a secondary
> field for sort I am getting the result from one shard which changes to
> other shard in other hits. Here the results are always from one shard.
> E.g In first hit all the results are from shard1 and in next hit all the
> results are from shard2.
>
> But when I add the secondary sort field I see the results from multiple
> shards. E.g It has results from shard1 and shard2 both. This does not
> change in multiple hits.
>
> So please help me understand why the similar result merge and aggregation
> in not happening in when a single sort field is given?
>
> Regards,
> Modassar
>
>
>
> On Thu, Sep 10, 2015 at 5:03 PM, Upayavira  wrote:
>
>> What scores are you getting? If two documents come back from different
>> shards with the same score, the order would not be predictable -
>> probably down to which shard responds first.
>>
>> Fix it with something like sort=score,timestamp or some other time
>> related field.
>>
>> Upayavira
>>
>> On Thu, Sep 10, 2015, at 11:01 AM, Modassar Ather wrote:
>> > To add to my previous observation I saw the response having results from
>> > multiple shards when the secondary sort field is added and they remain
>> > same
>> > across hits.
>> > Kindly help me understand this behavior. Why the results are changing as
>> > I
>> > understand that the result should be first clubbed together from all
>> > shard
>> > and then based on their score it should be sorted.
>> > But here I see that every time I hit the sort query I am getting results
>> > from different shard which has different scores.
>> >
>> > Thanks,
>> > Modassar
>> >
>> > On Thu, Sep 10, 2015 at 2:59 PM, Modassar Ather 
>> > wrote:
>> >
>> > > Upayavira! I add the fl=id,score,[shard] and saw the shards changing in
>> > > the response every time and for different shards the response changes
>> but
>> > > for the same shard result is same on multiple hits.
>> > > When I add secondary sort field e.g. score the shard remains same
>> across
>> > > hits.
>> > >
>> > > On Thu, Sep 10, 2015 at 12:52 PM, Upayavira  wrote:
>> > >
>> > >> Add fl=id,score,[shard] to your query, and show us the results of two
>> > >> differing executions.
>> > >>
>> > >> Perhaps we will be able to see the cause of the difference.
>> > >>
>> > >> Upayavira
>> > >>
>> > >> On Thu, Sep 10, 2015, at 05:35 AM, Modassar Ather wrote:
>> > >> > Thanks Erick. There are no replicas on my cluster and the indexing
>> is
>> > >> one
>> > >> > time. No updates or additions are done to the index and the
>> segments are
>> > >> > optimized at the end of indexing.
>> > >> > So adding a secondary sort criteria is the only solution for such
>> issue
>> > >> > in
>> > >> > sort?
>> > >> >
>> > >> > Regards,
>> > >> > Modassar
>> > >> >
>> > >> > On Wed, Sep 9, 2015 at 8:21 PM, Erick Erickson <
>> erickerick...@gmail.com
>> > >> >
>> > >> > wrote:
>> > >> >
>> > >> > > When the primary sort criteria is identical for two documents,
>> > >> > > then the _internal_ Lucene document ID is used to break the
>> > >> > > tie. The internal ID for two docs can be not only different, but
>> > >> > > in different _order_ on two separate shards. I'm assuming here
>> > >> > > that  each of your shards has multiple replicas and/or you're
>> > >> > > continuing to index to your cluster.
>> > >> > >
>> > >> > > The relative internal doc IDs for may change even relative to
>> > >> > > each other when segments get merged.
>> > >> > >
>> > >> > > So yes, if you are sorting by something that can be identical
>> > >> > > in documents, it's always best to specify a secondary sort
>> > >> > > criteria. It's not referenced unless there's a tie so it's
>> > >> > > not that expensive. People often use whatever field
>> > >> >

Re: How to reordering search result by some function query

2015-09-10 Thread Aman Tandon

>
> boost=product_guideline_score

Thank you  Upayavira.

Leonardo, thanks for the suggestion. But I think boost parameter will work
great for us. Thank you so much for your help.

With Regards
Aman Tandon

On Thu, Sep 10, 2015 at 5:11 PM, Upayavira  wrote:

> Aman,
>
> If you are using edismax then what you have written is just fine.
>
> For Lucene query parser queries, wrap them with the boost query parser:
>
> q={!boost b=product_guideline_score v=$qq}=jute
>
> Note in your example you don't need product(), just do
> boost=product_guideline_score
>
> Upayavira
>
> On Thu, Sep 10, 2015, at 07:33 AM, Aman Tandon wrote:
> > Hi,
> >
> > I figured it out to implement the same. I will be doing this by using the
> > boost parameter
> >
> > e.g. http://server:8112/solr/products/select?q=jute=title
> > *=product(1,product_guideline_score)*
> >
> > If there is any other alternative then please suggest.
> >
> > With Regards
> > Aman Tandon
> >
> > On Thu, Sep 10, 2015 at 11:02 AM, Aman Tandon 
> > wrote:
> >
> > > Hi,
> > >
> > > I have a requirement to reorder the search results by multiplying the
> *text relevance
> > > score* of a product with the *product_guideline_score,* which will be
> > > stored in index and will have some floating point number.
> > >
> > > e.g. On searching the *jute* in title if we got some results ID1 & ID2
> > >
> > > ID1 -> title = jute
> > >   score = 8.0
> > > *  product_guideline_score = 2.0*
> > >
> > > ID2 -> title = jute bags
> > >   score = 7.5
> > > *  product_guideline_score** = 2.2*
> > >
> > > So the new score should be like this
> > >
> > > ID1 -> title = jute
> > >   score = *product_score * 8 = 16.0*
> > > *  product_guideline_score** = 2.0*
> > >
> > > ID2 -> title = jute bags
> > >   score = *product_score * 7.5 = 16.5*
> > > *  product_guideline_score** = 2.2*
> > >
> > > *So new ordering should be*
> > >
> > > ID2 -> title = jute bags
> > >   score* = 16.5*
> > >
> > > ID1 -> title = jute
> > >   score =* 16.0*
> > >
> > > How can I do this in single query on runtime in solr.
> > >
> > > With Regards
> > > Aman Tandon
> > >
>

Re: Stemmer and stopword Development

2015-09-10 Thread Imtiaz Shakil Siddique

Hi Upayavira,

Thank you for your kind assistance Sir.
If that is the requirement for stemming then I will do it.

My next question is how can I build a stopword list for Bengali language?
The option that I've thought about are

1. Calculate idf values for all the stemmed words inside 20GB crawled data.
2. Find the words that have high inverse document frequency and mark them
as stopwords.

If you have any better solution then please help!
Thank you Sir,
Imtiaz Shakil Siddique


On 10 September 2015 at 17:38, Upayavira  wrote:

> I haven't heard of any machine learning based stemmers. I'm not really
> sure what algorithm you would use to do stemming - what you'd be looking
> for is something that says, well, running stemmed to run, walking
> stemmed to walk, therefore hopping should stem to hop, but that'd be
> quite an algorithm to develop, I'd say.
>
> There are a few ways you could handle this:
>
> 1) locate a Bengali linguist who can help you define an algorithm
>  2) manually stem a large number of documents and use that as a basis
>  for stemming
>
> If you had a stemmed corpus, you could simply use synonyms to do it, in
> English, you could map:
>
> run,running,runs,ran,runner=>run
> walk,walked,walking,walker=>walk
>
> Then all you need to do is generate a synonym file and use the
> SynonymFilterFactory with it, in place of a stemmer.
>
> Would that work?
>
> Upayavira
>
> On Thu, Sep 10, 2015, at 09:59 AM, Imtiaz Shakil Siddique wrote:
> > Thanks for the reply.
> >
> > Currently I have 20GB Bengali newspaper data ( for corpus building )
> > I don't have manual stemmed corpus but if needed I will build one.
> >
> > Basically I need guidance regarding how to do this.
> > If there are some standard approaches of building stemmer and stopword
> > for
> > use with solr then please
> > share it .
> >
> > Thank you Upayavira for your kind help.
> >
> > Imtiaz Shakil Siddique
> >
> >
> > On 10 September 2015 at 13:23, Upayavira  wrote:
> >
> > >
> > >
> > > On Thu, Sep 10, 2015, at 04:45 AM, Imtiaz Shakil Siddique wrote:
> > > > Hi,
> > > >
> > > > I am trying to develop stemmer and stopword for Bengaly language
> which is
> > > > not shipped with solr.
> > > >
> > > > I am trying to make this with machine learning approach but I
> couldn't
> > > > find
> > > > any good documents to study. It would be very helpful if you could
> shed
> > > > some lights into this matter.
> > >
> > > How are you going to do this with machine learning? What corpus are you
> > > going to use to learn from? Do you have some documents that have been
> > > manually stemmed for which you also have the originals?
> > >
> > > Upayavira
> > >
>

Re: Debugging Angular JS Application

2015-09-10 Thread Erik Hatcher

With the exploded structure, maybe we can move the webapp source underneath 
server/solr-webapp (and let the build just fill in the binary Java stuff, and 
avoid overwriting anything).  Then we can keep the source in the same place as 
the “dist”, keeping it nice and DRY and easily 
debuggable/refreshable-without-a-build?   Would that work?

Erik




> On Sep 10, 2015, at 11:36 AM, Shawn Heisey  wrote:
> 
> On 9/10/2015 9:03 AM, Esther-Melaine Quansah wrote:
>> Is there a way for me to debug and modify Angular JS code in the Solr Admin 
>> UI without needing to completely rebuild the server and clearing browser 
>> cache?
> 
> I'm not sure about browser caching.  That might be a problem, but if it
> is, it's going to be a problem regardless of Solr version.  In Firefox,
> holding down the shift key while you click the reload button (or
> pressing Shift-F5 in Windows) should invalidate the cache for the page
> you're on, so you don't need to dive into Options to clear the cache. 
> Other browsers should have something similar, but it might not be
> exactly the same key(s).
> 
> In Solr 5.2.x, the code you're talking about is in the war file, which
> gets extracted to server/solr-webapp/webapp.  You can modify the
> extracted contents.  I think that if that directory already exists on
> startup, jetty skips the extraction, so your changes might survive a
> restart ... but I am not positive, you'd have to test.
> 
> Starting with Solr 5.3.x, the war file is gone.  The application is
> pre-extracted in server/solr-webapp/webapp.  Any changes you make to the
> application will definitely survive a Solr restart.
> 
> So if you just change the running copy of the server that you downloaded
> or built, you can modify the Angular code and test the changes.  If it's
> 5.3, I *know* that your changes won't be overwritten when you restart
> Solr, and if it's 5.2, I *think* they won't be overwritten.
> 
> Thanks,
> Shawn
>

Re: error while running query on solr slave

2015-09-10 Thread shahper


Sorry for late reply.

I am facing one more issue now.

1. When I am shutting down my master and start working with my slave. I 
am not able to fetch any data.As I can check data folder in my core its 
same as master. but then also I am not able to get and data when I run 
any query.


"error":{
"msg":"undefined field ENTITYTYPE",
"code":400}}




2. This issue came across when I was testing on my master. When I update 
any entity in my database it did not get updated on my indexes .



version of solr :- 5.2.1
running with jetty


On Wednesday 09 September 2015 08:15 PM, Erick Erickson wrote:

Please review:
http://wiki.apache.org/solr/UsingMailingLists

You've essentially said "it doesn't work". There's not enough
information to say _anything_ intelligent.

How does it fail? An messages in the log file? What is
  the query you're sending? Does the slave start up without
error?

Best,
Erick

On Wed, Sep 9, 2015 at 3:44 AM, shahper  wrote:

Hi ,

I haves setup master slave solr version 5.2.1.

I have done indexing on master .

And replication is done.

When I am trying to run any query on slave its showing me error its not
running.


Shahper























--
Shahper Jamil

System Administrator

Tel: +91 124 4548383 Ext- 1033
UK: +44 845 0047 142 Ext- 5133

TBS Website 
Techblue Software Pvt. Ltd
The Palms, Plot No 73, Sector 5, IMT Manesar,
Gurgaon- 122050 (Hr.)

www.techbluesoftware.co.in 


	TBS Facebook 
 
TBS Twitter  TBS Google+ 
 TBS Linked In 



TBS Branding

Re: Debugging Angular JS Application

2015-09-10 Thread Shawn Heisey

On 9/10/2015 9:03 AM, Esther-Melaine Quansah wrote:
> Is there a way for me to debug and modify Angular JS code in the Solr Admin 
> UI without needing to completely rebuild the server and clearing browser 
> cache?

I'm not sure about browser caching.  That might be a problem, but if it
is, it's going to be a problem regardless of Solr version.  In Firefox,
holding down the shift key while you click the reload button (or
pressing Shift-F5 in Windows) should invalidate the cache for the page
you're on, so you don't need to dive into Options to clear the cache. 
Other browsers should have something similar, but it might not be
exactly the same key(s).

In Solr 5.2.x, the code you're talking about is in the war file, which
gets extracted to server/solr-webapp/webapp.  You can modify the
extracted contents.  I think that if that directory already exists on
startup, jetty skips the extraction, so your changes might survive a
restart ... but I am not positive, you'd have to test.

Starting with Solr 5.3.x, the war file is gone.  The application is
pre-extracted in server/solr-webapp/webapp.  Any changes you make to the
application will definitely survive a Solr restart.

So if you just change the running copy of the server that you downloaded
or built, you can modify the Angular code and test the changes.  If it's
5.3, I *know* that your changes won't be overwritten when you restart
Solr, and if it's 5.2, I *think* they won't be overwritten.

Thanks,
Shawn

Re: How to secure Admin UI with Basic Auth in Solr 5.3.x

2015-09-10 Thread Imtiaz Shakil Siddique

If you are using Linux server you can always iptables to restrict access to
solr admin panel.
On Sep 9, 2015 3:05 PM, "Merlin Morgenstern" 
wrote:

> I just installed solr cloud 5.3.x and found that the way to secure the amin
> ui has changed. Aparently there is a new plugin which does role based
> authentification and all info on how to secure the admin UI found on the
> net is outdated.
>
> I do not need role based authentification but just simply want to put basic
> authentification to the Admin UI.
>
> How do I configure solr cloud 5.3.x in order to restrict access to the
> Admin UI via Basic Authentification?
>
> Thank you for any help
>

Re: Debugging Angular JS Application

2015-09-10 Thread Upayavira

That would be fantastic, Erik. I've got a somewhat complex setup where I
rsync between folders. Being able to serve directly from the SVN
location would be very handy.

Upayavira

On Thu, Sep 10, 2015, at 04:58 PM, Erik Hatcher wrote:
> With the exploded structure, maybe we can move the webapp source
> underneath server/solr-webapp (and let the build just fill in the binary
> Java stuff, and avoid overwriting anything).  Then we can keep the source
> in the same place as the “dist”, keeping it nice and DRY and easily
> debuggable/refreshable-without-a-build?   Would that work?
> 
>   Erik
> 
> 
> 
> 
> > On Sep 10, 2015, at 11:36 AM, Shawn Heisey  wrote:
> > 
> > On 9/10/2015 9:03 AM, Esther-Melaine Quansah wrote:
> >> Is there a way for me to debug and modify Angular JS code in the Solr 
> >> Admin UI without needing to completely rebuild the server and clearing 
> >> browser cache?
> > 
> > I'm not sure about browser caching.  That might be a problem, but if it
> > is, it's going to be a problem regardless of Solr version.  In Firefox,
> > holding down the shift key while you click the reload button (or
> > pressing Shift-F5 in Windows) should invalidate the cache for the page
> > you're on, so you don't need to dive into Options to clear the cache. 
> > Other browsers should have something similar, but it might not be
> > exactly the same key(s).
> > 
> > In Solr 5.2.x, the code you're talking about is in the war file, which
> > gets extracted to server/solr-webapp/webapp.  You can modify the
> > extracted contents.  I think that if that directory already exists on
> > startup, jetty skips the extraction, so your changes might survive a
> > restart ... but I am not positive, you'd have to test.
> > 
> > Starting with Solr 5.3.x, the war file is gone.  The application is
> > pre-extracted in server/solr-webapp/webapp.  Any changes you make to the
> > application will definitely survive a Solr restart.
> > 
> > So if you just change the running copy of the server that you downloaded
> > or built, you can modify the Angular code and test the changes.  If it's
> > 5.3, I *know* that your changes won't be overwritten when you restart
> > Solr, and if it's 5.2, I *think* they won't be overwritten.
> > 
> > Thanks,
> > Shawn
> > 
>

Re: Debugging Angular JS Application

2015-09-10 Thread Upayavira

On Thu, Sep 10, 2015, at 04:03 PM, Esther-Melaine Quansah wrote:
> Hi, 
> 
> Is there a way for me to debug and modify Angular JS code in the Solr
> Admin UI without needing to completely rebuild the server and clearing
> browser cache?

I just edit the files in server/solr-webapp/webapp, and refresh my
browser when I edit them. I've never had any issue with that, doing most
of my development in Chrome, because I find its dev tools to be better.

I then rsync those files into webapp/web in order to commit them.

Upayavira

Re: How to secure Admin UI with Basic Auth in Solr 5.3.x

2015-09-10 Thread Noble Paul

Check this https://cwiki.apache.org/confluence/display/solr/Securing+Solr

There a couple of bugs in 5.3.o and a bug fix release is coming up
over the next few days.

We don't provide any specific means to restrict access to admin UI
itself. However we let users specify fine grained ACLs on various
operations such collection-admin-edit, read etc

On Wed, Sep 9, 2015 at 2:35 PM, Merlin Morgenstern
 wrote:
> I just installed solr cloud 5.3.x and found that the way to secure the amin
> ui has changed. Aparently there is a new plugin which does role based
> authentification and all info on how to secure the admin UI found on the
> net is outdated.
>
> I do not need role based authentification but just simply want to put basic
> authentification to the Admin UI.
>
> How do I configure solr cloud 5.3.x in order to restrict access to the
> Admin UI via Basic Authentification?
>
> Thank you for any help



-- 
-
Noble Paul

Issue while adding Long.MAX_VALUE to a TrieLong field

2015-09-10 Thread Pushkar Raste

Hi,
I am trying to following add document (value for price.long is
Long.MAX_VALUE)

  
411
one
9223372036854775807


However upon querying my collection value I get back for "price.long" is
9223372036854776000

Definition for 'price.long' field and 'long' look like following




My test case shows that MAX Value Solr can store without losing precision
is  18014398509481982. This is equivalent to ('Long.MAX_VALUE >> 9) - 1'
 (Not really sure if this computation really means something).


Can someone help to understand why TrieLong can't accept values >
18014398509481982

Re: Using join with edismax

2015-09-10 Thread Upayavira

On Thu, Sep 10, 2015, at 10:51 PM, Steven White wrote:
> Hi everyone,
> 
> Does any one know if "join" across cores supported with edismax?

Why wouldn't it be?

To unpack the question more though, edismax is a query parser, join is a
query parser. You can certainly have an edismax query in the main query
and a join in a filter. You could combine multiple queries to have an
edismax clause and a join clause.

Depends on what you're trying to do but as far as you have phrased the
question, I don't see any issues.

Upayavira

Re: Issue while adding Long.MAX_VALUE to a TrieLong field

2015-09-10 Thread Pushkar Raste

Thank you Yonik, looks like I missed previous reply. This is seems logical
as Max Long in java script is (2^53 - 1), which the max value I can insert
and validate through Admin UI. Never though Admin UI itself would trick me
though.

On Thu, Sep 10, 2015 at 6:01 PM, Yonik Seeley  wrote:

> On Thu, Sep 10, 2015 at 5:43 PM, Pushkar Raste 
> wrote:
>
> Did you see my previous response to you today?
> http://markmail.org/message/wt6db4ocqmty5a42
>
> Try querying a different way, like from the command line using curl,
> or from your browser, but not through the solr admin.
>
> [...]
> > My test case shows that MAX Value Solr can store without losing precision
> > is  18014398509481982. This is equivalent to '2 ^53 - 1'  (Not really
> sure
> > if this computation really means something).
>
> 53 happens to be the effective number of mantissa bits in a 64 bit
> double precision floating point ;-)
>
> -Yonik
>

Issue while adding Long.MAX_VALUE to a TrieLong field

2015-09-10 Thread Pushkar Raste

I am trying following add document (value for price.long is Long.MAX_VALUE)

  
411
one
9223372036854775807
   

However upon querying my collection value I get back for "price.long"
is  9223372036854776000
(I got same behavior when I used JSON file)

Definition for 'price.long' field and 'long' look like following




My test case shows that MAX Value Solr can store without losing precision
is  18014398509481982. This is equivalent to '2 ^53 - 1'  (Not really sure
if this computation really means something).

I wrote test using SolrTestHarness and it successfully saved value
9223372036854775807 to Solr.

Can someone help to understand why TrieLong can't accept values >
18014398509481982, when I try to use XML/JSON file to add a document

Re: Debugging Angular JS Application

2015-09-10 Thread Erik Hatcher

Upayavira, could you give this a try and see if this works (patch is for 
trunk): https://issues.apache.org/jira/browse/SOLR-8035 


And when do we make the Angular UI the default?   :)

Erik





> On Sep 10, 2015, at 12:26 PM, Upayavira  wrote:
> 
> That would be fantastic, Erik. I've got a somewhat complex setup where I
> rsync between folders. Being able to serve directly from the SVN
> location would be very handy.
> 
> Upayavira
> 
> On Thu, Sep 10, 2015, at 04:58 PM, Erik Hatcher wrote:
>> With the exploded structure, maybe we can move the webapp source
>> underneath server/solr-webapp (and let the build just fill in the binary
>> Java stuff, and avoid overwriting anything).  Then we can keep the source
>> in the same place as the “dist”, keeping it nice and DRY and easily
>> debuggable/refreshable-without-a-build?   Would that work?
>> 
>>  Erik
>> 
>> 
>> 
>> 
>>> On Sep 10, 2015, at 11:36 AM, Shawn Heisey  wrote:
>>> 
>>> On 9/10/2015 9:03 AM, Esther-Melaine Quansah wrote:
 Is there a way for me to debug and modify Angular JS code in the Solr 
 Admin UI without needing to completely rebuild the server and clearing 
 browser cache?
>>> 
>>> I'm not sure about browser caching.  That might be a problem, but if it
>>> is, it's going to be a problem regardless of Solr version.  In Firefox,
>>> holding down the shift key while you click the reload button (or
>>> pressing Shift-F5 in Windows) should invalidate the cache for the page
>>> you're on, so you don't need to dive into Options to clear the cache. 
>>> Other browsers should have something similar, but it might not be
>>> exactly the same key(s).
>>> 
>>> In Solr 5.2.x, the code you're talking about is in the war file, which
>>> gets extracted to server/solr-webapp/webapp.  You can modify the
>>> extracted contents.  I think that if that directory already exists on
>>> startup, jetty skips the extraction, so your changes might survive a
>>> restart ... but I am not positive, you'd have to test.
>>> 
>>> Starting with Solr 5.3.x, the war file is gone.  The application is
>>> pre-extracted in server/solr-webapp/webapp.  Any changes you make to the
>>> application will definitely survive a Solr restart.
>>> 
>>> So if you just change the running copy of the server that you downloaded
>>> or built, you can modify the Angular code and test the changes.  If it's
>>> 5.3, I *know* that your changes won't be overwritten when you restart
>>> Solr, and if it's 5.2, I *think* they won't be overwritten.
>>> 
>>> Thanks,
>>> Shawn
>>> 
>>

Using join with edismax

2015-09-10 Thread Steven White

Hi everyone,

Does any one know if "join" across cores supported with edismax?

Thanks!!!

Steve,

Re: Issue while adding Long.MAX_VALUE to a TrieLong field

2015-09-10 Thread Yonik Seeley

On Thu, Sep 10, 2015 at 5:43 PM, Pushkar Raste  wrote:

Did you see my previous response to you today?
http://markmail.org/message/wt6db4ocqmty5a42

Try querying a different way, like from the command line using curl,
or from your browser, but not through the solr admin.

[...]
> My test case shows that MAX Value Solr can store without losing precision
> is  18014398509481982. This is equivalent to '2 ^53 - 1'  (Not really sure
> if this computation really means something).

53 happens to be the effective number of mantissa bits in a 64 bit
double precision floating point ;-)

-Yonik

Re: Debugging Angular JS Application

2015-09-10 Thread Upayavira

On Thu, Sep 10, 2015, at 10:52 PM, Erik Hatcher wrote:
> Upayavira, could you give this a try and see if this works (patch is for
> trunk): https://issues.apache.org/jira/browse/SOLR-8035
> 

Will look :-)

> And when do we make the Angular UI the default?   :)

When people tell me it is looking good enough!!

I'm gonna commit some exception handling improvements shortly. I've also
got collections UI stuff looking good (create/delete collection,
create/delete alias - I just want to add add/delete replica then it is
done). With that, I'd be happy for the UI to be default in 5.4. I'd
also, when doing that, find a place in the UI to make a link to the old
UI, relatively prominently, so that if people find things wrong, then
they can still get stuff done.

Upayavira

Re: Detect term occurrences

2015-09-10 Thread Erick Erickson

_Assuming_ this isn't a high throughput _and_ the leaflet text isn't too big...

Index the thesaurus and fire all the terms of the query in a big OR
clause against the index as a _query_. Perhaps turn highlighting on
and highlight the entire leaflet text.

Note, this is just "off the top of my head", I really haven't thought
it through too far and a lot depends on how many leaflets you have to
process and how often

Best,
Erick

On Thu, Sep 10, 2015 at 7:21 PM, Francisco Andrés Fernández
 wrote:
> Yes.
> I have many drug products leaflets, each corresponding to 1 product. In the
> other hand we have a medical dictionary with about 10^5 terms.
> I want to detect all the occurrences of those terms for any leaflet
> document.
> Could you give me a clue about how is the best way to perform it?
> Perhaps, the best way is (as Walter suggests) to do all the queries every
> time, as needed.
> Regards,
>
> Francisco
>
> El jue., 10 de sept. de 2015 a la(s) 11:14 a. m., Alexandre Rafalovitch <
> arafa...@gmail.com> escribió:
>
>> Can you tell us a bit more about the business case? Not the current
>> technical one. Because it is entirely possible Solr can solve the
>> higher level problem out of the box without you doing manual term
>> comparisons.In which case, your problem scope is not quite right.
>>
>> Regards,
>>Alex.
>> 
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>>
>>
>> On 10 September 2015 at 09:58, Francisco Andrés Fernández
>>  wrote:
>> > Hi all, I'm new to Solr.
>> > I want to detect all ocurrences of terms existing in a thesaurus into 1
>> or
>> > more documents.
>> > What´s the best strategy to make it?
>> > Doing a query for each term doesn't seem to be the best way.
>> > Many thanks,
>> >
>> > Francisco
>>

Re: Issue Using Solr 5.3 Authentication and Authorization Plugins

2015-09-10 Thread Dan Davis

Kevin & Noble,

I've manually verified the fix for SOLR-8000, but not yet for SOLR-8004.

I reproduced the initial problem with reloading security.json after
restarting both Solr and ZooKeeper.   I verified using zkcli.sh that
ZooKeeper does retain the changes to the file after using
/solr/admin/authorization, and that therefore the problem was Solr.

After building solr-5.3.1-SNAPSHOT.tgz with ant package (because I don't
know how to give parameters to ant server), I expanded it, copied in the
core data, and then started it.   I was prompted for a password, and it let
me in once the password was given.

I'll probably get to SOLR-8004 shortly, since I have both environments
built and working.

It also occurs to me that it might be better to forbid all permissions and
grant specific permissions to specific roles.   Is there a comprehensive
list of the permissions available?


On Tue, Sep 8, 2015 at 1:07 PM, Kevin Lee  wrote:

> Thanks Dan!  Please let us know what you find.  I’m interested to know if
> this is an issue with anyone else’s setup or if I have an issue in my local
> configuration that is still preventing it to work on start/restart.
>
> - Kevin
>
> > On Sep 5, 2015, at 8:45 AM, Dan Davis  wrote:
> >
> > Kevin & Noble,
> >
> > I'll take it on to test this.   I've built from source before, and I've
> > wanted this authorization capability for awhile.
> >
> > On Fri, Sep 4, 2015 at 9:59 AM, Kevin Lee 
> wrote:
> >
> >> Noble,
> >>
> >> Does SOLR-8000 need to be re-opened?  Has anyone else been able to test
> >> the restart fix?
> >>
> >> At startup, these are the log messages that say there is no security
> >> configuration and the plugins aren’t being used even though
> security.json
> >> is in Zookeeper:
> >> 2015-09-04 08:06:21.205 INFO  (main) [   ] o.a.s.c.CoreContainer
> Security
> >> conf doesn't exist. Skipping setup for authorization module.
> >> 2015-09-04 08:06:21.205 INFO  (main) [   ] o.a.s.c.CoreContainer No
> >> authentication plugin used.
> >>
> >> Thanks,
> >> Kevin
> >>
> >>> On Sep 4, 2015, at 5:47 AM, Noble Paul  wrote:
> >>>
> >>> There are no download links for 5.3.x branch  till we do a bug fix
> >> release
> >>>
> >>> If you wish to download the trunk nightly (which is not same as 5.3.0)
> >>> check here
> >>
> https://builds.apache.org/job/Solr-Artifacts-trunk/lastSuccessfulBuild/artifact/solr/package/
> >>>
> >>> If you wish to get the binaries for 5.3 branch you will have to make it
> >>> (you will need to install svn and ant)
> >>>
> >>> Here are the steps
> >>>
> >>> svn checkout
> >> http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_5_3/
> >>> cd lucene_solr_5_3/solr
> >>> ant server
> >>>
> >>>
> >>>
> >>> On Fri, Sep 4, 2015 at 4:11 PM, davidphilip cherian
> >>>  wrote:
>  Hi Kevin/Noble,
> 
>  What is the download link to take the latest? What are the steps to
> >> compile
>  it, test and use?
>  We also have a use case to have this feature in solr too. Therefore,
> >> wanted
>  to test and above info would help a lot to get started.
> 
>  Thanks.
> 
> 
>  On Fri, Sep 4, 2015 at 1:45 PM, Kevin Lee 
> >> wrote:
> 
> > Thanks, I downloaded the source and compiled it and replaced the jar
> >> file
> > in the dist and solr-webapp’s WEB-INF/lib directory.  It does seem to
> >> be
> > protecting the Collections API reload command now as long as I upload
> >> the
> > security.json after startup of the Solr instances.  If I shutdown and
> >> bring
> > the instances back up, the security is no longer in place and I have
> to
> > upload the security.json again for it to take effect.
> >
> > - Kevin
> >
> >> On Sep 3, 2015, at 10:29 PM, Noble Paul 
> wrote:
> >>
> >> Both these are committed. If you could test with the latest 5.3
> branch
> >> it would be helpful
> >>
> >> On Wed, Sep 2, 2015 at 5:11 PM, Noble Paul 
> >> wrote:
> >>> I opened a ticket for the same
> >>> https://issues.apache.org/jira/browse/SOLR-8004
> >>>
> >>> On Wed, Sep 2, 2015 at 1:36 PM, Kevin Lee
>  >>>
> > wrote:
>  I’ve found that completely exiting Chrome or Firefox and opening
> it
> > back up re-prompts for credentials when they are required.  It was
> > re-prompting with the /browse path where authentication was working
> >> each
> > time I completely exited and started the browser again, however it
> >> won’t
> > re-prompt unless you exit completely and close all running instances
> >> so I
> > closed all instances each time to test.
> 
>  However, to make sure I ran it via the command line via curl as
> > suggested and it still does not give any authentication error when
> >> trying
>

Re: Search results differs with sorting on pagination.

2015-09-10 Thread Modassar Ather

Thanks Erick and Upayavira for the responses. One thing which I noticed in
context of single sort field that the scores differ in each shard response.
No score is identical in the response of one shard and they differ too in
the responses from other shards. The score I got using fl=score.

Regards,
Modassar

On Thu, Sep 10, 2015 at 8:45 PM, Erick Erickson 
wrote:

> First, if Upayavira's intuition is correct (and I'm guessing it is),
> then the behavior you're seeing is probably an accident of
> coding rather than intentional. I think the algorithm is something
> like this:
>
> Node1 gets the original query
> Node1 sends sub-queries out to each shard.
> As the results come back, they're sorted one by one into a final
> list.
>
> For simplicity, let's claim _all_ the docs have the exact same score.
> The _first_
> shard's response will completely fill up the final list. The rest will
> be thrown on
> the floor as none of the docs from the other 6 shards will have a
> higher score than
> any doc currently in the list.
>
> Here's the important part. The order that the sub-requests come back varies
> due to a zillion possible causes, network latency, a minor GC pause on one
> of the shards, whether all the caches are loaded, whatever. So subsequent
> calls will happen to get some _other_ shards docs in the list first.
>
> Does that make sense?
>
> On Thu, Sep 10, 2015 at 4:48 AM, Modassar Ather 
> wrote:
> > If two documents come back from different
> > shards with the same score, the order would not be predictable
> >
> > This is fine.
> >
> > What I am not able to understand is that when I do not give a secondary
> > field for sort I am getting the result from one shard which changes to
> > other shard in other hits. Here the results are always from one shard.
> > E.g In first hit all the results are from shard1 and in next hit all the
> > results are from shard2.
> >
> > But when I add the secondary sort field I see the results from multiple
> > shards. E.g It has results from shard1 and shard2 both. This does not
> > change in multiple hits.
> >
> > So please help me understand why the similar result merge and aggregation
> > in not happening in when a single sort field is given?
> >
> > Regards,
> > Modassar
> >
> >
> >
> > On Thu, Sep 10, 2015 at 5:03 PM, Upayavira  wrote:
> >
> >> What scores are you getting? If two documents come back from different
> >> shards with the same score, the order would not be predictable -
> >> probably down to which shard responds first.
> >>
> >> Fix it with something like sort=score,timestamp or some other time
> >> related field.
> >>
> >> Upayavira
> >>
> >> On Thu, Sep 10, 2015, at 11:01 AM, Modassar Ather wrote:
> >> > To add to my previous observation I saw the response having results
> from
> >> > multiple shards when the secondary sort field is added and they remain
> >> > same
> >> > across hits.
> >> > Kindly help me understand this behavior. Why the results are changing
> as
> >> > I
> >> > understand that the result should be first clubbed together from all
> >> > shard
> >> > and then based on their score it should be sorted.
> >> > But here I see that every time I hit the sort query I am getting
> results
> >> > from different shard which has different scores.
> >> >
> >> > Thanks,
> >> > Modassar
> >> >
> >> > On Thu, Sep 10, 2015 at 2:59 PM, Modassar Ather <
> modather1...@gmail.com>
> >> > wrote:
> >> >
> >> > > Upayavira! I add the fl=id,score,[shard] and saw the shards
> changing in
> >> > > the response every time and for different shards the response
> changes
> >> but
> >> > > for the same shard result is same on multiple hits.
> >> > > When I add secondary sort field e.g. score the shard remains same
> >> across
> >> > > hits.
> >> > >
> >> > > On Thu, Sep 10, 2015 at 12:52 PM, Upayavira  wrote:
> >> > >
> >> > >> Add fl=id,score,[shard] to your query, and show us the results of
> two
> >> > >> differing executions.
> >> > >>
> >> > >> Perhaps we will be able to see the cause of the difference.
> >> > >>
> >> > >> Upayavira
> >> > >>
> >> > >> On Thu, Sep 10, 2015, at 05:35 AM, Modassar Ather wrote:
> >> > >> > Thanks Erick. There are no replicas on my cluster and the
> indexing
> >> is
> >> > >> one
> >> > >> > time. No updates or additions are done to the index and the
> >> segments are
> >> > >> > optimized at the end of indexing.
> >> > >> > So adding a secondary sort criteria is the only solution for such
> >> issue
> >> > >> > in
> >> > >> > sort?
> >> > >> >
> >> > >> > Regards,
> >> > >> > Modassar
> >> > >> >
> >> > >> > On Wed, Sep 9, 2015 at 8:21 PM, Erick Erickson <
> >> erickerick...@gmail.com
> >> > >> >
> >> > >> > wrote:
> >> > >> >
> >> > >> > > When the primary sort criteria is identical for two documents,
> >> > >> > > then the _internal_ Lucene document ID is used to break the
> >> > >> > > tie. The internal ID for two docs can be not only

Re: Detect term occurrences

2015-09-10 Thread Francisco Andrés Fernández

Yes.
I have many drug products leaflets, each corresponding to 1 product. In the
other hand we have a medical dictionary with about 10^5 terms.
I want to detect all the occurrences of those terms for any leaflet
document.
Could you give me a clue about how is the best way to perform it?
Perhaps, the best way is (as Walter suggests) to do all the queries every
time, as needed.
Regards,

Francisco

El jue., 10 de sept. de 2015 a la(s) 11:14 a. m., Alexandre Rafalovitch <
arafa...@gmail.com> escribió:

> Can you tell us a bit more about the business case? Not the current
> technical one. Because it is entirely possible Solr can solve the
> higher level problem out of the box without you doing manual term
> comparisons.In which case, your problem scope is not quite right.
>
> Regards,
>Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 10 September 2015 at 09:58, Francisco Andrés Fernández
>  wrote:
> > Hi all, I'm new to Solr.
> > I want to detect all ocurrences of terms existing in a thesaurus into 1
> or
> > more documents.
> > What´s the best strategy to make it?
> > Doing a query for each term doesn't seem to be the best way.
> > Many thanks,
> >
> > Francisco
>

Re: Issue Using Solr 5.3 Authentication and Authorization Plugins

2015-09-10 Thread Dan Davis

SOLR-8004 also appears to work to me.   I manually edited security.json and
did putfile.   I didn't bother with browse permission, because it was
Kevin's workaround.solr-5.3.1-SNAPSHOT did challenge me for credentials
when going to curl
http://localhost:8983/solr/admin/collections?action=CREATE and so on...

On Thu, Sep 10, 2015 at 11:10 PM, Dan Davis  wrote:

> Kevin & Noble,
>
> I've manually verified the fix for SOLR-8000, but not yet for SOLR-8004.
>
> I reproduced the initial problem with reloading security.json after
> restarting both Solr and ZooKeeper.   I verified using zkcli.sh that
> ZooKeeper does retain the changes to the file after using
> /solr/admin/authorization, and that therefore the problem was Solr.
>
> After building solr-5.3.1-SNAPSHOT.tgz with ant package (because I don't
> know how to give parameters to ant server), I expanded it, copied in the
> core data, and then started it.   I was prompted for a password, and it let
> me in once the password was given.
>
> I'll probably get to SOLR-8004 shortly, since I have both environments
> built and working.
>
> It also occurs to me that it might be better to forbid all permissions and
> grant specific permissions to specific roles.   Is there a comprehensive
> list of the permissions available?
>
>
> On Tue, Sep 8, 2015 at 1:07 PM, Kevin Lee 
> wrote:
>
>> Thanks Dan!  Please let us know what you find.  I’m interested to know if
>> this is an issue with anyone else’s setup or if I have an issue in my local
>> configuration that is still preventing it to work on start/restart.
>>
>> - Kevin
>>
>> > On Sep 5, 2015, at 8:45 AM, Dan Davis  wrote:
>> >
>> > Kevin & Noble,
>> >
>> > I'll take it on to test this.   I've built from source before, and I've
>> > wanted this authorization capability for awhile.
>> >
>> > On Fri, Sep 4, 2015 at 9:59 AM, Kevin Lee 
>> wrote:
>> >
>> >> Noble,
>> >>
>> >> Does SOLR-8000 need to be re-opened?  Has anyone else been able to test
>> >> the restart fix?
>> >>
>> >> At startup, these are the log messages that say there is no security
>> >> configuration and the plugins aren’t being used even though
>> security.json
>> >> is in Zookeeper:
>> >> 2015-09-04 08:06:21.205 INFO  (main) [   ] o.a.s.c.CoreContainer
>> Security
>> >> conf doesn't exist. Skipping setup for authorization module.
>> >> 2015-09-04 08:06:21.205 INFO  (main) [   ] o.a.s.c.CoreContainer No
>> >> authentication plugin used.
>> >>
>> >> Thanks,
>> >> Kevin
>> >>
>> >>> On Sep 4, 2015, at 5:47 AM, Noble Paul  wrote:
>> >>>
>> >>> There are no download links for 5.3.x branch  till we do a bug fix
>> >> release
>> >>>
>> >>> If you wish to download the trunk nightly (which is not same as 5.3.0)
>> >>> check here
>> >>
>> https://builds.apache.org/job/Solr-Artifacts-trunk/lastSuccessfulBuild/artifact/solr/package/
>> >>>
>> >>> If you wish to get the binaries for 5.3 branch you will have to make
>> it
>> >>> (you will need to install svn and ant)
>> >>>
>> >>> Here are the steps
>> >>>
>> >>> svn checkout
>> >> http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_5_3/
>> >>> cd lucene_solr_5_3/solr
>> >>> ant server
>> >>>
>> >>>
>> >>>
>> >>> On Fri, Sep 4, 2015 at 4:11 PM, davidphilip cherian
>> >>>  wrote:
>>  Hi Kevin/Noble,
>> 
>>  What is the download link to take the latest? What are the steps to
>> >> compile
>>  it, test and use?
>>  We also have a use case to have this feature in solr too. Therefore,
>> >> wanted
>>  to test and above info would help a lot to get started.
>> 
>>  Thanks.
>> 
>> 
>>  On Fri, Sep 4, 2015 at 1:45 PM, Kevin Lee > >
>> >> wrote:
>> 
>> > Thanks, I downloaded the source and compiled it and replaced the jar
>> >> file
>> > in the dist and solr-webapp’s WEB-INF/lib directory.  It does seem
>> to
>> >> be
>> > protecting the Collections API reload command now as long as I
>> upload
>> >> the
>> > security.json after startup of the Solr instances.  If I shutdown
>> and
>> >> bring
>> > the instances back up, the security is no longer in place and I
>> have to
>> > upload the security.json again for it to take effect.
>> >
>> > - Kevin
>> >
>> >> On Sep 3, 2015, at 10:29 PM, Noble Paul 
>> wrote:
>> >>
>> >> Both these are committed. If you could test with the latest 5.3
>> branch
>> >> it would be helpful
>> >>
>> >> On Wed, Sep 2, 2015 at 5:11 PM, Noble Paul 
>> >> wrote:
>> >>> I opened a ticket for the same
>> >>> https://issues.apache.org/jira/browse/SOLR-8004
>> >>>
>> >>> On Wed, Sep 2, 2015 at 1:36 PM, Kevin Lee
>> > >>>
>> > wrote:
>>  I’ve found that completely exiting Chrome or Firefox and opening
>> it
>> >

Loading Solr Analyzer from RuntimeLib Blob

2015-09-10 Thread Steve Davids

Accidentally sent this on the java-users list instead of solr-users...


Hi,

I am attempting to migrate our deployment process over to using the
recently added "Blob Store API" which should simplify things a bit when it
comes to cloud infrastructures for us. Unfortunately, after loading the jar
in the .system collection and adding it to our runtimelib config overlay
analyzers from our schema doesn't appear to be aware of our custom code. Is
there a way to specify runtimeLib="true" on the schema or perhaps an
alternate method to make sure that jar is loaded on the classpath before
the schema is loaded?

Thanks for the help,

-Steve

Re: Issue while adding Long.MAX_VALUE to a TrieLong field

2015-09-10 Thread Yonik Seeley

On Thu, Sep 10, 2015 at 2:21 PM, Pushkar Raste  wrote:
> Hi,
> I am trying to following add document (value for price.long is
> Long.MAX_VALUE)
>
>   
> 411
> one
> 9223372036854775807
> 
>
> However upon querying my collection value I get back for "price.long" is
> 9223372036854776000

The value probably isn't actually rounded in solr, but in the client.
If you are looking at this from the admin console, then it's the
javascript there that is unfortunately rounding the displayed value.

http://stackoverflow.com/questions/1379934/large-numbers-erroneously-rounded-in-javascript

https://issues.apache.org/jira/browse/SOLR-6364

We should really fix the admin somehow... this has bitten quite a few people.

-Yonik

IOException occured when talking to server

2015-09-10 Thread ku3ia

Hi all!
Sometimes, in logs is this ERROR:
ERROR - 2015-09-10 11:52:19.940; org.apache.solr.common.SolrException;
org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: IOException occured when
talking to server at: http://x.x.x.x:8080/solr/corename_shard1_replica1
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:306)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1954)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:774)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
...

Now some words, about configuration. I have 3 servers. There are 3 jetty
instances on each server. I'm using Jetty 8 and Apache Solr 4.8.
There are 8 collections, 4 collections have 12 shards - 3 shards on each
collection, and 4 collections with 4 shards - 1 shard per collection. No
replicas. The schema.xml files are similar - I'd grouped my data by dates,
so in result there are too much collections. I'm using aliases to run
queries via collections, for example, query
http://x.x.x.x:8080/solr/corename_ALL/select?q=*:*
will across all collections and shards.

I had aplied patch:
https://issues.apache.org/jira/browse/SOLR-6931

and change solr.xml file as:

${socketTimeout:0}
${connTimeout:0}
true

I had read similar topics about this exception:
http://lucene.472066.n3.nabble.com/IOException-occured-when-talking-to-solr-server-td4175554.html
http://lucene.472066.n3.nabble.com/IOException-occured-when-talking-to-server-td4170253.html
and changed jetty.xml as:

10
2000
false

but all these actions don't help me. This exception still in logs.
I'd configured HAProxy to try understand in which OSI layer is problem, but,
unfortunately, this also doesn't help me yet.

Can anyone has any ideas about this exception, please advice.

--
View this message in context:
http://lucene.472066.n3.nabble.com/IOException-occured-when-talking-to-server-tp4228405.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Stemmer and stopword Development

2015-09-10 Thread Upayavira

It depends on why you want stopwords. Stopwords were an important thing
back in the day - they helped performance. Now, with a decent CPU and
TF/IDF on your side, they don't do so much harm, in fact, avoiding them
can save the day:

q=to be or not to be

would not locate anything if we'd used stopwords. However:

q=jack and jill

will score docs that have "jack" or "jill" or preferably both way above
docs that just have "and".

If I needed stopwords, I'd do something like you suggested, then show
the results to a native speaker and see what they think.

Upayavira

On Thu, Sep 10, 2015, at 03:16 PM, Imtiaz Shakil Siddique wrote:
> Hi Upayavira,
> 
> Thank you for your kind assistance Sir.
> If that is the requirement for stemming then I will do it.
> 
> My next question is how can I build a stopword list for Bengali language?
> The option that I've thought about are
> 
> 1. Calculate idf values for all the stemmed words inside 20GB crawled
> data.
> 2. Find the words that have high inverse document frequency and mark them
> as stopwords.
> 
> If you have any better solution then please help!
> Thank you Sir,
> Imtiaz Shakil Siddique
> 
> 
> On 10 September 2015 at 17:38, Upayavira  wrote:
> 
> > I haven't heard of any machine learning based stemmers. I'm not really
> > sure what algorithm you would use to do stemming - what you'd be looking
> > for is something that says, well, running stemmed to run, walking
> > stemmed to walk, therefore hopping should stem to hop, but that'd be
> > quite an algorithm to develop, I'd say.
> >
> > There are a few ways you could handle this:
> >
> > 1) locate a Bengali linguist who can help you define an algorithm
> >  2) manually stem a large number of documents and use that as a basis
> >  for stemming
> >
> > If you had a stemmed corpus, you could simply use synonyms to do it, in
> > English, you could map:
> >
> > run,running,runs,ran,runner=>run
> > walk,walked,walking,walker=>walk
> >
> > Then all you need to do is generate a synonym file and use the
> > SynonymFilterFactory with it, in place of a stemmer.
> >
> > Would that work?
> >
> > Upayavira
> >
> > On Thu, Sep 10, 2015, at 09:59 AM, Imtiaz Shakil Siddique wrote:
> > > Thanks for the reply.
> > >
> > > Currently I have 20GB Bengali newspaper data ( for corpus building )
> > > I don't have manual stemmed corpus but if needed I will build one.
> > >
> > > Basically I need guidance regarding how to do this.
> > > If there are some standard approaches of building stemmer and stopword
> > > for
> > > use with solr then please
> > > share it .
> > >
> > > Thank you Upayavira for your kind help.
> > >
> > > Imtiaz Shakil Siddique
> > >
> > >
> > > On 10 September 2015 at 13:23, Upayavira  wrote:
> > >
> > > >
> > > >
> > > > On Thu, Sep 10, 2015, at 04:45 AM, Imtiaz Shakil Siddique wrote:
> > > > > Hi,
> > > > >
> > > > > I am trying to develop stemmer and stopword for Bengaly language
> > which is
> > > > > not shipped with solr.
> > > > >
> > > > > I am trying to make this with machine learning approach but I
> > couldn't
> > > > > find
> > > > > any good documents to study. It would be very helpful if you could
> > shed
> > > > > some lights into this matter.
> > > >
> > > > How are you going to do this with machine learning? What corpus are you
> > > > going to use to learn from? Do you have some documents that have been
> > > > manually stemmed for which you also have the originals?
> > > >
> > > > Upayavira
> > > >
> >

Re: Stemmer and stopword Development

2015-09-10 Thread Doug Turnbull

I've used stopwords to reduce the index size considerably to improve search
performance (same with stemming, etc). For relevance I've often preferred
to leave stop words in for the reasons Upayavira mentions. There's all
kinds of confusing things taht can happen with stopwords that sometimes
they're not worth the trouble.

For an example of something confusing that happens when you take out
stopwords from the index, it interacts a bit unintuitively with min should
match
http://opensourceconnections.com/blog/2013/04/15/querying-more-fields-more-results-stop-wording-and-solrs-mm-min-should-match-argument/

Cheers
-Doug




On Thu, Sep 10, 2015 at 4:50 PM, Upayavira  wrote:

> It depends on why you want stopwords. Stopwords were an important thing
> back in the day - they helped performance. Now, with a decent CPU and
> TF/IDF on your side, they don't do so much harm, in fact, avoiding them
> can save the day:
>
> q=to be or not to be
>
> would not locate anything if we'd used stopwords. However:
>
> q=jack and jill
>
> will score docs that have "jack" or "jill" or preferably both way above
> docs that just have "and".
>
> If I needed stopwords, I'd do something like you suggested, then show
> the results to a native speaker and see what they think.
>
> Upayavira
>
> On Thu, Sep 10, 2015, at 03:16 PM, Imtiaz Shakil Siddique wrote:
> > Hi Upayavira,
> >
> > Thank you for your kind assistance Sir.
> > If that is the requirement for stemming then I will do it.
> >
> > My next question is how can I build a stopword list for Bengali language?
> > The option that I've thought about are
> >
> > 1. Calculate idf values for all the stemmed words inside 20GB crawled
> > data.
> > 2. Find the words that have high inverse document frequency and mark them
> > as stopwords.
> >
> > If you have any better solution then please help!
> > Thank you Sir,
> > Imtiaz Shakil Siddique
> >
> >
> > On 10 September 2015 at 17:38, Upayavira  wrote:
> >
> > > I haven't heard of any machine learning based stemmers. I'm not really
> > > sure what algorithm you would use to do stemming - what you'd be
> looking
> > > for is something that says, well, running stemmed to run, walking
> > > stemmed to walk, therefore hopping should stem to hop, but that'd be
> > > quite an algorithm to develop, I'd say.
> > >
> > > There are a few ways you could handle this:
> > >
> > > 1) locate a Bengali linguist who can help you define an algorithm
> > >  2) manually stem a large number of documents and use that as a basis
> > >  for stemming
> > >
> > > If you had a stemmed corpus, you could simply use synonyms to do it, in
> > > English, you could map:
> > >
> > > run,running,runs,ran,runner=>run
> > > walk,walked,walking,walker=>walk
> > >
> > > Then all you need to do is generate a synonym file and use the
> > > SynonymFilterFactory with it, in place of a stemmer.
> > >
> > > Would that work?
> > >
> > > Upayavira
> > >
> > > On Thu, Sep 10, 2015, at 09:59 AM, Imtiaz Shakil Siddique wrote:
> > > > Thanks for the reply.
> > > >
> > > > Currently I have 20GB Bengali newspaper data ( for corpus building )
> > > > I don't have manual stemmed corpus but if needed I will build one.
> > > >
> > > > Basically I need guidance regarding how to do this.
> > > > If there are some standard approaches of building stemmer and
> stopword
> > > > for
> > > > use with solr then please
> > > > share it .
> > > >
> > > > Thank you Upayavira for your kind help.
> > > >
> > > > Imtiaz Shakil Siddique
> > > >
> > > >
> > > > On 10 September 2015 at 13:23, Upayavira  wrote:
> > > >
> > > > >
> > > > >
> > > > > On Thu, Sep 10, 2015, at 04:45 AM, Imtiaz Shakil Siddique wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I am trying to develop stemmer and stopword for Bengaly language
> > > which is
> > > > > > not shipped with solr.
> > > > > >
> > > > > > I am trying to make this with machine learning approach but I
> > > couldn't
> > > > > > find
> > > > > > any good documents to study. It would be very helpful if you
> could
> > > shed
> > > > > > some lights into this matter.
> > > > >
> > > > > How are you going to do this with machine learning? What corpus
> are you
> > > > > going to use to learn from? Do you have some documents that have
> been
> > > > > manually stemmed for which you also have the originals?
> > > > >
> > > > > Upayavira
> > > > >
> > >
>



-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
, LLC | 240.476.9983
Author: Relevant Search 
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

Re: Boosting related doubt?

2015-09-10 Thread Shawn Heisey

On 9/9/2015 11:16 PM, Aman Tandon wrote:
> I need to ask that when i am looking for the all the parameters of the
> query using the *echoParams=ALL*, I am getting the boost parameter twice in
> the information printed on the browser screen.

If you see a parameter twice in the "echoParams=all" output, and you
aren't including it twice in the URL and/or request body, this usually
means that the parameter is in the "defaults" section of the handler
definition in solrconfig.xml, and also in the request itself.  In that
situation, the parameter in the request will be the one that actually
takes effect, but there is no obvious indication which one is which in
the response.  I think it will be the last one listed, but I am not sure.

I wonder if we could extend the echoed parameter structure so that it
would be obvious whether the parameter comes from invariants, appends,
defaults, or the request.  Or maybe we change it so that at the existing
level in the structure, the EFFECTIVE parameter list is displayed rather
than all parameters (removing duplicates for those situations where a
duplicate will override instead of merge), and the parameters for each
of the four possible sources are displayed in a separate part of the
structure.

Thanks,
Shawn

RE: Stemmer and stopword Development

2015-09-10 Thread Davis, Daniel (NIH/NLM) [C]

Stop words for international indexing seem not too useful to me at this point.  
  To use them, you definitely have to know what language you are in at all 
times, and that doesn't happen with unstructured data (e.g. a bunch of PDF/Word 
files that happen to be linked from a bunch of web pages).   I'm currently 
working on something where I do have structured data, but diacritics show up in 
fields clearly identified as English - structured data also can be messy.

-Original Message-
From: Doug Turnbull [mailto:dturnb...@opensourceconnections.com] 
Sent: Thursday, September 10, 2015 4:55 PM
To: solr-user@lucene.apache.org
Subject: Re: Stemmer and stopword Development

I've used stopwords to reduce the index size considerably to improve search 
performance (same with stemming, etc). For relevance I've often preferred to 
leave stop words in for the reasons Upayavira mentions. There's all kinds of 
confusing things taht can happen with stopwords that sometimes they're not 
worth the trouble.

For an example of something confusing that happens when you take out stopwords 
from the index, it interacts a bit unintuitively with min should match 
http://opensourceconnections.com/blog/2013/04/15/querying-more-fields-more-results-stop-wording-and-solrs-mm-min-should-match-argument/

Cheers
-Doug




On Thu, Sep 10, 2015 at 4:50 PM, Upayavira  wrote:

> It depends on why you want stopwords. Stopwords were an important 
> thing back in the day - they helped performance. Now, with a decent 
> CPU and TF/IDF on your side, they don't do so much harm, in fact, 
> avoiding them can save the day:
>
> q=to be or not to be
>
> would not locate anything if we'd used stopwords. However:
>
> q=jack and jill
>
> will score docs that have "jack" or "jill" or preferably both way 
> above docs that just have "and".
>
> If I needed stopwords, I'd do something like you suggested, then show 
> the results to a native speaker and see what they think.
>
> Upayavira
>
> On Thu, Sep 10, 2015, at 03:16 PM, Imtiaz Shakil Siddique wrote:
> > Hi Upayavira,
> >
> > Thank you for your kind assistance Sir.
> > If that is the requirement for stemming then I will do it.
> >
> > My next question is how can I build a stopword list for Bengali language?
> > The option that I've thought about are
> >
> > 1. Calculate idf values for all the stemmed words inside 20GB 
> > crawled data.
> > 2. Find the words that have high inverse document frequency and mark 
> > them as stopwords.
> >
> > If you have any better solution then please help!
> > Thank you Sir,
> > Imtiaz Shakil Siddique
> >
> >
> > On 10 September 2015 at 17:38, Upayavira  wrote:
> >
> > > I haven't heard of any machine learning based stemmers. I'm not 
> > > really sure what algorithm you would use to do stemming - what 
> > > you'd be
> looking
> > > for is something that says, well, running stemmed to run, walking 
> > > stemmed to walk, therefore hopping should stem to hop, but that'd 
> > > be quite an algorithm to develop, I'd say.
> > >
> > > There are a few ways you could handle this:
> > >
> > > 1) locate a Bengali linguist who can help you define an algorithm
> > >  2) manually stem a large number of documents and use that as a 
> > > basis  for stemming
> > >
> > > If you had a stemmed corpus, you could simply use synonyms to do 
> > > it, in English, you could map:
> > >
> > > run,running,runs,ran,runner=>run
> > > walk,walked,walking,walker=>walk
> > >
> > > Then all you need to do is generate a synonym file and use the 
> > > SynonymFilterFactory with it, in place of a stemmer.
> > >
> > > Would that work?
> > >
> > > Upayavira
> > >
> > > On Thu, Sep 10, 2015, at 09:59 AM, Imtiaz Shakil Siddique wrote:
> > > > Thanks for the reply.
> > > >
> > > > Currently I have 20GB Bengali newspaper data ( for corpus 
> > > > building ) I don't have manual stemmed corpus but if needed I will 
> > > > build one.
> > > >
> > > > Basically I need guidance regarding how to do this.
> > > > If there are some standard approaches of building stemmer and
> stopword
> > > > for
> > > > use with solr then please
> > > > share it .
> > > >
> > > > Thank you Upayavira for your kind help.
> > > >
> > > > Imtiaz Shakil Siddique
> > > >
> > > >
> > > > On 10 September 2015 at 13:23, Upayavira  wrote:
> > > >
> > > > >
> > > > >
> > > > > On Thu, Sep 10, 2015, at 04:45 AM, Imtiaz Shakil Siddique wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I am trying to develop stemmer and stopword for Bengaly 
> > > > > > language
> > > which is
> > > > > > not shipped with solr.
> > > > > >
> > > > > > I am trying to make this with machine learning approach but 
> > > > > > I
> > > couldn't
> > > > > > find
> > > > > > any good documents to study. It would be very helpful if you
> could
> > > shed
> > > > > > some lights into this matter.
> > > > >
> > > > > How are you going to do this with machine learning? What 
> > > > > corpus
> are you
> > > > >

Re: How to reordering search result by some function query

2015-09-10 Thread Aman Tandon

Hi,

I figured it out to implement the same. I will be doing this by using the
boost parameter

e.g. http://server:8112/solr/products/select?q=jute=title
*=product(1,product_guideline_score)*

If there is any other alternative then please suggest.

With Regards
Aman Tandon

On Thu, Sep 10, 2015 at 11:02 AM, Aman Tandon 
wrote:

> Hi,
>
> I have a requirement to reorder the search results by multiplying the *text 
> relevance
> score* of a product with the *product_guideline_score,* which will be
> stored in index and will have some floating point number.
>
> e.g. On searching the *jute* in title if we got some results ID1 & ID2
>
> ID1 -> title = jute
>   score = 8.0
> *  product_guideline_score = 2.0*
>
> ID2 -> title = jute bags
>   score = 7.5
> *  product_guideline_score** = 2.2*
>
> So the new score should be like this
>
> ID1 -> title = jute
>   score = *product_score * 8 = 16.0*
> *  product_guideline_score** = 2.0*
>
> ID2 -> title = jute bags
>   score = *product_score * 7.5 = 16.5*
> *  product_guideline_score** = 2.2*
>
> *So new ordering should be*
>
> ID2 -> title = jute bags
>   score* = 16.5*
>
> ID1 -> title = jute
>   score =* 16.0*
>
> How can I do this in single query on runtime in solr.
>
> With Regards
> Aman Tandon
>

52 matches

Mail list logo