Re: error reporting during indexing

2015-09-29 Thread Matteo Grolla
Hi Erik,
it's a curiosity question. When I add a document it's buffered by Solr
and can (apparently is) be parsed to verify it matches the schema. But it's
not written to a segment file until a commit is issued. If there is a
problem writing the segment, a permission error, isn't this a case where I
would report everything OK when in fact documents are not there?

thanks

2015-09-29 2:12 GMT+02:00 Erick Erickson :

> You shouldn't be losing errors with HttpSolrServer. Are you
> seeing evidence that you are or is this mostly a curiosity question?
>
> Do not it's better to batch up docs, your throughput will increase
> a LOT. That said, when you do batch (e.g. send 500 docs per update
> or whatever) and you get an error back, you're not quite sure what
> doc failed. So what people do is retry a failed batch one document
> at a time when the batch has errors and rely on Solr overwriting
> any docs in the batch that were indexed the first time.
>
> Best,
> Erick
>
> On Mon, Sep 28, 2015 at 2:27 PM, Matteo Grolla 
> wrote:
> > Hi,
> > if I need fine grained error reporting I use Http Solr server and
> send
> > 1 doc per request using the add method.
> > I report errors on exceptions of the add method,
> > I'm using autocommit so I'm not seing errors related to commit.
> > Am I loosing some errors? Is there a better way?
> >
> > Thanks
>


ConcurrentUpdateSolrServer with timeout on flushing ?

2015-09-29 Thread gsus
Is there any possibility to use the solrj ConcurrentUpdateSolrServer to flush
its queue on two conditions , queue is full OR a timeout occurs ? (e.g. 2
Minutes no new documents , so lets flush )

best regards 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/ConcurrentUpdateSolrServer-with-timeout-on-flushing-tp4231908.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: MoreLikeThisHandler with mltipli input documents

2015-09-29 Thread Upayavira
Let's take a step back. So, you have 3000 or so docs, and you want to
know which documents are similar to these.

Why do you want to know this? What feature do you need to build that
will use that information? Knowing this may help us to arrive at the
right technology for you.

For example, you might want to investigate offline clustering algorithms
(e.g. [1], which might be a bit dense to follow). A good book on machine
learning if you are okay with Python is "Programming Collective
Intelligence" as it explains the usual algorithms with simple for loops
making it very clear.

Or, you could do searches, and then cluster the results at search time
(so if you search for 100 docs, it will identify clusters within those
100 matching documents). That might get you there. See [2]

So, if you let us know what the end-goal is, perhaps we can suggest an
alternative approach, rather than burying ourselves neck-deep in MLT
problems.

Upayavira

[1]
http://mylazycoding.blogspot.co.uk/2012/03/cluster-apache-solr-data-using-apache_13.html
[2] https://cwiki.apache.org/confluence/display/solr/Result+Clustering

On Tue, Sep 29, 2015, at 12:42 PM, Szűcs Roland wrote:
> Hello Upayavira,
> 
> Thanks dealing with my issue. I have applied already the termVectors=true
> to all fileds involved in the more like this calculation. I have just 3
> 000
> documents each of them is represented by a relativly big term vector with
> more than 20 000 unique terms. If I run the more like this handler for a
> solr doc it takes close to 1 sec to get back the first 10 similar
> documents. Aftwr this I have to pass the docid-s to my other application
> which find the cover of the e-book and other metadata and put it on the
> web. The end-to-end process takes too much time from customer perspective
> that is why I tried to find solution for offline more like this
> calculation. But if my app has to call the morelikethishandler for each
> doc
> it puts overhead for the offline calculation.
> 
> Best Regards,
> Roland
> 
> 2015-09-29 13:01 GMT+02:00 Upayavira :
> 
> > If MoreLikeThis is slow for large documents that are indexed, have you
> > enabled term vectors on the similarity fields?
> >
> > Basically, what more like this does is this:
> >
> > * decide on what terms in the source doc are "interesting", and pick the
> > 25 most interesting ones
> > * build and execute a boolean query using these interesting terms.
> >
> > Looking at the first phase of this in more detail:
> >
> > If you pass in a document using stream.body, it will analyse this
> > document into terms, and then calculate the most interesting terms from
> > that.
> >
> > If you reference document in your index with a field that is stored, it
> > will take the stored version, and analyse it and identify the
> > interesting terms from there.
> >
> > If, however, you have stored term vectors against that field, this work
> > is not needed. You have already done much of the work, and the
> > identification of your "interesting terms" will be much faster.
> >
> > Thus, on the content field of your documents, add termVectors="true" in
> > your schema, and re-index. Then you could well find MLT becoming a lot
> > more efficient.
> >
> > Upayavira
> >
> > On Tue, Sep 29, 2015, at 10:39 AM, Szűcs Roland wrote:
> > > Hi Alessandro,
> > >
> > > My original goal was to get offline suggestsion on content based
> > > similarity
> > > for every e-book we have . We wanted to run a bulk more like this
> > > calculation in the evening when the usage of our site is low and we
> > > submit
> > > a new e-book. Real time more like this can take a while as we have
> > > typically long documents (2-5MB text) with all the content indexed.
> > >
> > > When we upload a new document we wanted to recalculate the more like this
> > > suggestions and a tf-idf based tag cloouds. Both of them are delivered by
> > > the More LikeThisHandler but only for one document as you wrote.
> > >
> > > The text input is not good for us because we need the similar doc list
> > > for
> > > each of the matched document. If I put together text of 10 document I can
> > > not separate which suggestion relates to which matched document and also
> > > the tag cloud will belong to the mixed text.
> > >
> > > Most likley we will use the MoreLikeThisHandler for each of the documents
> > > and parse the json repsonse and store the result in a DQL database
> > >
> > > Thanks your help.
> > >
> > > 2015-09-29 11:18 GMT+02:00 Alessandro Benedetti
> > > 
> > > :
> > >
> > > > Hi Roland,
> > > > what is your exact requirement ?
> > > > Do you want to basically build a "description" for a set of documents
> > and
> > > > then find documents in the index, similar to this description ?
> > > >
> > > > By default , based on my experience ( and on the code) this is the
> > entry
> > > > point for the Lucene More Like This :
> > > >
> > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >

Re: error reporting during indexing

2015-09-29 Thread Alessandro Benedetti
Hi Matteo, at this point I would suggest you this reading by Erick:

https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

If i am not wrong when the document is indexed ( simplifying) :
1) The document is added to the current segment in memory
2) When a soft commit happens, we get the visibility ( no flush happens to
the disk, but the document is searchable)
3) When the hard commit happens, we get the durability, we truncate the
segment in memory and we flush it to the disk, so if a problem happens
here, you  should see an error Solr side, but this does not imply that the
document indexing is failed, actually only the last flush has failed.

Related point 3, I am not sure what are the Solr reaction to this fail.
I should investigate.

Cheers



2015-09-29 8:53 GMT+01:00 Matteo Grolla :

> Hi Erik,
> it's a curiosity question. When I add a document it's buffered by Solr
> and can (apparently is) be parsed to verify it matches the schema. But it's
> not written to a segment file until a commit is issued. If there is a
> problem writing the segment, a permission error, isn't this a case where I
> would report everything OK when in fact documents are not there?
>
> thanks
>
> 2015-09-29 2:12 GMT+02:00 Erick Erickson :
>
> > You shouldn't be losing errors with HttpSolrServer. Are you
> > seeing evidence that you are or is this mostly a curiosity question?
> >
> > Do not it's better to batch up docs, your throughput will increase
> > a LOT. That said, when you do batch (e.g. send 500 docs per update
> > or whatever) and you get an error back, you're not quite sure what
> > doc failed. So what people do is retry a failed batch one document
> > at a time when the batch has errors and rely on Solr overwriting
> > any docs in the batch that were indexed the first time.
> >
> > Best,
> > Erick
> >
> > On Mon, Sep 28, 2015 at 2:27 PM, Matteo Grolla 
> > wrote:
> > > Hi,
> > > if I need fine grained error reporting I use Http Solr server and
> > send
> > > 1 doc per request using the add method.
> > > I report errors on exceptions of the add method,
> > > I'm using autocommit so I'm not seing errors related to commit.
> > > Am I loosing some errors? Is there a better way?
> > >
> > > Thanks
> >
>



-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: MoreLikeThisHandler with mltipli input documents

2015-09-29 Thread Szűcs Roland
Hi Alessandro,

My original goal was to get offline suggestsion on content based similarity
for every e-book we have . We wanted to run a bulk more like this
calculation in the evening when the usage of our site is low and we submit
a new e-book. Real time more like this can take a while as we have
typically long documents (2-5MB text) with all the content indexed.

When we upload a new document we wanted to recalculate the more like this
suggestions and a tf-idf based tag cloouds. Both of them are delivered by
the More LikeThisHandler but only for one document as you wrote.

The text input is not good for us because we need the similar doc list for
each of the matched document. If I put together text of 10 document I can
not separate which suggestion relates to which matched document and also
the tag cloud will belong to the mixed text.

Most likley we will use the MoreLikeThisHandler for each of the documents
and parse the json repsonse and store the result in a DQL database

Thanks your help.

2015-09-29 11:18 GMT+02:00 Alessandro Benedetti 
:

> Hi Roland,
> what is your exact requirement ?
> Do you want to basically build a "description" for a set of documents and
> then find documents in the index, similar to this description ?
>
> By default , based on my experience ( and on the code) this is the entry
> point for the Lucene More Like This :
>
>
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > *org.apache.lucene.queries.mlt.MoreLikeThis/*** Return a query that will
> > return docs like the passed lucene document ID.** @param docNum the
> > documentID of the lucene doc to generate the 'More Like This" query for.*
> > @return a query that will return docs like the passed lucene document
> > ID.*/public Query like(int docNum) throws IOException {if (fieldNames ==
> > null) {// gather list of valid fields from luceneCollection
> fields
> > = MultiFields.getIndexedFields(ir);fieldNames = fields.toArray(new
> > String[fields.size()]);}return createQuery(retrieveTerms(docNum));}*
>
> It means that talking about "documents" you can feed only one Solr doc.
>
> But you can also feed the MLT with simple text.
>
> So you should study better your use case and understand which option
> fits better :
>
> 1) customising the MLT component starting from Lucene
>
> 2) doing some processing client side and use the "text" similarity feature.
>
>
> Cheers
>
>
> 2015-09-29 10:05 GMT+01:00 Roland Szűcs :
>
> > Hi all,
> >
> > Is it possible to feed multiple solr id for a MoreLikeThisHandler?
> >
> > 
> > 
> > false
> > details
> > title,content
> > 4
> > title^12 content^1
> > 2
> > 10
> > true
> > json
> > true
> > 
> >   
> >
> > when I call this: http://localhost:8983/solr/bandwhu/mlt?q=id:8=id
> >  it works fine. Is there any way to have a kind of "bulk" call of more
> like
> > this handler . I need the intresting terms as well and as far as I know
> if
> > i use more like this as a search component it does not return with it so
> it
> > is not an alternative.
> >
> > Thanks in advance,
> >
> >
> > --
> > Roland
> Szűcs
> > Connect
> with
> > me on Linkedin <
> > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > CEOPhone: +36 1 210 81 13Bookandwalk.hu
> > 
> >
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
Szűcs Roland
Ismerkedjünk
meg a Linkedin 
-en ÜgyvezetőTelefon: +36 1 210 81 13Bookandwalk.hu



Re: MoreLikeThisHandler with mltipli input documents

2015-09-29 Thread Upayavira
If MoreLikeThis is slow for large documents that are indexed, have you
enabled term vectors on the similarity fields?

Basically, what more like this does is this:

* decide on what terms in the source doc are "interesting", and pick the
25 most interesting ones
* build and execute a boolean query using these interesting terms.

Looking at the first phase of this in more detail:

If you pass in a document using stream.body, it will analyse this
document into terms, and then calculate the most interesting terms from
that.

If you reference document in your index with a field that is stored, it
will take the stored version, and analyse it and identify the
interesting terms from there.

If, however, you have stored term vectors against that field, this work
is not needed. You have already done much of the work, and the
identification of your "interesting terms" will be much faster.

Thus, on the content field of your documents, add termVectors="true" in
your schema, and re-index. Then you could well find MLT becoming a lot
more efficient.

Upayavira

On Tue, Sep 29, 2015, at 10:39 AM, Szűcs Roland wrote:
> Hi Alessandro,
> 
> My original goal was to get offline suggestsion on content based
> similarity
> for every e-book we have . We wanted to run a bulk more like this
> calculation in the evening when the usage of our site is low and we
> submit
> a new e-book. Real time more like this can take a while as we have
> typically long documents (2-5MB text) with all the content indexed.
> 
> When we upload a new document we wanted to recalculate the more like this
> suggestions and a tf-idf based tag cloouds. Both of them are delivered by
> the More LikeThisHandler but only for one document as you wrote.
> 
> The text input is not good for us because we need the similar doc list
> for
> each of the matched document. If I put together text of 10 document I can
> not separate which suggestion relates to which matched document and also
> the tag cloud will belong to the mixed text.
> 
> Most likley we will use the MoreLikeThisHandler for each of the documents
> and parse the json repsonse and store the result in a DQL database
> 
> Thanks your help.
> 
> 2015-09-29 11:18 GMT+02:00 Alessandro Benedetti
> 
> :
> 
> > Hi Roland,
> > what is your exact requirement ?
> > Do you want to basically build a "description" for a set of documents and
> > then find documents in the index, similar to this description ?
> >
> > By default , based on my experience ( and on the code) this is the entry
> > point for the Lucene More Like This :
> >
> >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > *org.apache.lucene.queries.mlt.MoreLikeThis/*** Return a query that will
> > > return docs like the passed lucene document ID.** @param docNum the
> > > documentID of the lucene doc to generate the 'More Like This" query for.*
> > > @return a query that will return docs like the passed lucene document
> > > ID.*/public Query like(int docNum) throws IOException {if (fieldNames ==
> > > null) {// gather list of valid fields from luceneCollection
> > fields
> > > = MultiFields.getIndexedFields(ir);fieldNames = fields.toArray(new
> > > String[fields.size()]);}return createQuery(retrieveTerms(docNum));}*
> >
> > It means that talking about "documents" you can feed only one Solr doc.
> >
> > But you can also feed the MLT with simple text.
> >
> > So you should study better your use case and understand which option
> > fits better :
> >
> > 1) customising the MLT component starting from Lucene
> >
> > 2) doing some processing client side and use the "text" similarity feature.
> >
> >
> > Cheers
> >
> >
> > 2015-09-29 10:05 GMT+01:00 Roland Szűcs :
> >
> > > Hi all,
> > >
> > > Is it possible to feed multiple solr id for a MoreLikeThisHandler?
> > >
> > > 
> > > 
> > > false
> > > details
> > > title,content
> > > 4
> > > title^12 content^1
> > > 2
> > > 10
> > > true
> > > json
> > > true
> > > 
> > >   
> > >
> > > when I call this: http://localhost:8983/solr/bandwhu/mlt?q=id:8=id
> > >  it works fine. Is there any way to have a kind of "bulk" call of more
> > like
> > > this handler . I need the intresting terms as well and as far as I know
> > if
> > > i use more like this as a search component it does not return with it so
> > it
> > > is not an alternative.
> > >
> > > Thanks in advance,
> > >
> > >
> > > --
> > > Roland
> > Szűcs
> > > Connect
> > with
> > > me on Linkedin <
> > > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > CEOPhone: +36 1 210 81 13Bookandwalk.hu
> > > 
> > >
> >
> >
> >
> > --
> > --
> >
> > Benedetti Alessandro
> > Visiting card - http://about.me/alessandro_benedetti
> > Blog - http://alexbenedetti.blogspot.co.uk
> 

Re: MoreLikeThisHandler with mltipli input documents

2015-09-29 Thread Alessandro Benedetti
Hi Roland,
what is your exact requirement ?
Do you want to basically build a "description" for a set of documents and
then find documents in the index, similar to this description ?

By default , based on my experience ( and on the code) this is the entry
point for the Lucene More Like This :


>
>
>
>
>
>
>
>
>
>
>
>
>
> *org.apache.lucene.queries.mlt.MoreLikeThis/*** Return a query that will
> return docs like the passed lucene document ID.** @param docNum the
> documentID of the lucene doc to generate the 'More Like This" query for.*
> @return a query that will return docs like the passed lucene document
> ID.*/public Query like(int docNum) throws IOException {if (fieldNames ==
> null) {// gather list of valid fields from luceneCollection fields
> = MultiFields.getIndexedFields(ir);fieldNames = fields.toArray(new
> String[fields.size()]);}return createQuery(retrieveTerms(docNum));}*

It means that talking about "documents" you can feed only one Solr doc.

But you can also feed the MLT with simple text.

So you should study better your use case and understand which option
fits better :

1) customising the MLT component starting from Lucene

2) doing some processing client side and use the "text" similarity feature.


Cheers


2015-09-29 10:05 GMT+01:00 Roland Szűcs :

> Hi all,
>
> Is it possible to feed multiple solr id for a MoreLikeThisHandler?
>
> 
> 
> false
> details
> title,content
> 4
> title^12 content^1
> 2
> 10
> true
> json
> true
> 
>   
>
> when I call this: http://localhost:8983/solr/bandwhu/mlt?q=id:8=id
>  it works fine. Is there any way to have a kind of "bulk" call of more like
> this handler . I need the intresting terms as well and as far as I know if
> i use more like this as a search component it does not return with it so it
> is not an alternative.
>
> Thanks in advance,
>
>
> --
> Roland Szűcs
> Connect with
> me on Linkedin <
> https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> CEOPhone: +36 1 210 81 13Bookandwalk.hu
> 
>



-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: MoreLikeThisHandler with mltipli input documents

2015-09-29 Thread Szűcs Roland
Hello Upayavira,

Thanks dealing with my issue. I have applied already the termVectors=true
to all fileds involved in the more like this calculation. I have just 3 000
documents each of them is represented by a relativly big term vector with
more than 20 000 unique terms. If I run the more like this handler for a
solr doc it takes close to 1 sec to get back the first 10 similar
documents. Aftwr this I have to pass the docid-s to my other application
which find the cover of the e-book and other metadata and put it on the
web. The end-to-end process takes too much time from customer perspective
that is why I tried to find solution for offline more like this
calculation. But if my app has to call the morelikethishandler for each doc
it puts overhead for the offline calculation.

Best Regards,
Roland

2015-09-29 13:01 GMT+02:00 Upayavira :

> If MoreLikeThis is slow for large documents that are indexed, have you
> enabled term vectors on the similarity fields?
>
> Basically, what more like this does is this:
>
> * decide on what terms in the source doc are "interesting", and pick the
> 25 most interesting ones
> * build and execute a boolean query using these interesting terms.
>
> Looking at the first phase of this in more detail:
>
> If you pass in a document using stream.body, it will analyse this
> document into terms, and then calculate the most interesting terms from
> that.
>
> If you reference document in your index with a field that is stored, it
> will take the stored version, and analyse it and identify the
> interesting terms from there.
>
> If, however, you have stored term vectors against that field, this work
> is not needed. You have already done much of the work, and the
> identification of your "interesting terms" will be much faster.
>
> Thus, on the content field of your documents, add termVectors="true" in
> your schema, and re-index. Then you could well find MLT becoming a lot
> more efficient.
>
> Upayavira
>
> On Tue, Sep 29, 2015, at 10:39 AM, Szűcs Roland wrote:
> > Hi Alessandro,
> >
> > My original goal was to get offline suggestsion on content based
> > similarity
> > for every e-book we have . We wanted to run a bulk more like this
> > calculation in the evening when the usage of our site is low and we
> > submit
> > a new e-book. Real time more like this can take a while as we have
> > typically long documents (2-5MB text) with all the content indexed.
> >
> > When we upload a new document we wanted to recalculate the more like this
> > suggestions and a tf-idf based tag cloouds. Both of them are delivered by
> > the More LikeThisHandler but only for one document as you wrote.
> >
> > The text input is not good for us because we need the similar doc list
> > for
> > each of the matched document. If I put together text of 10 document I can
> > not separate which suggestion relates to which matched document and also
> > the tag cloud will belong to the mixed text.
> >
> > Most likley we will use the MoreLikeThisHandler for each of the documents
> > and parse the json repsonse and store the result in a DQL database
> >
> > Thanks your help.
> >
> > 2015-09-29 11:18 GMT+02:00 Alessandro Benedetti
> > 
> > :
> >
> > > Hi Roland,
> > > what is your exact requirement ?
> > > Do you want to basically build a "description" for a set of documents
> and
> > > then find documents in the index, similar to this description ?
> > >
> > > By default , based on my experience ( and on the code) this is the
> entry
> > > point for the Lucene More Like This :
> > >
> > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > *org.apache.lucene.queries.mlt.MoreLikeThis/*** Return a query that
> will
> > > > return docs like the passed lucene document ID.** @param docNum the
> > > > documentID of the lucene doc to generate the 'More Like This" query
> for.*
> > > > @return a query that will return docs like the passed lucene document
> > > > ID.*/public Query like(int docNum) throws IOException {if
> (fieldNames ==
> > > > null) {// gather list of valid fields from luceneCollection
> > > fields
> > > > = MultiFields.getIndexedFields(ir);fieldNames = fields.toArray(new
> > > > String[fields.size()]);}return createQuery(retrieveTerms(docNum));}*
> > >
> > > It means that talking about "documents" you can feed only one Solr doc.
> > >
> > > But you can also feed the MLT with simple text.
> > >
> > > So you should study better your use case and understand which option
> > > fits better :
> > >
> > > 1) customising the MLT component starting from Lucene
> > >
> > > 2) doing some processing client side and use the "text" similarity
> feature.
> > >
> > >
> > > Cheers
> > >
> > >
> > > 2015-09-29 10:05 GMT+01:00 Roland Szűcs  >:
> > >
> > > > Hi all,
> > > >
> > > > Is it possible to feed multiple solr id for a MoreLikeThisHandler?
> > > >
> > > > 
> > > > 
> > > > false
> > 

MoreLikeThisHandler with mltipli input documents

2015-09-29 Thread Roland Szűcs
Hi all,

Is it possible to feed multiple solr id for a MoreLikeThisHandler?



false
details
title,content
4
title^12 content^1
2
10
true
json
true

  

when I call this: http://localhost:8983/solr/bandwhu/mlt?q=id:8=id
 it works fine. Is there any way to have a kind of "bulk" call of more like
this handler . I need the intresting terms as well and as far as I know if
i use more like this as a search component it does not return with it so it
is not an alternative.

Thanks in advance,


-- 
Roland Szűcs
Connect with
me on Linkedin 
CEOPhone: +36 1 210 81 13Bookandwalk.hu



Re: highlighting

2015-09-29 Thread Upayavira
You can change the strings that are inserted into the text, and could
place markers that you use to identify the start/end of highlighting
elements. Does that work?

Upayavira

On Mon, Sep 28, 2015, at 09:55 PM, Mark Fenbers wrote:
> Greetings!
> 
> I have highlighting turned on in my Solr searches, but what I get back 
> is  tags surrounding the found term.  Since I use a SWT StyledText 
> widget to display my search results, what I really want is the offset 
> and length of each found term, so that I can highlight it in my own way 
> without HTML.  Is there a way to configure Solr to do that?  I couldn't 
> find it.  If not, how do I go about posting this as a feature request?
> 
> Thanks,
> Mark


Re: MoreLikeThisHandler with mltipli input documents

2015-09-29 Thread Alessandro Benedetti
Hi Roland,
you said "The main goal is that when a customer is on the pruduct page ".
But if you are in a  product page, I guess you have the product Id.
If you have the product id , you can simply execute the MLT request with
the single Doc Id in input.

Why do you need to calculate beforehand?

Cheers

2015-09-29 15:44 GMT+01:00 Szűcs Roland :

> Hello Upayavira,
>
> The main goal is that when a customer is on the pruduct page on an e-book
> and he does not like it somehow I want to immediately offer her/him
> alternative e-books in the same topic. If I expect from the customer to
> click on a button like "similar e-books" I lose half of them as they are
> lazy to click anywhere. So I would like to present on the product pages the
> alternatives of the e-books  without clicking.
>
> I assumed the best idea to claculate the similar e-books for all the other
> (n*(n-1) similarity calculation) and present only the top 5. I planned to
> do it when our server is not busy. In this point I found the description of
> mlt as a search component which seemed to be a good candidate as it
> calculates the similar documents to all the result set of the query. So if
> I say q=*:* and mlt component is enabled I get similar document for my
> entire document set. The only problem was with this approach that mlt
> search component does not give back the interesting terms for my tag cloud
> calculation.
>
> That's why I tried to mix the flexibility of mlt compoonent (multiple docs
> as an input accepted) with the robustness of MoreLikeThisHandler (having
> interesting terms).
>
> If there is no solution, I will use the mlt component and solve the tag
> cloud calculation other way. By the way if I am not mistaken, the 5.3.1
> version takes the union of the feature set of the mlt component, and
> handler
>
> Best Regards,
> Roland
>
>
>
> 2015-09-29 14:38 GMT+02:00 Upayavira :
>
> > Let's take a step back. So, you have 3000 or so docs, and you want to
> > know which documents are similar to these.
> >
> > Why do you want to know this? What feature do you need to build that
> > will use that information? Knowing this may help us to arrive at the
> > right technology for you.
> >
> > For example, you might want to investigate offline clustering algorithms
> > (e.g. [1], which might be a bit dense to follow). A good book on machine
> > learning if you are okay with Python is "Programming Collective
> > Intelligence" as it explains the usual algorithms with simple for loops
> > making it very clear.
> >
> > Or, you could do searches, and then cluster the results at search time
> > (so if you search for 100 docs, it will identify clusters within those
> > 100 matching documents). That might get you there. See [2]
> >
> > So, if you let us know what the end-goal is, perhaps we can suggest an
> > alternative approach, rather than burying ourselves neck-deep in MLT
> > problems.
> >
> > Upayavira
> >
> > [1]
> >
> >
> http://mylazycoding.blogspot.co.uk/2012/03/cluster-apache-solr-data-using-apache_13.html
> > [2] https://cwiki.apache.org/confluence/display/solr/Result+Clustering
> >
> > On Tue, Sep 29, 2015, at 12:42 PM, Szűcs Roland wrote:
> > > Hello Upayavira,
> > >
> > > Thanks dealing with my issue. I have applied already the
> termVectors=true
> > > to all fileds involved in the more like this calculation. I have just 3
> > > 000
> > > documents each of them is represented by a relativly big term vector
> with
> > > more than 20 000 unique terms. If I run the more like this handler for
> a
> > > solr doc it takes close to 1 sec to get back the first 10 similar
> > > documents. Aftwr this I have to pass the docid-s to my other
> application
> > > which find the cover of the e-book and other metadata and put it on the
> > > web. The end-to-end process takes too much time from customer
> perspective
> > > that is why I tried to find solution for offline more like this
> > > calculation. But if my app has to call the morelikethishandler for each
> > > doc
> > > it puts overhead for the offline calculation.
> > >
> > > Best Regards,
> > > Roland
> > >
> > > 2015-09-29 13:01 GMT+02:00 Upayavira :
> > >
> > > > If MoreLikeThis is slow for large documents that are indexed, have
> you
> > > > enabled term vectors on the similarity fields?
> > > >
> > > > Basically, what more like this does is this:
> > > >
> > > > * decide on what terms in the source doc are "interesting", and pick
> > the
> > > > 25 most interesting ones
> > > > * build and execute a boolean query using these interesting terms.
> > > >
> > > > Looking at the first phase of this in more detail:
> > > >
> > > > If you pass in a document using stream.body, it will analyse this
> > > > document into terms, and then calculate the most interesting terms
> from
> > > > that.
> > > >
> > > > If you reference document in your index with a field that is stored,
> it
> > > > will take the stored version, and analyse 

Re-label terms from a shard?

2015-09-29 Thread Dan Bolser
Hi,

I'm using sharding 'off label' to integrate data from various remote sites
running a common schema.

One issue is that the remote sites sometimes use synonyms of the allowed
terms in a given field. i.e. we specify that a certain field may only carry
the values x, y, and z, but the remote indexes decide to use X, Y, and Z
instead.

In my 'hub' (the server configured to query over all shards), can I
configure a mapping such that the facet only shows x, y and z, instead of
x, X, y, Y, z, and Z?

I'm not sure how a facet selection would 'magically' filter on the list of
all synonyms defined in the mapping.

I should have defined this field as an enumeration, but I think the cat's
out of the bag now!


Many thanks,
Dan.


Re: MoreLikeThisHandler with mltipli input documents

2015-09-29 Thread Szűcs Roland
Hello Upayavira,

The main goal is that when a customer is on the pruduct page on an e-book
and he does not like it somehow I want to immediately offer her/him
alternative e-books in the same topic. If I expect from the customer to
click on a button like "similar e-books" I lose half of them as they are
lazy to click anywhere. So I would like to present on the product pages the
alternatives of the e-books  without clicking.

I assumed the best idea to claculate the similar e-books for all the other
(n*(n-1) similarity calculation) and present only the top 5. I planned to
do it when our server is not busy. In this point I found the description of
mlt as a search component which seemed to be a good candidate as it
calculates the similar documents to all the result set of the query. So if
I say q=*:* and mlt component is enabled I get similar document for my
entire document set. The only problem was with this approach that mlt
search component does not give back the interesting terms for my tag cloud
calculation.

That's why I tried to mix the flexibility of mlt compoonent (multiple docs
as an input accepted) with the robustness of MoreLikeThisHandler (having
interesting terms).

If there is no solution, I will use the mlt component and solve the tag
cloud calculation other way. By the way if I am not mistaken, the 5.3.1
version takes the union of the feature set of the mlt component, and handler

Best Regards,
Roland



2015-09-29 14:38 GMT+02:00 Upayavira :

> Let's take a step back. So, you have 3000 or so docs, and you want to
> know which documents are similar to these.
>
> Why do you want to know this? What feature do you need to build that
> will use that information? Knowing this may help us to arrive at the
> right technology for you.
>
> For example, you might want to investigate offline clustering algorithms
> (e.g. [1], which might be a bit dense to follow). A good book on machine
> learning if you are okay with Python is "Programming Collective
> Intelligence" as it explains the usual algorithms with simple for loops
> making it very clear.
>
> Or, you could do searches, and then cluster the results at search time
> (so if you search for 100 docs, it will identify clusters within those
> 100 matching documents). That might get you there. See [2]
>
> So, if you let us know what the end-goal is, perhaps we can suggest an
> alternative approach, rather than burying ourselves neck-deep in MLT
> problems.
>
> Upayavira
>
> [1]
>
> http://mylazycoding.blogspot.co.uk/2012/03/cluster-apache-solr-data-using-apache_13.html
> [2] https://cwiki.apache.org/confluence/display/solr/Result+Clustering
>
> On Tue, Sep 29, 2015, at 12:42 PM, Szűcs Roland wrote:
> > Hello Upayavira,
> >
> > Thanks dealing with my issue. I have applied already the termVectors=true
> > to all fileds involved in the more like this calculation. I have just 3
> > 000
> > documents each of them is represented by a relativly big term vector with
> > more than 20 000 unique terms. If I run the more like this handler for a
> > solr doc it takes close to 1 sec to get back the first 10 similar
> > documents. Aftwr this I have to pass the docid-s to my other application
> > which find the cover of the e-book and other metadata and put it on the
> > web. The end-to-end process takes too much time from customer perspective
> > that is why I tried to find solution for offline more like this
> > calculation. But if my app has to call the morelikethishandler for each
> > doc
> > it puts overhead for the offline calculation.
> >
> > Best Regards,
> > Roland
> >
> > 2015-09-29 13:01 GMT+02:00 Upayavira :
> >
> > > If MoreLikeThis is slow for large documents that are indexed, have you
> > > enabled term vectors on the similarity fields?
> > >
> > > Basically, what more like this does is this:
> > >
> > > * decide on what terms in the source doc are "interesting", and pick
> the
> > > 25 most interesting ones
> > > * build and execute a boolean query using these interesting terms.
> > >
> > > Looking at the first phase of this in more detail:
> > >
> > > If you pass in a document using stream.body, it will analyse this
> > > document into terms, and then calculate the most interesting terms from
> > > that.
> > >
> > > If you reference document in your index with a field that is stored, it
> > > will take the stored version, and analyse it and identify the
> > > interesting terms from there.
> > >
> > > If, however, you have stored term vectors against that field, this work
> > > is not needed. You have already done much of the work, and the
> > > identification of your "interesting terms" will be much faster.
> > >
> > > Thus, on the content field of your documents, add termVectors="true" in
> > > your schema, and re-index. Then you could well find MLT becoming a lot
> > > more efficient.
> > >
> > > Upayavira
> > >
> > > On Tue, Sep 29, 2015, at 10:39 AM, Szűcs Roland wrote:
> > > > Hi Alessandro,
> > > >
> > > > 

Re: Re-label terms from a shard?

2015-09-29 Thread Upayavira


On Tue, Sep 29, 2015, at 03:38 PM, Dan Bolser wrote:
> Hi,
> 
> I'm using sharding 'off label' to integrate data from various remote
> sites
> running a common schema.
> 
> One issue is that the remote sites sometimes use synonyms of the allowed
> terms in a given field. i.e. we specify that a certain field may only
> carry
> the values x, y, and z, but the remote indexes decide to use X, Y, and Z
> instead.
> 
> In my 'hub' (the server configured to query over all shards), can I
> configure a mapping such that the facet only shows x, y and z, instead of
> x, X, y, Y, z, and Z?
> 
> I'm not sure how a facet selection would 'magically' filter on the list
> of
> all synonyms defined in the mapping.
> 
> I should have defined this field as an enumeration, but I think the cat's
> out of the bag now!

I'm not sure there's anything you can do here (without a substantial
programming effort) other than add a layer in front of Solr that adds
x+X, y+Y and z+Z.

As such, Solr doesn't have an enumeration data type - you'd have to just
use a string field and enforce it outside of Solr.

Upayavira


Re: Passing Basic Auth info to HttpSolrClient

2015-09-29 Thread Steven White
Hi,

Re-posting to see if anyone can help.  If my question is not clear, let me
know.

Thanks!

Steve

On Mon, Sep 28, 2015 at 5:15 PM, Steven White  wrote:

> Hi,
>
> I'm using HttpSolrClient to connect to Solr.  Everything works until when
> I enabled basic authentication in Jetty.  My question is, how do I pass to
> SolrJ the basic auth info. so that I don't get a 401 error?
>
> Thanks in advance
>
> Steve
>


SolrCloud and HTTP caching using eTag/if-none-match

2015-09-29 Thread Arcadius Ahouansou
Hello.

- Would you be kind enough to share your experience using SolrCloud with
HTTP Caching to return 304 status as described in the wiki
https://cwiki.apache.org/confluence/display/solr/RequestDispatcher+in+SolrConfig#RequestDispatcherinSolrConfig-httpCachingElement
?

- Looking at the SolrJ API, it seems there is no obvious way to set the
if-none-match header before submitting the search to SolrCloud.

Thank you very much.

Arcadius.


Re: How can I get a monotonically increasing field value for docs?

2015-09-29 Thread Gili Nachum
Hoss,

Good point, didn't know about cursor mark when we designed this a year ago
:(

Small potato: I assume cursor mark breaks when the number of shards changes
while keeping the original values doesn't, since the relative position is
encoded per shard...But that's an edge case.

Looking forward for http://yonik.com/solr-cross-data-center-replication/

On Tue, Sep 29, 2015 at 10:20 PM, Chris Hostetter 
wrote:

>
>
> You're basically re-implementing Solr' cursors.
>
> you can change your system of reading docs from the old collection to
> use...
>
> cursorMark=*=timestamp+asc,id+asc
>
> ...and then instead of keeping track of the last timestamp & id values and
> constructing a filter, you can just keep track of the nextCursorMark and
> pass it the next time you want to check for newer documents...
>
> https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
>
>
>
>
>
> : Date: Mon, 21 Sep 2015 21:32:33 +0300
> : From: Gili Nachum 
> : Reply-To: solr-user@lucene.apache.org
> : To: solr-user@lucene.apache.org
> : Subject: Re: How can I get a monotonically increasing field value for
> docs?
> :
> : Thanks for the indepth explanation!
> :
> : The secondary sort by uuid would allow me to read a series of docs with
> : identical time over multiple batches by specifying filtering
> : time>timeOnLastReadDoc or (time=timeOnLastReadDoc and
> : uuid>uuidOnLastReaDoc) which essentially creates a unique sorted value to
> : track progress over.
> : On Sep 21, 2015 19:56, "Shawn Heisey"  wrote:
> :
> : > On 9/21/2015 9:01 AM, Gili Nachum wrote:
> : > > TimestampUpdateProcessorFactory takes place only on the leader
> shard, or
> : > on
> : > > each shard replica?
> : > > if on each replica then I would get different values on each replica.
> : > >
> : > > My alternative would be to perform secondary sort on a UUID to ensure
> : > order.
> : >
> : > If the update chain is configured properly, it runs on the leader, so
> : > all replicas get the same timestamp.
> : >
> : > Without SolrCloud, the way to create an "indexed at" time field is in
> : > the schema -- specify a default value of NOW on the field definition
> and
> : > don't send the field when indexing.  The old master/slave replication
> : > copies the actual index contents, so the indexed values in all replicas
> : > are the same.
> : >
> : > The problem with NOW in the schema when running SolrCloud is that each
> : > replica indexes the document independently, so each replica can have a
> : > different timestamp.  This is why the timestamp update processor exists
> : > -- to set the timestamp to a specific value before the document is
> : > duplicated to each replica, eliminating the problem.
> : >
> : > FYI, secondary sort parameters affect the order when the primary sort
> : > field is identical between two documents.  It may not do what you are
> : > intending because of that.
> : >
> : > Thanks,
> : > Shawn
> : >
> : >
> :
>
> -Hoss
> http://www.lucidworks.com/
>


Re: Using dynamically calculated value for sorting

2015-09-29 Thread Chris Hostetter

: sorting. We are planning to introduce discounts based on login credentials
: and we have to dynamically calculate price (using base price in SOLR feed)
: based on a specific discount returned by an API. Now after the discount is
: calculated we want to sort based on the new price (discounted price).
: 
:  What is the best way to do that? Any ideas would be appreciated.

It's hard to provide a completley generic answer to your question w/o more 
details in terms of how you define "discount"

if the discount is a fixed amount, or fixed percentage, for all documents 
returned from the index for that user, then you can use simple functions 
to do something like "sort=min(0,sub(price,flat_discount)) asc, popularity 
desc"

https://cwiki.apache.org/confluence/display/solr/Function+Queries

If the discount depends on other factors, then you have to explain what 
those other factors are -- if they can be represented as a numerical 
calculation based entirely on a single (or small number) of constants you 
pass in to hte query based on the userid, plus some numerical values in 
every document, then you can probably expand on that type of solution even 
more.

But w/o knowing hte details, it's hard to give you any further guidance.




-Hoss
http://www.lucidworks.com/


Re: CloudSolrClient timeout settingsr

2015-09-29 Thread Arcadius Ahouansou
Thank you very much Shawn.

Arcadius.

On 29 September 2015 at 01:41, Shawn Heisey  wrote:

> On 9/28/2015 4:04 PM, Arcadius Ahouansou wrote:
> > CloudSolrClient has zkClientTimeout/zkConnectTimeout  for access to
> > zookeeper.
> >
> > It would be handy to also have the possibility to set something like
> >  soTimeout/connectTimeout for accessing the solr nodes similarly to the
> old
> > non-cloud client.
> >
> > Currently, in order to set a timeout for the client to connect to solr,
> one
> > has to create a custom HttpClient.
> > Same goes for maxConnection.
>
> Currently SolrJ is using HttpClient classes and methods that are
> deprecated as of HC 4.3.  SolrJ in the latest version is using HC 4.5,
> which still has all that deprecated code.
>
> In order to remove HC deprecations, SolrJ must move to immutable
> HttpClient objects, so the current methods on HttpSolrClient that modify
> the HttpClient object, like setConnectionTimeout, will no longer work.
>
> All of these settings (and more) can be configured on the *requests*
> made through HttpClient, and this is the way that the HttpComponents
> project recommends writing code using their libarary, so what we really
> need to have is simply numbers for these settings stored in the class
> implementing SolrClient (like HttpSolrClient, CloudSolrClient, etc),
> which get passed down to the internal code that makes the request.  We
> have an issue for removing HC deprecations, but because it's a massive
> undertaking requiring a fair amount of experience with new
> HttpComponents classes/methods, nobody has attempted to do it:
>
> https://issues.apache.org/jira/browse/SOLR-5604
>
> Thanks,
> Shawn
>
>


-- 
Arcadius Ahouansou
Menelic Ltd | Information is Power
M: 07908761999
W: www.menelic.com
---


Solr 4.8 - Updating zkhost list in solr.xml without requiring a restart

2015-09-29 Thread pramodEbay
Hi,

Is there an example which I could use - to upload solr.xml in zookeeper and
change zkhost entries on the fly and have solr instances be updated via
zookeeper. This will prevent us from restarting each solr node everytime, a
new zookeeper host is added or deleted.

We are on Solr 4.8.

Thanks,
Pramod



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-8-Updating-zkhost-list-in-solr-xml-without-requiring-a-restart-tp4231979.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solrcloud in an inconsistent state

2015-09-29 Thread Arcadius Ahouansou
Hello Renning.

Sounds like
https://issues.apache.org/jira/browse/SOLR-6246

A workaround that may not be very appealing is to create a new collection
and to use aliases to point to it in you code/call


Thanks.


On 30 September 2015 at 01:44, r b  wrote:

> lately, my workflow has been 1) make some config changes, 2) upload to
> zookeeper, 3) use collections API to reload config for the collection.
> this has been working pretty well.
>
> starting last week, i started using the AnalyzingInfixLookupFactory in
> a SuggestComponent (up until then, it was just the
> FuzzyLookupFactory). the Infix lookup requires an indexPath where it
> keeps an index on disk.
>
> first couple times i used this and went through my cycle, it was no
> problem. but then i started getting some strange errors:
>
> LockObtainFailedException: Lock obtain timed out:
> NativeFSLock@
> /opt/solr-5.2.1/server/solr/myCollection_shard4_replica6/data/myInfixLookup/write.lock
>
> now when i go and try to update the config and reload, i do not get a
> response back and the connection drops after a minute.
>
> when i run other collections API commands, i notice them queueing up
> in the overseer collection work queue. after a seemingly long while,
> they disappear. i assumed it was just that some solrcloud nodes were
> just taking a while, but when playing with the suggest component's
> handler, i notice that not all of the nodes get the new config
> changes.
>
> has anyone else seen this before? maybe there is somethign wrong with
> my workflow that caused this?
>
> -renning
>



-- 
Arcadius Ahouansou
Menelic Ltd | Information is Power
M: 07908761999
W: www.menelic.com
---


Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

2015-09-29 Thread Zheng Lin Edwin Yeo
Hi Charlie,

I've checked that Paoding's code is written for Solr 3 and Solr 4 versions.
It is not written for Solr 5, thus I was unable to use it in my Solr 5.x
version.

Have you tried to use HMMChineseTokenizer and JiebaTokenizer as well?

Regards,
Edwin


On 25 September 2015 at 18:46, Charlie Hull  wrote:

> On 25/09/2015 11:43, Zheng Lin Edwin Yeo wrote:
>
>> Hi Charlie,
>>
>> Thanks for your comment. I faced the compatibility issues with Paoding
>> when
>> I tried it in Solr 5.1.0 and Solr 5.2.1, and I found out that the code was
>> optimised for Solr 3.6.
>>
>> Which version of Solr are you using when you tried on the Paoding?
>>
>
> Solr v4.6 I believe.
>
> Charlie
>
>
>> Regards,
>> Edwin
>>
>>
>> On 25 September 2015 at 16:43, Charlie Hull  wrote:
>>
>> On 23/09/2015 16:23, Alexandre Rafalovitch wrote:
>>>
>>> You may find the following articles interesting:


 http://discovery-grindstone.blogspot.ca/2014/01/searching-in-solr-analyzing-results-and.html
 ( a whole epic journey)
 https://dzone.com/articles/indexing-chinese-solr


>>> The latter article is great and we drew on it when helping a recent
>>> client
>>> with Chinese indexing. However, if you do use Paoding bear in mind that
>>> it
>>> has few if any tests and all the comments are in Chinese. We found a
>>> problem with it recently (it breaks the Lucene highlighters) and have
>>> submitted a patch:
>>> http://git.oschina.net/zhzhenqin/paoding-analysis/issues/1
>>>
>>> Cheers
>>>
>>> Charlie
>>>
>>>
>>> Regards,
  Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/


 On 23 September 2015 at 10:41, Zheng Lin Edwin Yeo <
 edwinye...@gmail.com>
 wrote:

 Hi,
>
> Would like to check, will StandardTokenizerFactory works well for
> indexing
> both English and Chinese (Bilingual) documents, or do we need
> tokenizers
> that are customised for chinese (Eg: HMMChineseTokenizerFactory)?
>
>
> Regards,
> Edwin
>
>

>>> --
>>> Charlie Hull
>>> Flax - Open Source Enterprise Search
>>>
>>> tel/fax: +44 (0)8700 118334
>>> mobile:  +44 (0)7767 825828
>>> web: www.flax.co.uk
>>>
>>>
>>
>
> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk
>


Re: Solr 4.8 - Updating zkhost list in solr.xml without requiring a restart

2015-09-29 Thread Shawn Heisey
On 9/29/2015 5:59 PM, pramodEbay wrote:
> Is there an example which I could use - to upload solr.xml in zookeeper and
> change zkhost entries on the fly and have solr instances be updated via
> zookeeper. This will prevent us from restarting each solr node everytime, a
> new zookeeper host is added or deleted.
> 
> We are on Solr 4.8.

Support in zookeeper for dynamically changing the cluster membership has
been added to the 3.5 version, which is currently only available as an
alpha release.

https://issues.apache.org/jira/browse/ZOOKEEPER-107

This feature has been under development for a REALLY long time.  The
comments are a discussion that is very technical in nature and difficult
to follow, it looks like it took a very long time to come up with a
usable design.

I don't know anything about how they have implemented the dynamic
cluster support, so I do not know whether Solr requires code changes to
use it.  A quick scan of the first few comments suggests that they are
trying to make this server-side, with all clients updating
automatically, so Solr might not need any code changes.  Let's hope that
this is the case.

Before we even think about upgrading the zookeeper functionality in
Solr, we must wait for the official 3.5 release from the zookeeper
project.  Alpha (or Beta) software will not be included in Solr unless
it is the only way to fix a very serious bug.  This is a new feature,
not a bug.

Thanks,
Shawn



solrcloud in an inconsistent state

2015-09-29 Thread r b
lately, my workflow has been 1) make some config changes, 2) upload to
zookeeper, 3) use collections API to reload config for the collection.
this has been working pretty well.

starting last week, i started using the AnalyzingInfixLookupFactory in
a SuggestComponent (up until then, it was just the
FuzzyLookupFactory). the Infix lookup requires an indexPath where it
keeps an index on disk.

first couple times i used this and went through my cycle, it was no
problem. but then i started getting some strange errors:

LockObtainFailedException: Lock obtain timed out:
NativeFSLock@/opt/solr-5.2.1/server/solr/myCollection_shard4_replica6/data/myInfixLookup/write.lock

now when i go and try to update the config and reload, i do not get a
response back and the connection drops after a minute.

when i run other collections API commands, i notice them queueing up
in the overseer collection work queue. after a seemingly long while,
they disappear. i assumed it was just that some solrcloud nodes were
just taking a while, but when playing with the suggest component's
handler, i notice that not all of the nodes get the new config
changes.

has anyone else seen this before? maybe there is somethign wrong with
my workflow that caused this?

-renning


Re: Solr 4.8 - Updating zkhost list in solr.xml without requiring a restart

2015-09-29 Thread pramodmm

> Before we even think about upgrading the zookeeper functionality in
> Solr, we must wait for the official 3.5 release from the zookeeper
> project.  Alpha (or Beta) software will not be included in Solr unless
> it is the only way to fix a very serious bug.  This is a new feature,
> not a bug.

In the meantime, please help me validate what we are doing is right. 
Currently, our zookeeper instances are running on vmware machines and when
one of them dies and we get a new machine as a replacement - we install
zookeeper and make it a part of the ensemble. Then we manually, go to every
individual solr instance in the solr cloud - edit its  solr.xml - remove the
entry of the dead machine from zkhost and replace it with the new hostname -
thus keeping the list up-to-date. Then, we restart solr box. 

Are these the right steps ?

Thanks,
Pramod



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-8-Updating-zkhost-list-in-solr-xml-without-requiring-a-restart-tp4231979p4231994.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: firstSearcher cache warming with own QuerySenderListener

2015-09-29 Thread Chris Hostetter

You haven't really provided us enough info to make any meaningful 
suggestions.

You've got at least 2 custom plugins -- but you don't give us any idea 
what the implementations of those plugins look like, or how you've 
configured them.  Maybe there is a bug in your code?  maybe it's 
misconfigured?

You said that initial queries seem a little faster when you use your 
custom plugin(s) but not as fast as if you manual warm those queries from 
a browser first -- what do the queries look like? how fast is fast? ... 

w/o specifics it's impossible to guess where the added time (or added time 
savings when using hte browser to warm them) may be coming from ... and 
again: maybe the issue is that the code in your custom only is only 
partially right? maybe it's giving you a slight bit of warming just by 
executing a query to get some index data strucutres into ram, but it's 
actaully executing the wrong query?

Show us the details single query, and tell us how *exactly* does the 
timing comapare between: no warming; warming just that query with your 
custom plugin; warming just that query from your browser?

show us the *logs* from solr in all of those cases as well so we can see 
what is actaully getting executed under the hood.


As far as caching goes: all of the cache statistics are easily available 
from the plugin UI / handler...

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=32604180
https://cwiki.apache.org/confluence/display/solr/MBean+Request+Handler

what do you see in terms of insertions/hits/misses on all of the caches in 
each of the above scenerios?



: Date: Fri, 25 Sep 2015 17:31:30 +0200
: From: Christian Reuschling 
: Reply-To: solr-user@lucene.apache.org
: To: "solr-user@lucene.apache.org" 
: Subject: firstSearcher cache warming with own QuerySenderListener
: 
: Hey all,
: 
: we want to avoid cold start performance issues when the caches are cleared 
after a server restart.
: 
: For this, we have written a SearchComponent that saves least recently used 
queries. These are
: written to a file inside a closeHook of a SolrCoreAware at server shutdown.
: 
: The plan is to perform these queries at server startup to warm up the caches. 
For this, we have
: written a derivative of the QuerySenderListener and configured it as 
firstSearcher listener in
: solrconfig.xml. The only difference to the origin QuerySenderListener is that 
it gets it's queries
: from the formerly dumped lru queries rather than getting them from the config 
file.
: 
: It seems that everything is called correctly, and we have the impression that 
the query response
: times for the dumped queries are sometimes slightly better than without this 
warming.
: 
: Nevertheless, there is still a huge difference against the times when we 
manually perform the same
: queries once, e.g. from a browser. If we do this, the second time we perform 
these queries they
: respond much faster (up to 10 times) than the response times after the 
implemented warming.
: 
: It seems that not all caches are warmed up during our warming. And because of 
these huge
: differences, I doubt we missed something.
: 
: The index has about 25M documents, and is splitted into two shards in a cloud 
configuration, both
: shards are on the same server instance for now, for testing purposes.
: 
: Does anybody have an idea? I tried to disable lazy field loading as a 
potential issue, but with no
: success.
: 
: 
: Cheers,
: 
: Christian
: 
: 

-Hoss
http://www.lucidworks.com/


Re: Cost of having multiple search handlers?

2015-09-29 Thread Jeff Wartes

At the risk of going increasingly off-thread, yes, please do.
I’ve been using this:
https://dropwizard.github.io/metrics/3.1.0/manual/jetty/, which is
convenient, but doesn’t even have request-handler-level resolution.

Something I’ve started doing for issues that don’t seem likely to get
pulled in but also don’t really need changes in solr/lucene source code is
to publish a free-standing project (with tests) that builds the necessary
jar. For example, https://github.com/whitepages/SOLR-4449.
Seems like a decent middle ground where people can easily use or
contribute changes, and then if it gets popular enough, that’s a strong
signal it should be in the solr distribution itself.



On 9/28/15, 6:05 PM, "Walter Underwood"  wrote:

>We built our own because there was no movement on that. Don’t hold your
>breath.
>
>Glad to contribute it. We’ve been running it in production for a year,
>but the config is pretty manual.
>
>wunder
>Walter Underwood
>wun...@wunderwood.org
>http://observer.wunderwood.org/  (my blog)
>
>
>> On Sep 28, 2015, at 4:41 PM, Jeff Wartes  wrote:
>> 
>> 
>> One would hope that https://issues.apache.org/jira/browse/SOLR-4735 will
>> be done by then.
>> 
>> 
>> On 9/28/15, 11:39 AM, "Walter Underwood"  wrote:
>> 
>>> We did the same thing, but reporting performance metrics to Graphite.
>>> 
>>> But we won’t be able to add servlet filters in 6.x, because it won’t
>>>be a
>>> webapp.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> 
 On Sep 28, 2015, at 11:32 AM, Gili Nachum 
wrote:
 
 A different solution to the same need: I'm measuring response times of
 different collections measuring  online/batch queries apart using New
 Relic. I've added a servlet filter that analyses the request and makes
 this
 info available to new relic over a request argument.
 
 The built in new relic solr plug in doesn't provide much.
 On Sep 28, 2015 17:16, "Shawn Heisey"  wrote:
 
> On 9/28/2015 6:30 AM, Oliver Schrenk wrote:
>> I want to register multiple but identical search handler to have
> multiple buckets to measure performance for our different apis and
> consumers (and to find out who is actually using Solr).
>> 
>> What are there some costs associated with having multiple search
> handlers? Are they neglible?
> 
> Unless you are creating hundreds or thousands of them, I doubt you'll
> notice any significant increase in resource usage from additional
> handlers.  Each handler definition creates an additional URL endpoint
> within the servlet container, additional object creation within Solr,
> and perhaps an additional thread pool and threads to go with it, so
> it's
> not free, but I doubt that it's significant.  The resources required
> for
> actually handling a request is likely to dwarf what's required for
>more
> handlers.
> 
> Disclaimer: I have not delved into the code to figure out exactly
>what
> gets created with a search handler config, so I don't know exactly
>what
> happens.  I'm basing this on general knowledge about how Java
>programs
> are constructed by expert developers, not specifics about Solr.
> 
> There are others on the list who have a much better idea than I do,
>so
> if I'm wrong, I'm sure one of them will let me know.
> 
> Thanks,
> Shawn
> 
> 
>>> 
>> 
>



Re: Using dynamically calculated value for sorting

2015-09-29 Thread Leonardo Foderaro
Hi,
please take a look at Alba, a framework which simplifies the development of
new Solr plugins. You can write a plugin (e.g. custom function to be used
to boost/sort your docs or a custom Response Writer) in literally five
lines of code.

More specifically I think these two examples could be useful.

This is the basic example, which shows you how to write a function to
calculate the length of a text field and how to use that information for
boosting/sorting/filtering:
https://github.com/leonardofoderaro/alba/wiki/Your-first-Function-Query:-the-title-length

Here is a slightly more advanced example, how to a discount on books older
than X years:
https://github.com/leonardofoderaro/alba/wiki/discounting-old-books

Disclaimer 1: it has never been used in production yet ;-)

Disclaimer 2: I'm the author of the Alba Framework

Should you have any questions / issues / hints / PR  on how to improve it
please let me know.

best regards,
Leonardo










On Tue, Sep 29, 2015 at 8:36 PM, bbarani  wrote:

> Hi,
>
> We have a price field in our SOLR XML feed that we currently use for
> sorting. We are planning to introduce discounts based on login credentials
> and we have to dynamically calculate price (using base price in SOLR feed)
> based on a specific discount returned by an API. Now after the discount is
> calculated we want to sort based on the new price (discounted price).
>
>  What is the best way to do that? Any ideas would be appreciated.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Using-dynamically-calculated-value-for-sorting-tp4231950.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: How to preserve 0 after decimal point?

2015-09-29 Thread bbarani
Thanks for your response.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-preserve-0-after-decimal-point-tp4159295p4231961.html
Sent from the Solr - User mailing list archive at Nabble.com.


Using dynamically calculated value for sorting

2015-09-29 Thread bbarani
Hi,

We have a price field in our SOLR XML feed that we currently use for
sorting. We are planning to introduce discounts based on login credentials
and we have to dynamically calculate price (using base price in SOLR feed)
based on a specific discount returned by an API. Now after the discount is
calculated we want to sort based on the new price (discounted price).

 What is the best way to do that? Any ideas would be appreciated.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-dynamically-calculated-value-for-sorting-tp4231950.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How can I get a monotonically increasing field value for docs?

2015-09-29 Thread Chris Hostetter


You're basically re-implementing Solr' cursors.

you can change your system of reading docs from the old collection to 
use...

cursorMark=*=timestamp+asc,id+asc

...and then instead of keeping track of the last timestamp & id values and 
constructing a filter, you can just keep track of the nextCursorMark and 
pass it the next time you want to check for newer documents...

https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results





: Date: Mon, 21 Sep 2015 21:32:33 +0300
: From: Gili Nachum 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Re: How can I get a monotonically increasing field value for docs?
: 
: Thanks for the indepth explanation!
: 
: The secondary sort by uuid would allow me to read a series of docs with
: identical time over multiple batches by specifying filtering
: time>timeOnLastReadDoc or (time=timeOnLastReadDoc and
: uuid>uuidOnLastReaDoc) which essentially creates a unique sorted value to
: track progress over.
: On Sep 21, 2015 19:56, "Shawn Heisey"  wrote:
: 
: > On 9/21/2015 9:01 AM, Gili Nachum wrote:
: > > TimestampUpdateProcessorFactory takes place only on the leader shard, or
: > on
: > > each shard replica?
: > > if on each replica then I would get different values on each replica.
: > >
: > > My alternative would be to perform secondary sort on a UUID to ensure
: > order.
: >
: > If the update chain is configured properly, it runs on the leader, so
: > all replicas get the same timestamp.
: >
: > Without SolrCloud, the way to create an "indexed at" time field is in
: > the schema -- specify a default value of NOW on the field definition and
: > don't send the field when indexing.  The old master/slave replication
: > copies the actual index contents, so the indexed values in all replicas
: > are the same.
: >
: > The problem with NOW in the schema when running SolrCloud is that each
: > replica indexes the document independently, so each replica can have a
: > different timestamp.  This is why the timestamp update processor exists
: > -- to set the timestamp to a specific value before the document is
: > duplicated to each replica, eliminating the problem.
: >
: > FYI, secondary sort parameters affect the order when the primary sort
: > field is identical between two documents.  It may not do what you are
: > intending because of that.
: >
: > Thanks,
: > Shawn
: >
: >
: 

-Hoss
http://www.lucidworks.com/