Re: optimize boosting parameters

2020-12-08 Thread Derek Poh
We monitor the response time (pingdom) of the page that uses these 
boosting parameters. Since the addition of these boosting parameters and 
an additional field to search on (which I will create a thread on it in 
the mailing list), the page average response time has increased by 1-2 
seconds.

Management has feedback on this.


If it does turn out to be the boosting (and IIRC the
map function can be expensive), can you pre-compute some
number of the boosts? Your requirements look
like they can be computed at index time, then boost
by just the value of the pre-computed field.
I have gone through the list of functions and map function is the only 
one that can meet the requirements.

Or is there a less expensive function that I missed out?

By pre-compute some number, do you mean before the indexing at 
preparation stage, check the value of P_SupplierResponseRate. If the 
value = 3, specify 'boost="0.4"' for the field of the document?



BTW, boosts < 1.0
_reduce_ the score. I mention that just in case that’s a surprise ;)
Oh it is to reduce the score?! Not increase (multiply or add) the score 
by less than 1?



  You use termfreq, which changes of course, but
1> if your corpus is updated often enough, the termfreqs will be relatively 
stable.
   in that case you can pre-compute them too.
We do incremental indexing every half an hour on this collection. 
Average of 50K-100K documents during each indexing. Collection has 7+ 
milliion documents.

So the entire corpus does not get updated in every indexing.


2> your problem statement has nothing to do with termfreq so why are you
  using it in the first place?
I read up on termfreq function again. It returns the number of times the 
term appears in the field for that document. It does not really fit the 
requirements. Thank you for pointing it out.

I should use map instead?

Derek

On 8/12/2020 9:48 pm, Erick Erickson wrote:

Before worrying about it too much, exactly _how_ much has
the performance changed?

I’ve just been in too many situations where there’s
no objective measure of performance before and after, just
someone saying “it seems slower” and had those performance
changes disappear when a rigorous test is done. Then spent
a lot of time figuring out that the person reporting the
problem hadn’t had coffee yet. Or the network was slow.
Or….

If it does turn out to be the boosting (and IIRC the
map function can be expensive), can you pre-compute some
number of the boosts? Your requirements look
like they can be computed at index time, then boost
by just the value of the pre-computed field. BTW, boosts < 1.0
_reduce_ the score. I mention that just in case that’s a surprise ;)
Of course that means that to change the boosting you need
to re-index.

  You use termfreq, which changes of course, but
1> if your corpus is updated often enough, the termfreqs will be relatively 
stable.
   in that case you can pre-compute them too.


2> your problem statement has nothing to do with termfreq so why are you
  using it in the first place?

Best,
Erick


On Dec 8, 2020, at 12:46 AM, Radu Gheorghe  wrote:

Hi Derek,

Ah, then my reply was completely off :)

I don’t really see a better way. Maybe other than changing termfreq to field, 
if the numeric field has docValues? That may be faster, but I don’t know for 
sure.

Best regards,
Radu
--
Sematext Cloud - Full Stack Observability - https://sematext.com
Solr and Elasticsearch Consulting, Training and Production Support


On 8 Dec 2020, at 06:17, Derek Poh  wrote:

Hi Radu

Apologies for not making myself clear.

I would like to know if there is a more simple or efficient way to craft the 
boosting parameters based on the requirements.

For example, I am using 'if', 'map' and 'termfreq' functions in the bf 
parameters.

Is there a more efficient or simple function that can be use instead? Or craft 
the 'formula' it in a more efficient way?

On 7/12/2020 10:05 pm, Radu Gheorghe wrote:

Hi Derek,

It’s hard to tell whether your boosts can be made better without knowing your 
data and what users expect of it. Which is a problem in itself.

I would suggest gathering judgements, like if a user queries for X, what doc 
IDs do you expect to get back?

Once you have enough of these judgements, you can experiment with boosts and 
see how the query results change. There are measures such as nDCG (
https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG
) that can help you measure that per query, and you can average this score 
across all your judgements to get an overall measure of how well you’re doing.

Or even better, you can have something like Quaerite play with boost values for 
you:

https://github.com/tballison/quaerite/blob/main/quaerite-examples/README.md#genetic-algorithms-ga-runga


Best regards,
Radu
--
Sematext Cloud - Full Stack Observability -
https://sematext.com

Solr and Elasticsearch Consulting, Training and Production Support



On 7 Dec 2020

Re: optimize boosting parameters

2020-12-07 Thread Derek Poh

Hi Radu

Apologies for not making myself clear.

I would like to know if there is a more simple or efficient way to craft 
the boosting parameters based on the requirements.


For example, I am using 'if', 'map' and 'termfreq' functions in the bf 
parameters.


Is there a more efficient or simple function that can be use instead? Or 
craft the 'formula' it in a more efficient way?


On 7/12/2020 10:05 pm, Radu Gheorghe wrote:

Hi Derek,

It’s hard to tell whether your boosts can be made better without knowing your 
data and what users expect of it. Which is a problem in itself.

I would suggest gathering judgements, like if a user queries for X, what doc 
IDs do you expect to get back?

Once you have enough of these judgements, you can experiment with boosts and 
see how the query results change. There are measures such as nDCG 
(https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG) that 
can help you measure that per query, and you can average this score across all 
your judgements to get an overall measure of how well you’re doing.

Or even better, you can have something like Quaerite play with boost values for 
you:
https://github.com/tballison/quaerite/blob/main/quaerite-examples/README.md#genetic-algorithms-ga-runga

Best regards,
Radu
--
Sematext Cloud - Full Stack Observability - https://sematext.com
Solr and Elasticsearch Consulting, Training and Production Support


On 7 Dec 2020, at 10:51, Derek Poh  wrote:

Hi

I have added the following boosting requirements to the search query of a page. 
Feedback from monitoring team is that the overall response of the page has 
increased since then.
I am trying to find out if the added boosting parameters (below) could have 
contributed to the increased.

The boosting is working as per requirements.

May I know if the implemented boosting parameters can be enhanced or optimized 
further?
Hopefully to improve on the response time of the query and the page.

Requirements:
1. If P_SupplierResponseRate is:
a. 3, boost by 0.4
b. 2, boost by 0.2

2. If P_SupplierResponseTime is:
a. 4, boost by 0.4
b. 3, boost by 0.2

3. If P_MWSScore is:
a. between 80-100, boost by 1.6
b. between 60-79, boost by 0.8

4. If P_SupplierRanking is:
a. 3, boost by 0.3
b. 4, boost by 0.6
c. 5, boost by 0.9
b. 6, boost by 1.2

Boosting parameters implemented:
bf=map(P_SupplierResponseRate,3,3,0.4,0)
bf=map(P_SupplierResponseRate,2,2,0.2,0)

bf=map(P_SupplierResponseTime,4,4,0.4,0)
bf=map(P_SupplierResponseTime,3,3,0.2,0)

bf=map(P_MWSScore,80,100,1.6,0)
bf=map(P_MWSScore,60,79,0.8,0)

bf=if(termfreq(P_SupplierRanking,3),0.3,if(termfreq(P_SupplierRanking,4),0.6,if(termfreq(P_SupplierRanking,5),0.9,if(termfreq(P_SupplierRanking,6),1.2,0


I am using Solr 7.7.2

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or 
privileged information. If you are not the intended recipient or have received 
this e-mail in error, please inform the sender immediately and delete this 
e-mail (including any attachments) from your computer, and you must not use, 
disclose to anyone else or copy this e-mail (including any attachments), 
whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.






--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

optimize boosting parameters

2020-12-07 Thread Derek Poh

Hi

I have added the following boosting requirements to the search query of 
a page. Feedback from monitoring team is that the overall response of 
the page has increased since then.
I am trying to find out if the added boosting parameters (below) could 
have contributed to the increased.


The boosting is working as per requirements.

May I know if the implemented boosting parameters can be enhanced or 
optimized further?

Hopefully to improve on the response time of the query and the page.

Requirements:
1. If P_SupplierResponseRate is:
   a. 3, boost by 0.4
   b. 2, boost by 0.2

2. If P_SupplierResponseTime is:
   a. 4, boost by 0.4
   b. 3, boost by 0.2

3. If P_MWSScore is:
   a. between 80-100, boost by 1.6
   b. between 60-79, boost by 0.8

4. If P_SupplierRanking is:
   a. 3, boost by 0.3
   b. 4, boost by 0.6
   c. 5, boost by 0.9
   b. 6, boost by 1.2

Boosting parameters implemented:
bf=map(P_SupplierResponseRate,3,3,0.4,0)
bf=map(P_SupplierResponseRate,2,2,0.2,0)

bf=map(P_SupplierResponseTime,4,4,0.4,0)
bf=map(P_SupplierResponseTime,3,3,0.2,0)

bf=map(P_MWSScore,80,100,1.6,0)
bf=map(P_MWSScore,60,79,0.8,0)

bf=if(termfreq(P_SupplierRanking,3),0.3,if(termfreq(P_SupplierRanking,4),0.6,if(termfreq(P_SupplierRanking,5),0.9,if(termfreq(P_SupplierRanking,6),1.2,0


I am using Solr 7.7.2

--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.



Re: advice on whether to use stopwords for use case

2020-10-01 Thread Derek Poh
Yes, the requirements (for now) is not to return any results. I think 
they may change the requirements,pending their return from the holidays.



If so, then check for those words in the query before sending it to Solr.

That is what I think so too.

Thinking further, using stopwords for this, there will still be results 
return when the number of words in the search keywords is more than the 
stopwords.


On 1/10/2020 2:57 am, Walter Underwood wrote:

I’m not clear on the requirements. It sounds like the query “cigar” or “cuban 
cigar”
should return zero results. Is that right?

If so, then check for those words in the query before sending it to Solr.

But the stopwords approach seems like the requirement is different. Could you 
give
some examples?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Sep 30, 2020, at 11:53 AM, Alexandre Rafalovitch  wrote:

You may also want to look at something like: https://docs.querqy.org/index.html

ApacheCon had (is having..) a presentation on it that seemed quite
relevant to your needs. The videos should be live in a week or so.

Regards,
   Alex.

On Tue, 29 Sep 2020 at 22:56, Alexandre Rafalovitch  wrote:

I am not sure why you think stop words are your first choice. Maybe I
misunderstand the question. I read it as that you need to exclude
completely a set of documents that include specific keywords when
called from specific module.

If I wanted to differentiate the searches from specific module, I
would give that module a different end-point (Request Query Handler),
instead of /select. So, /nocigs or whatever.

Then, in that end-point, you could do all sorts of extra things, such
as setting appends or even invariants parameters, which would include
filter query to exclude any documents matching specific keywords. I
assume it is ok to return documents that are matching for other
reasons.

Ideally, you would mark the cigs documents during indexing with a
binary or enumeration flag and then during search you just need to
check against that flag. In that case, you could copyField  your text
and run it against something like
https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter
combined with Shingles for multiwords. Or similar. And just transform
it as index-only so that the result is basically a yes/no flag.
Similar thing could be done with UpdateRequestProcessor pipeline if
you want to end up with a true boolean flag. The idea is the same,
just to have an index-only flag that you force lock into for any
request from specific module.

Or even with something like ElevationSearchComponent. Same idea.

Hope this helps.

Regards,
   Alex.

On Tue, 29 Sep 2020 at 22:28, Derek Poh  wrote:

Hi

I have read in the mailings list that we should try to avoid using stop
words.

I have a use case where I would like to know if there is other
alternative solutions beside using stop words.

There is business requirement to return zero result when the search is
cigarette related words and the search is coming from a particular
module on our site. It does not apply to all searches from our site.
There is a list of these cigarette related words. This list contains
single word, multiple words (Electronic cigar), multiple words with
punctuation (e-cigarette case).
I am planning to copy a different set of search fields, that will
include the stopword filter in the index and query stage, for this
module to use.

For this use case, other than using stop words to handle it, is there
any alternative solution?

Derek

--
CONFIDENTIALITY NOTICE

This e-mail (including any attachments) may contain confidential and/or 
privileged information. If you are not the intended recipient or have received 
this e-mail in error, please inform the sender immediately and delete this 
e-mail (including any attachments) from your computer, and you must not use, 
disclose to anyone else or copy this e-mail (including any attachments), 
whether in whole or in part.

This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: advice on whether to use stopwords for use case

2020-10-01 Thread Derek Poh

Hi Alex

The business requirement (for now) is not to return any result when the 
search keywords are cigarette related. The business user team will 
provide the list of the cigarette related keywords.


Will digest, explore and research on your suggestions. Thank you.

On 30/9/2020 10:56 am, Alexandre Rafalovitch wrote:

I am not sure why you think stop words are your first choice. Maybe I
misunderstand the question. I read it as that you need to exclude
completely a set of documents that include specific keywords when
called from specific module.

If I wanted to differentiate the searches from specific module, I
would give that module a different end-point (Request Query Handler),
instead of /select. So, /nocigs or whatever.

Then, in that end-point, you could do all sorts of extra things, such
as setting appends or even invariants parameters, which would include
filter query to exclude any documents matching specific keywords. I
assume it is ok to return documents that are matching for other
reasons.

Ideally, you would mark the cigs documents during indexing with a
binary or enumeration flag and then during search you just need to
check against that flag. In that case, you could copyField  your text
and run it against something like
https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter
combined with Shingles for multiwords. Or similar. And just transform
it as index-only so that the result is basically a yes/no flag.
Similar thing could be done with UpdateRequestProcessor pipeline if
you want to end up with a true boolean flag. The idea is the same,
just to have an index-only flag that you force lock into for any
request from specific module.

Or even with something like ElevationSearchComponent. Same idea.

Hope this helps.

Regards,
Alex.

On Tue, 29 Sep 2020 at 22:28, Derek Poh  wrote:

Hi

I have read in the mailings list that we should try to avoid using stop
words.

I have a use case where I would like to know if there is other
alternative solutions beside using stop words.

There is business requirement to return zero result when the search is
cigarette related words and the search is coming from a particular
module on our site. It does not apply to all searches from our site.
There is a list of these cigarette related words. This list contains
single word, multiple words (Electronic cigar), multiple words with
punctuation (e-cigarette case).
I am planning to copy a different set of search fields, that will
include the stopword filter in the index and query stage, for this
module to use.

For this use case, other than using stop words to handle it, is there
any alternative solution?

Derek

--
CONFIDENTIALITY NOTICE

This e-mail (including any attachments) may contain confidential and/or 
privileged information. If you are not the intended recipient or have received 
this e-mail in error, please inform the sender immediately and delete this 
e-mail (including any attachments) from your computer, and you must not use, 
disclose to anyone else or copy this e-mail (including any attachments), 
whether in whole or in part.

This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.



--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

advice on whether to use stopwords for use case

2020-09-29 Thread Derek Poh

Hi

I have read in the mailings list that we should try to avoid using stop 
words.


I have a use case where I would like to know if there is other 
alternative solutions beside using stop words.


There is business requirement to return zero result when the search is 
cigarette related words and the search is coming from a particular 
module on our site. It does not apply to all searches from our site.
There is a list of these cigarette related words. This list contains 
single word, multiple words (Electronic cigar), multiple words with 
punctuation (e-cigarette case).
I am planning to copy a different set of search fields, that will 
include the stopword filter in the index and query stage, for this 
module to use.


For this use case, other than using stop words to handle it, is there 
any alternative solution?


Derek

--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

combined multiple bf into a single bf

2020-06-09 Thread Derek Poh

I have the following boost requirement using bf

response_rate is 3, boost by ^0.6
response_rate is 2, boost by ^0.3
response_time is 4, boost by ^0.6
response_time is 3, boost by ^0.3

I am using a bf for each of the boost requirement,

bf=map(response_rate,3,3,0.6,0)=map(response_rate,2,2,0.3,0)=map(response_time,4,4,0.6,0)=map(response_time,3,3,0.3,0)

I am trying to reduce on the number of parameters in the query.

Is it possible to combined them into 1 or 2 bf?

Running Solr 4.10.4.

Derek

--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.



alternative suggestions on how to store product attributes in collection

2019-08-29 Thread Derek Poh

Hi

I would like to know if there are suggestions on how I can handle my 
task below. Please pardon the lengthy description.


I need to store product attributes in a collection.
Attributes like Size, Color, Material etc.

Each product can have up to max of 5 attributes.
Between products, their attributes can be different.
Attribute can be added and deleted from the source system.

A simple example of possible product attributes information
Product    Attribute    Value
P1    Size  M
P1    Size  L
P1    Color    Red
P2    Size  M
P2    Color    Blue
P3    Material    Plastic
P4    Amp 12

I have come up with 2 approaches to it:

1.
If I store each attribute as a field in a collection, there will be alot 
of fields to create.
Furthermore as attribute can be added and deleted, the maintaining of 
the attributes fields in solr will be difficult.
However with each field for each attribute the product attribute facets 
will be easy and straight forward.

Example,
Size facet:
M - 2
L - 1

Color facet:
Red - 1
Blue - 1

2.
Another approach is to create only a field to store the attributes and 
attributes value of a product.

This field will be multi-value.
Solr does not need to bother with new attribute and deleted attribute.

Eg. P_ProductAttribute

P1:

Size-M
Size-L
Color-Red


However the product attribute facet with this approach will required the 
UI to iterate through the facet, extract the attributes and their values 
to display as individual attribute facet in the search result page.

Eg, of P_ProductAttribute facet:

Color-Blue>1
Color-Red>1
Size-L>1
Size-M>2

Any other suggestion on how I can approach this?

Regards,
Derek

--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: TolerantUpdateProcessorFactory maxErrors=-1 issue

2018-09-23 Thread Derek Poh

Hi Tomas

I moved TolerantUpdateProcessorFactoryto the beginning of the chain, 
reload the collection.

The indexing process still abort.

On 22/9/2018 4:28 AM, Tomás Fernández Löbbe wrote:

Hi Derek,
I suspect you need to move the TolerantUpdateProcessorFactory to the
beginning of the chain

On Thu, Sep 20, 2018 at 6:17 PM Derek Poh  wrote:


Does any one have any idea whatcould be the causeof this?

On 19/9/2018 11:40 AM, Derek Poh wrote:

In addition, I tried withmaxErrors=3 and with only 1error document,
the indexing process still gets aborted.

Could it be the way I defined the TolerantUpdateProcessorFactory in
solrconfg.xml?

On 18/9/2018 3:13 PM, Derek Poh wrote:

Hi

I am using CSV formatted indexupdates to index on tab delimited file.

I have define "TolerantUpdateProcessorFactory" with "maxErrors=-1" in
the solrconfig.xml to skip any document update error and proceed to
update the remaining documents without failing.
Howeverit does not seemto be workingas there is an document in the
tab delimited file withadditional number of fields and this caused
the indexing to abort instead.

This is how I start the indexing,
curl -o /apps/search/logs/indexing.log
"

http://localhost:8983/solr/$collection/update?update.chain=$updateChainName=true=%09=^=$fieldnames$splitOptions;


--data-binary "@/apps/search/feed/$csvFilePath/$csvFileName" -H
'Content-type:application/csv'

This is how the TolerantUpdateProcessorFactory is defined in the
solrconfig.xml,

   
 P_SupplierId
 P_TradeShowId
 P_ProductId
 id
   
   
 id
 
   
   
  -1
   
   
 
 
 43200
 P_TradeShowOnlineEndDateUTC
   
   
   


Solr version is 6.6.2.

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential
and/or privileged information. If you are not the intended recipient
or have received this e-mail in error, please inform the sender
immediately and delete this e-mail (including any attachments) from
your computer, and you must not use, disclose to anyone else or copy
this e-mail (including any attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.


--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential
and/or privileged information. If you are not the intended recipient
or have received this e-mail in error, please inform the sender
immediately and delete this e-mail (including any attachments) from
your computer, and you must not use, disclose to anyone else or copy
this e-mail (including any attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.


--
CONFIDENTIALITY NOTICE

This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.

This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.



--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: TolerantUpdateProcessorFactory maxErrors=-1 issue

2018-09-20 Thread Derek Poh

Does any one have any idea whatcould be the causeof this?

On 19/9/2018 11:40 AM, Derek Poh wrote:
In addition, I tried withmaxErrors=3 and with only 1error document, 
the indexing process still gets aborted.


Could it be the way I defined the TolerantUpdateProcessorFactory in 
solrconfg.xml?


On 18/9/2018 3:13 PM, Derek Poh wrote:

Hi

I am using CSV formatted indexupdates to index on tab delimited file.

I have define "TolerantUpdateProcessorFactory" with "maxErrors=-1" in 
the solrconfig.xml to skip any document update error and proceed to 
update the remaining documents without failing.
Howeverit does not seemto be workingas there is an document in the 
tab delimited file withadditional number of fields and this caused 
the indexing to abort instead.


This is how I start the indexing,
curl -o /apps/search/logs/indexing.log 
"http://localhost:8983/solr/$collection/update?update.chain=$updateChainName=true=%09=^=$fieldnames$splitOptions; 
--data-binary "@/apps/search/feed/$csvFilePath/$csvFileName" -H 
'Content-type:application/csv'


This is how the TolerantUpdateProcessorFactory is defined in the 
solrconfig.xml,


  
    P_SupplierId
    P_TradeShowId
    P_ProductId
    id
  
  
    id
    
  
  
 -1
  
  
    
    
    43200
    P_TradeShowOnlineEndDateUTC
  
  
  


Solr version is 6.6.2.

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential 
and/or privileged information. If you are not the intended recipient 
or have received this e-mail in error, please inform the sender 
immediately and delete this e-mail (including any attachments) from 
your computer, and you must not use, disclose to anyone else or copy 
this e-mail (including any attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.



--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential 
and/or privileged information. If you are not the intended recipient 
or have received this e-mail in error, please inform the sender 
immediately and delete this e-mail (including any attachments) from 
your computer, and you must not use, disclose to anyone else or copy 
this e-mail (including any attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.



--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: TolerantUpdateProcessorFactory maxErrors=-1 issue

2018-09-18 Thread Derek Poh
In addition, I tried withmaxErrors=3 and with only 1error document, the 
indexing process still gets aborted.


Could it be the way I defined the TolerantUpdateProcessorFactory in 
solrconfg.xml?


On 18/9/2018 3:13 PM, Derek Poh wrote:

Hi

I am using CSV formatted indexupdates to index on tab delimited file.

I have define "TolerantUpdateProcessorFactory" with "maxErrors=-1" in 
the solrconfig.xml to skip any document update error and proceed to 
update the remaining documents without failing.
Howeverit does not seemto be workingas there is an document in the tab 
delimited file withadditional number of fields and this caused the 
indexing to abort instead.


This is how I start the indexing,
curl -o /apps/search/logs/indexing.log 
"http://localhost:8983/solr/$collection/update?update.chain=$updateChainName=true=%09=^=$fieldnames$splitOptions; 
--data-binary "@/apps/search/feed/$csvFilePath/$csvFileName" -H 
'Content-type:application/csv'


This is how the TolerantUpdateProcessorFactory is defined in the 
solrconfig.xml,


  
    P_SupplierId
    P_TradeShowId
    P_ProductId
    id
  
  
    id
    
  
  
 -1
  
  
    
    
    43200
    P_TradeShowOnlineEndDateUTC
  
  
  


Solr version is 6.6.2.

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential 
and/or privileged information. If you are not the intended recipient 
or have received this e-mail in error, please inform the sender 
immediately and delete this e-mail (including any attachments) from 
your computer, and you must not use, disclose to anyone else or copy 
this e-mail (including any attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.



--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

TolerantUpdateProcessorFactory maxErrors settings issue

2018-09-18 Thread Derek Poh

Hi

I am using CSV formatted indexupdates to index on tab delimited file.

I have define "TolerantUpdateProcessorFactory" with "maxErrors=-1" in 
the solrconfig.xml to skip any document update error and proceed to 
update the remaining documents without failing.
Howeverit does not seemto be workingas there is an document in the tab 
delimited file withadditional number of fields and this caused the 
indexing to abort instead.


This is how I start the indexing,
curl -o /apps/search/logs/indexing.log 
"http://localhost:8983/solr/$collection/update?update.chain=$updateChainName=true=%09=^=$fieldnames$splitOptions; 
--data-binary "@/apps/search/feed/$csvFilePath/$csvFileName" -H 
'Content-type:application/csv'


This is how the TolerantUpdateProcessorFactory is defined in the 
solrconfig.xml,


  
    P_SupplierId
    P_TradeShowId
    P_ProductId
    id
  
  
    id
    
  
  
 -1
  
  
    
    
    43200
    P_TradeShowOnlineEndDateUTC
  
  
  


Solr version is 6.6.2.

Derek

--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

TolerantUpdateProcessorFactory maxErrors=-1 issue

2018-09-18 Thread Derek Poh

Hi

I am using CSV formatted indexupdates to index on tab delimited file.

I have define "TolerantUpdateProcessorFactory" with "maxErrors=-1" in 
the solrconfig.xml to skip any document update error and proceed to 
update the remaining documents without failing.
Howeverit does not seemto be workingas there is an document in the tab 
delimited file withadditional number of fields and this caused the 
indexing to abort instead.


This is how I start the indexing,
curl -o /apps/search/logs/indexing.log 
"http://localhost:8983/solr/$collection/update?update.chain=$updateChainName=true=%09=^=$fieldnames$splitOptions; 
--data-binary "@/apps/search/feed/$csvFilePath/$csvFileName" -H 
'Content-type:application/csv'


This is how the TolerantUpdateProcessorFactory is defined in the 
solrconfig.xml,


  
    P_SupplierId
    P_TradeShowId
    P_ProductId
    id
  
  
    id
    
  
  
 -1
  
  
    
    
    43200
    P_TradeShowOnlineEndDateUTC
  
  
  


Solr version is 6.6.2.

Derek

--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: change DocExpirationUpdateProcessorFactory deleteByQuery NOW parameter time zone

2018-09-03 Thread Derek Poh

SG refers to Singaporeand the time is UTC +8.

That means I need to set the P_TradeShowOnlineEndDate date to UTC 
instead of UTC +8 as a workaround to it.


On 31/8/2018 10:16 PM, Shawn Heisey wrote:

On 8/30/2018 7:26 PM, Derek Poh wrote:
Can the timezone of the NOW parameter in the |deleteByQuery| of the 
DocExpirationUpdateProcessorFactory be change to my timezone?


I am in SG and using solr 6.5.1.


I do not know what SG is.

The timezone cannot be changed.  Solr *always* handles dates in UTC.  
You can assign a timezone when doing date math, but this is only used 
to determine when a new day or week starts -- the dates themselves 
will be in UTC.


Thanks,
Shawn





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

change DocExpirationUpdateProcessorFactory deleteByQuery NOW parameter time zone

2018-08-30 Thread Derek Poh

Hi

Can the timezone of the NOW parameter in the |deleteByQuery| of the 
DocExpirationUpdateProcessorFactory be change to my timezone?


I am in SG and using solr 6.5.1.

The timestamp of the entries in the solr.log is in my timezone but the 
NOW parameter of the |deleteByQuery| is a different timezone (UTC?).


The |deleteByQuery| entry in the solr.log:

2018-08-30 16:34:03.941 INFO  (qtp834133664-3600) [c:exhibitor_product_2 
s:shard1 r:core_node1 x:exhibitor_product_2_shard1_replica2] 
o.a.s.u.p.LogUpdateProcessorFactory 
[exhibitor_product_2_shard1_replica2]  webapp=/solr path=/update 
params={update.distrib=FROMLEADER&_version_=-1610212229046599680=http://192.168.83.152:8983/solr/exhibitor_product_2_shard1_replica1/=javabin=2}{deleteByQuery={!cache=false}P_TradeShowOnlineEndDate:[* 
TO 2018-08-30T08:34:06.804Z] (-1610212229046599680)} 0 23



DocExpirationUpdateProcessorFactory definition in solrconfig.xml:


  
    P_SupplierId
    P_TradeShowId
    P_ProductId
    id
  
  
    id
    
  
  
 -1
  
  
    
    
    86400
    P_TradeShowOnlineEndDate
  
  
  



stored="true" multiValued="false"/>


Derek

--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

collections replicas still in Recovery Mode after restarting Solr

2018-08-15 Thread Derek Poh

Hi
We have a setup of 2 servers, running Solr 6.6.2, on production.
There are 5 collections.
All collection are created as 1 shard x 2 replicas.

4 of the collections have this issue.
A replica of each of this 4 collections is in Recovery Mode. The 
affected replicas are on the same server or node.
I noticed there is no Leader node indicated for this 4 collections in 
the Solr Admin. This is the screenshot of the Solr Admin 
http://imagebucket.net/pmndqkijla5c/solr_admin.PNG This is the commands 
I used to stop and start the solr process, bin/solr stop -p 8983 
bin/solr start -cloud -p 8983 -s "/apps/search/solr-6.6.2/home" -z 
hktszk1:2181,hktszk2:2181,hktszk3:2181 May I know how can I bring up 
this replicas? Derek


--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: How to find out which search terms have matches in a search

2018-06-18 Thread Derek Poh

Hi Erik

I have explored the facetquery but it doesnotreally help. Thank you for 
your suggestion.


On 12/6/2018 7:49 PM, Erik Hatcher wrote:

Derek -

One trick I like to do is try various forms of a query all in one go.   With 
facet=on, you can:

   =big brown bear
   =big brown
   =brown bear
   =big
   =brown
   =bear

The returned counts give you an indication of what queries matched docs in the 
result set, and which didn’t.   If you did this with q=*:* you’d see how each 
of those matched across the entire collection.

Grouping and group.query could be used similarly.

I’ve used facet.query to do some Venn diagramming of overlap of search results like 
that.   An oldie but a goodie: 
https://www.slideshare.net/lucenerevolution/hatcher-erik-rapid-prototyping-with-solr/12
 
<https://www.slideshare.net/lucenerevolution/hatcher-erik-rapid-prototyping-with-solr/12>

4.10.4?   woah

Erik Hatcher
Senior Solutions Architect, Lucidworks.com



On Jun 11, 2018, at 11:16 PM, Derek Poh  wrote:

Hi

How can I find out which search terms have matches in a search?

Eg.
The search terms are "big brown bear".And only "big" and "brown" have matches 
in the searchresult.
Can Solr return this information that "big" and "brown" have matches in the 
search result?
I want touse this information to display on the search result page that "big" and 
"brown" have matches.
Somethinglike "big brown bear".

Amusing solr 4.10.4.

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or 
privileged information. If you are not the intended recipient or have received 
this e-mail in error, please inform the sender immediately and delete this 
e-mail (including any attachments) from your computer, and you must not use, 
disclose to anyone else or copy this e-mail (including any attachments), 
whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: How to find out which search terms have matches in a search

2018-06-18 Thread Derek Poh
Seems like theHighlight feature could help but with some workaround. 
Will need to explore more on it. Thank you.


On 12/6/2018 5:32 PM, Alessandro Benedetti wrote:

I would recommend to look into the Highlight feature[1] .
There are few implementations and they should be all right for your user
requirement.

Regards

[1] https://lucene.apache.org/solr/guide/7_3/highlighting.html



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: How to find out which search terms have matches in a search

2018-06-12 Thread Derek Poh
Sorry I realized the strike through on the term "bear" in "big brown 
bear" cannot be displayaccordinglyin the mailing list.
My aim is to have the search terms "big brown bear", display on the 
search result page with the term "bear" striked through since it does 
not have a match in the search result.



On 12/6/2018 11:16 AM, Derek Poh wrote:

Hi

How can I find out which search terms have matches in a search?

Eg.
The search terms are "big brown bear".And only "big" and "brown" have 
matches in the searchresult.
Can Solr return this information that "big" and "brown" have matches 
in the search result?
I want touse this information to display on the search result page 
that "big" and "brown" have matches.

Somethinglike "big brown bear".

Amusing solr 4.10.4.

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential 
and/or privileged information. If you are not the intended recipient 
or have received this e-mail in error, please inform the sender 
immediately and delete this e-mail (including any attachments) from 
your computer, and you must not use, disclose to anyone else or copy 
this e-mail (including any attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.



--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

How to find out which search terms have matches in a search

2018-06-11 Thread Derek Poh

Hi

How can I find out which search terms have matches in a search?

Eg.
The search terms are "big brown bear".And only "big" and "brown" have 
matches in the searchresult.
Can Solr return this information that "big" and "brown" have matches in 
the search result?
I want touse this information to display on the search result page that 
"big" and "brown" have matches.

Somethinglike "big brown bear".

Amusing solr 4.10.4.

Derek

--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

edit gc parameters in solr.in.sh or solr?

2018-03-26 Thread Derek Poh

Hi

From your experience, would like to know if It is advisable to change 

the gc parameters in solr.in.sh or solrfile?
It is mentioned in the documentation to edit solr.in.sh but would like 
toknow which file you actually edit.


I am using Solr 6.6.2at the moment.

Regards,
Derek


--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: ways to check if document is in a huge search result set

2017-09-13 Thread Derek Poh

I see. Thank you.

On 9/13/2017 2:36 PM, Michael Kuhlmann wrote:

Am 13.09.2017 um 04:04 schrieb Derek Poh:

Hi Michael

"Then continue using binary search depending on the returned score
values."

May I know what do you mean by using binary search?

An example algorithm is in Java method java.util.Arrays::binarySearch.

Or more detailed: https://en.wikipedia.org/wiki/Binary_search_algorithm

Best,
Michael





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: ways to check if document is in a huge search result set

2017-09-12 Thread Derek Poh

Hi Michael

"Then continue using binary search depending on the returned score values."

May I know what do you mean by using binary search?

On 9/12/2017 3:08 PM, Michael Kuhlmann wrote:

So you're looking for a solution to validate the result output.

You have two ways:
1. Assuming you're sorting by the default "score" sort option:
Find the result you're looking for by setting the fq filter clause
accordingly, and add "score" the the fl field list.
Then do the normal unfiltered search, still including "score", and start
with page, let's say, 50,000.
Then continue using binary search depending on the returned score values.

2. Set fl to return only the supplier id, then you'll probably be able
to return several ten-thousand results at once.


But be warned, the result position of these elements can vary with every
single commit, esp. when there're lots of documents with the same score
value.

-Michael


Am 12.09.2017 um 03:21 schrieb Derek Poh:

Some additional information.

I have a query from user that a supplier's product(s) is not in the
search result.
I debugged by adding a fq on the supplier id to the query to verify
the supplier's product is in thesearch result. The products do existin
the search result.
I want to tell user in which page of the search result the supplier's
product appear in. To do this I go through each page of the search
result to find the supplier's product.
It is still fine if the search result has a few hundreds products but
it will be a chore if the result have thousands. In this case there
are more than 100,000 products in the result.

Any advice on easier ways to check which page the supplier's product
or document appear in a search result?

On 9/11/2017 2:44 PM, Mikhail Khludnev wrote:

You can request facet field, query facet, filter or even explainOther.

On Mon, Sep 11, 2017 at 5:12 AM, Derek Poh <d...@globalsources.com>
wrote:


Hi

I have a collection of productdocument.
Each productdocument has supplier information in it.

I need to check if a supplier's products is return in a search
resultcontaining over 100,000 products and in which page (assuming
pagination is 20 products per page).
Itis time-consuming and "labour-intensive" to go through each page
to look
for the product of the supplier.

Would like to know if you guys have any better and easier waysto do
this?

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer,
and you
must not use, disclose to anyone else or copy this e-mail (including
any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.





--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential
and/or privileged information. If you are not the intended recipient
or have received this e-mail in error, please inform the sender
immediately and delete this e-mail (including any attachments) from
your computer, and you must not use, disclose to anyone else or copy
this e-mail (including any attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.






--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: ways to check if document is in a huge search result set

2017-09-11 Thread Derek Poh

Some additional information.

I have a query from user that a supplier's product(s) is not in the 
search result.
I debugged by adding a fq on the supplier id to the query to verify the 
supplier's product is in thesearch result. The products do existin the 
search result.
I want to tell user in which page of the search result the supplier's 
product appear in. To do this I go through each page of the search 
result to find the supplier's product.
It is still fine if the search result has a few hundreds products but it 
will be a chore if the result have thousands. In this case there are 
more than 100,000 products in the result.


Any advice on easier ways to check which page the supplier's product or 
document appear in a search result?


On 9/11/2017 2:44 PM, Mikhail Khludnev wrote:

You can request facet field, query facet, filter or even explainOther.

On Mon, Sep 11, 2017 at 5:12 AM, Derek Poh <d...@globalsources.com> wrote:


Hi

I have a collection of productdocument.
Each productdocument has supplier information in it.

I need to check if a supplier's products is return in a search
resultcontaining over 100,000 products and in which page (assuming
pagination is 20 products per page).
Itis time-consuming and "labour-intensive" to go through each page to look
for the product of the supplier.

Would like to know if you guys have any better and easier waysto do this?

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.







--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

ways to check if document is in a huge search result set

2017-09-10 Thread Derek Poh

Hi

I have a collection of productdocument.
Each productdocument has supplier information in it.

I need to check if a supplier's products is return in a search 
resultcontaining over 100,000 products and in which page (assuming 
pagination is 20 products per page).
Itis time-consuming and "labour-intensive" to go through each page to 
look for the product of the supplier.


Would like to know if you guys have any better and easier waysto do this?

Derek

--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: different length/size of unique 'id' field value in a collection.

2017-05-22 Thread Derek Poh

Hi Rick

Myapologies I didnot make myself clearon the value of the fields. There 
are numbers.
I used 'ts1', 'sup1' and 'pdt1' for simplicity and for ease of 
understanding instead of the actual numbers.


You mentioned this design has the potential for (in error cases) 
concatenating id's incorrectly. Could explain more on this?


On 5/22/2017 6:12 PM, Rick Leir wrote:

On 2017-05-22 02:25 AM, Derek Poh wrote:

Hi

Due to the source data structure, I need to concatenate the values of 
2 fields ('supplier_id' and 'product_id') to form the unique 'id' of 
each document.
However there are cases where some documents only have 'supplier_id' 
field.
This will result in some documents with a longer/larger 'id' field 
(have both 'supplier_id' and 'product_id') and some with a 
shorter/smaller 'id' field value (has only 'supplier_id').


Please refer to simplified representation of the records below.
3rd record only has supplier id .
ts1 sup1 pdt1
ts1 sup1 pdt2
ts1 sup2
ts1 sup3 pdt3
ts1 sup4 pdt5
ts1 sup4 pdt6

I understand the unique 'id' is use during indexing to check whether 
a document already exists. Create if it does not exists else update 
if it exists.


Are there any implications if the unique 'id' field value is of 
different size/length among documents of a collection?

No

Is it advisable to have such design?

Derek
You need unique ID's. This design has the potential for (in error 
cases) concatenating id's incorrectly. It might be better to have ID's 
which are just a number. That said, my current project has ID's which 
are not just a number, YMMV.

cheers -- Rick


Derek






--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

different length/size of unique 'id' field value in a collection.

2017-05-22 Thread Derek Poh

Hi

Due to the source data structure, I need to concatenate the values of 2 
fields ('supplier_id' and 'product_id') to form the unique 'id' of each 
document.

However there are cases where some documents only have 'supplier_id' field.
This will result in some documents with a longer/larger 'id' field (have 
both 'supplier_id' and 'product_id') and some with a shorter/smaller 
'id' field value (has only 'supplier_id').


Please refer to simplified representation of the records below.
3rd record only has supplier id .
ts1 sup1 pdt1
ts1 sup1 pdt2
ts1 sup2
ts1 sup3 pdt3
ts1 sup4 pdt5
ts1 sup4 pdt6

I understand the unique 'id' is use during indexing to check whether a 
document already exists. Create if it does not exists else update if it 
exists.


Are there any implications if the unique 'id' field value is of 
different size/length among documents of a collection?

Is it advisable to have such design?

Derek

--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.



Re: 1 main collection or multiple smaller collections?

2017-04-27 Thread Derek Poh

Richard

Iam considering the sameoption asyour suggestion to put them in 1 single 
collection of products documents. A product doccontaining the supplier info.
In this option, a supplier info will get repeated in eachof the 
supplier's product doc.I may be influenced by DB concepts. Guess it's a 
trade off for this option.


On 4/28/2017 1:01 AM, Rick Leir wrote:

Does it make sense to use nested documents here? Products could be nested in a 
supplier document perhaps.

Alternately, consider de-normalizing "til it hurts". A product doc might be 
able to contain supplier info.

On April 27, 2017 8:50:59 AM EDT, Shawn Heisey <apa...@elyograg.org> wrote:

On 4/26/2017 11:57 PM, Derek Poh wrote:

There are some common fields between them.
At the source data end (database), the supplier info and product info
are updated separately. In this regard, I should separate them?
If it's In 1 single collection, when there are updatesto only the
supplier info,the product info will be index again even though there
is noupdates to them, Is my reasoning valid?


On 4/27/2017 1:33 PM, Walter Underwood wrote:

Do they have the same fields or different fields? Are they updated
separately or together?

If they have the same fields and are updated together, I’d put them
in the same collection. Otherwise, probably separate.

Walter's statements are right on the money, you just might need a
little
more detail.

There are are two critical details that decide whether you even CAN
combine different data in a single index: One is that all types of
records must use the same field (the uniqueKey field) to determine
uniqueness, and the value of this field must be unique across the
entire
dataset.  The other is that there SHOULD be a field with a name like
"type" that your search client can use to differentiate the different
kinds of documents.  This type field is not necessary, but it does make
things easier.

Assuming you CAN combine documents, there is still the question of
whether you SHOULD.  If the fields that you will commonly search are
the
same between the different kinds of documents, and if people want to be
able to do one search and get more than one of the document types you
are indexing, then it is something you should consider.  If people will
only ever search one type of document, you should probably keep them in
separate indexes to keep things cleaner.

Thanks,
Shawn



--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: 1 main collection or multiple smaller collections?

2017-04-27 Thread Derek Poh

Hi Shawn

1 set of data is suppliers info and 1 set isthe suppliers products info.
Usercan eitherdo a product search or a supplier search.

1 optionI am thinking of is to put them in 1 single collectionwith each 
product as a document. Each productdocument will have the supplier info 
in it.

Product id will be the uniquekey field.
With thisoption, the same supplier infowill be in every product document 
of the supplier.


A simplified example:
doc:
product id: P1
product description: XXX
supplier id: S1
supplier name: XXX
suppiler address: XXX

doc:
product id: P2
product description: XXXYYY
supplier id: S1
supplier name: XXX
supplier address: XXX

I may be influenced by DB concepts. Is such a design logical?


On 4/27/2017 8:50 PM, Shawn Heisey wrote:

On 4/26/2017 11:57 PM, Derek Poh wrote:

There are some common fields between them.
At the source data end (database), the supplier info and product info
are updated separately. In this regard, I should separate them?
If it's In 1 single collection, when there are updatesto only the
supplier info,the product info will be index again even though there
is noupdates to them, Is my reasoning valid?


On 4/27/2017 1:33 PM, Walter Underwood wrote:

Do they have the same fields or different fields? Are they updated
separately or together?

If they have the same fields and are updated together, I’d put them
in the same collection. Otherwise, probably separate.

Walter's statements are right on the money, you just might need a little
more detail.

There are are two critical details that decide whether you even CAN
combine different data in a single index: One is that all types of
records must use the same field (the uniqueKey field) to determine
uniqueness, and the value of this field must be unique across the entire
dataset.  The other is that there SHOULD be a field with a name like
"type" that your search client can use to differentiate the different
kinds of documents.  This type field is not necessary, but it does make
things easier.

Assuming you CAN combine documents, there is still the question of
whether you SHOULD.  If the fields that you will commonly search are the
same between the different kinds of documents, and if people want to be
able to do one search and get more than one of the document types you
are indexing, then it is something you should consider.  If people will
only ever search one type of document, you should probably keep them in
separate indexes to keep things cleaner.

Thanks,
Shawn





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: 1 main collection or multiple smaller collections?

2017-04-26 Thread Derek Poh

There are some common fields between them.
At the source data end (database), the supplier info and product info 
are updated separately. In this regard, I should separate them?
If it's In 1 single collection, when there are updatesto only the 
supplier info,the product info will be index again even though there is 
noupdates to them, Is my reasoning valid?



On 4/27/2017 1:33 PM, Walter Underwood wrote:

Do they have the same fields or different fields? Are they updated separately 
or together?

If they have the same fields and are updated together, I’d put them in the same 
collection. Otherwise, probably separate.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Apr 26, 2017, at 10:25 PM, Derek Poh <d...@globalsources.com> wrote:

Hi
I amplanning for a migration of a legacy searchengine to Solr.
Basically thedata can be categorisedinto suppliersinfo, suppliers products info 
and products category info. These sets of data are related to each other.
suppliers products data, which is the largest, have around 300,000 records 
currentlyand projected to increase.

Should I put these data in 1 single collection or in separate collections - eg. 
1 collection for suppliers info, 1 collection for suppliers products infoand 1 
collection fo products categories info?
What should I consider and plan for when deciding which option to take?

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or 
privileged information. If you are not the intended recipient or have received 
this e-mail in error, please inform the sender immediately and delete this 
e-mail (including any attachments) from your computer, and you must not use, 
disclose to anyone else or copy this e-mail (including any attachments), 
whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

1 main collection or multiple smaller collections?

2017-04-26 Thread Derek Poh

Hi
I amplanning for a migration of a legacy searchengine to Solr.
Basically thedata can be categorisedinto suppliersinfo, suppliers 
products info and products category info. These sets of data are related 
to each other.
suppliers products data, which is the largest, have around 300,000 
records currentlyand projected to increase.


Should I put these data in 1 single collection or in separate 
collections - eg. 1 collection for suppliers info, 1 collection for 
suppliers products infoand 1 collection fo products categories info?

What should I consider and plan for when deciding which option to take?

Derek

--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: format data at source or format data during indexing?

2017-03-30 Thread Derek Poh

Hi Alex

The business use case for the field is
- exact match
- singular-plural stemmingon each terms in the field
Eg. search for "dvd cases" must match "dvd case"and "dvds case".

This is the field type currently and It satisfy the business use case.
The 1 drawback of this is I need to add those words that cannot be 
singular-plural stemmed correctly by EnglishMinimalStemFilter to the 
'plural-singular.txt' of StemmerOverrideFilter as and when users 
reported on those words.


positionIncrementGap="100">

   
  pattern="^(.*)$" replacement="z01x $1 z01x" />

  
  
  dictionary="plural_singular.txt" />

  
   


I am wondering if it is advisable to let Solr append the code 'z01x' 
during indexing or append the code at source data end and feed to Solr.
For the query aspect, I will let Solr append the code to the query 
search words.


On 3/30/2017 7:28 PM, Alexandre Rafalovitch wrote:

What's you actual business use case?

On 30 Mar 2017 1:53 AM, "Derek Poh" <d...@globalsources.com> wrote:


Hi Erick

So I could also not use the query analyzer stage to append the code to the
search keyword?
Have the front-end application append the code for every query it issue
instead?


On 3/30/2017 12:20 PM, Erick Erickson wrote:


I generally prefer index-time work to query-time work on the theory
that the index-time work is done once and the query time work is done
for each query.

That said, for a corpus this size (and presumably without a large
query rate) I doubt you'd be able to measure any difference.

So basically choose the easiest to implement IMO.

Best,
Erick

On Wed, Mar 29, 2017 at 8:43 PM, Alexandre Rafalovitch
<arafa...@gmail.com> wrote:


I am not sure I can tell how to decide on one or another. However, I
wanted to mention that you also have an option of doing in in the
UpdateRequestProcessor chain. That's still within Solr (and therefore
is consistent with multiple clients feeding into Solr) but is before
individual field processing (so will survive - for example - a
copyField).

Regards,
 Alex.

http://www.solr-start.com/ - Resources for Solr users, new and
experienced


On 29 March 2017 at 23:38, Derek Poh <d...@globalsources.com> wrote:


Hi

Ineed to create afield that will be prefix and suffix with code
'z01x'.This
field needs to have the code in the index and during query.
I can either
1.
have the source data of the field formatted with the code before
indexing
(outside solr).
use a charFilter in the query stage of the field typeto add the
codeduring
query.



OR

2.
use the charFilter before tokenizerclass during the index and query
analyzer
stage of the field type.

The collection has between 100k - 200k documents currentlybut it may
increase in the future.
Theindexing time with option 2 and current indexing time is almost the
same,
between 2-3 minutes.

Which option would you advice?

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and
you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.


--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.



--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: format data at source or format data during indexing?

2017-03-30 Thread Derek Poh

Hi Alex

Thank you for pointing out theUpdateRequestProcessor option.

On 3/30/2017 11:43 AM, Alexandre Rafalovitch wrote:

I am not sure I can tell how to decide on one or another. However, I
wanted to mention that you also have an option of doing in in the
UpdateRequestProcessor chain. That's still within Solr (and therefore
is consistent with multiple clients feeding into Solr) but is before
individual field processing (so will survive - for example - a
copyField).

Regards,
Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 29 March 2017 at 23:38, Derek Poh <d...@globalsources.com> wrote:

Hi

Ineed to create afield that will be prefix and suffix with code 'z01x'.This
field needs to have the code in the index and during query.
I can either
1.
have the source data of the field formatted with the code before indexing
(outside solr).
use a charFilter in the query stage of the field typeto add the codeduring
query.



OR

2.
use the charFilter before tokenizerclass during the index and query analyzer
stage of the field type.

The collection has between 100k - 200k documents currentlybut it may
increase in the future.
Theindexing time with option 2 and current indexing time is almost the same,
between 2-3 minutes.

Which option would you advice?

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: format data at source or format data during indexing?

2017-03-29 Thread Derek Poh

Hi Erick

So I could also not use the query analyzer stage to append the code to 
the search keyword?
Have the front-end application append the code for every query it issue 
instead?



On 3/30/2017 12:20 PM, Erick Erickson wrote:

I generally prefer index-time work to query-time work on the theory
that the index-time work is done once and the query time work is done
for each query.

That said, for a corpus this size (and presumably without a large
query rate) I doubt you'd be able to measure any difference.

So basically choose the easiest to implement IMO.

Best,
Erick

On Wed, Mar 29, 2017 at 8:43 PM, Alexandre Rafalovitch
<arafa...@gmail.com> wrote:

I am not sure I can tell how to decide on one or another. However, I
wanted to mention that you also have an option of doing in in the
UpdateRequestProcessor chain. That's still within Solr (and therefore
is consistent with multiple clients feeding into Solr) but is before
individual field processing (so will survive - for example - a
copyField).

Regards,
Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 29 March 2017 at 23:38, Derek Poh <d...@globalsources.com> wrote:

Hi

Ineed to create afield that will be prefix and suffix with code 'z01x'.This
field needs to have the code in the index and during query.
I can either
1.
have the source data of the field formatted with the code before indexing
(outside solr).
use a charFilter in the query stage of the field typeto add the codeduring
query.



OR

2.
use the charFilter before tokenizerclass during the index and query analyzer
stage of the field type.

The collection has between 100k - 200k documents currentlybut it may
increase in the future.
Theindexing time with option 2 and current indexing time is almost the same,
between 2-3 minutes.

Which option would you advice?

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

format data at source or format data during indexing?

2017-03-29 Thread Derek Poh

Hi

Ineed to create afield that will be prefix and suffix with code 
'z01x'.This field needs to have the code in the index and during query.

I can either
1.
have the source data of the field formatted with the code before 
indexing (outside solr).
use a charFilter in the query stage of the field typeto add the 
codeduring query.


pattern="^(.*)$" replacement="z01x $1 z01x" />


OR

2.
use the charFilter before tokenizerclass during the index and query 
analyzer stage of the field type.


The collection has between 100k - 200k documents currentlybut it may 
increase in the future.
Theindexing time with option 2 and current indexing time is almost the 
same, between 2-3 minutes.


Which option would you advice?

Derek

--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: to handle expired documents: collection alias or delete by id query

2017-03-26 Thread Derek Poh

Hi Tom

The moving alias design is interesting, will explore it.

Regarding themethod of creating the collection on a node for indexing 
only and adding replicas of it to other nodes for queryinguponcompletion 
of indexing.
Am I right to say this is used in conjunction with collection alias or 
the moving alias you mentioned?



On 3/24/2017 10:23 PM, Tom Evans wrote:

On Thu, Mar 23, 2017 at 6:10 AM, Derek Poh <d...@globalsources.com> wrote:

Hi

I have collections of products. I am doing indexing 3-4 times daily.
Every day there are products that expired and I need to remove them from
these collectionsdaily.

Ican think of 2 ways to do this.
1. using collection aliasto switch between a main and temp collection.
- clear and index the temp collection
- create alias to temp collection.
- clear and index the main collection.
- create alias to main collection.

this way require additional collections.


Another way of doing this is to have a moving alias (not constantly
clearing the "temp" collection). If you reindex daily, your index
would be called "products_mmdd" with an alias to "products". The
advantage of this is that you can roll back to a previous version of
the index if there are problems, and each index is guaranteed to be
freshly created with no artifacts.

The biggest consideration for me would be how long indexing your full
corpus takes you. If you can do it in a small period of time, then
full indexes would be preferable. If it takes a very long time,
deleting is preferable.

If you are doing a cloud setup, full indexes are even more appealing.
You can create the new collection on a single node (even if sharded;
just place each shard on the same node). This would only place the
indexing cost on that one node, whilst other nodes would be unaffected
by indexing degrading regular query response time. You also don't have
to distribute the documents around the cluster. There is no
distributed indexing in Solr, each replica has to index each document
again, even if it is not the leader.

Once indexing is complete, you can expand the collection by adding
replicas of that shard on other nodes - perhaps even removing it from
the node that did the indexing. We have a node that solely does
indexing, before the collection is queried for anything it is added to
the querying nodes.

You can do this manually, or you can automate it using the collections API.

Cheers

Tom





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: to handle expired documents: collection alias or delete by id query

2017-03-23 Thread Derek Poh

Erick

Generally the products have contracted date but they could be extended 
and also get expired prematurely.
We will need additional processing to cater for these scenarios and 
update the 'expiry date' fields accordingly.


Will go through thedocumentationagainand see if it can fitour use case.

Thank you,
Derek

On 3/23/2017 11:12 PM, Erick Erickson wrote:

have you considered using TTL (Time To Live)?
You have to know at index time when the doc will expire.
If you do, Solr will delete the doc for you when its
life is over.

See: https://lucidworks.com/2014/05/07/document-expiration/
Also the Ref guide:
https://cwiki.apache.org/confluence/display/solr/Update+Request+Processors#UpdateRequestProcessors-UpdateRequestProcessorFactories
particularly DocExpirationUpdateProcessorFactory

Best,
Erick

On Thu, Mar 23, 2017 at 5:28 AM, Emir Arnautovic
<emir.arnauto...@sematext.com> wrote:

Hi Derek,

There are both pros and cons for both approaches:

1. if you are doing full reindexing PRO is that you have clean index all the
time and even if something goes wrong, you don't have to switch alias to
updated index so your users will not notice issues. CON is that you are
doing full reindex all the time even amount of changes is minimal. Also,
this approach is not real time friendly if you plan to have more frequent
update cycles.

2. If you delete in existing index, you do min changes. But note that
deleted doc are just flagged in index as deleted and removed when segments
are merged. This can result in skewed statistics and if you have replicas
and sort by score, can result in different ordering depending on replicas'
merge cycles. Using optimize after update is done would solve this issue.

In order to make the right decision, you have to look at size of your
collection, number of deleted items etc. You can even combine approaches,
e.g. delete daily and do full reindex once a week.

HTH,
Emir



On 23.03.2017 07:10, Derek Poh wrote:

Hi

I have collections of products. I am doing indexing 3-4 times daily.
Every day there are products that expired and I need to remove them from
these collectionsdaily.

Ican think of 2 ways to do this.
1. using collection aliasto switch between a main and temp collection.
- clear and index the temp collection
- create alias to temp collection.
- clear and index the main collection.
- create alias to main collection.

this way require additional collections.

2. get list of expired products and generate deleteby id queries to the
collections.

Would like to get some advice on which way should I adopt?


Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/






--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: to handle expired documents: collection alias or delete by id query

2017-03-23 Thread Derek Poh

Hi Emir

Thank you for pointing outdeleted docwill still existin the indextill it 
is optimize and itwill skewed statistics. We dosort by score.


This new collectionsare partofa new business initiativeandwe do not know 
as yet what will be their sizelike.


Willgo ponder on your inputs. Thank you.

Derek

On 3/23/2017 8:28 PM, Emir Arnautovic wrote:

Hi Derek,

There are both pros and cons for both approaches:

1. if you are doing full reindexing PRO is that you have clean index 
all the time and even if something goes wrong, you don't have to 
switch alias to updated index so your users will not notice issues. 
CON is that you are doing full reindex all the time even amount of 
changes is minimal. Also, this approach is not real time friendly if 
you plan to have more frequent update cycles.


2. If you delete in existing index, you do min changes. But note that 
deleted doc are just flagged in index as deleted and removed when 
segments are merged. This can result in skewed statistics and if you 
have replicas and sort by score, can result in different ordering 
depending on replicas' merge cycles. Using optimize after update is 
done would solve this issue.


In order to make the right decision, you have to look at size of your 
collection, number of deleted items etc. You can even combine 
approaches, e.g. delete daily and do full reindex once a week.


HTH,
Emir


On 23.03.2017 07:10, Derek Poh wrote:

Hi

I have collections of products. I am doing indexing 3-4 times daily.
Every day there are products that expired and I need to remove them 
from these collectionsdaily.


Ican think of 2 ways to do this.
1. using collection aliasto switch between a main and temp collection.
- clear and index the temp collection
- create alias to temp collection.
- clear and index the main collection.
- create alias to main collection.

this way require additional collections.

2. get list of expired products and generate deleteby id queries to 
the collections.


Would like to get some advice on which way should I adopt?


Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential 
and/or privileged information. If you are not the intended recipient 
or have received this e-mail in error, please inform the sender 
immediately and delete this e-mail (including any attachments) from 
your computer, and you must not use, disclose to anyone else or copy 
this e-mail (including any attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

to handle expired documents: collection alias or delete by id query

2017-03-23 Thread Derek Poh

Hi

I have collections of products. I am doing indexing 3-4 times daily.
Every day there are products that expired and I need to remove them from 
these collectionsdaily.


Ican think of 2 ways to do this.
1. using collection aliasto switch between a main and temp collection.
- clear and index the temp collection
- create alias to temp collection.
- clear and index the main collection.
- create alias to main collection.

this way require additional collections.

2. get list of expired products and generate deleteby id queries to the 
collections.


Would like to get some advice on which way should I adopt?


Derek

--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: Break up a supplier's documents (products) from dominating search result.

2016-12-01 Thread Derek Poh
While testing with groupparam(i think it apply to field collapse as 
well), I encountered a scenario where the number of suppliers in a 
result is less than the number of items to display per page (user select).


Eg. Products per page to display is 80.
The search result has 182 matching productswhichbelong to 13 suppliers.
Grouping by supplier idand 1 product per supplier, only 13 products will 
be return. Issuing anotherquery to getmore products to fill up the page 
will not help as there is no more suppliers.


Initial query parameters,
start=0=80=grout=P_SupplierSource:(1)=true=P_SupplierId=simple

issue another query to get more products to fill up. this will not 
return anyresult.

start=80=80=grout=P_SupplierSource:(1)=true=P_SupplierId=simple


Any suggestions/advice on how to address this scenario?

On 11/29/2016 11:01 AM, Alexandre Rafalovitch wrote:

You can use expand and it will provide several documents per group
(but in a different data structure in the response).

Then it is up to you how to sequence or interleave the results in your
UI. You do need to deal with edge-cases like what happens if you say 3
products per group, but then one group has only one and you don't have
enough items in a list, etc.

Regards,
Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 29 November 2016 at 12:56, Derek Poh <d...@globalsources.com> wrote:

Hi Walter

You used field collapsing for your case as well?

For my case the search result page is listing of products. There is a option
to select the number of products to display per page.
Let's say 40 products per page is selected. A search result has 100 matching
products but these products belong to only 20 suppliers. The page will only
display 20 products (1 product per supplier).
We still need to fill up the remaining 20 empty products.
How can I handle this scenario?


On 11/29/2016 8:26 AM, Walter Underwood wrote:

We had a similar feature in the Ultraseek search engine. One of our
customers
was a magazine publisher, and they wanted the best hit from each magazine
on the first page.

I expect that field collapsing would work for this.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Nov 28, 2016, at 4:19 PM, Derek Poh <d...@globalsources.com> wrote:

Alex

Hope I understand what you meant by positive business requirements.
With a few supplier's products dominating the first page of a search
result, the sales will not be able to convince prospectiveor existing
clients to sign up.
They would like the results tofeature other supplier's products as well.
To the extreme case, they were thinking of displaying the results tobe in
such order
Supplier A product
Supplier B product
Supplier C product
Supplier A product
Supplier B product
Supplier C product
...

Theyare alright with implementing this logic tothe first page only
andsubsequent pages will be as per current logic if it is not possible to
implement it to the entire search result.

Will take a lookat Collapse and Expandto seeif it can help.

On 11/28/2016 6:04 PM, Alexandre Rafalovitch wrote:

You have described your _negative_ business requirements, but not the
_positive_ ones. So, it is hard to see what they want to happen. It is
easy enough to promote or demote a particular filter matches. But you
want to partially limit them. On a first page? What about on the
second?

I suspect you would have to have a slightly different interface to do
this effectively. And, most likely, using Collapse and Expand:

https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
.

Regards,
 Alex.

http://www.solr-start.com/ - Resources for Solr users, new and
experienced


On 28 November 2016 at 20:09, Derek Poh <d...@globalsources.com> wrote:

Hi

We have a business requirement to breakupa supplier's products from
dominating search resultso as to allow othersuppliers' products in the
search result to have exposure.
Business users are open to implementing this for the first page of the
search resultif it is not possible to apply tothe entire search result.

  From the sample keywords users have provided, I also discovered
thatmost of
the time a supplier's products that are listed consecutively in the
result
all have the same score.

Any advice/suggestions on how I cando it?

Please let me know if more information is require. Thank you.

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and
you
must not use, disclose to anyone else or copy this e-mail (including
any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or

Re: Break up a supplier's documents (products) from dominating search result.

2016-11-28 Thread Derek Poh

Is there a way where we do not have to change the page UI?

This is the search page for your reference.
http://www.globalsources.com/gsol/GeneralManager?hostname=www.globalsources.com_search=on=search%2FProductSearchResults_search=off==PRODUCT=en=new=denim+fabric=en_id=300149681_id=23844==t=N=ProdSearch=GetPoint=DoFreeTextSearch_search=on_search=off=grid 



On 11/29/2016 10:04 AM, Walter Underwood wrote:

We used something like field collapsing, but it wasn’t with Solr or Lucene.
They had not been invented at the time. This was a feature of the Ultraseek
engine from Infoseek, probably in 1997 or 1998.

With field collapsing, you provide a link to show more results from that source.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Nov 28, 2016, at 5:56 PM, Derek Poh <d...@globalsources.com> wrote:

Hi Walter

You used field collapsing for your case as well?

For my case the search result page is listing of products. There is a option to 
select the number of products to display per page.
Let's say 40 products per page is selected. A search result has 100 matching 
products but these products belong to only 20 suppliers. The page will only 
display 20 products (1 product per supplier).
We still need to fill up the remaining 20 empty products.
How can I handle this scenario?

On 11/29/2016 8:26 AM, Walter Underwood wrote:

We had a similar feature in the Ultraseek search engine. One of our customers
was a magazine publisher, and they wanted the best hit from each magazine
on the first page.

I expect that field collapsing would work for this.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Nov 28, 2016, at 4:19 PM, Derek Poh <d...@globalsources.com> wrote:

Alex

Hope I understand what you meant by positive business requirements.
With a few supplier's products dominating the first page of a search result, 
the sales will not be able to convince prospectiveor existing clients to sign 
up.
They would like the results tofeature other supplier's products as well.
To the extreme case, they were thinking of displaying the results tobe in such 
order
Supplier A product
Supplier B product
Supplier C product
Supplier A product
Supplier B product
Supplier C product
...

Theyare alright with implementing this logic tothe first page only 
andsubsequent pages will be as per current logic if it is not possible to 
implement it to the entire search result.

Will take a lookat Collapse and Expandto seeif it can help.

On 11/28/2016 6:04 PM, Alexandre Rafalovitch wrote:

You have described your _negative_ business requirements, but not the
_positive_ ones. So, it is hard to see what they want to happen. It is
easy enough to promote or demote a particular filter matches. But you
want to partially limit them. On a first page? What about on the
second?

I suspect you would have to have a slightly different interface to do
this effectively. And, most likely, using Collapse and Expand:
https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
.

Regards,
Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 28 November 2016 at 20:09, Derek Poh <d...@globalsources.com> wrote:

Hi

We have a business requirement to breakupa supplier's products from
dominating search resultso as to allow othersuppliers' products in the
search result to have exposure.
Business users are open to implementing this for the first page of the
search resultif it is not possible to apply tothe entire search result.

 From the sample keywords users have provided, I also discovered thatmost of
the time a supplier's products that are listed consecutively in the result
all have the same score.

Any advice/suggestions on how I cando it?

Please let me know if more information is require. Thank you.

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or 
privileged information. If you are not the intended recipient or have received 
this e-mail in error, please inform the sender immediately and delete this 
e-mail (including any attachments) from your computer, and you must not use, 
disclose to anyone else or copy this e-mail (including any attachments), 
whether in whole or in part.
This e-mail and any reply to it may be monitored f

Re: Break up a supplier's documents (products) from dominating search result.

2016-11-28 Thread Derek Poh

Hi Walter

You used field collapsing for your case as well?

For my case the search result page is listing of products. There is a 
option to select the number of products to display per page.
Let's say 40 products per page is selected. A search result has 100 
matching products but these products belong to only 20 suppliers. The 
page will only display 20 products (1 product per supplier).

We still need to fill up the remaining 20 empty products.
How can I handle this scenario?

On 11/29/2016 8:26 AM, Walter Underwood wrote:

We had a similar feature in the Ultraseek search engine. One of our customers
was a magazine publisher, and they wanted the best hit from each magazine
on the first page.

I expect that field collapsing would work for this.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Nov 28, 2016, at 4:19 PM, Derek Poh <d...@globalsources.com> wrote:

Alex

Hope I understand what you meant by positive business requirements.
With a few supplier's products dominating the first page of a search result, 
the sales will not be able to convince prospectiveor existing clients to sign 
up.
They would like the results tofeature other supplier's products as well.
To the extreme case, they were thinking of displaying the results tobe in such 
order
Supplier A product
Supplier B product
Supplier C product
Supplier A product
Supplier B product
Supplier C product
...

Theyare alright with implementing this logic tothe first page only 
andsubsequent pages will be as per current logic if it is not possible to 
implement it to the entire search result.

Will take a lookat Collapse and Expandto seeif it can help.

On 11/28/2016 6:04 PM, Alexandre Rafalovitch wrote:

You have described your _negative_ business requirements, but not the
_positive_ ones. So, it is hard to see what they want to happen. It is
easy enough to promote or demote a particular filter matches. But you
want to partially limit them. On a first page? What about on the
second?

I suspect you would have to have a slightly different interface to do
this effectively. And, most likely, using Collapse and Expand:
https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
.

Regards,
Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 28 November 2016 at 20:09, Derek Poh <d...@globalsources.com> wrote:

Hi

We have a business requirement to breakupa supplier's products from
dominating search resultso as to allow othersuppliers' products in the
search result to have exposure.
Business users are open to implementing this for the first page of the
search resultif it is not possible to apply tothe entire search result.

 From the sample keywords users have provided, I also discovered thatmost of
the time a supplier's products that are listed consecutively in the result
all have the same score.

Any advice/suggestions on how I cando it?

Please let me know if more information is require. Thank you.

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.


--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or 
privileged information. If you are not the intended recipient or have received 
this e-mail in error, please inform the sender immediately and delete this 
e-mail (including any attachments) from your computer, and you must not use, 
disclose to anyone else or copy this e-mail (including any attachments), 
whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.




--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.



Re: Break up a supplier's documents (products) from dominating search result.

2016-11-28 Thread Derek Poh

Alex

Hope I understand what you meant by positive business requirements.
With a few supplier's products dominating the first page of a search 
result, the sales will not be able to convince prospectiveor existing 
clients to sign up.

They would like the results tofeature other supplier's products as well.
To the extreme case, they were thinking of displaying the results tobe 
in such order

Supplier A product
Supplier B product
Supplier C product
Supplier A product
Supplier B product
Supplier C product
...

Theyare alright with implementing this logic tothe first page only 
andsubsequent pages will be as per current logic if it is not possible 
to implement it to the entire search result.


Will take a lookat Collapse and Expandto seeif it can help.

On 11/28/2016 6:04 PM, Alexandre Rafalovitch wrote:

You have described your _negative_ business requirements, but not the
_positive_ ones. So, it is hard to see what they want to happen. It is
easy enough to promote or demote a particular filter matches. But you
want to partially limit them. On a first page? What about on the
second?

I suspect you would have to have a slightly different interface to do
this effectively. And, most likely, using Collapse and Expand:
https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
.

Regards,
Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 28 November 2016 at 20:09, Derek Poh <d...@globalsources.com> wrote:

Hi

We have a business requirement to breakupa supplier's products from
dominating search resultso as to allow othersuppliers' products in the
search result to have exposure.
Business users are open to implementing this for the first page of the
search resultif it is not possible to apply tothe entire search result.

 From the sample keywords users have provided, I also discovered thatmost of
the time a supplier's products that are listed consecutively in the result
all have the same score.

Any advice/suggestions on how I cando it?

Please let me know if more information is require. Thank you.

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Break up a supplier's documents (products) from dominating search result.

2016-11-28 Thread Derek Poh

Hi

We have a business requirement to breakupa supplier's products from 
dominating search resultso as to allow othersuppliers' products in the 
search result to have exposure.
Business users are open to implementing this for the first page of the 
search resultif it is not possible to apply tothe entire search result.


From the sample keywords users have provided, I also discovered thatmost 
of the time a supplier's products that are listed consecutively in the 
result all have the same score.


Any advice/suggestions on how I cando it?

Please let me know if more information is require. Thank you.

Derek

--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: Split words with period in between into separate tokens

2016-10-12 Thread Derek Poh
Why didn't I thought of that. That's another alternative. Thank you for 
your suggestion. Appreciate it.


On 10/13/2016 5:41 AM, Georg Sorst wrote:

You could use a PatternReplaceCharFilter before your tokenizer to replace
the dot with a space character.

Derek Poh <d...@globalsources.com> schrieb am Mi., 12. Okt. 2016 11:38:


Seems like LetterTokenizerFactory tokenise/discard on numbers as well. The
field does has values with numbers in them therefore it is not applicable.
Thank you.


On 10/12/2016 4:22 PM, Dheerendra Kulkarni wrote:

You can use LetterTokenizerFactory instead.

Regards,
Dheerendra Kulkarni

On Wed, Oct 12, 2016 at 6:24 AM, Derek Poh <d...@globalsources.com>

wrote:

Hi

How can I split words with period in between into separate tokens.
Eg. "Co.Ltd" => "Co" "Ltd" .

I am using StandardTokenizerFactory and it does notreplace periods

(dots)

that are not followed by whitespace are kept as part of the token,
including Internet domain names.

This is the field definition,



  
  
  


  
  
  
synonyms="synonyms.txt"

ignoreCase="true" expand="true"/>
  



Solr versionis 10.4.10.

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and

you

must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.




--
CONFIDENTIALITY NOTICE

This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.

This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: Split words with period in between into separate tokens

2016-10-12 Thread Derek Poh

Seems like LetterTokenizerFactory tokenise/discard on numbers as well. The 
field does has values with numbers in them therefore it is not applicable. 
Thank you.


On 10/12/2016 4:22 PM, Dheerendra Kulkarni wrote:

You can use LetterTokenizerFactory instead.

Regards,
Dheerendra Kulkarni

On Wed, Oct 12, 2016 at 6:24 AM, Derek Poh <d...@globalsources.com> wrote:


Hi

How can I split words with period in between into separate tokens.
Eg. "Co.Ltd" => "Co" "Ltd" .

I am using StandardTokenizerFactory and it does notreplace periods (dots)
that are not followed by whitespace are kept as part of the token,
including Internet domain names.

This is the field definition,


   
 
 
 
   
   
 
 
 
 
   


Solr versionis 10.4.10.

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.






--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.



Re: Split words with period in between ("Co.Ltd") into separate tokens

2016-10-12 Thread Derek Poh

Thank you for pointing out the flags.
I set generateWordParts=1 and the term is split up.

On 10/12/2016 3:26 PM, Modassar Ather wrote:

Hi,

The flags set in your WordDelimiterFilterFactory definition is 0.
You can try with generateWordParts=1 and splitOnCaseChange=1 and see if it
breaks as per your requirement.
You can also try with other available flags enabled.

Best,
Modassar

On Wed, Oct 12, 2016 at 12:44 PM, Derek Poh <d...@globalsources.com> wrote:


I tried adding Word Delimiter Filter to the field but it does not process
or it truncate away the term "Co.Ltd".



On 10/12/2016 8:54 AM, Derek Poh wrote:


Hi

How can I split words with period in between into separate tokens.
Eg. "Co.Ltd" => "Co" "Ltd" .

I am using StandardTokenizerFactory and it does notreplace periods (dots)
that are not followed by whitespace are kept as part of the token,
including Internet domain names.

This is the field definition,


   
 
 
 
   
   
 
 
 
 
   


Solr versionis 10.4.10.

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.


--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: Split words with period in between ("Co.Ltd") into separate tokens

2016-10-12 Thread Derek Poh
I tried adding Word Delimiter Filter to the field but it does not 
process or it truncate away the term "Co.Ltd".


generateNumberParts="0" catenateWords="0" catenateNumbers="0" 
catenateAll="0" splitOnCaseChange="0"/>


On 10/12/2016 8:54 AM, Derek Poh wrote:

Hi

How can I split words with period in between into separate tokens.
Eg. "Co.Ltd" => "Co" "Ltd" .

I am using StandardTokenizerFactory and it does notreplace periods 
(dots) that are not followed by whitespace are kept as part of the 
token, including Internet domain names.


This is the field definition,

positionIncrementGap="100">

  

words="stopwords.txt" />


  
  

words="stopwords.txt" />
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>


  


Solr versionis 10.4.10.

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential 
and/or privileged information. If you are not the intended recipient 
or have received this e-mail in error, please inform the sender 
immediately and delete this e-mail (including any attachments) from 
your computer, and you must not use, disclose to anyone else or copy 
this e-mail (including any attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.


--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.



Split words with period in between into separate tokens

2016-10-11 Thread Derek Poh

Hi

How can I split words with period in between into separate tokens.
Eg. "Co.Ltd" => "Co" "Ltd" .

I am using StandardTokenizerFactory and it does notreplace periods 
(dots) that are not followed by whitespace are kept as part of the 
token, including Internet domain names.


This is the field definition,

positionIncrementGap="100">

  

words="stopwords.txt" />


  
  

words="stopwords.txt" />
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>


  


Solr versionis 10.4.10.

Derek

--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

display filter based on existence of facet

2016-08-10 Thread Derek Poh
I have a couple of filtersthat is text input based, where user will 
input a value into the text boxes of these filters.
The condition is these filters will only be display if the facets exists 
in the search result.
Eg. Min Order Qty filter will be displayif theMin Order Qty facet exists 
in thesolr result.


To display this filter, I only need to'know' there is value to filter on.
Currentlyall the possible terms and counts of the Min Order Qty field is 
return for this facet.


Any suggestions on how I can avoid the computation of the possible terms 
and their countsfor the facet fieldand hence reduce the computational 
time of the query?

I just need to know there is'a value to filter on'.

This is the parameters of the query that is use to display the list of 
filters.

group.field=P_SupplierId=true=true=0=0=coffee=P_SupplierSource:(1)=true=1=P_CNState=P_BusinessType=P_CombinedBusTypeFlat=P_CombinedCompCertFlat=P_CombinedExportCountryFlat=P_CombinedProdCertFlat=P_Country=P_CSFParticipant=P_FOBPriceMinFlag=P_FOBPriceMaxFlag=P_HasAuditInfo=P_HasCreditInfo=P_LeadTime=P_Microsite=P_MinOrderQty=P_MonthlyCapacityFlag=P_OEMServices=P_PSEParticipant=P_SupplierRanking=P_SupplierUpcomingTradeShow=P_YearsInBusiness=P_SmallOrderFlag

Using solr 4.10.4

Thankyou,
Derek

--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: moving leader to another replica of a collection?

2016-07-10 Thread Derek Poh

Hi Shawn

Got it.
Will delete all replicas on thatserver first before shutting down solron it.

Thank you,
Derek



On 7/11/2016 9:43 AM, Shawn Heisey wrote:

On 7/10/2016 7:34 PM, Derek Poh wrote:

I need to remove a server from the cluster of serversrunning solr in
my production environment. 1of the collection's replica is a leader on
this server.
The collection is setup as 1shard with 5 replicas. With each replica
residing on a physical server.

How can I move or assignanother replicaas the leader on another server?
Or should I just go ahead and stop the solr process on this server and
solr or zookeeper will elect another replicaas leader?

If you shut down that Solr server, the remaining servers will elect a
new leader.

There is the preferred leader functionality, but this is really only
something that's needed if you have a very large number of
collections/shards and need to distribute the leader roles evenly among
multiple servers.  For a small number, having leaders concentrated on
one server does not represent a performance problem.

If the server will be permanently decommissioned, you should probably
use DELETEREPLICA on the collections API to remove all replicas on that
server before shutting it down.  That can also initiate leader election.

Thanks,
Shawn





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

moving leader to another replica of a collection?

2016-07-10 Thread Derek Poh

Hi

This is my situation.
I need to remove a server from the cluster of serversrunning solr in my 
production environment. 1of the collection's replica is a leader on this 
server.
The collection is setup as 1shard with 5 replicas. With each replica 
residing on a physical server.


How can I move or assignanother replicaas the leader on another server?
Or should I just go ahead and stop the solr process on this server and 
solr or zookeeper will elect another replicaas leader?


Derek



--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: Define search query parameters in Solr or let clients applicationscraft them?

2016-06-14 Thread Derek Poh

Hi Scott, thank you for sharing your solution, appreciate it.

To me interms of maintainability I think it will bebetter to define all 
the parameters either at the client end or solr end.


On 6/15/2016 9:47 AM, scott.chu wrote:

In my case, I write a HTTP gateway between application and Solr engine. This is 
long existed before I use Solr as SE. Back that time, I figure out one day I 
might replace our old SE and it would cause two dilemma:
1> If our applications directly call the API of THE search engines, when we 
replace it with another SE, all the calling statements have to be rewritten. It 
would be a very hard job for us, especially when the number and size of 
applications get bigger.
2> We have applications written in different languages and we from time to time 
need to maunally test status of SE by our system engineers.
Furthermore, we want to fix some default parameters in the gateway for 
simplicity and security issues (e.g. Shortening the size of HTTP call, Prevent 
db names, field names, etc. shown in the HTTP call, etc.)
And these considerations ended up with a gateway design.

For your question, IMHO, I wouldn't define query parameters in Solr unless you 
think they WOULD BE GLOBALIZED. You can consider our solution.


scott.chu,scott@udngroup.com
2016/6/15 (週三)
- Original Message -
From: Derek Poh
To: solr-user
CC:
Date: 2016/6/13 (週一) 11:21
Subject: Define search query parameters in Solr or let clients 
applicationscraft them?


Hi

Would like to get some advice on should the queries parameters be define
in Solr or let the clients applications define and pass the queries
parameters to Solr?

Regards,
Derek



--
CONFIDENTIALITY NOTICE

This e-mail (including any attachments) may contain confidential and/or 
privileged information. If you are not the intended recipient or have received 
this e-mail in error, please inform the sender immediately and delete this 
e-mail (including any attachments) from your computer, and you must not use, 
disclose to anyone else or copy this e-mail (including any attachments), 
whether in whole or in part.

This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.


-
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6201 / 病毒庫: 4598/12409 - 發佈日期: 06/12/16



--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: Define search query parameters in Solr or let clients applications craft them?

2016-06-14 Thread Derek Poh

Hi Emir

Yaguess one way is to implement a policy where new queries from client 
application have to be reviewcouple with periodic search log grooming as 
you have suggested.


On 6/14/2016 4:12 PM, Emir Arnautovic wrote:

Hi Derek,
Unless you lock all your parameters, there will always be a chance of 
inefficient queries. Only way to fight that is to have full control of 
Solr interface and provide some search API, or to do regular search 
log grooming.


Emir

On 14.06.2016 03:05, Derek Poh wrote:

Hi Emir

Thank you for pointing out the cons of defining them in Solr config.

One of the thing I am worry about in letting clientapplication 
defined the parametersis the developers will use or include 
unnecessary, wrong and resource intensive parameters.



On 6/13/2016 5:50 PM, Emir Arnautovic wrote:

Hi Derek,
Maybe I am looking this from perspective who is working with other 
peoples' setups, but I prefer when it is defined in Solr configs: I 
can get sense of queries from looking at configs, you have mechanism 
to lock some parameters, updates are centralized... However, it does 
come with some cons: it is less expressive than what you can do in 
client code, you have to reload cores when you want to change, 
people tend to override it from client so you get configs in two 
places.


HTH,
Emir

On 13.06.2016 05:21, Derek Poh wrote:

Hi

Would like to get some advice on should the queries parameters be 
define in Solr or let the clients applications define and pass the 
queries parameters to Solr?


Regards,
Derek



--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential 
and/or privileged information. If you are not the intended 
recipient or have received this e-mail in error, please inform the 
sender immediately and delete this e-mail (including any 
attachments) from your computer, and you must not use, disclose to 
anyone else or copy this e-mail (including any attachments), 
whether in whole or in part.
This e-mail and any reply to it may be monitored for security, 
legal, regulatory compliance and/or other appropriate reasons.





--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential 
and/or privileged information. If you are not the intended recipient 
or have received this e-mail in error, please inform the sender 
immediately and delete this e-mail (including any attachments) from 
your computer, and you must not use, disclose to anyone else or copy 
this e-mail (including any attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: Define search query parameters in Solr or let clients applications craft them?

2016-06-13 Thread Derek Poh

Hi Emir

Thank you for pointing out the cons of defining them in Solr config.

One of the thing I am worry about in letting clientapplication defined 
the parametersis the developers will use or include unnecessary, wrong 
and resource intensive parameters.



On 6/13/2016 5:50 PM, Emir Arnautovic wrote:

Hi Derek,
Maybe I am looking this from perspective who is working with other 
peoples' setups, but I prefer when it is defined in Solr configs: I 
can get sense of queries from looking at configs, you have mechanism 
to lock some parameters, updates are centralized... However, it does 
come with some cons: it is less expressive than what you can do in 
client code, you have to reload cores when you want to change, people 
tend to override it from client so you get configs in two places.


HTH,
Emir

On 13.06.2016 05:21, Derek Poh wrote:

Hi

Would like to get some advice on should the queries parameters be 
define in Solr or let the clients applications define and pass the 
queries parameters to Solr?


Regards,
Derek



--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential 
and/or privileged information. If you are not the intended recipient 
or have received this e-mail in error, please inform the sender 
immediately and delete this e-mail (including any attachments) from 
your computer, and you must not use, disclose to anyone else or copy 
this e-mail (including any attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Define search query parameters in Solr or let clients applications craft them?

2016-06-12 Thread Derek Poh

Hi

Would like to get some advice on should the queries parameters be define 
in Solr or let the clients applications define and pass the queries 
parameters to Solr?


Regards,
Derek



--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: float or string type for a field with whole number and decimal number values?

2016-05-31 Thread Derek Poh

Sorry about that.

Thank you for your explanation. I still have some questions on using and 
setting up collection alias for my current situation. I will start a new 
threadon this.


On 5/31/2016 11:21 PM, Erick Erickson wrote:

First, when changing the topic of the thread, please start a new thread. This
is called "thread hijacking" and makes it difficult to find threads later.

Collection aliasing does not do _anything_ about adding/deleting/whatever.
It's just a way to do exactly what you want. Your clients point to
mycollection.

You use the CREATEALIAS command to point mycollection to mycollection_1.
Thereafter you can do anything you want to mycollection_1 using either name.

That is, you can address mycollection_1 explicitly. You can use mycollection. It
doesn't matter.

Then you can create mycollection_2. So far you can _only_ address mycollection_2
explicitly. You then use the CREATEALIAS to point mycollection at
mycollection_2.
At that point, anybody using mycollection will start working with
mycollection_2.

Meanywhile, mycollection_1 is still addressable (presumably by the back end) by
addressing it explicitly rather than through an alias. It has _not_ been changed
in any way by creating the new alias.

Best,
Erick

On Mon, May 30, 2016 at 11:15 PM, Derek Poh <d...@globalsources.com> wrote:

Hi Erick

Thank you for pointing out the sort behaviour of numbers in a string field.
I did not think of that. Will use float.

Would like to know how would you guys handle the usage of collection alias
in my case.
I have a 'product' collectionand Icreate a new collection'product_tmp' for
this field type change and index into it. I create an alias 'product' on
this new 'product_tmp' collection.
IfI were to index to or delete documents from the 'product' collection, SOLR
will index on and delete from 'product_tmp' collection, am I right?
That means the 'product' collection cannot be usedanymore?
Even if I were to create an alias 'product_old' on 'product'
collection;issue a delete all documents or index on 'product_old', SOLR will
delete or index on 'product_tmp' collection instead?

My intention is to avoid having to updatethe clients serversto point to
'product_tmp' collection.


On 5/31/2016 10:57 AM, Erick Erickson wrote:

bq: Should I change the field type to "float" or "string"?

I'd go with float. Let's assume you want to sort by
this field. 10.00 sorts before 9.0 if you
just use Strings. Plus floats are generally much more
compact.

bq: do I need to delete all documents in the index and do a full indexing

That's the way I'd do it. You can always index to a _new_ collection
(assuming SolrCloud) and use collection aliasing to switch your
search all at once

Best,
Erick

On Sun, May 29, 2016 at 12:56 AM, Derek Poh <d...@globalsources.com>
wrote:

I am using solr 4.10.4.


On 5/29/2016 3:52 PM, Derek Poh wrote:

Hi

I have a field that is of "int" type currentlyand it's values are whole
numbers.



Due tochange inbusiness requirement, this field will need to take in
decimal numbers as well.
This fieldis sorted onand filter by range (field:[ 1 to *]).

Should I change the field type to "float" or "string"?
For the change to take effect, do I need to delete all documents in the
index and do a full indexing? Or I can just do a full indexing without
theneed to delete all documents first?

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and
you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.



--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and
you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.




--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (incl

Re: float or string type for a field with whole number and decimal number values?

2016-05-31 Thread Derek Poh

Hi Erick

Thank you for pointing out the sort behaviour of numbers in a string 
field. I did not think of that. Will use float.


Would like to know how would you guys handle the usage of collection 
alias in my case.
I have a 'product' collectionand Icreate a new collection'product_tmp' 
for this field type change and index into it. I create an alias 
'product' on this new 'product_tmp' collection.
IfI were to index to or delete documents from the 'product' collection, 
SOLR will index on and delete from 'product_tmp' collection, am I right?

That means the 'product' collection cannot be usedanymore?
Even if I were to create an alias 'product_old' on 'product' 
collection;issue a delete all documents or index on 'product_old', SOLR 
will delete or index on 'product_tmp' collection instead?


My intention is to avoid having to updatethe clients serversto point to 
'product_tmp' collection.



On 5/31/2016 10:57 AM, Erick Erickson wrote:

bq: Should I change the field type to "float" or "string"?

I'd go with float. Let's assume you want to sort by
this field. 10.00 sorts before 9.0 if you
just use Strings. Plus floats are generally much more
compact.

bq: do I need to delete all documents in the index and do a full indexing

That's the way I'd do it. You can always index to a _new_ collection
(assuming SolrCloud) and use collection aliasing to switch your
search all at once

Best,
Erick

On Sun, May 29, 2016 at 12:56 AM, Derek Poh <d...@globalsources.com> wrote:

I am using solr 4.10.4.


On 5/29/2016 3:52 PM, Derek Poh wrote:

Hi

I have a field that is of "int" type currentlyand it's values are whole
numbers.



Due tochange inbusiness requirement, this field will need to take in
decimal numbers as well.
This fieldis sorted onand filter by range (field:[ 1 to *]).

Should I change the field type to "float" or "string"?
For the change to take effect, do I need to delete all documents in the
index and do a full indexing? Or I can just do a full indexing without
theneed to delete all documents first?

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.



--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: float or string type for a field with whole number and decimal number values?

2016-05-29 Thread Derek Poh

I am using solr 4.10.4.

On 5/29/2016 3:52 PM, Derek Poh wrote:

Hi

I have a field that is of "int" type currentlyand it's values are 
whole numbers.


stored="true" multiValued="false"/>


Due tochange inbusiness requirement, this field will need to take in 
decimal numbers as well.

This fieldis sorted onand filter by range (field:[ 1 to *]).

Should I change the field type to "float" or "string"?
For the change to take effect, do I need to delete all documents in 
the index and do a full indexing? Or I can just do a full indexing 
without theneed to delete all documents first?


Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential 
and/or privileged information. If you are not the intended recipient 
or have received this e-mail in error, please inform the sender 
immediately and delete this e-mail (including any attachments) from 
your computer, and you must not use, disclose to anyone else or copy 
this e-mail (including any attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.



--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

float or string type for a field with whole number and decimal number values?

2016-05-29 Thread Derek Poh

Hi

I have a field that is of "int" type currentlyand it's values are whole 
numbers.


stored="true" multiValued="false"/>


Due tochange inbusiness requirement, this field will need to take in 
decimal numbers as well.

This fieldis sorted onand filter by range (field:[ 1 to *]).

Should I change the field type to "float" or "string"?
For the change to take effect, do I need to delete all documents in the 
index and do a full indexing? Or I can just do a full indexing without 
theneed to delete all documents first?


Derek

--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: Advice to add additional non-related fields to a collection or create a subset of it?

2016-05-16 Thread Derek Poh

Mikhail
It was caused by an endless loop in the page's codes that is triggered 
only under certain conditions.


On 5/11/2016 4:07 PM, Mikhail Khludnev wrote:

On Wed, May 11, 2016 at 10:16 AM, Derek Poh <d...@globalsources.com> wrote:


Hi Erick

Yes we have identified and fixed the page slow loading.


Derek,
Can you elaborate more? What did you fix?



I was wondering if there are any best practices when it comes to deciding
to create a single collection that stores all information in it or create
multiple sub collections. I understand now itdepends on the use-case.
My apologies for not giving it much thoughts before asking the questions.
Thank you for your patience.

- Derek


On 5/10/2016 12:10 PM, Erick Erickson wrote:


Not quite sure where you are at with this. It sounds
like your slow loading is fixed and was a coding
issue on your part, that happens to us all.

bq: Is it advisable to has as less number of
queries to solr in a page?

Of course it is advisable to have as few Solr queries
executed to display a page as possible. Every one
costs you at least _some_ turnaround time. You can
mitigate this (assuming your Solr server isn't running
flat out) by issuing the subsequent queries in parallel
threads.

But it's not really a question to me of advisability, it's a
question of what your application needs to deliver. The
use-case drives all. You can do some tricks like display
partial pages and fill in the rest behind the scenes to
display when your user clicks something and the like.

bq: In my case, by denormalizing,that means putting the
product and supplier information into one collection?
The supplier information are stored but not indexed in the collection.

It Depends(tm). If all you want to do is provide supplier
information when people do product searches then stored-only
is fine.

If you want to perform queries like "show me all the products
supplied by supplier X", then you need to index at least
some values too.

Best,
Erick

On Sun, May 8, 2016 at 10:36 PM, Derek Poh <d...@globalsources.com>
wrote:


Hi Erick

In my case, by denormalizing,that means putting the product and supplier
information into one collection?
The supplier information arestored but not indexed in thecollection.

We haveidentified itwas a combination of a loop and bad source data that
caused an endless loop under certain scenario.

Is it advisable to has as less number of queries to solr in a page?


On 5/6/2016 11:17 PM, Erick Erickson wrote:


Denormalizing the data is usually the first thing to try. That's
certainly the preferred option if it doesn't bloat the index
unacceptably.

But my real question is what have you done to try to figure out _why_
it's slow? Do you have some loop
like
for (each found document)
  extract all the supplier IDs and query Solr for them)

? That's a fundamental design decision that will be expensive.

Have you examined the time each query takes to see if Solr is really
the bottleneck or whether it's "something else"? Mind you, I have no
clue what "something else" is here

Do you ever return lots of rows (i.e. thousands)?

Solr serves queries very quickly, so I'd concentrate on identifying what
is slow before jumping to a solution

Best,
Erick

On Wed, May 4, 2016 at 10:28 PM, Derek Poh <d...@globalsources.com>
wrote:


Hi

We have a "product" collection and a "supplier" collection.
The "product" collection contains products information and "supplier"
collection contains the product's suppliers information.
We have a subsidiary page that query on "product" collection for the
search.
The display result include product and supplier information.
This page will query the "product" collection to get the matching
product
records.
   From this query a list of the matching product's supplier id is
extracted
and used in a filter query against the "supplier" collection to get the
necessary supplier's information.

The loading of this page is very slow, it leads to timeout at times as
well.
Beside looking at tweaking the codes of the page we are also looking at
what
tweaking can be done on solr side. Reducing the number of queries
generated
bythis page was one of the optionto try.

The main "product" collection is also use by our site main search page
and
other subsidiary pages as well. So the query load on it is substantial.
It has about 6.5 million documents and index size of 38-39 GB.
It is setup as 1 shard with 5 replicas. Each replica is on it's own
server.
Total of 5 servers.
There are other smaller collections with similar 1 shard 5 replicas
setup
residing on these servers as well.

I am thinking of either
1. Index supplier information into the "product" collection.
2. Create another similar "product" collection for this page to use.
This
collection will have lesser product fields and will include the
required
supplier fields. But

Re: Advice to add additional non-related fields to a collection or create a subset of it?

2016-05-11 Thread Derek Poh

Hi Erick

Yes we have identified and fixed the page slow loading.

I was wondering if there are any best practices when it comes to 
deciding to create a single collection that stores all information in it 
or create multiple sub collections. I understand now itdepends on the 
use-case.

My apologies for not giving it much thoughts before asking the questions.
Thank you for your patience.

- Derek

On 5/10/2016 12:10 PM, Erick Erickson wrote:

Not quite sure where you are at with this. It sounds
like your slow loading is fixed and was a coding
issue on your part, that happens to us all.

bq: Is it advisable to has as less number of
queries to solr in a page?

Of course it is advisable to have as few Solr queries
executed to display a page as possible. Every one
costs you at least _some_ turnaround time. You can
mitigate this (assuming your Solr server isn't running
flat out) by issuing the subsequent queries in parallel
threads.

But it's not really a question to me of advisability, it's a
question of what your application needs to deliver. The
use-case drives all. You can do some tricks like display
partial pages and fill in the rest behind the scenes to
display when your user clicks something and the like.

bq: In my case, by denormalizing,that means putting the
product and supplier information into one collection?
The supplier information are stored but not indexed in the collection.

It Depends(tm). If all you want to do is provide supplier
information when people do product searches then stored-only
is fine.

If you want to perform queries like "show me all the products
supplied by supplier X", then you need to index at least
some values too.

Best,
Erick

On Sun, May 8, 2016 at 10:36 PM, Derek Poh <d...@globalsources.com> wrote:

Hi Erick

In my case, by denormalizing,that means putting the product and supplier
information into one collection?
The supplier information arestored but not indexed in thecollection.

We haveidentified itwas a combination of a loop and bad source data that
caused an endless loop under certain scenario.

Is it advisable to has as less number of queries to solr in a page?


On 5/6/2016 11:17 PM, Erick Erickson wrote:

Denormalizing the data is usually the first thing to try. That's
certainly the preferred option if it doesn't bloat the index
unacceptably.

But my real question is what have you done to try to figure out _why_
it's slow? Do you have some loop
like
for (each found document)
 extract all the supplier IDs and query Solr for them)

? That's a fundamental design decision that will be expensive.

Have you examined the time each query takes to see if Solr is really
the bottleneck or whether it's "something else"? Mind you, I have no
clue what "something else" is here

Do you ever return lots of rows (i.e. thousands)?

Solr serves queries very quickly, so I'd concentrate on identifying what
is slow before jumping to a solution

Best,
Erick

On Wed, May 4, 2016 at 10:28 PM, Derek Poh <d...@globalsources.com> wrote:

Hi

We have a "product" collection and a "supplier" collection.
The "product" collection contains products information and "supplier"
collection contains the product's suppliers information.
We have a subsidiary page that query on "product" collection for the
search.
The display result include product and supplier information.
This page will query the "product" collection to get the matching product
records.
  From this query a list of the matching product's supplier id is
extracted
and used in a filter query against the "supplier" collection to get the
necessary supplier's information.

The loading of this page is very slow, it leads to timeout at times as
well.
Beside looking at tweaking the codes of the page we are also looking at
what
tweaking can be done on solr side. Reducing the number of queries
generated
bythis page was one of the optionto try.

The main "product" collection is also use by our site main search page
and
other subsidiary pages as well. So the query load on it is substantial.
It has about 6.5 million documents and index size of 38-39 GB.
It is setup as 1 shard with 5 replicas. Each replica is on it's own
server.
Total of 5 servers.
There are other smaller collections with similar 1 shard 5 replicas setup
residing on these servers as well.

I am thinking of either
1. Index supplier information into the "product" collection.
2. Create another similar "product" collection for this page to use. This
collection will have lesser product fields and will include the required
supplier fields. But the number of documents in it will be the same as
the
main "product" collection. The index size will be smallerthough.

With either 2 options we do not need to query "supplier" collection. So
there is one less query and hopefully it will improve the performance of
this page.

W

Re: Advice to add additional non-related fields to a collection or create a subset of it?

2016-05-08 Thread Derek Poh

Hi Erick

In my case, by denormalizing,that means putting the product and supplier 
information into one collection?

The supplier information arestored but not indexed in thecollection.

We haveidentified itwas a combination of a loop and bad source data that 
caused an endless loop under certain scenario.


Is it advisable to has as less number of queries to solr in a page?


On 5/6/2016 11:17 PM, Erick Erickson wrote:

Denormalizing the data is usually the first thing to try. That's
certainly the preferred option if it doesn't bloat the index
unacceptably.

But my real question is what have you done to try to figure out _why_
it's slow? Do you have some loop
like
for (each found document)
extract all the supplier IDs and query Solr for them)

? That's a fundamental design decision that will be expensive.

Have you examined the time each query takes to see if Solr is really
the bottleneck or whether it's "something else"? Mind you, I have no
clue what "something else" is here

Do you ever return lots of rows (i.e. thousands)?

Solr serves queries very quickly, so I'd concentrate on identifying what
is slow before jumping to a solution

Best,
Erick

On Wed, May 4, 2016 at 10:28 PM, Derek Poh <d...@globalsources.com> wrote:

Hi

We have a "product" collection and a "supplier" collection.
The "product" collection contains products information and "supplier"
collection contains the product's suppliers information.
We have a subsidiary page that query on "product" collection for the search.
The display result include product and supplier information.
This page will query the "product" collection to get the matching product
records.
 From this query a list of the matching product's supplier id is extracted
and used in a filter query against the "supplier" collection to get the
necessary supplier's information.

The loading of this page is very slow, it leads to timeout at times as well.
Beside looking at tweaking the codes of the page we are also looking at what
tweaking can be done on solr side. Reducing the number of queries generated
bythis page was one of the optionto try.

The main "product" collection is also use by our site main search page and
other subsidiary pages as well. So the query load on it is substantial.
It has about 6.5 million documents and index size of 38-39 GB.
It is setup as 1 shard with 5 replicas. Each replica is on it's own server.
Total of 5 servers.
There are other smaller collections with similar 1 shard 5 replicas setup
residing on these servers as well.

I am thinking of either
1. Index supplier information into the "product" collection.
2. Create another similar "product" collection for this page to use. This
collection will have lesser product fields and will include the required
supplier fields. But the number of documents in it will be the same as the
main "product" collection. The index size will be smallerthough.

With either 2 options we do not need to query "supplier" collection. So
there is one less query and hopefully it will improve the performance of
this page.

What is the advise between the 2 options?
Any other advice or options?

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Advice to add additional non-related fields to a collection or create a subset of it?

2016-05-04 Thread Derek Poh

Hi

We have a "product" collection and a "supplier" collection.
The "product" collection contains products information and "supplier" 
collection contains the product's suppliers information.
We have a subsidiary page that query on "product" collection for the 
search. The display result include product and supplier information.
This page will query the "product" collection to get the matching 
product records.
From this query a list of the matching product's supplier id is 
extracted and used in a filter query against the "supplier" collection 
to get the necessary supplier's information.


The loading of this page is very slow, it leads to timeout at times as 
well. Beside looking at tweaking the codes of the page we are also 
looking at what tweaking can be done on solr side. Reducing the number 
of queries generated bythis page was one of the optionto try.


The main "product" collection is also use by our site main search page 
and other subsidiary pages as well. So the query load on it is substantial.

It has about 6.5 million documents and index size of 38-39 GB.
It is setup as 1 shard with 5 replicas. Each replica is on it's own 
server. Total of 5 servers.
There are other smaller collections with similar 1 shard 5 replicas 
setup residing on these servers as well.


I am thinking of either
1. Index supplier information into the "product" collection.
2. Create another similar "product" collection for this page to use. 
This collection will have lesser product fields and will include the 
required supplier fields. But the number of documents in it will be the 
same as the main "product" collection. The index size will be smallerthough.


With either 2 options we do not need to query "supplier" collection. So 
there is one less query and hopefully it will improve the performance of 
this page.


What is the advise between the 2 options?
Any other advice or options?

Derek

--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: make document with more matches rank higher with edismax parser?

2016-04-03 Thread Derek Poh
Will try the "tie" parameterand see if it satisfy business user 
requirements. Thank you.


On 4/2/2016 7:15 AM, Alexandre Rafalovitch wrote:

Have you tried 'tie' parameter?

https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser#TheDisMaxQueryParser-Thetie%28TieBreaker%29Parameter

Regards,
Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 1 April 2016 at 14:03, Derek Poh <d...@globalsources.com> wrote:

Hi

Correct me if I am wrong, my understanding of edismax parser is it use the
max score of the matches in a doc.

How do I make docs with more matches rank higher with edismax?

These 2 docs are from the same query result and this is their order in the
result.

P_ProductId: 1116393488
P_CatConcatKeyword: Bancos del poder
P_NewShortDescription: Accione el banco, 10,400mAh, 5.0V DC entran
P_VeryShortDescription: Accione el banco

score: 0.83850163

P_ProductId: 1124048475
P_CatConcatKeyword: Bancos del poder
P_NewShortDescription: Banco del poder con el altavoz
P_VeryShortDescription: Banco del poder

score: 0.83850163

q=Bancos del poder
qf=P_CatConcatKeyword^3.0 P_NewShortDescription^2.0
P_NewVeryShortDescription^1.0

 From the debug info, both docs max score match is from P_CatConcatKeyword
field. Debug info of both docs attached.
Comparing the field matches between both, the 2nd doc has more fields with
matches. How can I make 2nd doc ranked higher based on this?



--
CONFIDENTIALITY NOTICE

This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.

This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.







--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

make document with more matches rank higher with edismax parser?

2016-03-31 Thread Derek Poh

Hi

Correct me if I am wrong, my understanding of edismax parser is it use 
the max score of the matches in a doc.


How do I make docs with more matches rank higher with edismax?

These 2 docs are from the same query resultand this is their order in 
the result.


P_ProductId: 1116393488
P_CatConcatKeyword: Bancos del poder
P_NewShortDescription: Accione el banco, 10,400mAh, 5.0V DC entran
P_VeryShortDescription: Accione el banco

score: 0.83850163

P_ProductId: 1124048475
P_CatConcatKeyword: Bancos del poder
P_NewShortDescription: Banco del poder con el altavoz
P_VeryShortDescription: Banco del poder

score: 0.83850163

q=Bancos del poder
qf=P_CatConcatKeyword^3.0 P_NewShortDescription^2.0 
P_NewVeryShortDescription^1.0


From the debug info, both docs max score match is from 

P_CatConcatKeyword field. Debug info of both docsattached.
Comparing the field matches between both, the 2nd doc has more fields 
with matches. How can I make 2nd doc ranked higher based on this?


--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.1124048475

0.83850163 = (MATCH) sum of:
  0.004233816 = (MATCH) sum of:
0.0019395099 = (MATCH) max of:
  8.000289E-9 = (MATCH) weight(spp_keyword:banc^1.0E-5 in 6088628) 
[DefaultSimilarity], result of:
8.000289E-9 = score(doc=6088628,freq=1.0), product of:
  1.74163E-9 = queryWeight, product of:
1.0E-5 = boost
9.187129 = idf(docFreq=1868, maxDocs=6717914)
1.8957282E-5 = queryNorm
  4.5935645 = fieldWeight in 6088628, product of:
1.0 = tf(freq=1.0), with freq of:
  1.0 = termFreq=1.0
9.187129 = idf(docFreq=1868, maxDocs=6717914)
0.5 = fieldNorm(doc=6088628)
  5.8594847E-4 = (MATCH) weight(P_NewShortDescription:banco in 6088628) 
[DefaultSimilarity], result of:
5.8594847E-4 = score(doc=6088628,freq=1.0), product of:
  1.0539445E-4 = queryWeight, product of:
5.559576 = idf(docFreq=70312, maxDocs=6717914)
1.8957282E-5 = queryNorm
  5.559576 = fieldWeight in 6088628, product of:
1.0 = tf(freq=1.0), with freq of:
  1.0 = termFreq=1.0
5.559576 = idf(docFreq=70312, maxDocs=6717914)
1.0 = fieldNorm(doc=6088628)
  0.0012108017 = (MATCH) weight(P_VeryShortDescription:banco^2.0 in 
6088628) [DefaultSimilarity], result of:
0.0012108017 = score(doc=6088628,freq=1.0), product of:
  2.1425923E-4 = queryWeight, product of:
2.0 = boost
5.6511064 = idf(docFreq=64162, maxDocs=6717914)
1.8957282E-5 = queryNorm
  5.6511064 = fieldWeight in 6088628, product of:
1.0 = tf(freq=1.0), with freq of:
  1.0 = termFreq=1.0
5.6511064 = idf(docFreq=64162, maxDocs=6717914)
1.0 = fieldNorm(doc=6088628)
  0.0019395099 = (MATCH) weight(P_CatConcatKeyword:banco^3.0 in 6088628) 
[DefaultSimilarity], result of:
0.0019395099 = score(doc=6088628,freq=1.0), product of:
  3.3211973E-4 = queryWeight, product of:
3.0 = boost
5.8397913 = idf(docFreq=53129, maxDocs=6717914)
1.8957282E-5 = queryNorm
  5.8397913 = fieldWeight in 6088628, product of:
1.0 = tf(freq=1.0), with freq of:
  1.0 = termFreq=1.0
5.8397913 = idf(docFreq=53129, maxDocs=6717914)
1.0 = fieldNorm(doc=6088628)
4.8392292E-4 = (MATCH) max of:
  3.6249184E-9 = (MATCH) weight(spp_keyword:del^1.0E-5 in 6088628) 
[DefaultSimilarity], result of:
3.6249184E-9 = score(doc=6088628,freq=1.0), product of:
  1.1723361E-9 = queryWeight, product of:
1.0E-5 = boost
6.184094 = idf(docFreq=37653, maxDocs=6717914)
1.8957282E-5 = queryNorm
  3.092047 = fieldWeight in 6088628, product of:
1.0 = tf(freq=1.0), with freq of:
  1.0 = termFreq=1.0
6.184094 = idf(docFreq=37653, maxDocs=6717914)
0.5 = fieldNorm(doc=6088628)
  4.699589E-5 = (MATCH) weight(P_NewShortDescription:del in 6088628) 
[DefaultSimilarity], result of:
4.699589E-5 = score(doc=6088628,freq=1.0), product of:
  2.9848188E-5 = queryWeight, product of:
1.5744972 = idf(docFreq=3782103, maxDocs=6717914)
1.8957282E-5 = queryNorm
  1.5744972 = fieldWeight in 6088628, product of:
1.0 = tf(freq=1.0), 

Re: Filter factory to reduce word from plural forms to singular forms correctly?

2016-02-29 Thread Derek Poh

Hi Alex

Can you advice how can I make use of copyField to handle this issue?

NLP lematisation will be the last resort and subject to budget and 
business usersdecision.


Derek


On 3/1/2016 8:13 AM, Alexandre Rafalovitch wrote:

On 29 February 2016 at 20:40, Derek Poh <d...@globalsources.com> wrote:

Is there other filter factory that can reduce pluralto singular correctly?

English is not an easy language and most of the heuristic filters have
issues. You could try copyField and multiple approaches.

Or, if this is a really Really big issue for you, there are commercial
companies that do NLP lematisation properly and integrate with Solr.
But they are not cheap.

Regards,
Alex.


Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: Filter factory to reduce word from plural forms to singular forms correctly?

2016-02-29 Thread Derek Poh

Hi Emir

For my use case, it is to do a exact match (enclosed searchkeyword in 
double quotes) on a search field. Searchon "power banks" should return 
matches for "power bank" and "power banks", singular and plural forms.
I will need to do furthertesting with porterstemfilter to ensure it meet 
thebusiness use case.


On 2/29/2016 7:07 PM, Emir Arnautovic wrote:

Hi Derek,
Why does aggressive stemming worries you? You might have false 
positives, but that is desired behavior in most cases. In your case 
"iphone" documents will also be returned for "iphon" query. Is this 
something that is not desired behavior? You can have more than one 
field if you want to prefer matches with exact wording, but that is 
unnecessary overhead in most cases.


Regards,
Emir




--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Filter factory to reduce word from plural forms to singular forms correctly?

2016-02-29 Thread Derek Poh

Hi

I am using EnglishMinimalStemFilterFactory to reducewords in plural 
forms to singular forms.
The filter factory is not reducingthe plural formof 'es' to the singular 
form correctly. It is reducing correctly for plural form of 's'.

"boxes" is reduced to "boxe" instead of "box"
"glasses" to "glasse" instead of "glass" etc.

I tried with PorterStemFilterFactory, itis able to reduce the plural 
'es' formto singular form correctly. However itreduced "iphones" to 
"iphon" instead.


Is there other filter factory that can reduce pluralto singular correctly?

The field type definition of the field.
positionIncrementGap="100">













--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: "pf" not supported by edismax?

2016-02-14 Thread Derek Poh

Hi Jack

Sorry I am confused.

For mycase,it seems that "pf" only work with dismax.

with dismax:


+((spp_keyword_exact:dvd) (spp_keyword_exact:bracket))
*(spp_keyword_exact:dvd bracket)*



with edismax:


+((spp_keyword_exact:dvd) (spp_keyword_exact:bracket)) ()




On 2/15/2016 1:26 PM, Jack Krupansky wrote:

Maybe because the tokenized phrase produces only a single term it is
ignored. In any case, it won't be a phrase. pf only does something useful
for phrases. IOW, where a PhraseQuery can be generated. A PhraseQuery for
more than a single term would never match when the field value is a single
term.

-- Jack Krupansky

On Mon, Feb 15, 2016 at 12:11 AM, Derek Poh <d...@globalsources.com> wrote:


It is using KeywordTokenizerFactory. It is still consider as tokenized?

Here's the field definition:


 
   
 
 
 
   
   
 
 
 
   
 


On 2/15/2016 12:43 PM, Jack Krupansky wrote:


pf stands for phrase boosting, which implies tokenized text...
spp_keyword_exact sounds like it is not tokenized.

-- Jack Krupansky

On Sun, Feb 14, 2016 at 10:08 PM, Derek Poh <d...@globalsources.com>
wrote:

Hi

Correct me If I am wrong, edismax is an extension of dismax, so it will
support "pf".
But from my testing I noticed "pf" is not working with edismax.
  From the debug information of a query using "pf" with edismax, there is
no
phrase match for the "pf" field "spp_keyword_exact".
If I changed to dismax, it is doing a phrase match on the field.

Is this normal?

We are running Solr 4.10.4.

Below is the queriesand their debug information.

Query using "pf" with edismax and the debug statement:


http://hkenedcdg1.globalsources.com:8983/solr/product/select?q=dvd%20bracket=spp_keyword_exact=P_SPPKW,P_NewShortDescription.P_CatConCatKeyword,P_VeryShortDescription=spp_keyword_exact=query=edismax

dvd bracket
dvd bracket

(+(DisjunctionMaxQuery((spp_keyword_exact:dvd))
DisjunctionMaxQuery((spp_keyword_exact:bracket))) ())/no_coord


+((spp_keyword_exact:dvd) (spp_keyword_exact:bracket)) ()

ExtendedDismaxQParser


Query using "pf" with dismax and the debug statement:


http://hkenedcdg1.globalsources.com:8983/solr/product/select?q=dvd%20bracket=spp_keyword_exact=P_SPPKW,P_NewShortDescription.P_CatConCatKeyword,P_VeryShortDescription=spp_keyword_exact=query=dismax

dvd bracket
dvd bracket

(+(DisjunctionMaxQuery((spp_keyword_exact:dvd))
DisjunctionMaxQuery((spp_keyword_exact:bracket)))
DisjunctionMaxQuery((spp_keyword_exact:dvd bracket)))/no_coord


+((spp_keyword_exact:dvd) (spp_keyword_exact:bracket))
(spp_keyword_exact:dvd bracket)

DisMaxQParser

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and
you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.


--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: "pf" not supported by edismax?

2016-02-14 Thread Derek Poh

It is using KeywordTokenizerFactory. It is still consider as tokenized?

Here's the field definition:
type="gs_keyword_exact" multiValued="true"/>


positionIncrementGap="100">

  



  
  



  


On 2/15/2016 12:43 PM, Jack Krupansky wrote:

pf stands for phrase boosting, which implies tokenized text...
spp_keyword_exact sounds like it is not tokenized.

-- Jack Krupansky

On Sun, Feb 14, 2016 at 10:08 PM, Derek Poh <d...@globalsources.com> wrote:


Hi

Correct me If I am wrong, edismax is an extension of dismax, so it will
support "pf".
But from my testing I noticed "pf" is not working with edismax.
 From the debug information of a query using "pf" with edismax, there is no
phrase match for the "pf" field "spp_keyword_exact".
If I changed to dismax, it is doing a phrase match on the field.

Is this normal?

We are running Solr 4.10.4.

Below is the queriesand their debug information.

Query using "pf" with edismax and the debug statement:

http://hkenedcdg1.globalsources.com:8983/solr/product/select?q=dvd%20bracket=spp_keyword_exact=P_SPPKW,P_NewShortDescription.P_CatConCatKeyword,P_VeryShortDescription=spp_keyword_exact=query=edismax

dvd bracket
dvd bracket

(+(DisjunctionMaxQuery((spp_keyword_exact:dvd))
DisjunctionMaxQuery((spp_keyword_exact:bracket))) ())/no_coord


+((spp_keyword_exact:dvd) (spp_keyword_exact:bracket)) ()

ExtendedDismaxQParser


Query using "pf" with dismax and the debug statement:

http://hkenedcdg1.globalsources.com:8983/solr/product/select?q=dvd%20bracket=spp_keyword_exact=P_SPPKW,P_NewShortDescription.P_CatConCatKeyword,P_VeryShortDescription=spp_keyword_exact=query=dismax

dvd bracket
dvd bracket

(+(DisjunctionMaxQuery((spp_keyword_exact:dvd))
DisjunctionMaxQuery((spp_keyword_exact:bracket)))
DisjunctionMaxQuery((spp_keyword_exact:dvd bracket)))/no_coord


+((spp_keyword_exact:dvd) (spp_keyword_exact:bracket))
(spp_keyword_exact:dvd bracket)

DisMaxQParser

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.


--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.



"pf" not supported by edismax?

2016-02-14 Thread Derek Poh

Hi

Correct me If I am wrong, edismax is an extension of dismax, so it will 
support "pf".

But from my testing I noticed "pf" is not working with edismax.
From the debug information of a query using "pf" with edismax, there is 
no phrase match for the "pf" field "spp_keyword_exact".

If I changed to dismax, it is doing a phrase match on the field.

Is this normal?

We are running Solr 4.10.4.

Below is the queriesand their debug information.

Query using "pf" with edismax and the debug statement:
http://hkenedcdg1.globalsources.com:8983/solr/product/select?q=dvd%20bracket=spp_keyword_exact=P_SPPKW,P_NewShortDescription.P_CatConCatKeyword,P_VeryShortDescription=spp_keyword_exact=query=edismax

dvd bracket
dvd bracket

(+(DisjunctionMaxQuery((spp_keyword_exact:dvd)) 
DisjunctionMaxQuery((spp_keyword_exact:bracket))) ())/no_coord



+((spp_keyword_exact:dvd) (spp_keyword_exact:bracket)) ()

ExtendedDismaxQParser


Query using "pf" with dismax and the debug statement:
http://hkenedcdg1.globalsources.com:8983/solr/product/select?q=dvd%20bracket=spp_keyword_exact=P_SPPKW,P_NewShortDescription.P_CatConCatKeyword,P_VeryShortDescription=spp_keyword_exact=query=dismax

dvd bracket
dvd bracket

(+(DisjunctionMaxQuery((spp_keyword_exact:dvd)) 
DisjunctionMaxQuery((spp_keyword_exact:bracket))) 
DisjunctionMaxQuery((spp_keyword_exact:dvd bracket)))/no_coord



+((spp_keyword_exact:dvd) (spp_keyword_exact:bracket)) 
(spp_keyword_exact:dvd bracket)


DisMaxQParser

Derek

--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: implement exact match for one of the search fields only?

2016-02-04 Thread Derek Poh

Hi Erick

<<
The manual way of doing this would be to construct an elaborate query, 
like q=spp_keyword_exact:"dvd bracket" OR P_ShortDescription:(dvd 
bracket) OR NOTE: the parens are necessary or the last part of the 
above would be parsed as P_ShortDescription:dvd default_searchfield:bracket

>>

Your suggestion to construct the query like q=spp_keyword_exact:"dvd 
bracket" OR P_ShortDescription:(dvd bracket) OR does not fit into our 
current implementation.
The front-end pages will only pass the "q=search keywords" in the query 
to solr. The list of search fields (qf) is pre-defined in solr.


Do you have any alternatives to implement your suggestion without making 
changes to the front-end?


On 1/29/2016 1:49 AM, Erick Erickson wrote:

bq: if you are interested phrase query, you should use String field

If you do this, you will NOT be able to search within the string. I.e.
if the doc field is "my dog has fleas" you cannot match
"dog has" with a string-based field.

If you want to match the _entire_ string or you want prefix-only
matching, then string might work, i.e. if you _only_ want to be able
to match

"my dog has fleas"
"my dog*"
but not
"dog has fleas".

On to the root question though.

I really think you want to look at edismax. What you're trying to do
is apply the same search term to individual fields. In particular,
the pf parameter will automatically apply the search terms _as a phrase_
against the field specified, relieving you of having to enclose things
in quotes.

The manual way of doing this would be to construct an elaborate query, like
q=spp_keyword_exact:"dvd bracket" OR P_ShortDescription:(dvd bracket) OR

NOTE: the parens are necessary or the last part of the above would be
parsed as
P_ShortDescription:dvd default_searchfield:bracket

And the =query trick will show you exactly how things are actually
searched, it's invaluable.

Best,
Erick

On Thu, Jan 28, 2016 at 5:08 AM, Mugeesh Husain  wrote:

Hi,
if you are interested phrase query, you should use String field instead of
text field in schema like as
  

this will solved you problem.

if you are missing anything else let share



--
View this message in context: 
http://lucene.472066.n3.nabble.com/implement-exact-match-for-one-of-the-search-fields-only-tp4253786p4253827.html
Sent from the Solr - User mailing list archive at Nabble.com.




--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.



Re: implement exact match for one of the search fields only?

2016-01-28 Thread Derek Poh

Hi Emir

For the other search fields, if they have matches it should be return.

On 1/28/2016 8:17 PM, Emir Arnautovic wrote:

Hi Derek,
It is not clear what you are trying to achieve: "one of the search 
fields is an exact phrase match while the rest of the search fields 
can be exact or partial matches". What does "while" mean - it has to 
match in other fields as well or result should be scored better if it 
does but not mandatory to match?

For exact match you can use string type instead of text.
For querying multiple fields you can take a look at (e)dismax query 
parser.


Regards,
Emir




--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: implement exact match for one of the search fields only?

2016-01-28 Thread Derek Poh
Do you mean for the spp_keyword_exact field, I should use String field 
with keyword tokenised and lowercase token filtered?


On 1/28/2016 10:54 PM, Alessandro Benedetti wrote:

I think you are overthinking the problem :
I agre the described one is the most obvious solution in your case.
Only addition is to use a keyword tokenised field type, lowercase token
filtered if you want to be case in-sensitive .

Cheers

On 28 January 2016 at 13:08, Mugeesh Husain  wrote:


Hi,
if you are interested phrase query, you should use String field instead of
text field in schema like as
  

this will solved you problem.

if you are missing anything else let share



--
View this message in context:
http://lucene.472066.n3.nabble.com/implement-exact-match-for-one-of-the-search-fields-only-tp4253786p4253827.html
Sent from the Solr - User mailing list archive at Nabble.com.







--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

implement exact match for one of the search fields only?

2016-01-28 Thread Derek Poh

Hi

First of all, sorry for the long post.

How do I implement or structured the query such that one of the search 
fields is an exact phrase match while the rest of the search fields can 
be exact or partial matches? Is this possible?


I have the following search fields
- P_VeryShortDescription
- P_ShortDescription
- P_CatConcatKeyword
- spp_keyword_exact

For the spp_keyword_exact field, I want to apply an exact match to it.

I have a document with the following information. If I search for 'dvd', 
this document should not match. However if I search for 'dvd bracket', 
this document should match.

Right now when I search for 'dvd', it is not return, which is correct.
I want it to be return when I search for 'dvd bracket' but it is not.
I try enclosing it in double quotes "dvd bracket" but it is not return. 
Then again I can't enclosed the search terms in double quotes "dvd 
bracket" as those documents with the word 'dvd' and 'bracket' in the 
other fields will not be match, am I right?


doc:

Re: implement exact match for one of the search fields only?

2016-01-28 Thread Derek Poh

Hi Erick and all

Yes I am trying to apply the same search term to all the 4 search 
fieldsand 1 of the search field must be an exact match.


You mentioned "In particular, the pf parameter will automatically apply 
the search terms _as a phrase_ against the field specified, relieving 
you of having to enclose things in quotes."

I triedbut it is not returning the document.

http://hkenedcdg1.globalsources.com:8983/solr/product/select?q=dvd%20bracket=spp_keyword_exact=edismax=query=spp_keyword_exact=P_ProductId,spp_keyword_exact,P_SPPKW

I may have misunderstood.


On 1/29/2016 1:49 AM, Erick Erickson wrote:

bq: if you are interested phrase query, you should use String field

If you do this, you will NOT be able to search within the string. I.e.
if the doc field is "my dog has fleas" you cannot match
"dog has" with a string-based field.

If you want to match the _entire_ string or you want prefix-only
matching, then string might work, i.e. if you _only_ want to be able
to match

"my dog has fleas"
"my dog*"
but not
"dog has fleas".

On to the root question though.

I really think you want to look at edismax. What you're trying to do
is apply the same search term to individual fields. In particular,
the pf parameter will automatically apply the search terms _as a phrase_
against the field specified, relieving you of having to enclose things
in quotes.

The manual way of doing this would be to construct an elaborate query, like
q=spp_keyword_exact:"dvd bracket" OR P_ShortDescription:(dvd bracket) OR

NOTE: the parens are necessary or the last part of the above would be
parsed as
P_ShortDescription:dvd default_searchfield:bracket

And the =query trick will show you exactly how things are actually
searched, it's invaluable.

Best,
Erick

On Thu, Jan 28, 2016 at 5:08 AM, Mugeesh Husain  wrote:

Hi,
if you are interested phrase query, you should use String field instead of
text field in schema like as
  

this will solved you problem.

if you are missing anything else let share



--
View this message in context: 
http://lucene.472066.n3.nabble.com/implement-exact-match-for-one-of-the-search-fields-only-tp4253786p4253827.html
Sent from the Solr - User mailing list archive at Nabble.com.





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: StringIndexOutOfBoundsException using spellcheck and synonyms

2015-11-17 Thread Derek Poh

Hi

Any advice how to resolve or workaround to this issue?


On 11/17/2015 8:28 AM, Derek Poh wrote:

Hi Scott

I amusing Solr 4.10.4.

On 11/16/2015 10:06 PM, Scott Stults wrote:

Hi Derek,

Could you please add what version of Solr you see this in? I didn't 
see a

related Jira, so this might warrant a new one.


k/r,
Scott

On Sun, Nov 15, 2015 at 11:01 PM, Derek Poh <d...@globalsources.com> 
wrote:



Hi
Iam using spellcheck and synonyms.I am getting
"java.lang.StringIndexOutOfBoundsException: String index out of 
range: -1"

for some keywords.

I think I managed to narrow down to the likely caused of it.
I have thisline of entry in the synonyms.txt file,

body spray,cologne,parfum,parfume,perfume,purfume,toilette

When I search for 'cologne' it will hit the exception.
If I removed the'body spray' from the line, I will not hit the 
exception.


cologne,parfum,parfume,perfume,purfume,toilette

It seems like it could be due to multi terms in the synonyms files but
there are some keywords with multi terms in synonyms that does not 
has the

issue.
This line has a multi term "paint ball" in it, when I search for 
paintball

or paintballs it does not hit the exception.

paintball,paintballs,paint ball


Any advice how can I resolve this issue?


The field use for spellcheck:




 
   
 
 
 
 
   
   
 
 
 synonyms="synonyms.txt"

ignoreCase="true" expand="true"/>
 
   
 


Exception stacktrace:
2015-11-16T07:06:43,055 - ERROR 
[qtp744979286-193443:SolrException@142] -
null:java.lang.StringIndexOutOfBoundsException: String index out of 
range:

-1
 at
java.lang.AbstractStringBuilder.replace(AbstractStringBuilder.java:789)
 at java.lang.StringBuilder.replace(StringBuilder.java:266)
 at
org.apache.solr.spelling.SpellCheckCollator.getCollation(SpellCheckCollator.java:235) 


 at
org.apache.solr.spelling.SpellCheckCollator.collate(SpellCheckCollator.java:92) 


 at
org.apache.solr.handler.component.SpellCheckComponent.addCollationsToResponse(SpellCheckComponent.java:230) 


 at
org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:197) 


 at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:218) 


 at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) 


 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1976)
 at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777) 


 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) 


 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) 


 at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) 


 at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) 


 at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) 


 at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) 


 at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) 


 at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) 


 at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) 


 at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) 


 at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) 


 at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) 


 at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) 


 at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110) 


 at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) 


 at org.eclipse.jetty.server.Server.handle(Server.java:497)
 at
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
 at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) 


 at org.eclipse.jetty.io
.AbstractConnection$2.run(AbstractConnection.java:540)
 at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) 


 at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) 


 at java.lang.Thread.run(Thread.java:722)

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attac

Re: StringIndexOutOfBoundsException using spellcheck and synonyms

2015-11-16 Thread Derek Poh

Hi Scott

I amusing Solr 4.10.4.

On 11/16/2015 10:06 PM, Scott Stults wrote:

Hi Derek,

Could you please add what version of Solr you see this in? I didn't see a
related Jira, so this might warrant a new one.


k/r,
Scott

On Sun, Nov 15, 2015 at 11:01 PM, Derek Poh <d...@globalsources.com> wrote:


Hi
Iam using spellcheck and synonyms.I am getting
"java.lang.StringIndexOutOfBoundsException: String index out of range: -1"
for some keywords.

I think I managed to narrow down to the likely caused of it.
I have thisline of entry in the synonyms.txt file,

body spray,cologne,parfum,parfume,perfume,purfume,toilette

When I search for 'cologne' it will hit the exception.
If I removed the'body spray' from the line, I will not hit the exception.

cologne,parfum,parfume,perfume,purfume,toilette

It seems like it could be due to multi terms in the synonyms files but
there are some keywords with multi terms in synonyms that does not has the
issue.
This line has a multi term "paint ball" in it, when I search for paintball
or paintballs it does not hit the exception.

paintball,paintballs,paint ball


Any advice how can I resolve this issue?


The field use for spellcheck:




 
   
 
 
 
 
   
   
 
 
 
 
   
 


Exception stacktrace:
2015-11-16T07:06:43,055 - ERROR [qtp744979286-193443:SolrException@142] -
null:java.lang.StringIndexOutOfBoundsException: String index out of range:
-1
 at
java.lang.AbstractStringBuilder.replace(AbstractStringBuilder.java:789)
 at java.lang.StringBuilder.replace(StringBuilder.java:266)
 at
org.apache.solr.spelling.SpellCheckCollator.getCollation(SpellCheckCollator.java:235)
 at
org.apache.solr.spelling.SpellCheckCollator.collate(SpellCheckCollator.java:92)
 at
org.apache.solr.handler.component.SpellCheckComponent.addCollationsToResponse(SpellCheckComponent.java:230)
 at
org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:197)
 at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:218)
 at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1976)
 at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
 at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
 at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
 at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
 at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
 at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
 at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
 at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
 at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
 at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
 at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
 at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
 at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
 at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
 at org.eclipse.jetty.server.Server.handle(Server.java:497)
 at
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
 at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
 at org.eclipse.jetty.io
.AbstractConnection$2.run(AbstractConnection.java:540)
 at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
 at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
 at java.lang.Thread.run(Thread.java:722)

Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or

StringIndexOutOfBoundsException using spellcheck and synonyms

2015-11-15 Thread Derek Poh

Hi
Iam using spellcheck and synonyms.I am getting 
"java.lang.StringIndexOutOfBoundsException: String index out of range: 
-1" for some keywords.


I think I managed to narrow down to the likely caused of it.
I have thisline of entry in the synonyms.txt file,

body spray,cologne,parfum,parfume,perfume,purfume,toilette

When I search for 'cologne' it will hit the exception.
If I removed the'body spray' from the line, I will not hit the exception.

cologne,parfum,parfume,perfume,purfume,toilette

It seems like it could be due to multi terms in the synonyms files but 
there are some keywords with multi terms in synonyms that does not has 
the issue.
This line has a multi term "paint ball" in it, when I search for 
paintball or paintballs it does not hit the exception.


paintball,paintballs,paint ball


Any advice how can I resolve this issue?


The field use for spellcheck:


multiValued="true"/>


positionIncrementGap="100">

  

words="stopwords.txt" />



  
  

words="stopwords.txt" />
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>


  



Exception stacktrace:
2015-11-16T07:06:43,055 - ERROR [qtp744979286-193443:SolrException@142] 
- null:java.lang.StringIndexOutOfBoundsException: String index out of 
range: -1
at 
java.lang.AbstractStringBuilder.replace(AbstractStringBuilder.java:789)

at java.lang.StringBuilder.replace(StringBuilder.java:266)
at 
org.apache.solr.spelling.SpellCheckCollator.getCollation(SpellCheckCollator.java:235)
at 
org.apache.solr.spelling.SpellCheckCollator.collate(SpellCheckCollator.java:92)
at 
org.apache.solr.handler.component.SpellCheckComponent.addCollationsToResponse(SpellCheckComponent.java:230)
at 
org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:197)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:218)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:1976)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)

at org.eclipse.jetty.server.Server.handle(Server.java:497)
at 
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at 
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)

at java.lang.Thread.run(Thread.java:722)

Derek

--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: 'missing content stream' issuing expungeDeletes=true

2015-09-02 Thread Derek Poh

There are around 6+ millions documents in the collection.

Each document (or product record) is unqiue in the collection.
When we found out the document has a docfreq of 2, we did a query on the 
document's product id and indeed 2 documents were returned.
We suspect 1 of them is deleted but not remove from the index. We try 
optimizing. Only 1 document is return when we query again and the 
document docreq is 1.


We checked the source data and the document is not duplicated.
It could be the way we index (full index every time) that result in this 
scenario of having 2 of the same document in the index.


On 9/2/2015 12:11 PM, Erick Erickson wrote:

How many document total in your corpus? And how many do you
intend to have?

My point is that if you are testing this with a small corpus, the results
are very likely different than when you test on a reasonable corpus.
So if you expect your "real" index will contain many more docs than
what you're testing, this is likely a red herring.

But something isn't making a lot of sense here. You say you've traced it
to having a docfreq of 2 that changes to 1. But that means that the
value is unique in your entire corpus, which kind of indicates you're
trying to boost on unique values which is unusual.

If you're confident in your model though, the only way to guarantee
what you want is to optimize/expungeDeletes.

Best,
Erick

On Tue, Sep 1, 2015 at 7:51 PM, Derek Poh <d...@globalsources.com> wrote:

Erick

Yes, we see documents changing their position in the list due to having
deleted docs.
In our searchresult,weapply higher boost (bq) to a group of matched
documents to have them display at the top tier of the result.
At times 1 or 2 of these documentsare not return in the top tier, they are
relegateddown to the lower tierof the result. Wediscovered that these
documents have a lower score due to docFreq=2.
After we do an optimize, these 1-2 documents are back in the top tier result
order and their docFreqis 1.



On 9/1/2015 11:40 PM, Erick Erickson wrote:

Derek:

Why do you care? What evidence do you have that this matters
_practically_?

If you've look at scoring with a small number of documents, you'll see
significant
differences due to deleted documents. In most cases, as you get a larger
number
of documents the ranking of documents in an index with no deletions .vs.
indexes
that have deletions is usually not noticeable.

I'm suggesting that this is a red herring. Your specific situation may
be different
of course, but since scoring is really only about ranking docs
relative to each other,
unless the relative positions change enough to be noticeable it's not a
problem.

Note that I'm saying "relative rankings", NOT "absolute score". Document
scores
have no meaning outside comparisons to other docs _in the same query_. So
unless you see documents changing their position in the list due to
having deleted
docs, it's not worth spending time on IMO.

Best,
Erick

On Tue, Sep 1, 2015 at 12:45 AM, Upayavira <u...@odoko.co.uk> wrote:

I wonder if this resolves it [1]. It has been applied to trunk, but not
to the 5.x release branch.

If you needed it in 5.x, I wonder if there's a way that particular
choice could be made configurable.

Upayavira

[1] https://issues.apache.org/jira/browse/LUCENE-6711
On Tue, Sep 1, 2015, at 02:43 AM, Derek Poh wrote:

Hi Upayavira

In fact we are using optimize currently but was advised to use expunge
deletes as it is less resource intensive.
So expunge deletes will only remove deleted documents, it will not merge
all index segments into one?

If we don't use optimize, the deleted documents in the index will affect
the scores (with docFreq=2) of the matched documents which will affect
the relevancy of the search result.

Derek

On 9/1/2015 12:05 AM, Upayavira wrote:

If you really must expunge deletes, use optimize. That will merge all
index segments into one, and in the process will remove any deleted
documents.

Why do you need to expunge deleted documents anyway? It is generally
done in the background for you, so you shouldn't need to worry about
it.

Upayavira

On Mon, Aug 31, 2015, at 06:46 AM, davidphilip cherian wrote:

Hi,

The below curl command worked without error, you can try.

curl http://localhost:8983/solr/techproducts/update?commit=true -H
"Content-Type: text/xml" --data-binary ''

However, after executing this, I could still see same deleted counts
on
dashboard.  Deleted Docs:6
I am not sure whether that means,  the command did not take effect or
it
took effect but did not reflect on dashboard view.





On Mon, Aug 31, 2015 at 8:51 AM, Derek Poh <d...@globalsources.com>
wrote:


Hi

I tried doing a expungeDeletes=true with the following but get the
message
'missing content stream'. What am I missing? I need to provide
additional
parameters?

curl
'http://127.0.0.1:8983/solr/supplier/update/json?expungeDeletes=true
';

Thanks,
Derek

--
CON

Re: 'missing content stream' issuing expungeDeletes=true

2015-09-01 Thread Derek Poh

Erick

Yes, we see documents changing their position in the list due to having 
deleted docs.
In our searchresult,weapply higher boost (bq) to a group of matched 
documents to have them display at the top tier of the result.
At times 1 or 2 of these documentsare not return in the top tier, they 
are relegateddown to the lower tierof the result. Wediscovered that 
these documents have a lower score due to docFreq=2.
After we do an optimize, these 1-2 documents are back in the top tier 
result order and their docFreqis 1.




On 9/1/2015 11:40 PM, Erick Erickson wrote:

Derek:

Why do you care? What evidence do you have that this matters _practically_?

If you've look at scoring with a small number of documents, you'll see
significant
differences due to deleted documents. In most cases, as you get a larger number
of documents the ranking of documents in an index with no deletions .vs. indexes
that have deletions is usually not noticeable.

I'm suggesting that this is a red herring. Your specific situation may
be different
of course, but since scoring is really only about ranking docs
relative to each other,
unless the relative positions change enough to be noticeable it's not a problem.

Note that I'm saying "relative rankings", NOT "absolute score". Document scores
have no meaning outside comparisons to other docs _in the same query_. So
unless you see documents changing their position in the list due to
having deleted
docs, it's not worth spending time on IMO.

Best,
Erick

On Tue, Sep 1, 2015 at 12:45 AM, Upayavira <u...@odoko.co.uk> wrote:

I wonder if this resolves it [1]. It has been applied to trunk, but not
to the 5.x release branch.

If you needed it in 5.x, I wonder if there's a way that particular
choice could be made configurable.

Upayavira

[1] https://issues.apache.org/jira/browse/LUCENE-6711
On Tue, Sep 1, 2015, at 02:43 AM, Derek Poh wrote:

Hi Upayavira

In fact we are using optimize currently but was advised to use expunge
deletes as it is less resource intensive.
So expunge deletes will only remove deleted documents, it will not merge
all index segments into one?

If we don't use optimize, the deleted documents in the index will affect
the scores (with docFreq=2) of the matched documents which will affect
the relevancy of the search result.

Derek

On 9/1/2015 12:05 AM, Upayavira wrote:

If you really must expunge deletes, use optimize. That will merge all
index segments into one, and in the process will remove any deleted
documents.

Why do you need to expunge deleted documents anyway? It is generally
done in the background for you, so you shouldn't need to worry about it.

Upayavira

On Mon, Aug 31, 2015, at 06:46 AM, davidphilip cherian wrote:

Hi,

The below curl command worked without error, you can try.

curl http://localhost:8983/solr/techproducts/update?commit=true -H
"Content-Type: text/xml" --data-binary ''

However, after executing this, I could still see same deleted counts on
dashboard.  Deleted Docs:6
I am not sure whether that means,  the command did not take effect or it
took effect but did not reflect on dashboard view.





On Mon, Aug 31, 2015 at 8:51 AM, Derek Poh <d...@globalsources.com>
wrote:


Hi

I tried doing a expungeDeletes=true with the following but get the message
'missing content stream'. What am I missing? I need to provide additional
parameters?

curl 'http://127.0.0.1:8983/solr/supplier/update/json?expungeDeletes=true
';

Thanks,
Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.




--
CONFIDENTIALITY NOTICE

This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and
you must not use, disclose to anyone else or copy this e-mail (including
any attachments), whether in whole or in part.

This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and

Re: 'missing content stream' issuing expungeDeletes=true

2015-08-31 Thread Derek Poh

Hi Upayavira

In fact we are using optimize currently but was advised to use expunge 
deletes as it is less resource intensive.
So expunge deletes will only remove deleted documents, it will not merge 
all index segments into one?


If we don't use optimize, the deleted documents in the index will affect 
the scores (with docFreq=2) of the matched documents which will affect 
the relevancy of the search result.


Derek

On 9/1/2015 12:05 AM, Upayavira wrote:

If you really must expunge deletes, use optimize. That will merge all
index segments into one, and in the process will remove any deleted
documents.

Why do you need to expunge deleted documents anyway? It is generally
done in the background for you, so you shouldn't need to worry about it.

Upayavira

On Mon, Aug 31, 2015, at 06:46 AM, davidphilip cherian wrote:

Hi,

The below curl command worked without error, you can try.

curl http://localhost:8983/solr/techproducts/update?commit=true -H
"Content-Type: text/xml" --data-binary ''

However, after executing this, I could still see same deleted counts on
dashboard.  Deleted Docs:6
I am not sure whether that means,  the command did not take effect or it
took effect but did not reflect on dashboard view.





On Mon, Aug 31, 2015 at 8:51 AM, Derek Poh <d...@globalsources.com>
wrote:


Hi

I tried doing a expungeDeletes=true with the following but get the message
'missing content stream'. What am I missing? I need to provide additional
parameters?

curl 'http://127.0.0.1:8983/solr/supplier/update/json?expungeDeletes=true
';

Thanks,
Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential and/or
privileged information. If you are not the intended recipient or have
received this e-mail in error, please inform the sender immediately and
delete this e-mail (including any attachments) from your computer, and you
must not use, disclose to anyone else or copy this e-mail (including any
attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal,
regulatory compliance and/or other appropriate reasons.







--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

'missing content stream' issuing expungeDeletes=true

2015-08-30 Thread Derek Poh

Hi

I tried doing a expungeDeletes=true with the following but get the 
message 'missing content stream'. What am I missing? I need to provide 
additional parameters?


curl 'http://127.0.0.1:8983/solr/supplier/update/json?expungeDeletes=true';

Thanks,
Derek

--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.



No query fields MATCH weight info for a doc in debug?

2015-07-11 Thread Derek Poh

Hi

I came across a document that does not has the query field MATCH weight 
information in the debugbut the query fields do have the search term in 
them.

What can cause this?

Here is some info of the document.

Search keyword is LED

The search fields and their values,
P_CatConcatKeyword = E14 LED bulbs
P_NewShortDescription = TIWIN TUV GS CE RoHS Certified LED Bulb, 3W 5W 
7W 9W 11W 13W

P_VeryShortDescription = LED Bulb
spp_keyword_exact =  LED,led bulb,led bulb light,solar garden lights

This is the partial debug info for the document,
2.9332387 = (MATCH) sum of:
  2.516484E-7 = (MATCH) max of:
2.516484E-7 = (MATCH) weight(spp_keyword_exact:led^1.0E-5 in 
1775174) [DefaultSimilarity], result of:

  2.516484E-7 = score(doc=1775174,freq=1.0), product of:
3.8561673E-8 = queryWeight, product of:
  1.0E-5 = boost
  13.051737 = idf(docFreq=38, maxDocs=6684479)
  2.9545242E-4 = queryNorm
6.5258684 = fieldWeight in 1775174, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  13.051737 = idf(docFreq=38, maxDocs=6684479)
  0.5 = fieldNorm(doc=1775174)
  2.8821251 = (MATCH) weight(P_ProductId:1119054943^38.0 in 1775174) 
[DefaultSimilarity], result of:

2.8821251 = score(doc=1775174,freq=1.0), product of:
  0.17988378 = queryWeight, product of:
38.0 = boost
16.022152 = idf(docFreq=1, maxDocs=6684479)
2.9545242E-4 = queryNorm
  16.022152 = fieldWeight in 1775174, product of:
1.0 = tf(freq=1.0), with freq of:
  1.0 = termFreq=1.0
16.022152 = idf(docFreq=1, maxDocs=6684479)
1.0 = fieldNorm(doc=1775174)

Thanks,
Derek




Re: understanding collapsingQParser with facet vs group.facet

2015-06-19 Thread Derek Poh

Hi Upayavira

Thank you for your explanation onthe difference between traditional 
grouping and collapsingQParser. I understand more now.


On 6/19/2015 7:11 PM, Upayavira wrote:

On Fri, Jun 19, 2015, at 06:20 AM, Derek Poh wrote:

Hi

I read about collapsingQParser returns the facet count the same as
group.truncate=true and has this issue with the facet count and the
after filter facet count notthe same.
Using group.facetdoes not has this issue but it's performance is very
badcompared to collapsingQParser.

I trying to understand why collapsingQParser behave this way and will
need to explain to management.

Can someone explain how collapsingQParser calculatethefacet
countscompated to group.facet?

I'm not familiar with group.facet. But to compare traditional grouping
to the collapsingQParser - in traditional grouping, all matching
documents remain in the result set, but they are grouped for output
purposes. However, the collapsingQParser is actually a query filter. It
will reduce the number of matching results. Any faceting that happens
will happen on the filtered results.

I wonder if you can use this syntax to achieve faceting alongside
collapsing:

q=whatever
fq={!collapse tag=collapse}blah
facet.field={!ex=collapse}my_facet_field

This way, you get the benefits of the CollapsingQParserPlugin, with full
faceting on the uncollapsed resultset.

I've no idea how this would perform, but I'd expect it to be better than
the grouping option.

Upayavira






Re: understanding collapsingQParser with facet vs group.facet

2015-06-19 Thread Derek Poh

Hi Joel

By group heads, is it referring to the document thatis use to represent 
each group in the main result section?


Eg. Using the below 3 documentsandwe collapse on field supplier_id

supplier_id:S1
product_id:P1

supplier_id:S2
product_id:P2

supplier_id:S2
product_id:P3

With collapse on supplier_id, the result in the main sectionis as follows,

supplier_id:S1
product_id:P1

supplier_id:S2
product_id:P3

The group head of supplier_id:S1 is P1and supplier_id:S2 will be P3?

Facets (and even sort) are calculated on P1 and P3?

-Derek

On 6/19/2015 7:05 PM, Joel Bernstein wrote:

The CollapsingQParserPlugin currently doesn't calculate facets at all. It
simply collapses the document set. The facets are then calculated only on
the group heads.

Grouping has special faceting code built into it that supports the
group.facet functionality.

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Jun 19, 2015 at 6:20 AM, Derek Poh d...@globalsources.com wrote:


Hi

I read about collapsingQParser returns the facet count the same as
group.truncate=true and has this issue with the facet count and the after
filter facet count notthe same.
Using group.facetdoes not has this issue but it's performance is very
badcompared to collapsingQParser.

I trying to understand why collapsingQParser behave this way and will need
to explain to management.

Can someone explain how collapsingQParser calculatethefacet countscompated
to group.facet?

Thank you,
Derek







understanding collapsingQParser with facet vs group.facet

2015-06-19 Thread Derek Poh

Hi

I read about collapsingQParser returns the facet count the same as 
group.truncate=true and has this issue with the facet count and the 
after filter facet count notthe same.
Using group.facetdoes not has this issue but it's performance is very 
badcompared to collapsingQParser.


I trying to understand why collapsingQParser behave this way and will 
need to explain to management.


Can someone explain how collapsingQParser calculatethefacet 
countscompated to group.facet?


Thank you,
Derek




sort on fields that are not mandatory in each document

2015-05-27 Thread Derek Poh

Hi

I am trying to sort on multiple fields. These fields donot necessary 
exist in every document.

sort=sppddrank asc, ddrank asc

From the sorted result, it seems that documents which donot have 
sppddrank field is at the top.


How can I make the documents that have the sppddrank field to be on top 
and sortedby it and those documents which do not have the field below?


-Derek



Re: sort on fields that are not mandatory in each document

2015-05-27 Thread Derek Poh

Hi Ahmet

The sortMissingLast and sortMissingFirst attributes are defined at the 
field or fieldType level?


field name=P_TSRank type=int indexed=true stored=true 
multiValued=false/


fieldType name=int class=solr.TrieIntField precisionStep=0 
positionIncrementGap=0/


On 5/27/2015 4:43 PM, Ahmet Arslan wrote:

Hi,
I think you are looking for sortMissing* attributes:

sortMissingLast and sortMissingFirst attributes are optional attributes are
currently supported on types that are sorted internally as strings
and on numeric types.

Ahmet

On Wednesday, May 27, 2015 11:36 AM, Derek Poh d...@globalsources.com wrote:
Hi

I am trying to sort on multiple fields. These fields donot necessary
exist in every document.
sort=sppddrank asc, ddrank asc

From the sorted result, it seems that documents which donot have
sppddrank field is at the top.

How can I make the documents that have the sppddrank field to be on top
and sortedby it and those documents which do not have the field below?

-Derek






Re: sort on fields that are not mandatory in each document

2015-05-27 Thread Derek Poh

Got it. Thank you Rajani.
On 5/27/2015 5:34 PM, Rajani Maski wrote:

Hi Derek,

They are at the fieldType Level. You might find some reference examples in
schema.xml using them.

https://cwiki.apache.org/confluence/display/solr/Field+Type+Definitions+and+Properties

On Wed, May 27, 2015 at 2:30 PM, Derek Poh d...@globalsources.com wrote:


Hi Ahmet

The sortMissingLast and sortMissingFirst attributes are defined at the
field or fieldType level?

field name=P_TSRank type=int indexed=true stored=true
multiValued=false/

fieldType name=int class=solr.TrieIntField precisionStep=0
positionIncrementGap=0/


On 5/27/2015 4:43 PM, Ahmet Arslan wrote:


Hi,
I think you are looking for sortMissing* attributes:

sortMissingLast and sortMissingFirst attributes are optional attributes
are
currently supported on types that are sorted internally as strings
and on numeric types.

Ahmet

On Wednesday, May 27, 2015 11:36 AM, Derek Poh d...@globalsources.com
wrote:
Hi

I am trying to sort on multiple fields. These fields donot necessary
exist in every document.
sort=sppddrank asc, ddrank asc

From the sorted result, it seems that documents which donot have
sppddrank field is at the top.

How can I make the documents that have the sppddrank field to be on top
and sortedby it and those documents which do not have the field below?

-Derek







Re: sort on fields that are not mandatory in each document

2015-05-27 Thread Derek Poh

Oh ok. Thank youAlessandro.

On 5/27/2015 6:07 PM, Alessandro Benedetti wrote:

Actually it is both field level and field type level.
You decide based on your use case ( can happen that for the same field type
, you want sortMissingFirst for one field, and sortMissingLast for another)
.

I want to add a bonus note, related the ( empty ) and null concept.

Be very careful you don't index empty values for your fields or this will
mess up the sorting.
Solr manage the missing values ( that are null values), and does not manage
the empty  values.

Those values for a human are identical to null values, but not for solr.
So you can have very weird situations for your users.
So , to be sure everything work nice with sortMissing attribute, be sure to
not index empty values.

Cheers

2015-05-27 10:34 GMT+01:00 Rajani Maski rajani.ma...@lucidworks.com:


Hi Derek,

They are at the fieldType Level. You might find some reference examples in
schema.xml using them.


https://cwiki.apache.org/confluence/display/solr/Field+Type+Definitions+and+Properties

On Wed, May 27, 2015 at 2:30 PM, Derek Poh d...@globalsources.com wrote:


Hi Ahmet

The sortMissingLast and sortMissingFirst attributes are defined at the
field or fieldType level?

field name=P_TSRank type=int indexed=true stored=true
multiValued=false/

fieldType name=int class=solr.TrieIntField precisionStep=0
positionIncrementGap=0/


On 5/27/2015 4:43 PM, Ahmet Arslan wrote:


Hi,
I think you are looking for sortMissing* attributes:

sortMissingLast and sortMissingFirst attributes are optional attributes
are
currently supported on types that are sorted internally as strings
and on numeric types.

Ahmet

On Wednesday, May 27, 2015 11:36 AM, Derek Poh d...@globalsources.com
wrote:
Hi

I am trying to sort on multiple fields. These fields donot necessary
exist in every document.
sort=sppddrank asc, ddrank asc

From the sorted result, it seems that documents which donot have
sppddrank field is at the top.

How can I make the documents that have the sppddrank field to be on top
and sortedby it and those documents which do not have the field below?

-Derek










Re: search or filter by a list of document ids and return them in the same order.

2015-05-03 Thread Derek Poh

Hi Erick

Sorry I missed your reply.

Ya that is the alternative solution I am thinking of if it's not 
possible through Solr.


-Derek

On 4/24/2015 12:01 AM, Erick Erickson wrote:

Not that I know of. But your application gets the original params back,
so you can order the display based on the params that are echoed back.

Best,
Erick

On Thu, Apr 23, 2015 at 2:17 AM, Derek Poh d...@globalsources.com wrote:

Hi

I am trying to search or filter by alist ofdocuments by their ids (product
id field).The requirement is the return documents must be in the same order
as search or filter by.
Eg.if i search or filter on the below list of ids, the documents must be
return in the same order too
1083342171 1079463095 1078278592 1085253674 1076558399

Is this possible?

Thanks,
Derek








Re: search or filter by a list of document ids and return them in the same order.

2015-05-03 Thread Derek Poh

Hi

Any advise on this?

Thanks,
Derek

On 4/23/2015 5:17 PM, Derek Poh wrote:

Hi

I am trying to search or filter by alist ofdocuments by their ids 
(product id field).The requirement is the return documents must be in 
the same order as search or filter by.
Eg.if i search or filter on the below list of ids, the documents must 
be return in the same order too

1083342171 1079463095 1078278592 1085253674 1076558399

Is this possible?

Thanks,
Derek







search or filter by a list of document ids and return them in the same order.

2015-04-23 Thread Derek Poh

Hi

I am trying to search or filter by alist ofdocuments by their ids 
(product id field).The requirement is the return documents must be in 
the same order as search or filter by.
Eg.if i search or filter on the below list of ids, the documents must be 
return in the same order too

1083342171 1079463095 1078278592 1085253674 1076558399

Is this possible?

Thanks,
Derek




spellcheck enabled but not getting any suggestions.

2015-04-17 Thread Derek Poh

Hi

I have enabled spellcheck but not getting any suggestions 
withincorrectly spelled keywords.

I added the spellcheck into the/select request handler.

What steps did I miss out?

spellcheck list in return result:
lst name=spellcheck
lst name=suggestions/
/lst


solrconfig.xml:

requestHandler name=/select class=solr.SearchHandler
!-- default values for query parameters can be specified, these
 will be overridden by parameters in the request
  --
 lst name=defaults
   str name=echoParamsexplicit/str
   int name=rows10/int
   str name=dftext/str
   !-- Spell checking defaults --
   str name=spellcheckon/str
   str name=spellcheck.extendedResultsfalse/str
   str name=spellcheck.count5/str
   str name=spellcheck.alternativeTermCount2/str
   str name=spellcheck.maxResultsForSuggest5/str
   str name=spellcheck.collatetrue/str
   str name=spellcheck.collateExtendedResultstrue/str
   str name=spellcheck.maxCollationTries5/str
   str name=spellcheck.maxCollations3/str
 /lst

 !-- append spellchecking to our list of components --
 arr name=last-components
strspellcheck/str
 /arr

/requestHandler




Re: Collapse and Expand behaviour on result with 1 document.

2015-04-07 Thread Derek Poh

Hi Joel

Is the number of documents info available when using collapse and expand 
parameters?


I can't seem to find it in the return xml.
I know the numFound in the the main result set (result 
maxScore=6.470696 name=response numFound=27 start=0) refer to 
the number of collapse groups.


I need to issue another query without the collapse and expand parameters 
to get the total number of documents?
Or is there any fieldor parameter that indicate the number of documents 
that can be return through 'fl' parameter?


I am trying to display such info on the front-end,

571 led results from 240 suppliers.


On 4/1/2015 7:05 PM, Joel Bernstein wrote:

Exactly correct.

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Apr 1, 2015 at 5:44 AM, Derek Poh d...@globalsources.com wrote:


Hi Joel

Correct me if my understanding is wrong.
Using supplier id as the field to collapse on.

- If thecollapse group heads inthe main result set has only 1document in
each group, the expanded section will be empty since there are no documents
to expandfor each collapse group.
- To render the page, I need to iterate the main result set. For each
document I have to check if there is an expanded group with the same
supplier id.
- The facets counts is based on the number of collapse groupsin the main
result set (result maxScore=6.470696 name=response numFound=27
start=0)

-Derek


On 3/31/2015 7:43 PM, Joel Bernstein wrote:


The way that collapse/expand is designed to be used is as follows:

The main result set will contain the collapsed group heads.

The expanded section will contain the expanded groups for the page of
results.

To render the page you iterate the main result set. For each document
check
to see if there is an expanded group.




Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Mar 31, 2015 at 7:37 AM, Joel Bernstein joels...@gmail.com
wrote:

  You should be able to use collapse/expand with one result.

Does the document in the main result set have group members that aren't
being expanded?



Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Mar 31, 2015 at 2:00 AM, Derek Poh d...@globalsources.com
wrote:

  If I want to group the results (by a certain field) even if there is

only
1 document, I should use the group parameter instead?
The requirement is to group the result of product documents by their
supplier id.
group=truegroup.field=P_SupplierIdgroup.limit=5

Is it true that the performance of collapse is better than group
parameter on large data set, say 10-20 million documents?

-Derek


On 3/31/2015 10:03 AM, Joel Bernstein wrote:

  The expanded section will only include groups that have expanded

documents.

So, if the document that in the main result set has no documents to
expand,
then this is working as expected.



Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Mar 30, 2015 at 8:43 PM, Derek Poh d...@globalsources.com
wrote:

   Hi


I have a query which return 1 document.
When I add the collapse and expand parameters to it,
expand=trueexpand.rows=5fq={!collapse%20field=P_SupplierId}, the
expanded section is empty (lst name=expanded/).

Is this the behaviour of collapse and expand parameters on result
which
contain only 1 document?

-Derek









Re: sort on facet.index?

2015-04-05 Thread Derek Poh

Yonik

Isee. Thank you for the updates.

On 4/3/2015 12:28 AM, Yonik Seeley wrote:

On Thu, Apr 2, 2015 at 10:25 AM, Ryan Josal rjo...@gmail.com wrote:

Sorting the result set or the facets?  For the facets there is
facet.sort=index (lexicographically) and facet.sort=count.  So maybe you
are asking if you can sort by index, but reversed?  I don't think this is
possible, and it's a good question.

The new facet module that will be in Solr 5.1 supports sorting both
directions on both count and index order (as well as by statistics /
bucket aggregations).
http://yonik.com/json-facet-api/

-Yonik






sort on facet.index?

2015-04-02 Thread Derek Poh

Is sorting on facet index supported?

I would like to sort on the below facet index

lst name=P_SupplierRanking
int name=014/int
int name=18/int
int name=212/int
int name=3349/int
int name=481/int
int name=58/int
int name=612/int
/lst

to

lst name=P_SupplierRanking
int name=612/int
int name=58/int
int name=481/int
int name=3349/int
...
...
...
/lst

-Derek


Re: Collapse and Expand behaviour on result with 1 document.

2015-04-01 Thread Derek Poh

Hi Joel

Correct me if my understanding is wrong.
Using supplier id as the field to collapse on.

- If thecollapse group heads inthe main result set has only 1document in 
each group, the expanded section will be empty since there are no 
documents to expandfor each collapse group.
- To render the page, I need to iterate the main result set. For each 
document I have to check if there is an expanded group with the same 
supplier id.
- The facets counts is based on the number of collapse groupsin the main 
result set (result maxScore=6.470696 name=response numFound=27 
start=0)


-Derek

On 3/31/2015 7:43 PM, Joel Bernstein wrote:

The way that collapse/expand is designed to be used is as follows:

The main result set will contain the collapsed group heads.

The expanded section will contain the expanded groups for the page of
results.

To render the page you iterate the main result set. For each document check
to see if there is an expanded group.




Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Mar 31, 2015 at 7:37 AM, Joel Bernstein joels...@gmail.com wrote:


You should be able to use collapse/expand with one result.

Does the document in the main result set have group members that aren't
being expanded?



Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Mar 31, 2015 at 2:00 AM, Derek Poh d...@globalsources.com wrote:


If I want to group the results (by a certain field) even if there is only
1 document, I should use the group parameter instead?
The requirement is to group the result of product documents by their
supplier id.
group=truegroup.field=P_SupplierIdgroup.limit=5

Is it true that the performance of collapse is better than group
parameter on large data set, say 10-20 million documents?

-Derek


On 3/31/2015 10:03 AM, Joel Bernstein wrote:


The expanded section will only include groups that have expanded
documents.

So, if the document that in the main result set has no documents to
expand,
then this is working as expected.



Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Mar 30, 2015 at 8:43 PM, Derek Poh d...@globalsources.com
wrote:

  Hi

I have a query which return 1 document.
When I add the collapse and expand parameters to it,
expand=trueexpand.rows=5fq={!collapse%20field=P_SupplierId}, the
expanded section is empty (lst name=expanded/).

Is this the behaviour of collapse and expand parameters on result which
contain only 1 document?

-Derek








  1   2   >