shard query with duplicated documents cause inaccuate paginating

2014-04-29 Thread Jie Sun
When we have duplicated documents (same uniqueID) among the shards, the query results could be non-deterministic, this is an known issue. The consequence when we display the search results on our UI page with paginating is: if user click the 'last page', it could display an empty page since the

Re: SolrCloud: how to index documents into a specific core and how to search against that core?

2013-07-15 Thread Jie Sun
Yandong, have you figured out if it works for you to use one collection per customer? We have the similar use-case as yours: customer id's are used as core names. that was the reason our company did not upgrade to solrcould ... I might remember it wrong but I vaguely remember I looked into

Re: rename a core to same name of existing core

2013-05-13 Thread Jie Sun
did any one verified the following is ture? the Description on http://wiki.apache.org/solr/CoreAdmin#CREATE is: *quote* If a core with the same name exists, while the new created core is initalizing, the old one will continue to accept requests. Once it has finished, all new request will go

Re: rename a core to same name of existing core

2013-05-13 Thread Jie Sun
thanks for the information, you are right, I was using the same instance dir. I agree with you, I would like to see an error is I am creating a core with the name of existing core name. right now I have to do ping first, and analyze if the returned code is 404 or not. Jie -- View this

RE: numFound changes on changing start and rows

2013-05-08 Thread Jie Sun
any update on this? will this be addressed/fixed? in our system, our UI will allow user to paginate through search results. As my in deep test find out, if the rows=0, the results size is consistently the total sum of the documents on all shards regardless there is any duplicates; if the rows

RE: numFound changes on changing start and rows

2013-05-08 Thread Jie Sun
ok when my head is cooled down, I remember this old school issue... that I have been dealing with it myself. so I do not expect this can be straighten out or fixed in anyways. basically when you have to sorted results sets you need to merge, and paginate through, it is never an easy job (if all

Re: solr 3.5 core rename issue

2013-04-18 Thread Jie Sun
yeah I realize using ${solr.core.name} for dataDir must be the cause for the issue we see... it is fair to say the SWAP and RENAME just create an alias that still points to the old datadir. if they can not fix it then it is not a bug :-) at least we understand exactly what is going on there.

shard query return 500 on large data set

2013-04-18 Thread Jie Sun
Hi - when I execute a shard query like: [myhost]:8080/solr/mycore/select?q=type:messagerows=14...qt=standardwt=standardexplainOther=hl.fl=shards=solrserver1:8080/solr/mycore,solrserver2:8080/solr/mycore,solrserver3:8080/solr/mycore everything works fine until I query against a large set

Re: solr 3.5 core rename issue

2013-04-17 Thread Jie Sun
thanks Shawn for filing the issue. by the way my solrconfig.xml has: dataDir${MYSOLRROOT:/mysolrroot}/messages/solr/data/${solr.core.name}/dataDir For now I will have to shutdown solr and write a script to modify the solr.xml manually and rename the core data directory to new one. by the way

solr 3.5 core rename issue

2013-04-16 Thread Jie Sun
We just tried to use .../solr/admin/cores?action=RENAMEcore=core0other=core5 to rename a core 'old' to 'new'. After the request is done, the solr.xml has new core name, and the solr admin shows the new core name in the list. But the index dir still has the old name as the directory name. I

Re: solr 3.5 core rename issue

2013-04-16 Thread Jie Sun
Hi Shawn, I do have persistent=true in my solr.xml: ?xml version=1.0 encoding=UTF-8 ? solr persistent=true cores adminPath=/admin/cores core name=default instanceDir=.// core name=413a instanceDir=.// core name=blah instanceDir=.// ... /cores /solr the command I ran was to rename

Re: POST query with non-ASCII to solr using httpclient wont work

2013-01-14 Thread Jie Sun
unfortunately solrj is not an option here... we will have to make a quick fix with a patch out in production. I am still unable to make the solr (3.5) take url encoded query. again passing non-urlencoded query string works with non-ASIIC (Chinese), but fails return anything when sending request

POST query with non-ASCII to solr using httpclient wont work

2013-01-12 Thread Jie Sun
When I use HttpClient and its PostMethod to post a query with some Chinese, solr fails returning any record, or return everything. ... ... method = new PostMethod(solrReq); method.getParams().setContentCharset(UTF-8);

Re: POST query with non-ASCII to solr using httpclient wont work

2013-01-12 Thread Jie Sun
:-) Otis, I also looked at solrJ source code, seems exactly what I am doing here... but I probably will do what you suggested ... thanks Jie -- View this message in context: http://lucene.472066.n3.nabble.com/POST-query-with-non-ASCII-to-solr-using-httpclient-wont-work-tp4032957p4032973.html

Re: if I only need exact search, does frequency/score matter?

2012-12-19 Thread Jie Sun
Hi Otis, I customized the Similarity class and add it through the end of schema.xml: ... ... solrQueryParser defaultOperator=OR/ similarity class=mypackage.NoTfSimilarity/ /schema and mypackage.NoTfSimilarity.java is like: public class NoTfSimilarity extends DefaultSimilarity { public

Re: if I only need exact search, does frequency/score matter?

2012-12-19 Thread Jie Sun
Hi Otis, here is the debug output on the query... seems all tf and idf indeed return 1.0f as I customized... I did not overwrite queryNorm or weight etc... see below. but the bottom line is that if my purpose is to reduce the frq file size, customize similarity seems wont help on that. I guess

Re: how to understand this benchmark test results (compare index size after schema change)

2012-12-17 Thread Jie Sun
thanks Erik ... I did run optimize on both indices to get ride of the deleted data when compare to each other. (and my benchmark tests were just indexing 5000 new documents without duplicates...into a new core... but I did optimize just to make sure). I think one results is consistent that the

Re: if I only need exact search, does frequency/score matter?

2012-12-17 Thread Jie Sun
thanks, this is very helpful -- View this message in context: http://lucene.472066.n3.nabble.com/if-I-only-need-exact-search-does-frequency-score-matter-tp4026893p4027559.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: if I only need exact search, does frequency/score matter?

2012-12-17 Thread Jie Sun
Hi Otis, do you think I should customize both tf and idf to disable the term frequency? i.e. something like: public float tf(float freq) { return freq 0 ? 1.0f : 0.0f; } public float idf(int docFreq, int numDocs) { return docFreq 0 ? 1.0f : 0.0f; }

Re: if I only need exact search, does frequency/score matter?

2012-12-15 Thread Jie Sun
thanks for the information... I did come across that discussion, I guess I will try to write a customized Similarity class and disable tf. I hope this is not totally odd to do ... I do notice about 10GB .frq file size in cores that have total 10-30GB .fdt files. I wish the benchmark will show me

if I only need exact search, does frequency/score matter?

2012-12-13 Thread Jie Sun
this is related to my previous post where I did not get any feedback yet... I am going through a practice to reduce the disk usage by solr index files. first step I took was to move some fields from stored to not stored. this reduced the size of .fdt by 30-60%. very promising... however I

how to understand this benchmark test results (compare index size after schema change)

2012-12-12 Thread Jie Sun
I cleaned up the solr schema by change a small portion of the stored fields to stored=false. out for 5000 document (about 500M total size of original documents), I ran a benchmark comparing the solr index size between the schema before/after the clean up. first time run it showed about 40%

suggestion howto handle highly repetitive valued field

2012-12-11 Thread Jie Sun
Hi - our indexed documents currently store solr fields like 'digest' or 'type', which most of our documents will end up with same value (such as 'sha1' for field 'digest', or 'message' for field 'type' etc). on each solr server, we usually have 100 of millions of documents indexed and with the

Re: suggestion howto handle highly repetitive valued field

2012-12-11 Thread Jie Sun
thank you David! -- View this message in context: http://lucene.472066.n3.nabble.com/suggestion-howto-handle-highly-repetitive-valued-field-tp4026104p4026163.html Sent from the Solr - User mailing list archive at Nabble.com.

programmatically get dataDir setting from solrconfig.xml

2012-11-28 Thread Jie Sun
I am trying to get the value of 'dataDir' that was set in solrconfig.xml. other thank query solr with http://[host]:8080/solr/default/admin/file/?contentType=text/xml;charset=utf-8file=solrconfig.xml and parse the dataDir element using some xml parser, then resolve all possible environment

Re: load balance with SolrCloud

2012-11-06 Thread Jie Sun
thanks for your feedback Erick. I am also aware of the current limitation of shard number in a collection is fixed. changing the number will need re-config and re-index. Let's say if the limitation gets levitated in near future release, I would then consider setup collection for each customer,

load balance with SolrCloud

2012-11-05 Thread Jie Sun
we are using solr 3.5 in production and we deal with customers data of terabytes. we are using shards for large customers and write our own replica management in our software. Now with the rapid growth of data, we are looking into solrcloud for its robustness of sharding and replications. I

solr replication against active indexing on master

2012-11-01 Thread Jie Sun
I have a question about the solr replication (master/slaves). when index activities are on going on master, when slave send in file list command to get a version (actually to my understand a snapshot of the time) of all files and their size/timestamp etc. then slaves will decide which files need

Re: solr replication against active indexing on master

2012-11-01 Thread Jie Sun
thanks ... could you please point me to some more detailed explanation on line or I will have to read the code to find out? I would like to understand a little more on how this is achieved. thanks! Jie -- View this message in context:

Re: solr replication against active indexing on master

2012-11-01 Thread Jie Sun
thanks... I just read the related code ... now I understand it seems the master keeps replicable snapshots (version), so it should be static. thank you Otis! -- View this message in context:

Re: [/solr] memory leak prevent tomcat shutdown

2012-10-22 Thread Jie Sun
any input on this? thanks Jie -- View this message in context: http://lucene.472066.n3.nabble.com/solr-memory-leak-prevent-tomcat-shutdown-tp4014788p4015265.html Sent from the Solr - User mailing list archive at Nabble.com.

[/solr] memory leak prevent tomcat shutdown

2012-10-19 Thread Jie Sun
very often when we try to shutdown tomcat, we got following error in catalina.out indicating a solr thread can not be stopped, the tomcat results hanging, we have to kill -9, which we think lead to some core corruptions in our production environment. please help ... catalina.out: ... ... Oct

Re: [/solr] memory leak prevent tomcat shutdown

2012-10-19 Thread Jie Sun
by the way, I am running tomcat 6, solr 3.5 on redhat 2.6.18-274.el5 #1 SMP Fri Jul 8 17:36:59 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux -- View this message in context: http://lucene.472066.n3.nabble.com/solr-memory-leak-prevent-tomcat-shutdown-tp4014788p4014792.html Sent from the Solr - User

CheckIndex question

2012-10-17 Thread Jie Sun
Hi - with a corrupted core, 1. if I run CheckIndex with -fix, it will drop the hook to the corrupted segment, but the segment files are still there, when we have a lot of corrupted segments, we have to manually pick them out and remove them, is there a way the tool can suffix them or prefix

Re: queryResultWindowSize vs rows

2012-10-07 Thread Jie Sun
any suggestions? -- View this message in context: http://lucene.472066.n3.nabble.com/queryResultWindowSize-vs-rows-tp401p4012336.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: queryResultWindowSize vs rows

2012-10-07 Thread Jie Sun
Hi Erik, no I dont have any evidence, just a precaution question. So according to your explanation, this cache only keep the document ID, so if client paying to next group of document in the window, there will be another query to solr server to retrieve these docs, correct? ok that is good to

queryResultWindowSize vs rows

2012-10-05 Thread Jie Sun
what will happen if in my query I specify a greater number for rows than the queryResultWindowSize in my solrconfig.xml for example, if queryResultWindowSize=100, but I need process a batch query from solr with rows=1000 each time and vary the start move on... what will happen? if I do not turn