Re: docValue vs. analyzer
Hi Erick, Thank your for the hint about SortableTextField. This seems to be really the type I was looking for. UpdateProcesors could be a workaround, but I don't like them. For me they are neither fish nor fowl. (neither internal nor external) Uwe Am 19.04.2018 um 18:38 schrieb Erick Erickson: I haven't poked into the details, but (recently, very recently, 7.3) theres a SortableTextField that may be useful in this situation. Otherwise you could use a FieldMutatingUpdateProcessorFactory or perhaps a ScriptUpdateProcessor to manipulate the fields on the way in. Not quite sure how you could get synonyms to work in those situations though. Best, Erick
wired behavior of own tokenizer
Hi I'm trying to write a own tokenizer for Solr7. Doing this, everything seems to be fine: - the tokenizer compiles - the tokanizer is instanced fine by it's factory - the tokenizer seems to do his work, when tested with the gui. "../solr/#/collection/analysis" BUT - the expected result isn't visible in the document. Sure, I got something wrong. But I have no idea what. Any hints are appreciated. Uwe ### # Snippet schema.xml ### ### # minimized example: # Just replace everything with the constant string "substitute" ### public class MyTokenizer extends Tokenizer { private static final Logger LOG = LoggerFactory.getLogger(MyTokenizer.class); protected CharTermAttribute charTermAttribute = addAttribute(CharTermAttribute.class); private boolean done = false; public ClusterSynonymTokenizer() { super(); } @Override public boolean incrementToken() throws IOException { if (done) return false; charTermAttribute.setEmpty(); String toReplace = getStartOFChallange(); LOG.info("Input: " + toReplace + " replaced."); charTermAttribute.append("substitute"); done = true; return true; } @Override public void reset() throws IOException { super.reset(); done = false; } /* Read some chars from 'input' */ private String getStartOFChallange() { char[] buffer = new char[200]; int inputLength = -1; try { inputLength = input.read(buffer, 0, 200); } catch (IOException e) { throw new RuntimeException(e); } if (inputLength == -1) { LOG.warn("No input"); return null; } return new String(buffer, 0, inputLength); } } ### # Snippet solr.log # The input was "ReplaceMe" ### de.hebis.solr.analysis.MyTokenizer.incrementToken(): Input: ReplaceMe replaced.
docValue vs. analyzer
Hi, I'm stuck in a dead end. My task is to map individual ids, to group them. So far, so simple: * copyfield 'id' -> 'groupId' * use a SynonymFilter on 'groupId' Now, I had the idea to improve the performance of grouping with 'docValues'. Unfortunately, this leads to a contradiction: * docValues are not allowed for TextFields * analysers are not allowed on StrFields. Is there a way, to resolve this contradiction within Solr? (without the need of external preprocessing?) Regards Uwe PS. Yes, a token stream for a strfield, isn't a great idea. But having CharFiltes would be nice.
Re: CVE-2017-12629 which versions are vulnerable?
Sorry, I missed the post from Florian Gleixner: >Re: Several critical vulnerabilities discovered in Apache Solr (XXE & RCE) Am 16.10.2017 um 16:52 schrieb Uwe Reh: Hi, I'm still using V4.10. Is this version also vulnerable by http://openwall.com/lists/oss-security/2017/10/13/1 ? Uwe
CVE-2017-12629 which versions are vulnerable?
Hi, I'm still using V4.10. Is this version also vulnerable by http://openwall.com/lists/oss-security/2017/10/13/1 ? Uwe
Re: CDCR (Solr6.x) does not start
Hi Renaud, thank you for your response. You asked for some further information: 1. Log messages at the source cluster: As mentioned in my addendum "CDCR (Solr6.x) does not start (logfile)". I changed the log level for all Handlers to TRACE and I got three Messages for each shard caused by "Action LASTPROCESSEDVERSION sent to non-leader replica .." For me this looks like the blocker. 2. Replication should start even if no commit has been sent to the source cluster. Thanks for the clarification. It helps me to understand. 3. The empty queue seems to indicate there is an issue, and that cdcr was unable to instantiate the replicator for the target cluster. Just to be sure, your source cluster has 4 shards, but not replica ? If it has replicas, can you ensure that you execute these command on the shard leader. At the beginning I tried to replicate 4 shards with an replication factor of 3. Later on i simplified the environment by omitting the replicas. (replication factor = 1) Do you think having no replicas could the reason for the log messages above? Regards Uwe Am 05.07.2016 um 14:55 schrieb Renaud Delbru: Hi Uwe, At first look, your configuration seems correct, see my comments below. On 28/06/16 15:36, Uwe Reh wrote: 9. Start CDCR http://SOURCE:s_port/solr/scoll/cdcr?action=start=json {"responseHeader":{"status":0,"QTime":13},"status":["process","started","buffer","enabled"]} ! (not even a single query to the target's zookeeper ??) Indeed, you should have observed a communication between the source cluster and the target zookeeper. Do you see any errors in the log of the source cluster ? Or a log message such as: "Unable to instantiate the log reader for target collection ..." 10. Enter some test data into the SOURCE 11. Explicit commit in SOURCE http://SOURCE:s_port/solr/scoll/update?commit=true=true !! (at least now there should be some traffic, or?) Replication should start even if no commit has been sent to the source cluster. 12. Check errors and queues http://SOURCE:s_port/solr/scoll_shard1_replica1/cdcr?action=queues=json {"responseHeader":{"status":0,"QTime":0},"queues":[],"tlogTotalSize":135,"tlogTotalCount":1,"updateLogSynchronizer":"stopped"} http://SOURCE:s_port/solr/scoll_shard1_replica1/cdcr?action=errors=json {"responseHeader":{"status":0,"QTime":0},"errors":[]} ! Why is the element queues is empty The empty queue seems to indicate there is an issue, and that cdcr was unable to instantiate the replicator for the target cluster. Just to be sure, your source cluster has 4 shards, but not replica ? If it has replicas, can you ensure that you execute these command on the shard leader. Kind Regards
Re: CDCR (Solr6.x) does not start (logfile)
Hi, trying to get more information, I restarted the SOURCE node and watched the log. For each shard i got following triple: WARN org.apache.solr.handler.CdcrRequestHandler - Action LASTPROCESSEDVERSION sent to non-leader replica @ scoll:shard1 ERROR org.apache.solr.handler.RequestHandlerBase - org.apache.solr.common.SolrException: Action LASTPROCESSEDVERSION sent to non-leader replica WARN org.apache.solr.handler.CdcrUpdateLogSynchronizer - Caught unexpected exception org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://SOURCE:s_port/solr/scoll_shard1_replica1: Action LASTPROCESSEDVERSION sent to non-leader replica Could this the reason, why there is no further action? The SOURCE cloud has just the replicationfactor '1'. 'scoll_shard1_replica1' should have to be allays the leader, or? Regards Uwe
CDCR (Solr6.x) does not start
Hi, I'm trying to get CDCR to run, but I can't even trigger any communication between SOURCE and TARGET. It seems to be a small but grave misunderstanding. I've tested a lot of variants but now I'm blind on this point. If anyone could give me a hint, I would appreciate. Uwe Testsetting: Two nearly identical hosts (open solaris) with: - a minimal zookeeper ensemble (one local installation (not embedded), listening on port 2181) - a minimal cloud (one node, one empty collection, 4 shards) Initial both installations differ only in solrconfig.xml (snipplets below) The tcp traffic was observed with 'snoop' (tcpdump). There are no packet filters or other firewalls between both machines. Testprocess: 1. Start node for TARGET 2. Create TARGET collection 'tcoll' http://TARGET:t_port/solr/admin/collections?action=CREATE=tcoll=4=1=4=cdcr 3. Get status http://TARGET:t_port/solr/tcoll/cdcr?action=status=json {"responseHeader":{"status":0,"QTime":0},"status":["process","stopped","buffer","enabled"]} 4. Disable buffer http://TARGET:t_port/solr/tcoll/cdcr?action=disablebuffer=json {"responseHeader":{"status":0,"QTime":12},"status":["process","stopped","buffer","disabled"]} 6. Start node for SOURCE (like expected, no tcp between both hosts) 7. Create SOURCE collection 'scoll' http://SOURCE:s_port/solr/admin/collections?action=CREATE=scoll=4=1=4=cdcr (no tcp between both hosts) 8. Get status http://SOURCE:s_port/solr/scoll/cdcr?action=status=json {"responseHeader":{"status":0,"QTime":13},"status":["process","stopped","buffer","enabled"]} (like expected, no tcp between both hosts) 9. Start CDCR http://SOURCE:s_port/solr/scoll/cdcr?action=start=json {"responseHeader":{"status":0,"QTime":13},"status":["process","started","buffer","enabled"]} ! (not even a single query to the target's zookeeper ??) 10. Enter some test data into the SOURCE 11. Explicit commit in SOURCE http://SOURCE:s_port/solr/scoll/update?commit=true=true !! (at least now there should be some traffic, or?) 12. Check errors and queues http://SOURCE:s_port/solr/scoll_shard1_replica1/cdcr?action=queues=json {"responseHeader":{"status":0,"QTime":0},"queues":[],"tlogTotalSize":135,"tlogTotalCount":1,"updateLogSynchronizer":"stopped"} http://SOURCE:s_port/solr/scoll_shard1_replica1/cdcr?action=errors=json {"responseHeader":{"status":0,"QTime":0},"errors":[]} ! Why is the element queues is empty where is my stupid bug # solrconfig Source ${solr.ulog.dir:} TARGET:2181 scoll tcoll 1 # solrconfig Target ${solr.ulog.dir:} ${solr.autoCommit.maxdocs:1000} ${solr.autoCommit.maxTime:300} true ${solr.autoSoftCommit.maxTime:60} disabled cdcr-processor-chain ## # EOF #
Re: Solr 6 CDCR does not work
Hi Adam, maybe it's my poor English, but I'm confused. I've taken Renault's quote as a hint to activate autocommit on the target cluster. Or at least doing manually frequent commits, to see the replicated documents. Now you wrote disabling autocommit helps. Could you please clarify this point? Regards Uwe Am 01.06.2016 um 12:28 schrieb Adam Majid Sanjaya: disable autocommit on the target It worked! thanks 2016-05-30 15:40 GMT+07:00 Renaud Delbru: Hi Adam, ... Also, do you have an autocommit configured on the target ? CDCR does not replicate commit, and therefore you have to send a commit command on the target to ensure that the latest replicated documents are visible. ... -- Renaud Delbru
relaxed vs. improved validation in solr.TrieDateField
Hi, doing some migration tests (4.10 to 6.0) I recognized a improved validation of TrieDateField. Syntactical correct but impossible days are rejected now. (stack trace at the end of the mail) Examples: - '1997-02-29T00:00:00Z' - '2006-06-31T00:00:00Z' - '2000-00-00T00:00:00Z' The first two dates are formal ok, but the Date does not exist. The third date is more suspicions, but was also accepted by Solr 4.10. I appreciate this improvement in principle, but I have to respect the original data. The dates might be intentionally wrong. Is there an easy way to get the weaker validation back? Regards Uwe Invalid Date in Date Math String:'1997-02-29T00:00:00Z' at org.apache.solr.util.DateMathParser.parseMath(DateMathParser.java:254) at org.apache.solr.schema.TrieField.createField(TrieField.java:726) at org.apache.solr.schema.TrieField.createFields(TrieField.java:763) at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:47)
Re: faceting is unusable slow since upgrade to 5.3.0
Sorry for the delay. I had an ugly flu. SOLR-7730 seems to work fine. Using docValues with Solr 5.4.0-2015-09-29_08-29-55 1705813 makes my faceted queries fast again. (90ms vs. 2ms) :-) Thanks Uwe Am 27.09.2015 um 20:32 schrieb Mikhail Khludnev: On Sun, Sep 27, 2015 at 2:00 PM, Uwe Reh <r...@hebis.uni-frankfurt.de> wrote: When 5.4 with SOLR-7730 will be released, I will start to use docValues. Going this way, seems more straight forward to me. Sure. Giving your answers docValues facets has a really good chance to perform in your index after SOLR-7730. It's really interesting to see performance numbers on early 5.4 builds: https://builds.apache.org/view/All/job/Solr-Artifacts-5.x/lastSuccessfulBuild/artifact/solr/package/
Re: Scramble data
Hi, my suggestions are probably to simple, because they are not a real protection of privacy. But maybe one fits to your needs. Most simple: Declare your 'hidden' fields just as "indexed=true stored=false", the data will be used for searching, but the fields are not listed in the query response. Cons: The Terms of the fields can be still examined by advanced users. As example they could use the field as facet. Very simple Use a PhoneticFilter for indexing and searching. The encoding "ColognePhonetic" generates a numeric hash for each term. The name "Breschnew" will be saved as "17863". Cons: Phonetic similaritys will lead to false hits. This hashing is really only scrambling and not appropriate as security feature. Simple Declare a special SearchHandlers in your solrconfig.xml and define an invariant fieldList parameter. This should contain just the public subset of your fields. Cons: I'm not really sure, about this. Still quite simple Write a own Filter, which generates real cryptographic hashes Cons: If the entropy of your data is poor, you may need additional tricks like padding the data. This filter may slow down your system. Last but not least be aware, that the searching could be a way to restore hidden informations. If a query for "billionaire" just get one hit, it's obvious that "billionaire" is an attribute of the document even if it is not listed in the result. Uwe
Re: faceting is unusable slow since upgrade to 5.3.0
Hi Mikhail, is this, what you've requested? lookups: 34084 hits: 34067 hitratio: 1 inserts: 34 evictions: 0 ... item_author_facet: {field=author_facet,memSize=104189615,tindexSize=789195,time=16901,phase1=16534,nTerms=3989851,bigTerms=0,termInstances=16214154,uses=4065} item_topic_facet: {field=topic_facet,memSize=103817915,tindexSize=112199,time=8912,phase1=8496,nTerms=525261,bigTerms=0,termInstances=11050466,uses=1510} item_material_access: {field=material_access,memSize=4532,tindexSize=46,time=1820,phase1=1820,nTerms=2,bigTerms=2,termInstances=0,uses=3406} (The fields 'author_facet' and 'topic_facet' do have a lot of unique entries. 'material_access' has only two values ('online' vs. 'print') Beside of "*:*", querys with more than maxdoc/2 hits happen very very rawly. Typical requests results in less than 1% of maxdoc. Here a typical example, searching for "Goethe" in the portfolio of the University Library Frankfurt/Main > https://hds.hebis.de/ubffm/Search/Results?lookfor=goethe=new The request yields over 31,000 results (~.2%. of maxdocs). The majority are books about Goethe, 'just' 5700 books are from him. The facet helps to detect professionals. Like Walter Underwood wrote, in technical sense faceting on authors isn't a good idea. In the worst case, the relation book to author is n:n. Never the less, thanks to authority files (which are intensively used in Germany) the facet 'author' is often helpful. Uwe Am 26.09.2015 um 14:08 schrieb Mikhail Khludnev: Uwe, Would you mind to provide a few details about your case? I wonder about number of bigterms and other stats as well at 'author' field (ant other most expensive facets). It looks like log rows: Sep 13, 2011 2:51:53 PM org.apache.solr.request.UnInvertedField uninvert INFO: UnInverted multi-valued field {*field=nomejornal*,memSize=827108,tindexSize=40,time=16,phase1=4,*nTerms=15,bigTerms=0*,termInstances=750,uses=0} Those heavy requests, do they find more than half of docs, eg hits>maxdoc/2 ? Thanks for your input! On Thu, Sep 24, 2015 at 11:38 AM, Uwe Reh <r...@hebis.uni-frankfurt.de> wrote: Am 22.09.2015 um 18:10 schrieb Walter Underwood: Faceting on an author field is almost always a bad idea. Or at least a slow, expensive idea. Hi Wunder, n a technical context, the 'author'-facet may be suboptimal. In our businesses (library services) it's a core feature. Yes the facet is expensive, but thanks to the fieldValueCache (4.10) sufficiently fast. uwe
Re: faceting is unusable slow since upgrade to 5.3.0
Hi Mikhail, thanks for the hint, and "no" it wasn't obvious for me. :-) But I think, for us it's better to remain at 4.10.3 and observe the evolution of SOLR-8096. When 5.4 with SOLR-7730 will be released, I will start to use docValues. Going this way, seems more straight forward to me. Uwe Am 27.09.2015 um 00:20 schrieb Mikhail Khludnev: Uwe, As a workaround, can you add facet.threads=Ncores to count fields in parallel? Also, setting fcs method for single value fields runs per segment faceting in parallel. Of course, fields which has small number of terms are beneficial from enum method. Excuse me if it's obvious. https://cwiki.apache.org/confluence/display/solr/Faceting
Re: Different ports for search and upload request
Am 25.09.2015 um 00:05 schrieb Siddhartha Singh Sandhu: *Never did this. *But how about this crazy idea: Take an Amazon EFS and share it between 2 EC2. I think, you are on the right way. Imho this requirement should be solved external. Option 1: Hide your Solr node behind a http-proxy which publishes the APIs/handler on different Ports. Or publish only the requestHandler like 'select' and 'get' and let use your updateprocess the full API. Option 2: Use replication. Update the Master and send your Querys to the Slave Uwe
Re: faceting is unusable slow since upgrade to 5.3.0
Am 25.09.2015 um 05:16 schrieb Yonik Seeley: I did some performance benchmarks and opened an issue. It's bad. https://issues.apache.org/jira/browse/SOLR-8096 Hi Yonik, thanks a lot for your investigation. Using the JSON Facet API is fast and seems to be a usable workaround for new applications. But not really, as fast patch to our production environment. What' your assessment about Bill's question? Is there a chance to get the fieldValueCache back? I would like to have it back in 5.x, even marked as deprecated. This would help to migrate. Uwe
Re: faceting is unusable slow since upgrade to 5.3.0
Am 23.09.2015 um 10:02 schrieb Mikhail Khludnev: ... Accelerating non-DV facets is not so clear so far. Please show profiler snapshot for non-DV facets if you wish to go this way. Hi, attached is a visualvm profile to several times a simplified query (just one facet): http://xyz/solr/hebis/select/?q=*:*=true=1=30=author_facet=true The avarage "QTime" for the query is ~5 Seconds: 5254.0 0.0 5253.0 0.0 0.0 0.0 0.0 The profile was made with Solr 5.3 running an 4.10 index with no 'docValue' at all in the schema. (A native 5.3 index with docValues is still building) For me it's surprising, that a lot of "docValue" could be found in the profile. Uwe PS. Meanwhile I tried a 5.1 and I got in the same behavior. "Hot Spots - Method";"Self Time [%]";"Self Time";"Self Time (CPU)";"Total Time";"Total Time (CPU)";"Samples" "sun.nio.ch.ServerSocketChannelImpl.accept()";"29.757507";"911411.696 ms";"227852.922 ms";"911411.696 ms";"227852.922 ms";"4" "sun.nio.ch.SelectorImpl.select()";"29.751842";"911238.171 ms";"911238.171 ms";"911238.171 ms";"911238.171 ms";"6" "java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos()";"12.056136";"369254.84 ms";"0.0 ms";"369254.84 ms";"0.0 ms";"73" "java.lang.Object.wait()";"7.439377";"227852.924 ms";"0.0 ms";"227852.924 ms";"0.0 ms";"1" "java.net.ServerSocket.accept()";"7.439377";"227852.924 ms";"0.0 ms";"227852.924 ms";"0.0 ms";"1" "java.util.HashMap.put()";"3.8150647";"116847.636 ms";"116847.636 ms";"116847.636 ms";"116847.636 ms";"873" "java.util.TreeMap.put()";"2.946289";"90238.818 ms";"90238.818 ms";"90238.818 ms";"90238.818 ms";"180" "org.apache.lucene.index.FieldInfos$Builder.addOrUpdateInternal()";"2.1034875";"64425.528 ms";"64425.528 ms";"183450.033 ms";"183450.033 ms";"113" "java.util.Collections$UnmodifiableCollection$1.next()";"0.8864094";"27148.909 ms";"27148.909 ms";"27148.909 ms";"27148.909 ms";"41" "java.util.TreeMap$EntryIterator.next()";"0.81940365";"25096.661 ms";"25096.661 ms";"25096.661 ms";"25096.661 ms";"26" "java.util.HashMap.get()";"0.66768044";"20449.689 ms";"20449.689 ms";"20449.689 ms";"20449.689 ms";"159" "org.apache.solr.request.DocValuesFacets.accumMultiSeg()";"0.42119572";"12900.365 ms";"12900.365 ms";"32423.444 ms";"32423.444 ms";"23" "org.apache.lucene.util.packed.MonotonicLongValues.get()";"0.37381834";"11449.292 ms";"11449.292 ms";"11449.292 ms";"11449.292 ms";"73" "java.util.AbstractCollection.toArray()";"0.3550354";"10874.009 ms";"10874.009 ms";"10874.009 ms";"10874.009 ms";"63" "org.apache.lucene.index.FieldInfos.()";"0.319384";"9782.08 ms";"9782.08 ms";"150232.207 ms";"150232.207 ms";"69" "org.apache.lucene.uninverting.DocTermOrds$Iterator.read()";"0.26374063";"8077.837 ms";"8077.837 ms";"8077.837 ms";"8077.837 ms";"64" "java.util.Collections.max()";"0.21143816";"6475.919 ms";"6475.919 ms";"6475.919 ms";"6475.919 ms";"46" "org.apache.solr.request.DocValuesFacets.getCounts()";"0.090463296";"2770.706 ms";"2770.706 ms";"410211.805 ms";"410211.805 ms";"60" "org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll()";"0.05521328";"1691.07 ms";"1691.07 ms";"1691.07 ms";"1691.07 ms";"2" "java.lang.System.identityHashCode[native]()";"0.031022375";"950.152 ms";"950.152 ms";"950.152 ms";"950.152 ms";"3" "org.apache.solr.util.LongPriorityQueue.downHeap()";"0.026107715";"799.626 ms";"799.626 ms";"799.626 ms";"799.626 ms";"9" "java.util.Collections$UnmodifiableCollection$1.hasNext()";"0.020632554";"631.933 ms";"631.933 ms";"631.933 ms";"631.933 ms";"6" "org.apache.lucene.index.FieldInfo.()";"0.011944577";"365.838 ms";"365.838 ms";"365.838 ms";"365.838 ms";"4" "java.util.WeakHashMap.put()";"0.011552288";"353.823 ms";"353.823 ms";"353.823 ms";"353.823 ms";"2" "org.apache.lucene.index.FieldInfos$Builder.add()";"0.010934878";"334.913 ms";"334.913 ms";"211565.8 ms";"211565.8 ms";"181" "org.eclipse.jetty.server.HttpOutput.write()";"0.010440102";"319.759 ms";"319.759 ms";"482.602 ms";"482.602 ms";"9" "java.util.WeakHashMap.get()";"0.010077655";"308.658 ms";"308.658 ms";"308.658 ms";"308.658 ms";"2" "org.apache.lucene.util.LongValues.get()";"0.010070211";"308.43 ms";"308.43 ms";"11757.722 ms";"11757.722 ms";"74" "org.apache.lucene.util.fst.FST.findTargetArc()";"0.00995512";"304.905 ms";"304.905 ms";"304.905 ms";"304.905 ms";"2" "org.apache.lucene.uninverting.DocTermOrds.uninvert()";"0.008576673";"262.686 ms";"262.686 ms";"262.686 ms";"262.686 ms";"1" "org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter()";"0.0037003446";"113.334 ms";"113.334 ms";"341247.51 ms";"341247.51 ms";"58" "java.lang.String.getBytes()";"0.0032916984";"100.818 ms";"100.818 ms";"100.818 ms";"100.818 ms";"1" "java.nio.DirectByteBuffer.get()";"0.0032914046";"100.809 ms";"100.809 ms";"100.809 ms";"100.809 ms";"1" "org.eclipse.jetty.http.DateGenerator.doFormatDate()";"0.003270182";"100.159 ms";"100.159 ms";"100.159 ms";"100.159 ms";"1"
Re: faceting is unusable slow since upgrade to 5.3.0
Am 22.09.2015 um 18:10 schrieb Walter Underwood: Faceting on an author field is almost always a bad idea. Or at least a slow, expensive idea. Hi Wunder, n a technical context, the 'author'-facet may be suboptimal. In our businesses (library services) it's a core feature. Yes the facet is expensive, but thanks to the fieldValueCache (4.10) sufficiently fast. uwe
Re: faceting is unusable slow since upgrade to 5.3.0
Am 22.09.2015 um 02:12 schrieb Joel Bernstein: Have you looked at your Solr instance with a cpu profiler like YourKit? It would be useful to see the hotspots which should be really obvious with 20 second response times. No, until now I have done no profiling. I thought the unused fieldValueCache was clear indicator of my faulty operation. Because we are a public service, I can not YourKit use (not the license itself, the local expenses for licensing is the blocker) I will try to detect the hotspot with VisualVM. Also are you running in distributed mode or on a single Solr instance? Just as single instance. Thanks for the attention Uwe
Re: faceting is unusable slow since upgrade to 5.3.0
The exact version as shown by the UI is: - solr-impl 5.3.0 1696229 - noble - 2015-08-17 17:10:43 - lucene-impl 5.3.0 1696229 - noble - 2015-08-17 16:59:03 Unfortunately my skills in debugging are limited. So I'm not sure about a 'deeper caller stack'. Did you mean the attached snapshot from VirtualVM, a stack trace like below or something else? Please give me a hint. uwe "qtp1734853116-68" #68 prio=5 os_prio=64 tid=0x117fd800 nid=0x77 runnable [0xfd7f991fc000] java.lang.Thread.State: RUNNABLE at java.util.HashMap.resize(HashMap.java:734) at java.util.HashMap.putVal(HashMap.java:662) at java.util.HashMap.put(HashMap.java:611) at org.apache.lucene.index.FieldInfos$Builder.addOrUpdateInternal(FieldInfos.java:344) at org.apache.lucene.index.FieldInfos$Builder.add(FieldInfos.java:366) at org.apache.lucene.index.FieldInfos$Builder.add(FieldInfos.java:304) at org.apache.lucene.index.MultiFields.getMergedFieldInfos(MultiFields.java:245) at org.apache.lucene.index.SlowCompositeReaderWrapper.getFieldInfos(SlowCompositeReaderWrapper.java:237) at org.apache.lucene.index.SlowCompositeReaderWrapper.getSortedSetDocValues(SlowCompositeReaderWrapper.java:174) at org.apache.solr.request.DocValuesFacets.getCounts(DocValuesFacets.java:72) at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:492) at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:385) at org.apache.solr.request.SimpleFacets$3.call(SimpleFacets.java:628) at org.apache.solr.request.SimpleFacets$3.call(SimpleFacets.java:619) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.solr.request.SimpleFacets$2.execute(SimpleFacets.java:573) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:644) at org.apache.solr.handler.component.FacetComponent.getFacetCounts(FacetComponent.java:294) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:256) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:285) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2068) at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:669) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:462) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:210) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:499) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) at java.lang.Thread.run(Thread.java:745) Am 22.09.2015 um 12:56 schrieb Mikhail Khludnev: It's quite strange https://issues.apache.org/jira/browse/SOLR-7730 significantly optimized DV facets at 5.3.0 exactly by avoiding FileInfo merge. Would you mind to provide deeper caller stack for org.apache.lucene.index.FileInfos.MultibleFields.getMergedFieldInfos()? Or a time spend in SlowCompositeReaderWrapper, DocValuesFacets, MultiDocValues and their hot methods. Which version you exactly on? and how do you know that? Thanks
Re: faceting is unusable slow since upgrade to 5.3.0
here is my try to detect with VirtualVM some hot spots with VirtualVM. Enviroment: A newly started node with ~15 times the query: http://yxz/solr/hebis/select/?q=darwin=true=1=30=material_access=department_3=rvk_facet=author_facet=material_brief=language==count=all=true Ordered by self time the top methods are: org.eclipseutil.BlockingArrayQueue.poll(): 260s(self), 260s(total) org.apache.lucene.index.FileInfos.init() 90s(self), 90s(total) org.apache.lucene.index.FileInfos.FieldNumbers.addOrGet() 60s(self), 60s(total) org.apache.lucene.index.FileInfos.Builder.addOrGetUpdateInternal() 51s(self), 121s(total) org.apache.lucene.index.FileInfos.Builder.finish() 13s(self), 102s(total) org.apache.lucene.index.FileInfos.Builder.fieldInfo() 9s(self), 9s(total) org.apache.lucene.index.FileInfos.Builder.add() 4s(self), 126s(total) org.apache.lucene.index.FileInfos.MultibleFields.getMergedFieldInfos() 1s(self), 229s(total) ...less than 1000ms Ordered by total time the top (non http/jetty) methods are: jetty ... 231s(total) org.apache.solr.handler.component.SearchHandler.handleRequestBody() 231s(total) org.apache.solr.request.SimpleFacets.* 230s(total) org.apache.solr.handler.component.FacetComponent.* 230s(total) org.apache.lucene.index.* 125s(total) org.apache.lucene.search.* .3s(total) ... less than 300ms
Re: faceting is unusable slow since upgrade to 5.3.0 (missing attachment)
virtualvm_snapshot_solr5.3_facetting.csv Description: MS-Excel spreadsheet
faceting is unusable slow since upgrade to 5.3.0
Hi, our bibliographic index (~20M entries) runs fine with Solr 4.10.3 With Solr 5.3 faceted searching is constantly incredibly slow (~ 20 seconds) Output of 'debugQuery': 17705.0 2.0 17590.0 !! 111.0 The 'fieldValueCache' seems to be unused (no inserts nor lookups) in Solr 5.3. In Solr 4.10 the 'fieldValueCache' is in heavy use with a cumulative_hitratio of 1. - the behavior is the same, running Solr5.3 on a copy of the old index (luceneMatch=4.6) or a newly build index - using 'facet.method=enum' makes no remarkable difference - declaring 'docValues' (with reindexing) makes no remarkable difference - 'softCommit' isn't used My enviroment is OS: Solaris 5.11 on AMD64 JDK: 1.8.0_25 and 1.8.0_60 (same behavior) JavaOpts: -Xmx 10g -XX:+UseG1GC -XX:+AggressiveOpts -XX:+UseLargePages -XX:LargePageSizeInBytes=2m Any help/advice is welcome Uwe
Re: faceting is unusable slow since upgrade to 5.3.0
Am 21.09.2015 um 15:16 schrieb Shalin Shekhar Mangar: Can you post your complete facet request as well as the schema definition of the field on which you are faceting? Query: http://yxz/solr/hebis/select/?q=darwin=true=1=30=material_access=department_3=rvk_facet=author_facet=material_brief=language==count=all=true Schema (with docValue): ... ... ... Schema (w/o docValue): ... ... ... solrconfig: ... ... 10 allfields none query facet stats debug elevator
Re: Understanding the Debug explanations for Query Result Scoring/Ranking
Hi, to get an idea of the meaning of all this numbers, have a look on http://explain.solr.pl. I like this tool, it's great. Uwe Am 25.07.2014 00:45, schrieb O. Olson: Hi, If you add /*debug=true*/ to the Solr request /(and wt=xml if your current output is not XML)/, you would get a node in the resulting XML that is named debug. There is a child node to this called explain to this which has a list showing why the results are ranked in a particular order. I'm curious if there is some documentation on understanding these numbers/results. I am new to Solr, so I apologize that I may be using the wrong terms to describe my problem. I also aware of http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html though I have not completely understood it. My problem is trying to understand something like this: 1.5797625 = (MATCH) sum of: 0.4717142 = (MATCH) weight(text:televis in 44109) [DefaultSimilarity], result of: 0.4717142 = score(doc=44109,freq=1.0 = termFreq=1.0 ), product of: 0.71447384 = queryWeight, product of: 7.0424104 = idf(docFreq=896, maxDocs=377553) 0.10145303 = queryNorm 0.660226 = fieldWeight in 44109, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 7.0424104 = idf(docFreq=896, maxDocs=377553) 0.09375 = fieldNorm(doc=44109) 1.1080483 = (MATCH) weight(text:tv in 44109) [DefaultSimilarity], result of: 1.1080483 = score(doc=44109,freq=6.0 = termFreq=6.0 ), product of: 0.6996622 = queryWeight, product of: 6.896415 = idf(docFreq=1037, maxDocs=377553) 0.10145303 = queryNorm 1.5836904 = fieldWeight in 44109, product of: 2.4494898 = tf(freq=6.0), with freq of: 6.0 = termFreq=6.0 6.896415 = idf(docFreq=1037, maxDocs=377553) 0.09375 = fieldNorm(doc=44109) *Note:* I have searched for televisions. My search field is a single catch-all field. The Edismax parser seems to break up my search term into televis and tv Is there some documentation on how to understand these numbers. They do not seem to be properly delimited. At the minimum, I can understand something like: 1.5797625 = 0.4717142 + 1.1080483 and 0.71447384 = 7.0424104 * 0.10145303 But, I cannot understand if something like 0.10145303 = queryNorm 0.660226 = fieldWeight in 44109 is used in the calculation anywhere. Also since there were only two terms /(televis and tv)/ I could use subtraction to find out 1.1080483 was the start of a new result. I'd also appreciate if someone can tell me which class dumps out the above data. If I know it, I can edit that class to make the output a bit more understandable for me. Thank you, O. O. -- View this message in context: http://lucene.472066.n3.nabble.com/Understanding-the-Debug-explanations-for-Query-Result-Scoring-Ranking-tp4149137.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Fields Multilingue
Am 30.06.2014 16:57, schrieb benjelloun: AllChamp that don't do analyzer and filter. any idea? Exemple: I search for : AllChamp:presenton -- num result=0 AllChamp:présenton -- num result=1 Hi Anass, any analyzer means any modification (no ICU-Normalisation). copyField copys just the raw input not the processed tokens from the source field(s). Maybe that's your misconception. Uwe
Re: Two solr instances access common index
Hi, with the lock type 'simple' I have tree instances (different JREs, GC-Problem) running on the same files. You should use this option only for a readonly system. Otherwise it's easy to corrupt the index. Maybe you should have a look on replication or SolrCloud. Uwe Am 26.06.2014 11:25, schrieb Prasi S: Hi, Is it possible to point two solr instances to point to a common index directory. Will this work wit changing the lock type? Thanks, Prasi
Re: Distributed search with Terms Component and Solr Cloud.
Hi Ryan, just take a look on the thread TermsComponent/SolrCloud. Setting your parameters as default in solrconfig.xml should help. Uwe Am 13.01.2014 20:24, schrieb Ryan Fox: Hello, I am running Solr 4.6.0. I am experiencing some difficulties using the terms component across multiple shards. I see according to the documentation, it should work, but I am unable to do so with solr cloud. When I have one shard, queries using the terms component respond as I would expect. However, when I split my index across two shards, I get empty results for the same query. I am querying solr with a CloudSolrServer object. When I manually add the query params shards and shards.qt to my SolrQuery, I get the expected response. It's not ideal, but if there's a way to get a list of all shards programmatically, I could set that parameter. From the documentation, it appears to me the terms component should be supported by solr cloud, but I can't find anything that explicitly says one way or the other. If there is a better way to do it, or perhaps something I have misconfigured, any advice would be much appreciated. If it's just not possible, I will manage. I can provide more configuration or specifically how I am running the query if that would help. Ryan Fox
Re: Error when creating collection in Solr 4.6
Hi, I had the same problem. In my case the error was, I had a copy/paste typo in my solr.xml. str name=genericCoreNodeNames${genericCoreNodeNames:true}/str !^! Ouch! With the type 'bool' instead of 'str' it works definitely better. ;-) Uwe Am 28.11.2013 08:53, schrieb lansing: Thank you for your replies, I am using the new-style discovery It worked after adding this setting : bool name=genericCoreNodeNames${genericCoreNodeNames:true}/bool -- View this message in context: http://lucene.472066.n3.nabble.com/Error-when-creating-collection-in-Solr-4-6-tp4103536p4103696.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to index X™ as #8482; (HTML decimal entity)
What's about having a simple charfilter in the analyzer queue for indexing *and* searching. e.g charFilter class=solr.PatternReplaceFilterFactory pattern=™ replacement=#8482; / or charFilter class=solr.MappingCharFilterFactory mapping=mapping-specials.txt / Uwe Am 19.11.2013 23:46, schrieb Developer: I have a data coming in to SOLR as below. field name=displayNameX™ - Black/field I need to store the HTML Entity (decimal) equivalent value (i.e. #8482;) in SOLR rather than storing the original value. Is there a way to do this?
Re: [ANNOUNCE] Apache Solr Reference Guide 4.5 Available
Am 18.11.2013 14:39, schrieb Furkan KAMACI: Atlassian Jira has two options at default: exporting to PDF and exporting to Word. I see, 'Word' isn't optimal for a reference guide. But OO can handle 'doc' and has epub plugins. Could it be possible, to offer the doku also as 'doc(x)' barefaced Uwe
Re: [ANNOUNCE] Apache Solr Reference Guide 4.5 Available
Thank you for opening the issue. I'm not sure that my case is representative. I'm spending every day three hours in the train (commuting to work). I like to use this time to have a closer look into manuals. Printouts and laptops are horrible in this situation. So there is only the alternative between my 10 tablet and my 6 e-reader. I prefer the more handy reader. No, I can't afford a nice Nexus7. Not now ;-) Uwe Am 19.11.2013 17:08, schrieb Cassandra Targett: I've often thought of possibly providing the reference guide in .epub format, but wasn't sure of general interest. I also once tried to convert the PDF version with calibre and it was a total mess. - but PDF is probably the least-flexible starting point for conversion. Unfortunately, the Word export is only available on a per-page basis, which would make it really tedious to try to make a .doc version of the entire guide (there are ~150 pages). There are, however, options for HTML export, which I believe could be converted to .epub - but might take some fiddling. I created an issue for this - for now just to track that it's something that might be of interest - but not sure if/when I'd personally be able to work on it: https://issues.apache.org/jira/browse/SOLR-5467. On Tue, Nov 19, 2013 at 6:34 AM, Uwe Reh r...@hebis.uni-frankfurt.de wrote: Am 18.11.2013 14:39, schrieb Furkan KAMACI: Atlassian Jira has two options at default: exporting to PDF and exporting to Word. I see, 'Word' isn't optimal for a reference guide. But OO can handle 'doc' and has epub plugins. Could it be possible, to offer the doku also as 'doc(x)' barefaced Uwe
Re: [ANNOUNCE] Apache Solr Reference Guide 4.5 Available
I'd like to read the guide as e-paper. Is there a way to obtain the document in the Format epub or odt. Trying to convert the PDF with Calibre, wasn't very satisfyingly. :-( Uwe Am 05.10.2013 14:19, schrieb Steve Rowe: The Lucene PMC is pleased to announce the release of the Apache Solr Reference Guide for Solr 4.5. This 338 page PDF serves as the definitive users manual for Solr 4.5. The Solr Reference Guide is available for download from the Apache mirror network: https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/ Steve
SolrCloud: read only node
Hi, as service provider for libraries we run a small cloud (1 collection, 1 shard, 3 replicas). To improve the local reliability we want to offer the possibility to set up own local replicas. As fas as I know, this can be easily done just by adding a new node to the cloud. But the external node shouldn't be able to do any changes on the index. Is there a cheap way to restrict a node of a SolrCloud into a read only modus? Is it a better idea, to do legacy replication from one node (master) to an external slave? Uwe
Re: SolrCloud: read only node
F***, this is the answer, I was afraid of. ;-) I hoped, there could be anything, similar to http://zookeeper.apache.org/doc/trunk/zookeeperObservers.html. Nevertheless, thank you. Uwe Am 04.11.2013 14:14, schrieb Erick Erickson: In this situation, I'd consider going with the older master/slave setup. The problem is that in SolrCloud, you have a lot of chatter back and forth. Presumably the connection to your local instances is rather slow, so if you're adding data to your index, each and every add has to be communicated individually to the remote node. But no, there's no good way in SolrCloud to make a node read only. Actually, that doesn't really make sense in the solr cloud world since each node maintains its own index, does its own indexing, etc. So each node _must_ be able to change the Solr index it uses. FWIW, Erick
SOLR-3076 for beginners?
Hi, blockjoin seems to be a real cool feature. Unfortunately I'm to dumb, to get the patch running. I even don't know what to do :-( Is there anywhere an example, a howto or a cookbook, other than using elasticsearch or bare lucene? Uwe
Re: Nested function query must use ....
Hi Jack thanks a lot for the hint. Am 02.02.2013 00:46, schrieb Jack Krupansky: I've updated the example on the Function Query wiki that you may have copied: http://wiki.apache.org/solr/FunctionQuery#exists Thanks again, because the wiki page was really my start point. Uwe
Nested function query must use ....
Hi, should be easy, but I'm to blind to find the correct syntax (Solr4.1) Problem: I' have some documents in the index, because of their structure they tend to get too high scores. This documents are easy to identify and I want to boost the others to get a fair ranking. Could anyone give my the correct syntax to accomplish this simplified query? ...q=*:*fl=foo:exists(query(id:3)) Uwe Example: response lst name=responseHeader int name=status400/int int name=QTime2/int lst name=params str name=flfoo:exists(query(id:3))/str str name=q*:*/str /lst /lst lst name=error str name=msgError parsing fieldname: Nested function query must use $param or {!v=value} forms. got 'exists(query(id:3))'/str int name=code400/int /lst /response But ...q=*:*fl=foo:exists(id) works response lst name=responseHeader int name=status0/int int name=QTime1/int lst name=params str name=flfoo:exists(id)/str str name=q*:*/str str name=rows1/str /lst /lst result name=response numFound=1735 start=0 doc bool name=footrue/bool/doc /result /response
Re: Tokenized keywords
Hi probably my note is nonsense. But sometimes one is blind and not able to see simple things anymore. Is this query, what you are looking for? q=modified:(search+for+Laptops)fl=original,modified Sorry, if my suggest is too trivial. Uwe Am 21.01.2013 09:17, schrieb Romita Saha: Hi, I have a field defined in scheme.xml named as 'original'. I first copy this field to modified and apply filters on this field modified. field name=original type=string indexed=true stored=true/ field name=modified type=text_general indexed=true stored=true/ copyField source=original dest=modified/ I want to display in my responseas follows: original: Search for all the Laptops modified: search laptop Thanks and regards, Romita Saha
Re: Missing documents with ConcurrentUpdateSolrServer (vs. HttpSolrServer) ?
Hi Mark, one entry in my long list of self made problems is: Done the commit before the ConcurrentUpdateSolrServer was finished. Since the ConcurrentUpdateSolrServer is asynchronous, it's very easy to create a race conditions. Make sure that your program is waiting () before it's doing the commit. if (solrserver instanceof ConcurrentUpdateSolrServer) { ((ConcurrentUpdateSolrServer) solrserver).blockUntilFinished(); } Uwe
Re: Missing documents with ConcurrentUpdateSolrServer (vs. HttpSolrServer) ?
Hi Shawn, don't panic Due 'historical' reasons, like comparing the different subclasses of SolrServer, I have an HttpSolrServer for querys and commits. I've never tried to to use the CUSS for anything else than adding documents. As I wrote, it was a home made problem and not a bug. Sometimes I hope, not to be the only dumbass and others may caught in the same trap. Uwe Am 17.01.2013 15:52, schrieb Shawn Heisey: If you are using the same ConcurrentUpdateSolrServer object for all update interaction with Solr (including commits) and you still have to do the blockUntilFinished() in your own code before you issue an explicit commit, that sounds like a bug, and you should put all the details in a Jira issue.
Re: Results in same or different fields
Hi, maybe it helps to have a closer look on the other params of edismax. http://wiki.apache.org/solr/ExtendedDisMax#pf_.28Phrase_Fields.29 'mm=2' will be to strong, but th usage of pf, pf2, and pf is likely your solution. uwe Am 15.01.2013 10:15, schrieb Gastone Penzo: Hi, i'm using solr 4.0 with edismax search handler. i'm searching inside 3 fields with same boost. i'd like to have high score for results in the same fields, instead of results in different fields es. qf=title,description if white house is found in title, it must have higher score than white in title field and house in description field how is it possible? ps. i set omitTermFreqAndPositions=true for all fields thanx *Gastone Penzo* * *
Re: theory of sets
Am 08.01.2013 10:26, schrieb Uwe Reh: OK, OK, I will try it again with dynamic fields. NO! dynamic fields are nice, but not for my problem. :-( I got more than *52* new fields. I was wrong, the impact on searching is really reasonable. But have you ever used the Admin's Schema Browser with that much fields? I suppose never, my Installation (4.1) freezes the FireFox while the JS-job runs into a timeout. At most I don't like it, because having that much fields 'smells' for me. Before anyone asks the XY question. The index is intended for a library's catalog and the quest is Find all members of a series (eg. penguin books, paperbacks) and order them on their sortkey. Unfortunately titles may belong to several (sub)series with different sortkeys. Still seeking for better approaches. Uwe
Re: POST query with non-ASCII to solr using httpclient wont work
Hi Jie, maybe there is a simple solution. When we used tomcat as servlet container for solr I notices similar problems. Even with the hints from the solr wiki about unicode and Tomcat, i wasn't able to fix this. So we switched back to Jetty, querys like q=allfields2%3A能力 are reliable now. Uwe BTW: I have no idea for at all what these Japanese signs mean. So just let me append two of 31 hits in our bibliographic catalog doc str name=idHEB052032124/str str name=raw_fullrecordalg: 5203212 001@ $0205 001A $4:13-05-97 001B $t13:12:07.000$01999:10-06-10 001D $0:99-99-99 001U $0utf8 001X $00 002@ $0Aau 003@ $0052032124 007I $0NacsisBN09679884 010@ $ajpn 011@ $a1993 013H $0z 019@ $ajp 021A $ULatn$T01$aNōryoku kaihatsu no shisutemu$hYaguchi Hajime 021A $UJpan$T01$a@能力開発のシステム$h矢口新著 028A $ULatn$T01$9165745363$8Yaguchi, Hajime 028A $UJpan$T01$d新$a矢口 033A $ULatn$T01$pTokyo$nNōryoku Kaihatsu Kōgaku Sentaa 033A $UJpan$T01$p東久留米$n能力開発工学センター 034D $a274 S. 034M $aIll. 036E $aYaguchi Hajime senshū$l2 036F $l2$9052031527$8Yaguchi Hajime senshū$x12 037B $aSysteme zur Entwicklung der Fähigkeiten 046L $aIn japan. Schr. ... 247C/01 $9102595631$8351457-2 4/457Marburg, Universität Marburg, Bibliothek des Japan-Zentrums (BJZ) /str /doc doc str name=idHEB286840723/str str name=raw_fullrecordalg: 28684072 001@ $03 001A $00030:04-01-12 001B $t22:29:11.000$01999:04-01-12 001C $t10:48:47.000$00030:04-01-12 001D $00030:04-01-12 001U $0utf8 001X $00 002@ $0Aau 003@ $0286840723 004A $A978-4-88319-546-6 007A $0286840723$aHEB 010@ $ajpn 011@ $a2010 021A $ULatn$T01$aShin kanzen masutā kanji nihongo nōryoku shiken ; N1$hIshii Reiko ... 021A $UJpan$T01$a新完全マスター漢字日本語能力試験 ; N1$h石井怜子 [ほか] 著 027A $ULatn$T01$aShin kanzen masutā kanji : nihongo nōryoku shiken ; enu ichi / Ishii Reiko ... 027A $UJpan$T01$a新完全マスター漢字 : 日本語能力試験 ; N1 / 石井怜子 [ほか] 著 028C $9230917593$8Ishii, Reiko 033A $ULatn$T01$pTōkyō$nSurīē nettowāku 033A $UJpan$T01$p東京$nスリーエーネットワーク 034D $aviii, 197, 21S. 034I $a26cm 044A $S4$aNihongokyōiku(Taigaikokujin) 045Z $aEI 4650 ... 247C/01 $9102599157$8601220-6 30/220Frankfurt, Universität Frankfurt, Institut für Orientalische und Ostasiatische Philologien, Japanologie /str /doc
Re: retrieving latest document **only**
Am 10.01.2013 11:54, schrieb jmozah: I need a query that matches only the most recent ones... Because my stats depend on it.. But I have a requirement to show **only** the latest documents and the stats along with it.. What do you want? 'the most recent ones' or '**only** the latest' ? Perhaps a range query q=timestamp:[refdate TO NOW] will match your needs. Uwe
Re: Hotel Searches
Hi, maybe I'm thinking too simple again. Nevertheless, here an idea to solve the question. The basic thought is to get rid of the range query. Have: - a textfield 'vacant_days'. Instead of ISO-Dates just simple dates in the form mmdd - a dynamic field 'price_*', You can add the tariff for Jan. 31th into 'price_0131' To get the total, eg. Feb. 1st to Feb. 3th you could query for the days 0201, 0202 and 0203. You can calculate the sum of the corresponding price fields q=vacant_days:0201 AND vacant_days:0202 AND vacant_days:0203fl?_val_:sum(price_0201, price_0202, price_0203) (not tested) Uwe Am 09.01.2013 07:08, schrieb Harshvardhan Ojha: Hi Alex, Thanks for your reply. I saw prices based on daterange using multipoints . But this is not my problem. Instead the problem statement for me is pretty simple. Say I have 100 documents each having tariff as field. Doc1 doc double name=tariff2400.0/double /doc Doc2 doc double name=tariff2500.0/double /doc Now a user's search should give me a total tariff. Desired result doc double name=tariff4900.0/double /doc And this could be any combination for 100 docs it is (100+101)/2. (N*N+1)/2. How can I get these combination of documents already indexed ? Or is there any way to do calculations at runtime? How can I place this constraint that if there is any 1 doc missing in a range don’t give me any result.(if a user asked for hotel tariff from 11th to 13th, and I don’t have tariff for 12th, I shouldn't add 11th and 13th only). Hope I made my problem very simple. Regards Harshvardhan Ojha -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Tuesday, January 08, 2013 6:12 PM To: solr-user@lucene.apache.org Subject: Re: Hotel Searches Did you look at a conversation thread from 12 Dec 2012 on this list? Just go to the archives and search for 'hotel'. Hopefully that will give you something to work with. If you have any questions after that, come back with more specifics. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Tue, Jan 8, 2013 at 7:18 AM, Harshvardhan Ojha harshvardhan.o...@makemytrip.com wrote: Sorry for that, we just spoiled that thread so posted my question in a fresh thread. Problem is indeed very simple. I have solr documents, which has all the required fields(from db). Say DOC1,DOC2,DOC3.DOCn. Every document has 1 night tariff and I have 180 nights tariff. So a person can search for any combination in these 180 nights. Say a request came to me to give total tariff for 10th to 15th of jan 2013. Now I need to get a sum of tariff field of 6 docs. So how can I keep this data indexed, to avoid search time calculation, and there are other dimensions of this data also beside tariff. Hope this makes sense. Regards Harshvardhan Ojha -Original Message- From: Gora Mohanty [mailto:g...@mimirtech.com] Sent: Tuesday, January 08, 2013 5:37 PM To: solr-user@lucene.apache.org Subject: Re: Hotel Searches On 8 January 2013 17:10, Harshvardhan Ojha harshvardhan.o...@makemytrip.com wrote: Hi All, Looking into a finding solution for Hotel searches based on the below criteria's [...] Didn't you just post this on a separate thread, complete with some nonsensical follow-up from a colleague of yours? Please do not repost the same message over and over again. It is not clear what you are trying to achieve. What is the difference between a city and a hotel in your data? How is a person represented in your documents? Is it by the ID field? Are you looking to cache all possible combinations of ID, city, and startdate? If so, to what end? This smells like a XY problem: http://people.apache.org/~hossman/#xyproblem Regards, Gora
Re: theory of sets
OK, OK, I will try it again with dynamic fields. May be the Problem has been something else. All statements sound reasonable. Even Lisheng's thoughts about the impact of to many fields on memory consumption should not be the problem for a JVM with 32G Ram an almost no gc. Please give me some time. Thanks Uwe Am 08.01.2013 00:27, schrieb Zhang, Lisheng: Hi, Just thought this possibility: I think dynamic field is solr concept, on lcene level all fields are the same, but in initial startup, lucene should load all field information into memory (not field data, but schema). If we have too many fields (like *_my_fields, * = a1, a2, ...), does this take too much memory and slow down performance (even if very few fields are really used)? Best regards, Lisheng -Original Message- From: Upayavira [mailto:u...@odoko.co.uk] Sent: Monday, January 07, 2013 2:57 PM To: solr-user@lucene.apache.org Subject: Re: theory of sets Dynamic fields resulted in poor response times? How many fields did each document have? I can't see how a dynamic field should have any difference from any other field in terms of response time. Or are you querying across a large number of dynamic fields concurrently? I can imagine that slowing things down. Upayavira Am 07.01.2013 17:40, schrieb Petersen, Robert: Hi Uwe, We have hundreds of dynamic fields but since most of our docs only use some of them it doesn't seem to be a performance drag. They can be viewed as a sparse matrix of fields in your indexed docs. Then if you make the sortinfo_for_groupx an int then that could be used in a function query to perform your sorting. See http://wiki.apache.org/solr/FunctionQuery
Re: fieldtype for name
Hi Michael, in our index ob bibliographic metadata, we see the need for at least tree fields: - name_facet: String as type, because the facet should should represent the original inverted format from our data. - name: TextField for searching. This field is heavily analyzed to match different orders, to match synonyms, phonetic similarity, German umlauts and other European stuff. - name_lc: TextField. This field is just mapped to lower case. It's used to boost docs with the same style of writing like the users input. Uwe Am 08.01.2013 15:30, schrieb Michael Jones: Hi, What would be the best fieldtype for a persons name? at the moment I'm using text_general but, if I search for bob smith, some results I get back might be rob thomas. In that it's matched 'ob'. But I only really want results that are either 'bob smith' 'bob, smith' 'smith, bob' 'smith bob' Thanks
Re: theory of sets (first solution)
Hi, I found a own hack. It's based on free interpretation of the function strdist(). Have: - one multivalued field 'part_of' - one unique field 'groupsort' Index each item: For each group membership: add groupid to 'part_of' concat groupid and sortstring to new string add this string to a csv list End add the csv list to 'groupsort' End Have also, a own class that implements org.apache.lucene.search.spell.StringDistance, to generate a custom distance value. This class should: - split the csv list - find the element/string that starts with the given group id - translate the rest (sortstring) to a float value .../select?q=part_of:Xsort=strdist(X, groupsort, FQN) asc FQN is the fully qualified name of the own class. (remember to place the the jar in a 'lib' defined in solrconfig.xml or add a own 'lib' entry) Uwe (still looking for a smarter solution)
Re: custom solr sort
Am 06.01.2013 02:32, schrieb andy: I want to custom solr sort and pass solr param from client to solr server, Hi Andy, not a answer of your question, but maybe an other approach to solve your initial question. Instead of writing a new SearchComponent I decided to (miss)use the function http://wiki.apache.org/solr/FunctionQuery#strdist 'strdist' seems to have everything, you need: - a parameter 's1' - a fieldname 's2' - a slot to plugin your own algo How to use this to sort on multivalued attributes, I've described in this list as thread theory of sets Uwe
Re: Sorting on mutivalued fields still impossible?
Hi Jack, thank you for the hint. Since I have already a solrj client to do the preprocessing, mapping to sort fields isn't my problem. I will try to explain better in my reply to Erick. Uwe (Sorry late reaction) Am 30.08.2012 16:04, schrieb Jack Krupansky: You can also use a Field Mutating Update Processor to do a smart copy of a multi-valued field to a sortable single-valued field. See: http://wiki.apache.org/solr/UpdateRequestProcessor#Field_Mutating_Update_Processors Such as using the maximum value via MaxFieldValueUpdateProcessorFactory. See: http://lucene.apache.org/solr/api-4_0_0-BETA/org/apache/solr/update/processor/MaxFieldValueUpdateProcessorFactory.html Which value of a multi-valued field do you wish to sort by? -- Jack Krupansky
Re: Sorting on mutivalued fields still impossible?
Am 31.08.2012 13:35, schrieb Erick Erickson: ... what would the correct behavior be for sorting on a multivalued field Hi Erick, in generally you are right, the question of multivalued fields is which value the reference is. But there are thousands of cases where this question is implicit answered. See my example ...sort=max(datefield) desc It is obvious, that the newest date should win. I see no reason why simple filters like max can't handle multivalued fields. Now four month's later i still wounder, why there is no pluginable function to map multivalued fields into a single value. eg. ...sort=sqrt(mapMultipleToOne(FQN, fieldname)) asc... Uwe (Sorry late reaction)
Re: Sorting on mutivalued fields still impossible?
Hi, like I just wrote in my reply to the similar suggestion form Jack. I'm not looking for a way to preprocess my data. My question is, why do i need two redundant fields to sort a multivalued field ('date_max' and 'date_min' for 'date') For me it's just a waste of space, poisoning the fieldcache. There is also an other class of problems, where a filterfunction like 'mapMultipleToOne' may helpful. In the thread 'theory of sets' (this list) I described a hack with the function strdist, an own class and the mapping of a multiple values as a cvs list in a single value field. Uwe Am 07.01.2013 14:54, schrieb Alexandre Rafalovitch: If the Multiple-to-one mapping would be stable (e.g. independent of a query), why not implement it as a custom update.chain processor with a copy to a separate field? There is already a couple of implementations under FieldValueMutatingUpdateProcessor (first, last, max, min). Regards, Alex.
Re: theory of sets
Hi Robi, thank you for the contribution. It's exiting to read, that your index isn't contaminated by the number of fields. I can't exclude other mistakes, but my first experience with extensive use of dynamic fields have been very poor response times. Even though I found an other solution, I should give the straight forward solution a second chance. Uwe Am 07.01.2013 17:40, schrieb Petersen, Robert: Hi Uwe, We have hundreds of dynamic fields but since most of our docs only use some of them it doesn't seem to be a performance drag. They can be viewed as a sparse matrix of fields in your indexed docs. Then if you make the sortinfo_for_groupx an int then that could be used in a function query to perform your sorting. See http://wiki.apache.org/solr/FunctionQuery
Re: indexing cpu utilization
Hi Mark, SOLR-3929 rocks! A nigthly build of 4.1 with maxIndexingThreads configured to 24, takes 80% to 100% of the cpu resources :-) Thank you, Otis and Gora mpstat 10 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 00 0 13 607 241 234 78 10021 258 87 2 0 11 10 0 24 240 23 293 94 10931 286 86 1 0 13 20 0 12 367 181 268 83 10241 338 89 1 0 10 30 0 18 188 20 226 67 8651 243 87 1 0 13 40 05 205 22 255 74 10041 310 87 1 0 12 50 05 192 22 228 68 8840 260 89 1 0 10 60 0 15 223 23 278 86 10451 319 87 1 0 12 70 0 18 215 23 267 75 10451 321 85 1 0 14 80 04 253 21 272 64 11240 284 77 1 0 22 90 04 243 20 281 61 10830 300 79 1 0 20 100 02 234 22 272 56 11140 376 78 1 0 21 110 02 205 18 237 57 9640 297 82 1 0 17 120 03 251 24 273 59 11340 323 72 1 0 27 130 04 203 19 236 54 9120 294 82 1 0 17 140 04 245 21 288 54 11130 309 77 1 0 22 150 04 233 21 258 58 10630 280 80 1 0 19 160 05 286 19 346 60 13340 425 73 1 0 26 170 06 340 23 414 67 15140 500 67 1 0 31 180 07 343 23 435 67 15050 482 66 2 0 32 190 08 294 19 348 53 12850 444 70 1 0 29 200 06 309 21 385 64 13940 514 68 1 0 31 210 07 279 20 378 58 13330 471 69 1 0 30 220 06 249 18 329 50 12040 469 72 1 0 27 230 06 258 20 338 54 12730 388 70 1 0 28 240 06 400 20 608 146 18740 1071 75 3 0 22 250 04 375 20 550 134 17350 891 73 2 0 25 260 08 329 19 490 103 15250 856 75 2 0 23 270 07 341 22 489 107 16140 793 72 2 0 26 280 05 321 18 478 98 16230 793 75 2 0 23 290 04 283 18 399 84 13640 744 76 2 0 22 300 05 252 16 378 86 12730 620 79 2 0 20 310 05 277 16 447 96 14440 715 76 2 0 22
Re: indexing cpu utilization
Hi, thank you for the hints. On 3 January 2013 05:55, Mark Miller markrmil...@gmail.com wrote: 32 cores eh? You probably have to raise some limits to take advantage of that. 32 cores isn't that much anymore. You can buy amd servers from Supermicro with two sockets and 32G of ram for less than 2500$. Systems with four sockets (64Cores) aren't unaffordable too. Having some more money, one can think about the four socket Oracle T4-4 System (4 * 8cores * 8vcores = 256) You might always want to experiment with using more merge threads? I think the default may be 3. I will try this. But i think otis is right. It's rather SOLR-3929 than SOLR-4078. Mark wanted to point this other issue: https://issues.apache.org/jira/browse/SOLR-3929 though, so try that... Am 03.01.2013 05:20, schrieb Otis Gospodnetic: I, too, was going to point out to the number of threads, but was going to suggest using fewer of them because the server has 32 cores and there was a mention of 100 threads being used from the client. Thus, my guess was that the machine is busy juggling threads and context switching (how's vmstat 2 output, Uwe?) instead of doing the real work. use more threads vs. use less threads It is a bit confusing. I made some test with 50 to 200 threads. within this range I noticed no real difference. 50 threads on the client seems to trigger enough threads on the server to saturate the bottleneck. 200 client threads seems not to be destructive. vmstat 5 on the Server with 100 threads on the client kthr memorypagedisk faults cpu r b w swap free re mf pi po fr de sr cd cd s0 s4 in sy cs us sy id 0 0 0 13605380 17791928 0 7 0 0 0 0 0 0 0 0 0 3791 1638 1666 26 0 73 1 0 0 13641072 17826368 0 8 0 0 0 0 0 0 0 0 0 3540 1305 1527 25 0 74 0 0 0 13691908 17876364 0 8 0 0 0 0 0 0 0 0 48 3935 1453 1919 26 0 73 0 0 0 13720208 17904652 0 4 0 0 0 0 0 0 0 0 0 3964 1342 1645 25 0 74 0 0 0 13792440 17976868 0 9 0 0 0 0 0 0 0 0 0 3891 1551 1757 26 0 74 1 0 0 13867128 18051532 0 4 0 0 0 0 0 0 0 0 0 3871 1430 1584 26 0 74 1 0 0 13948796 18133184 0 6 0 0 0 0 0 0 0 0 0 3079 1218 1435 25 0 74 To see whats going on I prefer mpstat 10 (100 client threads) CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 00 02 1389 1149 556 1580 173 62 4 0 34 10 05 2302 3415 34 15024 16 1 0 83 20 00 653 612 4865 180 222 68 5 0 27 30 01311 38177021 13 1 0 87 40 08392 45266038 17 1 0 82 50 06763 844 18 14051 35 2 0 64 60 04300 40467059 32 1 0 67 70 01360 5336 12065 34 1 0 65 80 04 1074 913 314025 19 0 0 80 90 04704 663 17 10038 27 1 0 72 100 01582 564 138047 34 1 0 65 110 01363 4625 13020 14 0 0 86 120 00322 37358030 20 0 0 80 130 01402 48379037 25 1 0 74 140 02773 854 18 16042 35 1 0 64 150 02294 27263023 15 0 0 85 160 03 1102 1002 338024 14 1 0 85 170 03662 693 179040 27 1 0 72 180 01421 5449 11058 32 1 0 67 190 01541 601 13 110167 0 0 93 200 00260 3913 110229 0 0 91 210 00330 4636 11050 30 1 0 69 220 03381 4448 12050 33 1 0 66 230 03601 612 158029 18 0 0 82 240 04 1022 923 31 10095 31 1 0 68 250 02751 764 21 10047 36 1 0 63 260 01681 815 19 18069 47 1 0 52 270 01401 523 10 14025 22 1 0 77 280 00350 38396034 24 0 0 76 290 01310 4647 13044 31 1 0 68 300 00320 4848 13047 37 1 0 62 310 00260 3637 10050 32 1 0 67 No minor fails, no major fails, low crosscalls, reasonable interrupts, only some migrations... This seems for me quite good. Do you see a pitfall? The
theory of sets
Hi, I'm looking for a tricky solution of a common problem. I have to handle a lot of items and each could be member of several groups. - OK, just add a field called 'member_of' No that's not enough, because each group is sorted and each member has a sortstring for this group. - OK, still easy add a dynamic field 'sortinfo_for_*' and fill this for each group membership. Yes, this works, but there are thousands of different groups, that much dynamic fields are probably a serious performance issue. - Well ... I'm looking for a smart way to answer to the question Find the members of group X and sort them by the the sortstring for this group. One idea I had was to fill the 'member_of' field with composed entrys (groupname + _ + sortstring). Finding the members is easy with wildcards but there seems to be no way to use the sortstring as a boostfactor Has anybody solved this problem? Any hints are welcome. Uwe
Re: indexing cpu utilization
Hi, while trying to optimize our indexing workflow I reached the same endpoint like gabriel shen described in his mail. My Solr server won't utilize more than 40% of the computing power. I made some tests, but i'm not able to find the bottleneck. Could anybody help to solve this quest? At first let me describe the environment: Server: - Two socket Opteron (interlagos) = 32 cores - 64Gb Ram (1600Mhz) - SATA Disks: spindle and ssd - Solaris 5.11 - JRE 1.7.0 - Solr 4.0 - ApplicationServer Jetty - 1Gb network interface Client: - same hardware as client - either multi threaded solrj client using multiple instances of HttpSolrServer - or multi threaded solrj client using a ConcurrentUpdateSolrServer with 100 threads Problem: - 10,000,000 docs of bibliographic data (~4k each) - with a simplified schema definition it takes 10 hours to index = ~250docs/second - with the real schema.xml it takes 50 hours to index = ~50docs/second In both cases the client takes just 2% of the cpu resources and the server 35%. It's obvious that there is some optimization potential in the schema definition, but why uses the Server never more than 40% of the cpu power? Discarded possible bottlenecks: - Ram for the JVM Solr takes only up to 12G of heap and there is just a negligible gc activity. So the increase from 16G to 32G of possible heap made no difference. - Bandwidth of the net The transmitted data is identical in both cases. The size of the transmitted data is somewhat below 50G. Since both machines have a dedicated 1G line to the switch, the raw transmission should not take much more than 10 minutes - Performance of the client Like above, the client ist fast enough for the simplified case (10h). A dry run (just preprocessing not indexing) may finish after 75 minutes. - Servers disk IO The size of the simpler index is ~100G the size of the other is ~150G. This makes factor of 1.5 not 5. The difference between a ssd and a real disk is not noticeable. The output of 'iostat' and 'zpool iostat' is unsuspicious. - Bad thread distribution 'mpstat' shows a well distributed load over all cpus and a sensible amount of crosscalls (less than ten/cpu) - Solr update parameter (solrconfig.xml) Inspired from http://www.hathitrust.org/blogs/large-scale-search/forty-days-and-forty-nights-re-indexing-7-million-books-part-1 I'm using: ramBufferSizeMB256/ramBufferSizeMB mergeFactor40/mergeFactor termIndexInterval1024/termIndexInterval lockTypenative/lockType unlockOnStartuptrue/unlockOnStartup Any changes on this Parameters made it worse. To get an idea whats going on, I've done some statistics with visualvm. (see attachement) The distribution of real and cpu time looks significant, but Im not smart enough to interpret the results. The method org.apache.lucene.index.treadAffinityDocumentsWriterThreadPool.getAndLock() is active at 80% of the time but takes only 1% of the cpu time. On the other hand the second method org.apache.commons.codec.language.bm.PhoneticEngine$PhonemeBuilder.append() is active at 12% of the time and is always running on a cpu So again the question When there are free resources in all dimensions, why utilizes Solr not more than 40% of the computing Power? Bandwidth of the RAM?? I can't believe this. How to verify? ??? Any hints are welcome. Uwe
Re: indexing cpu utilization (attachement)
Am 02.01.2013 22:39, schrieb Uwe Reh: To get an idea whats going on, I've done some statistics with visualvm. (see attachement) merde the listserver stripes attachments. You'll find the screen shot at http://fantasio.rz.uni-frankfurt.de/solrtest/HotSpot.gif uwe
Re: Where is ISOLatin1AccentFilterFactory (Solr4)?
Hi, I like the best of both worlds: charFilter class=solr.MappingCharFilterFactory mapping=mapping-specials.txt / Mask some specials like C++ to cplusplus or C# to csharp ... tokenizer class=solr.ICUTokenizerFactory / Tokenize an identify on unicode whitespaces and charsets filter class=solr.WordDelimiterFilterFactory / Well known splitter for composed words filter class=solr.ICUFoldingFilterFactory / Perfect superset of charFilter ... ISOLatin1Accent.txt/ or the ISOLatin1AccentFilterFactory because it can handle composed and decomposed accents and umlauts filter class=solr.CJKBigramFilterFactory / Nice workaround for missing whitespace as word separator in this languages. Am 01.01.2013 17:48, schrieb Jack Krupansky: Hmmm... quite some time ago I switched from ASCIIFoldingFilterFactory to MappingCharFilterFactory, because I was told (by who I can't recall) that the latter was better/preferred. Is there any particular reason to favor one over the other? -Original Message- From: Erick Erickson ASCIIFoldingFilterFactory is preferred, does that suit your needs?
Sorting on mutivalued fields still impossible?
Hi, just to be sure. There is still no way to sort by multivalued fields? ...sort=max(datefield) desc There is no smarter option, than creating additional singelevalued fields just for sorting? eg. datafield_max and datefield_min Uwe
Re: Paoding analyzer with solr for chinese
Hi Rajani, I'm not really familiar with this paoding tokenizer, but it seems a bit old. We are using the CJKBigramFilter (like in the example of Solr 4.0 alpha), which should be equivalent or even better and it works. analyzer tokenizer class=solr.ICUTokenizerFactory / filter class=solr.WordDelimiterFilterFactory / filter class=solr.ICUFoldingFilterFactory / filter class=solr.CJKBigramFilterFactory / /analyzer Uwe Am 09.08.2012 06:47, schrieb Rajani Maski: Hi All, Any reply on this? On Wed, Aug 8, 2012 at 3:23 PM, Rajani Maski rajinima...@gmail.com mailto:rajinima...@gmail.com wrote: Hi All, As said in this blog site http://java.dzone.com/articles/indexing-chinese-solr that paoding analyzer is much better for chinese text, I was trying to implement it to get accurate results for chinese text. I followed the instruction specified in the below sites Site1 http://androidyou.blogspot.hk/2010/05/chinese-tokenizerlibrary-paoding-with.html Site2 http://www.opensourceconnections.com/2011/12/23/indexing-chinese-in-solr/ After Indexing, when I search on same field with same text, no search results(numFound=0) And luke tool is not showing up any terms for the field that is indexed with below field type. Can anyone comment on what is going wrong? *_Schema field types for paoding :_* *1) fieldType name=paoding class=solr.TextField positionIncrementGap=100 * *analyzer* *tokenizer class=test.solr.PaodingTokerFactory.PaoDingTokenizerFactory/* */analyzer* */fieldType* And analaysis page results is : Inline image 2 *2)fieldType name=paoding_chinese class=solr.TextField* * analyzer class=net.paoding.analysis.analyzer.PaodingAnalyzer* * /analyzer* */fieldType* Analysis on the field paoding_chinese throws this error Inline image 3 Thanks Regards Rajani
Two questions on spellchecking
Hi, even though I read a lot, none of my spellchecker configurations works really well. I reached a dead end. Maybe someone could help, to solve my challenges. - How can I get case sensitive suggestions, independent of the given case in the query? - How to configure a 'did you mean' spellchecking, as discussed in https://issues.apache.org/jira/browse/SOLR-2585 (Context-Sensitive Spelling Suggestions Collations) I'm using following environment: - Solr 4.0-alpha (downloaded 25. June) - Java 7 - schema.xml fieldType name=textSuggest class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType ... field name=suggest type=textSuggest indexed=true stored=true required=false multiValued=true / - solrconfig.xml (suggester) requestHandler name=/hint class=org.apache.solr.handler.component.SearchHandler lst name=defaults str name=echoParamsall/str str name=spellchecktrue/str str name=spellcheck.dictionarysuggester/str str name=spellcheck.extendedResultstrue/str str name=spellcheck.onlyMorePopularfalse/str str name=spellcheck.count20/str /lst arr name=components strsuggester/str /arr /requestHandler searchComponent name=suggester class=solr.SpellCheckComponent lst name=spellchecker str name=namesuggester/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str str name=fieldsuggest/str /lst /searchComponent - solrconfig.xml (spellcheck) requestHandler name=standard class=solr.StandardRequestHandler default=true lst name=defaults str name=echoParamsall/str int name=rows10/int str name=dfallfields/str str name=spellcheck.extendedResultstrue/str str name=spellcheck.onlyMorePopularfalse/str str name=spellcheck.count20/str /lst arr name=last-components strspellcheck/str /arr /requestHandler searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetextSpell/str lst name=spellchecker str name=namedefault/str str name=fieldsuggest/str str name=classnamesolr.DirectSolrSpellChecker/str str name=distanceMeasureinternal/str float name=accuracy0.1/float int name=maxEdits2/int int name=minPrefix1/int int name=maxInspections5/int int name=minQueryLength1/int float name=maxQueryFrequency0.1/float float name=thresholdTokenFrequency0.001/float /lst /searchComponent *Suggester problem* With this configuration the suggester works not case sensitive, but the hints are all lower case. Example: .../hint?q=dawt=xmlspellcheck=truespellcheck.build=true ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime173/intlst name=paramsstr name=spellchecktrue/strstr name=echoParamsall/strstr name=spellcheck.extendedResultstrue/strstr name=spellcheck.dictionarysuggester/strstr name=spellcheck.count20/strstr name=spellcheck.onlyMorePopularfalse/strstr name=spellchecktrue/strstr name=qda/strstr name=wtxml/strstr name=spellcheck.buildtrue/str/lst/lststr name=commandbuild/strlst name=spellchecklst name=suggestionslst name=daint name=numFound20/intint name=startOffset0/intint name=endOffset2/intarr name=suggestionstrdat-marktspiegel spezial/strstrdata structures with c++ using stl/strstrdata warehouse/strstrdatan, ingeborg/strstrdatenbanken mit delphi/strstrdatenverschlüsselung/strstrdauner, gabriele/strstrdautermann, margit/strstrdavid copperfield/strstrdavid, horst/strstrdav id, leo/strstrdavid, nicholas/strstrdavis, charles t./strstrdavis, edward l/strstrdavis, leslie dorfman/strstrdavis, stanley m./strstrdavor kommt noch/strstrdavydova, irina n./strstrdawidowski, bernd/strstrdayan, daniel/str/arr/lstbool name=correctlySpelledfalse/bool/lst/lst /response Using just solr.StrField as field type, the suggestion are true to original capitalization, but I get no suggestions, if the query starts with a lower case character. *Spelling problem* One of the indexed entries in the field 'suggest' is David Copperfield and I want this string as alternative suggestion to the query David opperfield. Example .../select?q=david+opperfieldrows=0wt=xmlspellcheck=true ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime15/intlst name=paramsstr name=dfallfields/strstr name=echoParamsall/strstr name=spellcheck.extendedResultstrue/strstr name=spellcheck.count20/strstr name=spellcheck.onlyMorePopularfalse/strstr name=rows0/strstr name=spellchecktrue/strstr name=qdavid opperfield/strstr
Re: Can't find org.apache.solr.client.solrj.embedded
Sorry, I had inspected the ...core.jar three times, without recognizing the package. I was realy blind. =8-) thanks Uwe Am 26.07.2010 20:48, schrieb Chris Hostetter: : where is a Jar, containing org.apache.solr.client.solrj.embedded? Classes in the embedded package are useless w/o the rest of the Solr internal core classes, so they are included directly in the apache-solr-core-1.4.1.jar. -Hoss
Can't find org.apache.solr.client.solrj.embedded
Hello experts, where is a Jar, containing org.apache.solr.client.solrj.embedded? I miss this package in 'apache-solr-solrj-1.4.[01].jar'. Also I can't find any other sources than http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/webapp/src/org/apache/solr/client/solrj/embedded/ , which does not fit to Solr 1.4. Any tips for a blind newbie? Uwe