HOWTO get a working copy of SOLR?
Dear list, this sounds stupid, but how to get a full working copy of SOLR? What I have tried so far: - started with LucidWorks SOLR. Installs fine, runs fine but has an old tika version and can only handle some PDFs. - changed to SOLR trunk. Installs fine, runs fine but luke 1.0.1 argues about Unknown format version: -10. I guess because luke 1.0.1 compiles with lucene-core-3.0.1.jar but trunk has lucene-core-4.0-dev.jar ??? Anyway, no luck with this version. - changed to SOLR branch_3x. Installs fine, runs fine, luke works fine but the extraction with /update/extract (ExtractingRequestHandler) only replies the metadata but not the content. No luck with this version. Is there any full working recent copy at all? Or a luke working with SOLR trunk? Regards, Bernd
Re: HOWTO get a working copy of SOLR?
Sixten Otto wrote: On Tue, Jun 15, 2010 at 12:58 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: - changed to SOLR branch_3x. Installs fine, runs fine, luke works fine but the extraction with /update/extract (ExtractingRequestHandler) only replies the metadata but not the content. Sounds like https://issues.apache.org/jira/browse/SOLR-1902 Thanks for the hint. - The trunk is running fine (at least on my system) but has no luke. - The branch has a running luke but doesn't extract the text. What a pity. So is SOLR a serious development or just a playground or test case for lucene? Why is luke a separate tool and not combined / integrated with SOLR? Very strange... Bernd
Re: How to Debug Sol-Code in Eclipse ?!
can nobody help me or want :D As already someone said: - install Eclipse - add Jetty Webapp Plugin to Eclipse - add svn plugin to Eclipse - download with svn the repository from trunk - change to lucene dir and run ant package - change to solr dir and run ant dist - setup with Run configure... a Jetty Webapp for solr - start debugging :-) If debugging below solr level into lucene level just add lucene src path to debugging source. May be you should read: http://www.lucidimagination.com/developers/articles/setting-up-apache-solr-in-eclipse Regards, Bernd
Re: Different analyzers for dfferent documents in different languages?
Actually, this is one of the biggest disadvantage of Solr for multilingual content. Solr is field based which means you have to know the language _before_ you feed the content to a specific field and process the content for that field. This results in having separate fields for each language. E.g. for Europe this will be 24 to 26 languages for each title, keyword, description, ... I guess when they started with Lucene/Solr they never had multilingual on their mind. The alternative is to have a separate index for each language. Therefore you also have to know the language of the content _before_ feeding to the core. E.g. again for Europe you end up with 24 to 26 cores. Onother option is to see the multilingual fields (title, keywords, description,...) as a subdocument. Write a filter class as subpipeline, use language and encoding detection as first step in that pipeline and then go on with all other linguistic processing within that pipeline and return the processed content back to the field for further filtering and storing. Many solutions, but nothing out off the box :-) Bernd Am 22.09.2010 12:01, schrieb Andy: I have documents that are in different languages. There's a field in the documents specifying what language it's in. Is it possible to index the documents such that based on what language a document is in, a different analyzer will be used on that document? What is the normal way to handle documents in different languages? Thanks Andy
Re: Migrating to Solr
Hi list, is this true, no downloaded copy of the documentprocessor anywhere available? Regards, Bernd Bernd Fehling schrieb: Was anyone able to get a copy of: http://sesat.no/svn/sesat-documentprocessor/ Unfortunately it is offline. Would be pleased to get a copy. Regards, Bernd
Re: Query regarding solr custom sort order
Hi, I suggest using the following fieldType for your field: fieldType name=sint class=solr.SortableIntField sortMissingLast=true omitNorms=true/ Regards Bernd Am 04.01.2012 14:40, schrieb umaswayam: Hi, We want to sort our records based on some sequence which is like 1 2 3 4 5 6 7 8 9 10 11 12 13 14. I am using Websphere commerce to retrieve data using solr. When we are customizing the sort order/ option in wc-search.xml file then we are getting the sort order as 1 10 11 12 13 14 2 3 4 5 6 7 8 9 like this. As I guess the sort order is checking with first digit of all sequences based on that if they are same moving on to compare the next digit so on, which is resulting on wrong sort output. Can anyone put some thoughts on this or help me out if I am doing something wrong here. Thanks in advance Uma Shankar
Re: Query regarding solr custom sort order
Hi Uma, i don't understand what you're looking for. Do you need to sort on fields of type double with precision 2 or what? In your example you were talking about 1 2 3 4 5 6 7 8 9 10 11 12 13 14. Regards, Bernd Am 06.01.2012 07:11, schrieb umaswayam: Hi Bernd, The column which comes from database is string only, that is being default populated. How do I convert it to double as the format is 1.00,2.00,3.00 in database. So I need it to be coverted to double only. Thanks, Uma Shankar -- View this message in context: http://lucene.472066.n3.nabble.com/Query-regarding-solr-custom-sort-order-tp3631854p3637181.html Sent from the Solr - User mailing list archive at Nabble.com.
exception while loading with DIH multi-threaded
Hi list, after changing DIH to multi-theaded (4 threads) I get sometimes an exception. This is not always the case and I never had any problems with single-threaded at all. I'm using Solr 3.5 but also tried branch_3x (3.6) and could see this with both versions. Don't know why this comes up after changing to multi-threaded. No other errors at all. This is when LogUpdateProcessor finishes and is going create the log message. Whats wrong with this code? public String getName(int idx) { return (String)nvPairs.get(idx 1); } Any idea how to trace this down? ... 11.01.2012 11:25:52 org.apache.solr.handler.dataimport.SolrWriter persist INFO: Wrote last indexed time to /srv/www/solr/solr/solrserver/solr/./conf/dataimport.properties 11.01.2012 11:25:52 org.apache.solr.common.SolrException log SEVERE: Full Import failed:java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String at org.apache.solr.common.util.NamedList.getName(NamedList.java:127) at org.apache.solr.common.util.NamedList.toString(NamedList.java:253) at java.lang.String.valueOf(String.java:2826) at java.lang.StringBuilder.append(StringBuilder.java:115) at org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:188) at org.apache.solr.update.processor.UpdateRequestProcessor.finish(UpdateRequestProcessor.java:78) at org.apache.solr.handler.dataimport.SolrWriter.finish(SolrWriter.java:133) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:213) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408) 11.01.2012 11:25:52 org.apache.solr.update.DirectUpdateHandler2 rollback INFO: start rollback 11.01.2012 11:25:52 org.apache.solr.update.DirectUpdateHandler2 rollback INFO: end_rollback 11.01.2012 11:26:07 org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/select params={command=statusqt=/dataimport} status=0 QTime=0 11.01.2012 11:26:08 org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false) ... Regards Bernd
Re: exception while loading with DIH multi-threaded
After browsing through the issues it looks like something belonging to https://issues.apache.org/jira/browse/SOLR-2694 Am 11.01.2012 14:08, schrieb Bernd Fehling: Hi list, after changing DIH to multi-theaded (4 threads) I get sometimes an exception. This is not always the case and I never had any problems with single-threaded at all. I'm using Solr 3.5 but also tried branch_3x (3.6) and could see this with both versions. Don't know why this comes up after changing to multi-threaded. No other errors at all. This is when LogUpdateProcessor finishes and is going create the log message. Whats wrong with this code? public String getName(int idx) { return (String)nvPairs.get(idx 1); } Any idea how to trace this down? ... 11.01.2012 11:25:52 org.apache.solr.handler.dataimport.SolrWriter persist INFO: Wrote last indexed time to /srv/www/solr/solr/solrserver/solr/./conf/dataimport.properties 11.01.2012 11:25:52 org.apache.solr.common.SolrException log SEVERE: Full Import failed:java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String at org.apache.solr.common.util.NamedList.getName(NamedList.java:127) at org.apache.solr.common.util.NamedList.toString(NamedList.java:253) at java.lang.String.valueOf(String.java:2826) at java.lang.StringBuilder.append(StringBuilder.java:115) at org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:188) at org.apache.solr.update.processor.UpdateRequestProcessor.finish(UpdateRequestProcessor.java:78) at org.apache.solr.handler.dataimport.SolrWriter.finish(SolrWriter.java:133) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:213) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408) 11.01.2012 11:25:52 org.apache.solr.update.DirectUpdateHandler2 rollback INFO: start rollback 11.01.2012 11:25:52 org.apache.solr.update.DirectUpdateHandler2 rollback INFO: end_rollback 11.01.2012 11:26:07 org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/select params={command=statusqt=/dataimport} status=0 QTime=0 11.01.2012 11:26:08 org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false) ... Regards Bernd -- * Bernd FehlingUniversitätsbibliothek Bielefeld Dipl.-Inform. (FH)Universitätsstr. 25 Tel. +49 521 106-4060 Fax. +49 521 106-4052 bernd.fehl...@uni-bielefeld.de33615 Bielefeld BASE - Bielefeld Academic Search Engine - www.base-search.net *
Re: exception while loading with DIH multi-threaded
Hi Mikhail, thanks for pointing me to the issue. Regards, Bernd Am 11.01.2012 21:47, schrieb Mikhail Khludnev: FYI, it's https://issues.apache.org/jira/browse/SOLR-2804 I'm trying to address it. On Wed, Jan 11, 2012 at 5:49 PM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: After browsing through the issues it looks like something belonging to https://issues.apache.org/**jira/browse/SOLR-2694https://issues.apache.org/jira/browse/SOLR-2694 Am 11.01.2012 14:08, schrieb Bernd Fehling: Hi list, after changing DIH to multi-theaded (4 threads) I get sometimes an exception. This is not always the case and I never had any problems with single-threaded at all. I'm using Solr 3.5 but also tried branch_3x (3.6) and could see this with both versions. Don't know why this comes up after changing to multi-threaded. No other errors at all. This is when LogUpdateProcessor finishes and is going create the log message. Whats wrong with this code? public String getName(int idx) { return (String)nvPairs.get(idx 1); } Any idea how to trace this down? ... 11.01.2012 11:25:52 org.apache.solr.handler.**dataimport.SolrWriter persist INFO: Wrote last indexed time to /srv/www/solr/solr/solrserver/** solr/./conf/dataimport.**properties 11.01.2012 11:25:52 org.apache.solr.common.**SolrException log SEVERE: Full Import failed:java.lang.**ClassCastException: java.util.ArrayList cannot be cast to java.lang.String at org.apache.solr.common.util.**NamedList.getName(NamedList.**java:127) at org.apache.solr.common.util.**NamedList.toString(NamedList.**java:253) at java.lang.String.valueOf(**String.java:2826) at java.lang.StringBuilder.**append(StringBuilder.java:115) at org.apache.solr.update.**processor.LogUpdateProcessor.**finish(** LogUpdateProcessorFactory.**java:188) at org.apache.solr.update.**processor.**UpdateRequestProcessor.finish(** UpdateRequestProcessor.java:**78) at org.apache.solr.handler.**dataimport.SolrWriter.finish(** SolrWriter.java:133) at org.apache.solr.handler.**dataimport.DocBuilder.execute(** DocBuilder.java:213) at org.apache.solr.handler.**dataimport.DataImporter.** doFullImport(DataImporter.**java:359) at org.apache.solr.handler.**dataimport.DataImporter.** runCmd(DataImporter.java:427) at org.apache.solr.handler.**dataimport.DataImporter$1.run(** DataImporter.java:408) 11.01.2012 11:25:52 org.apache.solr.update.**DirectUpdateHandler2 rollback INFO: start rollback 11.01.2012 11:25:52 org.apache.solr.update.**DirectUpdateHandler2 rollback INFO: end_rollback 11.01.2012 11:26:07 org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/select params={command=statusqt=/**dataimport} status=0 QTime=0 11.01.2012 11:26:08 org.apache.solr.update.**DirectUpdateHandler2 commit INFO: start commit(optimize=true,**waitFlush=false,waitSearcher=** true,expungeDeletes=false) ... Regards Bernd
Re: Synonym configuration not working?
Yes and No. If using Synonyms funtionality out of the box you have to do it at index time. But if using it at query time, like we do, you have to do some programming. We have connected a thesaurus which is actually using synonyms functionality at query time. There are some pitfalls to take care of. Bernd Am 15.01.2012 07:07, schrieb Michael Lissner: Just replying for others in the future. The answer to this is to do synonyms at index time, not at query time. Mike On Fri 06 Jan 2012 02:35:23 PM PST, Michael Lissner wrote: I'm trying to set up some basic synonyms. The one I've been working on is: us, usa, united states My understanding is that adding that to the synonym file will allow users to search for US, and get back documents containing usa or united states. Ditto for if a user puts in usa or united states. Unfortunately, with this in place, when I do a search, I get the results for items that contain all three of the words - it's doing an AND of the synonyms rather than an OR. If I turn on debugging, this is indeed what I see (plus some stemming): (+DisjunctionMaxQuery(((westCite:us westCite:usa westCite:unit) | (text:us text:usa text:unit) | (docketNumber:us docketNumber:usa docketNumber:unit) | ((status:us status:usa status:unit)^1.25) | (court:us court:usa court:unit) | (lexisCite:us lexisCite:usa lexisCite:unit) | ((caseNumber:us caseNumber:usa caseNumber:unit)^1.25) | ((caseName:us caseName:usa caseName:unit)^1.5/no_coord Am I doing something wrong to cause this? My defaultOperator is set to AND, but I'd expect the synonym filter to understand that. Any help? Thanks, Mike
SolrException with branch_3x
On January 11th I downloaded branch_3x with svn into eclipse (indigo). Compiled and tested it without problems. Today I updated my branch_3x from repository. Compiled fine but get now SolrException when starting. Jan 31, 2012 1:50:15 PM org.apache.solr.core.SolrCore initListeners INFO: [] Added SolrEventListener for firstSearcher: org.apache.solr.core.QuerySenderListener{queries=[{q=*:*,start=0,rows=10,spellcheck.build=true}, {q=(text:(*:*). Jan 31, 2012 2:00:10 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: QueryResponseWriter init failure at org.apache.solr.core.SolrCore.initWriters(SolrCore.java:1499) at org.apache.solr.core.SolrCore.init(SolrCore.java:557) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:466) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:319) ... It isn't able to init QueryResponseWriter on startup :-( My config hasn't changed since 3 weeks ago. Can't find any issue in CHANGES.txt belonging to this. And something else to mention, in SolrCore.java initWriters at lines 1491 to 1495: if(info.isDefault()){ defaultResponseWriter = writer; if(defaultResponseWriter != null) log.warn(Multiple default queryResponseWriter registered ignoring: + old.getClass().getName()); } This will also log.warn for the first defaultResponseWriter. I would place defaultResponseWriter = writer; _AFTER_ the if/log.warn. Regards, Bernd
SOLVED: SolrException with branch_3x
After changing the below suggested lines and compiling the branch_3x runs fine now. SolrException is gone. Regards, Bernd Am 31.01.2012 14:21, schrieb Bernd Fehling: On January 11th I downloaded branch_3x with svn into eclipse (indigo). Compiled and tested it without problems. Today I updated my branch_3x from repository. Compiled fine but get now SolrException when starting. Jan 31, 2012 1:50:15 PM org.apache.solr.core.SolrCore initListeners INFO: [] Added SolrEventListener for firstSearcher: org.apache.solr.core.QuerySenderListener{queries=[{q=*:*,start=0,rows=10,spellcheck.build=true}, {q=(text:(*:*). Jan 31, 2012 2:00:10 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: QueryResponseWriter init failure at org.apache.solr.core.SolrCore.initWriters(SolrCore.java:1499) at org.apache.solr.core.SolrCore.init(SolrCore.java:557) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:466) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:319) ... It isn't able to init QueryResponseWriter on startup :-( My config hasn't changed since 3 weeks ago. Can't find any issue in CHANGES.txt belonging to this. And something else to mention, in SolrCore.java initWriters at lines 1491 to 1495: if(info.isDefault()){ defaultResponseWriter = writer; if(defaultResponseWriter != null) log.warn(Multiple default queryResponseWriter registered ignoring: + old.getClass().getName()); } This will also log.warn for the first defaultResponseWriter. I would place defaultResponseWriter = writer; _AFTER_ the if/log.warn. Regards, Bernd
Re: usage of /etc/jetty.xml when debugging Solr in Eclipse
Hi, run-jetty-run issue #9: ... In the VM Arguments of your launch configuration set -Drjrxml=./jetty.xml If jetty.xml is in the root of your project it will be used (you can also use a fully qualified path name). The UI port, context and WebApp dir are ignored, since you can define them in jetty.xml Note: You still have to specify a valid WebApp dir because there are other checks that the plugin performs. ... Or you can start solr with jetty as usual and then connect eclipse to the running process. Regards Am 08.02.2012 12:24, schrieb jmlucjav: Hi, I am following http://www.lucidimagination.com/devzone/technical-articles/setting-apache-solr-eclipse in order to be able to debug Solr in eclipse. I got it working fine. Now, I usually use ./etc/jetty.xml to set logging configuration. When starting jetty in eclipse I dont see any log files created, so I guessed jetty.xml is not being used. So I added it to RunJetty Advanced configuration (Additional jetty.xml), but in that case something goes wrong, as I get a 'java.net.BindException: Address already in use: JVM_Bind' error, like if something is started twice. So my question is: can jetty.xml be used while debugging in eclipse? If so, how? I would like to use the same configuration I use when I am just changing xml stuff in Solr and starting with 'java -jar start.jar'. thank in advance
Re: need to support bi-directional synonyms
Use sprayer, washer http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory Regards Bernd Am 23.02.2012 07:03, schrieb remi tassing: Same question here... On Wednesday, February 22, 2012, geeky2gee...@hotmail.com wrote: hello all, i need to support the following: if the user enters sprayer in the desc field - then they get results for BOTH sprayer and washer. and in the other direction if the user enters washer in the desc field - then they get results for BOTH washer and sprayer. would i set up my synonym file like this? assuming expand = true.. sprayer = washer washer = sprayer thank you, mark -- View this message in context: http://lucene.472066.n3.nabble.com/need-to-support-bi-directional-synonyms-tp3767990p3767990.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: [SoldCloud] leaking file descriptors
What is netstat telling you about the connections on the servers? Any connections in CLOSE_WAIT (passive close) hanging? Saw this on my servers last week. Used a little proggi to spoof a local connection on those servers ports and was able to fake the TCP-stack to close those connections. It also immediately released all open fd's set to DEL and cleaned everything up without restarting. Regards Bernd Am 01.03.2012 11:36, schrieb Markus Jelsma: Hi, Yesterday we had an issue with too many open files, which was solved because a username was misspelled. But there is still a problem with open files. We cannot succesfully index a few millions documents from MapReduce to a 5-node Solr cloud cluster. One of the problems is that after a while ClassNotFoundErrors and other similar weirdness begin to appear. This will not solve itself if indexing is stopped. With lsof i found that Solr keeps open roughly 9k files 8 hours after indexing failed. Out of the 9k there are roughly 7.5k deleted files that still have a file descriptor open for the tomcat6 user, these are all segments files: /opt/solr/openindex_a/data/index.20120228101550/_34s.tvd java 10049 tomcat6 DEL REG 9,0 515607 /opt/solr/openindex_a/data/index.20120228101550/_34s.tvx java 10049 tomcat6 DEL REG 9,0 515504 /opt/solr/openindex_a/data/index.20120228101550/_34s.fdx java 10049 tomcat6 DEL REG 9,0 515735 /opt/solr/openindex_a/data/index.20120228101550/_34s_nrm.cfs java 10049 tomcat6 DEL REG 9,0 515595 /opt/solr/openindex_a/data/index.20120228101550/_34v_nrm.cfs java 10049 tomcat6 DEL REG 9,0 515592 /opt/solr/openindex_a/data/index.20120228101550/_34v_0.tim java 10049 tomcat6 DEL REG 9,0 515591 /opt/solr/openindex_a/data/index.20120228101550/_34v_0.prx java 10049 tomcat6 DEL REG 9,0 515590 /opt/solr/openindex_a/data/index.20120228101550/_34v_0.frq any many more Did i misconfigure anything? This is a pretty standard (no changes to IndexDefaults section) and a recent Solr trunk revision. Is there a bug somewhere? Thanks, Markus
CLOSE_WAIT connections
Hi list, I have looked into the CLOSE_WAIT problem and created an issue with a patch to fix this. A search for CLOSE_WAIT shows that there are many Apache projects hit by this problem. https://issues.apache.org/jira/browse/SOLR-3280 Can someone recheck the patch (it belongs to SnapPuller) and give the OK for release? The patch is against branch_3x (3.6). Regards Bernd
Re: [Announce] Solr 4.0 with RankingAlgorithm 1.4.1, NRT now supports both RankingAlgorithm and Lucene
Nothing against RankingAlgorithm and your work, which sounds great, but I think that YOUR Solr 4.0 might confuse some Solr users and/or newbees. As far as I know the next official release will be 3.6. So your Solr 4.0 is a trunk snapshot or what? If so, which revision number? Or have you done a fork and produced a stable Solr 4.0 of your own? Regards Bernd Am 29.03.2012 15:49, schrieb Nagendra Nagarajayya: I am very excited to announce the availability of Solr 4.0 with RankingAlgorithm 1.4.1 (NRT support) (build 2012-03-19). The NRT implementation now supports both RankingAlgorithm and Lucene. RankingAlgorithm 1.4.1 has improved performance over the earlier release (1.4) and supports the entire Lucene Query Syntax, ± and/or boolean queries and is compatible with the new Lucene 4.0 api. You can get more information about NRT performance from here: http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_4.x You can download Solr 4.0 with RankingAlgorithm 1.4.1 from here: http://solr-ra.tgels.org Please download and give the new version a try. Regards, Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org
Re: solr 3.5 taking long to index
There were some changes in solrconfig.xml between solr3.1 and solr3.5. Always read CHANGES.txt when switching to a new version. Also helpful is comparing both versions of solrconfig.xml from the examples. Are you sure you need a MaxPermSize of 5g? Use jvisualvm to see what you really need. This is also for all other JAVA_OPTS. Am 11.04.2012 19:42, schrieb Rohit: We recently migrated from solr3.1 to solr3.5, we have one master and one slave configured. The master has two cores, 1) Core1 - 44555972 documents 2) Core2 - 29419244 documents We commit every 5000 documents, but lately the commit is taking very long 15 minutes plus in some cases. What could have caused this, I have checked the logs and the only warning i can see is, WARNING: Use of deprecated update request parameter update.processor detected. Please use the new parameter update.chain instead, as support for update.processor will be removed in a later version. Memory details: export JAVA_OPTS=$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g Solr Config: useCompoundFilefalse/useCompoundFile mergeFactor10/mergeFactor ramBufferSizeMB32/ramBufferSizeMB !-- maxBufferedDocs1000/maxBufferedDocs -- maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout What could be causing this, as everything was running fine a few days back? Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg http://about.me/rohitg
Re: Lexical analysis tools for German language data
You might have a look at: http://www.basistech.com/lucene/ Am 12.04.2012 11:52, schrieb Michael Ludwig: Given an input of Windjacke (probably wind jacket in English), I'd like the code that prepares the data for the index (tokenizer etc) to understand that this is a Jacke (jacket) so that a query for Jacke would include the Windjacke document in its result set. It appears to me that such an analysis requires a dictionary-backed approach, which doesn't have to be perfect at all; a list of the most common 2000 words would probably do the job and fulfil a criterion of reasonable usefulness. Do you know of any implementation techniques or working implementations to do this kind of lexical analysis for German language data? (Or other languages, for that matter?) What are they, where can I find them? I'm sure there is something out (commercial or free) because I've seen lots of engines grokking German and the way it builds words. Failing that, what are the proper terms do refer to these techniques so you can search more successfully? Michael
Re: Lexical analysis tools for German language data
Paul, nearly two years ago I requested an evaluation license and tested BASIS Tech Rosette for Lucene Solr. Was working excellent but the price much much to high. Yes, they also have compound analysis for several languages including German. Just configure your pipeline in solr and setup the processing pipeline in Rosette Language Processing (RLP) and thats it. Example from my very old schema.xml config: fieldtype name=text_rlp class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=com.basistech.rlp.solr.RLPTokenizerFactory rlpContext=solr/conf/rlp-index-context.xml postPartOfSpeech=false postLemma=true postStem=true postCompoundComponents=true/ filter class=solr.LowerCaseFilterFactory/ filter class=org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=com.basistech.rlp.solr.RLPTokenizerFactory rlpContext=solr/conf/rlp-query-context.xml postPartOfSpeech=false postLemma=true postCompoundComponents=true/ filter class=solr.LowerCaseFilterFactory/ filter class=org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldtype So you just point tokenizer to RLP and have two RLP pipelines configured, one for indexing (rlp-index-context.xml) and one for querying (rlp-query-context.xml). Example form my rlp-index-context.xml config: contextconfig properties property name=com.basistech.rex.optimize value=false/ property name=com.basistech.ela.retokenize_for_rex value=true/ /properties languageprocessors languageprocessorUnicode Converter/languageprocessor languageprocessorLanguage Identifier/languageprocessor languageprocessorEncoding and Character Normalizer/languageprocessor languageprocessorEuropean Language Analyzer/languageprocessor !--languageprocessorScript Region Locator/languageprocessor languageprocessorJapanese Language Analyzer/languageprocessor languageprocessorChinese Language Analyzer/languageprocessor languageprocessorKorean Language Analyzer/languageprocessor languageprocessorSentence Breaker/languageprocessor languageprocessorWord Breaker/languageprocessor languageprocessorArabic Language Analyzer/languageprocessor languageprocessorPersian Language Analyzer/languageprocessor languageprocessorUrdu Language Analyzer/languageprocessor -- languageprocessorStopword Locator/languageprocessor languageprocessorBase Noun Phrase Locator/languageprocessor !--languageprocessorStatistical Entity Extractor/languageprocessor -- languageprocessorExact Match Entity Extractor/languageprocessor languageprocessorPattern Match Entity Extractor/languageprocessor languageprocessorEntity Redactor/languageprocessor languageprocessorREXML Writer/languageprocessor /languageprocessors /contextconfig As you can see I used the European Language Analyzer. Bernd Am 12.04.2012 12:58, schrieb Paul Libbrecht: Bernd, can you please say a little more? I think this list is ok to contain some description for commercial solutions that satisfy a request formulated on list. Is there any product at BASIS Tech that provides a compound-analyzer with a big dictionary of decomposed compounds in German? If yes, for which domain? The Google Search result (I wonder if this is politically correct to not have yours ;-)) shows me that there's an amount of job done in this direction (e.g. Gärten to match Garten) but being precise for this question would be more helpful! paul
HowTo getDefaultOperator with solr3.6?
I'm trying to get the default operator of a schema in solr 3.6 but unfortunately everything is deprecated. The API solr 3.6 says: getQueryParserDefaultOperator() - Method in class org.apache.solr.schema.IndexSchema Deprecated. use getSolrQueryParser().getDefaultOperator() getSolrQueryParser(String) - Method in class org.apache.solr.schema.IndexSchema Deprecated. Now what? How can I continue if I start with: QueryParser.Operator operator = getReq().getSchema(). Regards Bernd
Problems with edismax parser and solr3.6
I just looked through my logs of solr 3.6 and saw several 0 hits which were not seen with solr 3.5. While tracing this down it turned out that edismax don't like queries of type ...q=(text:ide)... any more. If parentheses around the query term the edismax fails with solr 3.6. Can anyone confirm this and give me feedback? Bernd
debugging junit test with eclipse
I have tried all hints from internet for debugging a junit test of solr 3.6 under eclipse but didn't succeed. eclipse and everything is running, compiling, debugging with runjettyrun. Tests have no errors. Ant from command line ist also running with ivy, e.g. ant -Dtestmethod=testUserFields -Dtestcase=TestExtendedDismaxParser test-solr-core But I can't get a single test with junit running from eclipse and then jump into it for debugging. Any idea what's going wrong? Regards Bernd
Re: Multi-words synonyms matching
request fq = is Filter Query; generally used to restrict the super set of documents without influencing score (more info. http://wiki.apache.org/solr/**CommonQueryParameters#q http://wiki.apache.org/solr/CommonQueryParameters#q ) For example: q=hotel de ville === returns 100 documents q=hotel de villefq=price:[100 To *]fq=roomType:King size Bed === returns 40 documents from super set of 100 documents hope this helps! - Jeevanandam On 24-04-2012 3:08 pm, elisabeth benoit wrote: Hello, I'd like to resume this post. The only way I found to do not split synonyms in words in synonyms.txt it to use the line filter class=solr.**SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true tokenizerFactory=solr.**KeywordTokenizerFactory/ in schema.xml where tokenizerFactory=solr.**KeywordTokenizerFactory instructs SynonymFilterFactory not to break synonyms into words on white spaces when parsing synonyms file. So now it works fine, mairie is mapped into hotel de ville and when I send request q=hotel de ville (quotes are mandatory to prevent analyzer to split hotel de ville on white spaces), I get answers with word mairie. But when I use fq parameter (fq=CATEGORY_ANALYZED:hotel de ville), it doesn't work!!! CATEGORY_ANALYZED is same field type as default search field. This means that when I send q=hotel de ville and fq=CATEGORY_ANALYZED:hotel de ville, solr uses the same analyzer, the one with the line filter class=solr.**SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true tokenizerFactory=solr.**KeywordTokenizerFactory/. Anyone as a clue what is different between q analysis behaviour and fq analysis behaviour? Thanks a lot Elisabeth 2012/4/12 elisabeth benoit elisaelisael...@gmail.com oh, that's right. thanks a lot, Elisabeth 2012/4/11 Jeevanandam Madanagopal je...@myjeeva.com Elisabeth - As you described, below mapping might suit for your need. mairie = hotel de ville, mairie mairie gets expanded to hotel de ville and mairie at index time. So mairie and hotel de ville searchable on document. However, still white space tokenizer splits at query time will be a problem as described by Markus. --Jeevanandam On Apr 11, 2012, at 12:30 PM, elisabeth benoit wrote: Have you tried the =' mapping instead? Something like hotel de ville = mairie might work for you. Yes, thanks, I've tried it but from what I undestand it doesn't solve my problem, since this means hotel de ville will be replace by mairie at index time (I use synonyms only at index time). So when user will ask hôtel de ville, it won't match. In fact, at index time I have mairie in my data, but I want user to be able to request mairie or hôtel de ville and have mairie as answer, and not have mairie as an answer when requesting hôtel. To map `mairie` to `hotel de ville` as single token you must escape your white space. mairie, hotel\ de\ ville This results in a problem if your tokenizer splits on white space at query time. Ok, I guess this means I have a problem. No simple solution since at query time my tokenizer do split on white spaces. I guess my problem is more or less one of the problems discussed in http://lucene.472066.n3.**nabble.com/Multi-word-** synonyms-td3716292.html#**a3717215 http://lucene.472066.n3.nabble.com/Multi-word-synonyms-td3716292.html#a3717215 Thanks a lot for your answers, Elisabeth 2012/4/10 Erick Erickson erickerick...@gmail.com Have you tried the =' mapping instead? Something like hotel de ville = mairie might work for you. Best Erick On Tue, Apr 10, 2012 at 1:41 AM, elisabeth benoit elisaelisael...@gmail.com wrote: Hello, I've read several post on this issue, but can't find a real solution to my multi-words synonyms matching problem. I have in my synonyms.txt an entry like mairie, hotel de ville and my index time analyzer is configured as followed for synonyms. filter class=solr.**SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ The problem I have is that now mairie matches with hotel and I would only want mairie to match with hotel de ville and mairie. When I look into the analyzer, I see that mairie is mapped into hotel, and words de ville are added in second and third position. To change that, I tried to do filter class=solr.**SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true tokenizerFactory=solr.**KeywordTokenizerFactory/ (as I read in one post) and I can see now in the analyzer that mairie is mapped to hotel de ville, but now when I have query hotel de ville, it doesn't match at all with mairie. Anyone has a clue of what I'm doing wrong? I'm using Solr 3.4. Thanks, Elisabeth -- * Bernd FehlingUniversitätsbibliothek Bielefeld
Re: Out Of Memory =( Too many cores on one server?
I guess you should give JVM more memory. When starting to find a good value for -Xmx I oversized and set it to Xmx20G and Xms20G. Then I monitored the system and saw that JVM is between 5G and 10G (java7 with G1 GC). Now it is finally set to Xmx11G and Xms11G for my system with 1 core and 38 million docs. But JVM memory depends pretty much on number of fields in schema.xml and fieldCache (sortable fields). Regards Bernd Am 16.11.2012 09:29, schrieb stockii: Hello. if my server is running for a while i get some OOM Problems. I think the problem is, that i running to many cores on one Server with too many documents. this is my server concept: 14 cores. 1 with 30 million docs 1 with 22 million docs 1 with growing 25 million docs 1 with 67 million docs and the other cores are under 1 million docs. all these cores are running fine in one jetty and searching is very fast and we are satisfied with this. yesterday we got OOM. Do you think that we should outsource the big cores into another virtual instance of the server? so that the JVM not share the memory and going OOM? starting with: MEMORY_OPTIONS=-Xmx6g -Xms2G -Xmn1G
Re: error opening index solr 4.0 with lukeall-4.0.0-ALPHA.jar
I think there is already a BETA available: http://luke.googlecode.com/svn/trunk/ Changes in unreleased version: * Update to 4.0.0_BETA. * Issue 22: term vectors could not be accessed if a field was not stored. Fixed also several other wrong assumptions about field flags. You might try that one. Regards Bernd Am 16.11.2012 17:16, schrieb Miguel Ángel Martín: hi all: i can open an index create with solr 4.0. with luke version= lukeall-4.0.0-ALPHA.jar i have the error: Format version is not supported (resource: NIOFSIndexInput(path=/Users/desa/data/index/_2.tvx)): 1 (needs to be between 0 and 0) at org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:148) at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:130) at org.apache.lucene.codecs.lucene40.Lucene40TermVectorsReader.init(Lucene40TermVectorsReader.java:108) at org.apache.lucene.codecs.lucene40.Lucene40TermVectorsFormat.vectorsReader(Lucene40TermVectorsFormat.java:107) at org.apache.lucene.index.SegmentCoreReaders.init(SegmentCoreReaders.java:118) at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:55) at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:62) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:752) at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:63) at org.getopt.luke.Luke.openIndex(Luke.java:967) at org.getopt.luke.Luke.openOk(Luke.java:696) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at thinlet.Thinlet.invokeImpl(Thinlet.java:4579) at thinlet.Thinlet.invoke(Thinlet.java:4546) at thinlet.Thinlet.handleMouseEvent(Thinlet.java:3937) at thinlet.Thinlet.processEvent(Thinlet.java:2917) at java.awt.Component.dispatchEventImpl(Component.java:4744) at java.awt.Container.dispatchEventImpl(Container.java:2141) at java.awt.Component.dispatchEvent(Component.java:4572) at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4619) at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4280) at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4210) at java.awt.Container.dispatchEventImpl(Container.java:2127) at java.awt.Window.dispatchEventImpl(Window.java:2489) at java.awt.Component.dispatchEvent(Component.java:4572) at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:704) at java.awt.EventQueue.access$400(EventQueue.java:82) at java.awt.EventQueue$2.run(EventQueue.java:663) at java.awt.EventQueue$2.run(EventQueue.java:661) at java.security.AccessController.doPrivileged(Native Method) at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87) at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:98) at java.awt.EventQueue$3.run(EventQueue.java:677) at java.awt.EventQueue$3.run(EventQueue.java:675) at java.security.AccessController.doPrivileged(Native Method) at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87) at java.awt.EventQueue.dispatchEvent(EventQueue.java:674) at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:296) at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:211) at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:201) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:196) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:188) at java.awt.EventDispatchThread.run(EventDispatchThread.java:122) o any ideas? I,ve created another index with lucene 4.0 and this luke open the index well. thanks in advance -- * Bernd FehlingBielefeld University Library Dipl.-Inform. (FH)LibTec - Library Technology Universitätsstr. 25 and Knowledge Management 33615 Bielefeld Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de BASE - Bielefeld Academic Search Engine - www.base-search.net *
Re: error opening index solr 4.0 with lukeall-4.0.0-ALPHA.jar
I just downloaded, compiled and opened an optimized solr 4.0 index in read only without problems. Could browse through the docs, search with different analyzers, ... Looks good. Am 19.11.2012 08:49, schrieb Toke Eskildsen: On Mon, 2012-11-19 at 08:10 +0100, Bernd Fehling wrote: I think there is already a BETA available: http://luke.googlecode.com/svn/trunk/ You might try that one. That doesn't work either for Lucene 4.0.0 indexes, same for source trunk. I did have some luck with downloading the source and changing the dependencies to Lucene 4.0.0 final (4 or 5 JARs, AFAIR). It threw a non-fatal exception upon index open, something about subReaders not being accessible throught the metod it used (sorry for being vague, it was on my home machine and some days ago), so I'm guessing that not all functionality works. It was possible to inspect some documents and that was what I needed at the time.
Re: Multi word synonyms
There are also other solutions: Multi-word synonym filter (synonym expansion) https://issues.apache.org/jira/browse/LUCENE-4499 Since Solr 3.4 i have my own solution which might be obsolete if LUCENE-4499 will be in a released version. http://www.ub.uni-bielefeld.de/~befehl/base/solr/eurovoc.html Am 29.11.2012 13:44, schrieb O. Klein: Found an article about the issue of multi word synonyms http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ . Not sure it's the solution I'm looking for, but it may be for someone else. -- View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p4023220.html Sent from the Solr - User mailing list archive at Nabble.com.
DefaultSolrParams ?
Dear list, after going from 3.6 to 4.0 I see exceptions in my logs. It turned out that somehow the q-parameter was empty. With 3.6 the q.alt in the solrconfig.xml worked as fallback but now with 4.0 I get exceptions. I use it like this: SolrParams params = req.getParams(); String q = params.get(CommonParams.Q).trim(); The exception is from the second line if q is empty. I can see q.alt=*:* in my defaults within params. So why is it not picking up q.alt if q is empty? Regards Bernd
Re: DefaultSolrParams ?
Hi Hoss, my config has definately not changed and it worked with 3.6 and 3.6.1. Yes I have a custom plugin and if q was empty with 3.6 it picked automatically q.alt from solrconfig.xml. This all was done with params.get() With 4.x this is gone due to some changes in DefaultSolrParams(?). Which is now the method to get q from params and have an automatic fallback to q.alt? Bernd : I use it like this: : SolrParams params = req.getParams(); : String q = params.get(CommonParams.Q).trim(); : : The exception is from the second line if q is empty. : I can see q.alt=*:* in my defaults within params. : : So why is it not picking up q.alt if q is empty? Youre talking about some sort of custom solr plugin that you have correct? when you are accessing a SolrParams object, there is nothing magic about q and q.alt -- params.get() will only return the value specified for the param name you ask about. The logic for using q.alt (aka: DisMaxParams.ALTQ) if q doesn't exist in the params (or is blank) has always been a specific feature of the DisMaxQParser. So if you are suddenly getting an NPE when q is missing, perhaps the problem is that in your old configs there was a default q containing hte empty string, and now that's gone? -Hoss
Re: OutOfMemoryError | While Faceting Query
Hi Uwe, sorting should be well prepared. First rough check is fieldCache. You can see it with SolrAdmin Stats. The insanity_count there should be 0 (zero). Only sort on fields which are prepared for sorting and make sense to be sorted. Do only faceting on fields which make sense. I've seen systems with faceting on id, this is a no-go and doesn't make sense. The system pulls a lot of data from the index into memory which can lead to OOME. How to figure out what is killing your system? There is no general rule but what you can do is: - Start your test-system and make sure noone else is using it. - Start a monitor for your JVM running SOLR (e.g. jvisualvm). - use your search frontend and do searches, sorting, faceting of any combination possible and watch your heap memory if it has big jumps in memory heap. - Analyze your search log files and look for searches which have a very high Qtime. Repeat the searches with high Qtime and see if you get insanity_counts or heap memory jumps in JVM. Regards Bernd Am 06.12.2012 23:27, schrieb uwe72: Hi there, since i use a lot sorting and faceting i am getting very often an OutOfMemoryError. I have arround 6 million documents, index size arround 18GB and using tomcat with 1.8 GB max heap size. What can i do? What heap size is recommended in our case? Can i do other things in order to prevent OutOfMemoryError while using a lot of facets? Urgent, please help. Thanks, Uwe -- View this message in context: http://lucene.472066.n3.nabble.com/OutOfMemoryError-While-Faceting-Query-tp4024947.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: jconsole over jmx - should threads be visible?
Hi Shawn, actually I use munin for monitoring but just checked with jvisualvm which also runs fine for remote monitoring. You might try the following: http://www.codefactorycr.com/java-visualvm-to-profile-a-remote-server.html You have to: - generate a policy file on the server to be monitored - start jstatd on the server to be monitored - have jmx enabled for jetty or tomcat or ... - should have jmx protected with password password protection for jetty: $JETTY_HOME/etc/jmxremote.access $JETTY_HOME/etc/jmxremote.password jmxremote.access looks like: monitorRole readonly controlRole readwrite jmxremote.password looks like: monitorRole solr4monitor controlRole solr4control If eveything is set correct then start jvisualvm, right click on Remote and add a remote host. Enter IP address into Hostname and click on OK. Now you have the connection to jstatd on the remote host which will show jstatd and start.jar of remote host. Then click on start.jar and you will be asked for username and password. Enter controlRole as username and solr4control as password. Regards Bernd Am 18.12.2012 18:21, schrieb Shawn Heisey: If I connect jconsole to a remote Solr installation (or any app) using jmx, all the graphs are populated except 'threads' ... is this expected, or have I done something wrong? I can't seem to locate the answer with google. Thanks, Shawn
thanks for solr 4.1
Now this must be said, thanks for solr 4.1 (and lucene 4.1)! Great improvements compared to 4.0. After building the first 4.1 index I thought the index was broken, but had no error messages anywhere. Why I thought it was damaged? The index size went down from 167 GB (solr 4.0) to 115 GB (solr 4.1)!!! Will now move the new 4.1 index to testing stage and after it passes all testing it goes online. Can't wait to see the new stats. Regards, Bernd
Solr4.1 changing result order FIFO to LIFO
Hi list, I recognized that the result order is FIFO if documents have the same score. I think this is due to the fact that documents which are indexed later get a higher internal document ID and the output for documents with the same score starts with the lowest internal document ID and raises. Is this right so far? I would be pleased to get LIFO output. Documents with the same score but indexed later are newer (as seen for my data) and should be displayed first. Sure, I could use sorting, but sorting is always time consuming. Whereas the output as LIFO is just starting with highest internal document ID first for documents with the same score. Is there anything like this already available? If not, any hint where to look at (Lucene or Solr)? Regards Bernd
expert question about SolrReplication
A question to the experts, why is the replicated index copied from its temporary location (index.x) to the real index directory and NOT moved? Copying over 100s of gigs takes some time, moving is just changing the file system link. Also, instead of first deleting the old index, why not - moving the file links of old index to index.x.old - moving the file links of new index to index - and finally after new searcher is up, deleting index.x.old Any answers? Regards Bernd
Re: expert question about SolrReplication
Am 02.02.2013 03:48, schrieb Yonik Seeley: On Fri, Feb 1, 2013 at 4:13 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: A question to the experts, why is the replicated index copied from its temporary location (index.x) to the real index directory and NOT moved? The intent is certainly to move and not copy (provided the Directory supports it). See StandardDirectoryFactory.move() Because running Solr/Lucene with Linux I suppose it should really move, but I will step through it with debugger and see what happens. Copying over 100s of gigs takes some time, moving is just changing the file system link. Also, instead of first deleting the old index, why not - moving the file links of old index to index.x.old You can't do this in Windows? Solr/Lucene is optimized for Windows??? Who is on the MS payroll? - moving the file links of new index to index - and finally after new searcher is up, deleting index.x.old -Yonik http://lucidworks.com
replication problems with solr4.1
Hi list, after upgrading from solr4.0 to solr4.1 and running it for two weeks now it turns out that replication has problems and unpredictable results. My installation is single index 41 mio. docs / 115 GB index size / 1 master / 3 slaves. - the master builds a new index from scratch once a week - a replication is started manually with Solr admin GUI What I see is one of these cases: - after a replication a new searcher is opened on index.xxx directory and the old data/index/ directory is never deleted and besides the file replication.properties there is also a file index.properties OR - the replication takes place everything looks fine but when opening the admin GUI the statistics report Last Modified: a day ago Num Docs: 42262349 Max Doc: 42262349 Deleted Docs: 0 Version: 45174 Segment Count: 1 VersionGen Size Master: 1360483635404 112 116.5 GB Slave: 1360483806741 113 116.5 GB In the first case, why is the replication doing that??? It is an offline slave, no search activity, just there fore backup! In the second case, why is the version and generation different right after full replication? Any thoughts on this? - Bernd
Re: replication problems with solr4.1
Now this is strange, the index generation and index version is changing with replication. e.g. master has index generation 118 index version 136059533234 and slave has index generation 118 index version 136059533234 are both same. Now add one doc to master with commit. master has index generation 119 index version 1360595446556 Next replicate master to slave. The result is: master has index generation 119 index version 1360595446556 slave has index generation 120 index version 1360595564333 I have not seen this before. I thought replication is just taking over the index from master to slave, more like a sync? Am 11.02.2013 09:29, schrieb Bernd Fehling: Hi list, after upgrading from solr4.0 to solr4.1 and running it for two weeks now it turns out that replication has problems and unpredictable results. My installation is single index 41 mio. docs / 115 GB index size / 1 master / 3 slaves. - the master builds a new index from scratch once a week - a replication is started manually with Solr admin GUI What I see is one of these cases: - after a replication a new searcher is opened on index.xxx directory and the old data/index/ directory is never deleted and besides the file replication.properties there is also a file index.properties OR - the replication takes place everything looks fine but when opening the admin GUI the statistics report Last Modified: a day ago Num Docs: 42262349 Max Doc: 42262349 Deleted Docs: 0 Version: 45174 Segment Count: 1 VersionGen Size Master: 1360483635404 112 116.5 GB Slave:1360483806741 113 116.5 GB In the first case, why is the replication doing that??? It is an offline slave, no search activity, just there fore backup! In the second case, why is the version and generation different right after full replication? Any thoughts on this? - Bernd -- * Bernd FehlingBielefeld University Library Dipl.-Inform. (FH)LibTec - Library Technology Universitätsstr. 25 and Knowledge Management 33615 Bielefeld Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de BASE - Bielefeld Academic Search Engine - www.base-search.net *
Re: replication problems with solr4.1
OK then index generation and index version are out of count when it comes to verify that master and slave index are in sync. What else is possible? The strange thing is if master is 2 or more generations ahead of slave then it works! With your logic the slave must _always_ be one generation ahead of the master, because the slave replicates from master and then does an additional commit to recognize the changes on the slave. This implies that the slave acts as follows: - if the master is one generation ahaed then do an additional commit - if the master is 2 or more generations ahead then do _no_ commit OR - if the master is 2 or more generations ahead then do a commit but don't change generation and version of index Can this be true? I would say not really. Regards Bernd Am 13.02.2013 20:38, schrieb Amit Nithian: Okay so then that should explain the generation difference of 1 between the master and slave On Wed, Feb 13, 2013 at 10:26 AM, Mark Miller markrmil...@gmail.com wrote: On Feb 13, 2013, at 1:17 PM, Amit Nithian anith...@gmail.com wrote: doesn't it do a commit to force solr to recognize the changes? yes. - Mark
Re: Slaves always replicate entire index Index versions
May be the info about index version is pulled from the repeaters data/replication.properties file and the content of that file is wrong. Had something similar and only solution for me was deleting the replication.properties file. But no guarantee about this. Actually the replication is pretty much messed up in solr4.1. Have seen about 6 or 7 erroneous combinations with replication that were leading to some kind of problems. My problem is I can't reproduce it continously to use a debugger :-( Positive point is, if something goes wrong it goes wrong on all slaves. Some kind of continuity :-) While writing this I just found a new combination. - master had clean index (everything committed, and optimized) and was successfully replicated to all slaves - master has added a few docs and committed - master was replicated to all slaves - all slaves have same generation and version as master but: - all slaves have now no index directory anymore. They only have an index.x directory and an additional index.properties file. I already knew that something would go wrong when I started replication and saw that the slaves pulled the whole index (again) from the master and not only the files with added docs. Under this circumstances with replication I would not even dream about using SolrCloud. Regards Bernd Am 27.02.2013 08:50, schrieb raulgrande83: I'm now having a different problem. In my master-repeater-2slaves architecture I have these generations versions: Master: 29147 Repeater: 29147 Slaves: 29037 When I go to slaves logs it shows Slave in sync with master. That is apparently because if I do http://localhost:17045/solr/replication?command=indexversion (my repeater's replication URL) the response is: long name=generation29037/long Why this URL is returning an old index version? Any solutions to this? -- View this message in context: http://lucene.472066.n3.nabble.com/Slaves-always-replicate-entire-index-Index-versions-tp4041256p4043314.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how often do you boys restart your tomcat?
Till now I used jetty and got 2 week as the longest uptime until OOM. I just switched to tomcat6 and will see how that one behaves but I think its not a problem of the servlet container. Solr is pretty unstable if having a huge database. Actually this can't be blamed directly to Solr it is a problem of Lucene and its fieldCache. Somehow during 2 weeks runtime with searching and replication the fieldCache gets doubled until OOM. Currently there is no other solution to this than restarting your tomcat or jetty regularly :-( Am 27.07.2011 03:42, schrieb Bing Yu: I find that, if I do not restart the master's tomcat for some days, the load average will keep rising to a high level, solr become slow and unstable, so I add a crontab to restart the tomcat everyday. do you boys restart your tomcat ? and is there any way to avoid restart tomcat?
Re: how often do you boys restart your tomcat?
It is definately Lucenes fieldCache making the trouble. Restart your solr and monitor it with jvisualvm, especially OldGen heap. When it gets to 100 percent filled use jmap to dump heap of your system. Then use Eclipse Memory Analyzer http://www.eclipse.org/mat/ and open the heap dump. You will see a pie chart and can easily identify the largets consumer of your heap space. Am 27.07.2011 09:02, schrieb Paul Libbrecht: On curriki.org, our solr's Tomcat saturates memory after 2-4 weeks. I am still investigating if I am accumulating something or something else is. To check it, I am running a query all, return num results every minute to measure the time it takes. It's generally when it meets a big GC that gives a timeout that I start to worry. Memory then starts to be hogged but things get back to normal as soon as the GC is out. I had other tomcat servers with very long uptimes (more than 6 months) so I do not think tomcat is guilty. Currently I can only show the freememory of the system and what's in solr-stats, but I do not know what to look at really... paul Le 27 juil. 2011 à 03:42, Bing Yu a écrit : I find that, if I do not restart the master's tomcat for some days, the load average will keep rising to a high level, solr become slow and unstable, so I add a crontab to restart the tomcat everyday. do you boys restart your tomcat ? and is there any way to avoid restart tomcat?
segment.gen file is not replicated
Dear list, is there a deeper logic behind why the segment.gen file is not replicated with solr 3.2? Is it obsolete because I have a single segment? Regards, Bernd
Re: Solr 3.3 crashes after ~18 hours?
Any JAVA_OPTS set? Do not use -XX:+OptimizeStringConcat or -XX:+AggressiveOpts flags. Am 02.08.2011 12:01, schrieb alexander sulz: Hello folks, I'm using the latest stable Solr release - 3.3 and I encounter strange phenomena with it. After about 19 hours it just crashes, but I can't find anything in the logs, no exceptions, no warnings, no suspicious info entries.. I have an index-job running from 6am to 8pm every 10 minutes. After each job there is a commit. An optimize-job is done twice a day at 12:15pm and 9:15pm. Does anyone have an idea what could possibly be wrong or where to look for further debug info? regards and thank you alex
performance crossover between single index and sharding
Is there any knowledge on this list about the performance crossover between a single index and sharding and when to change from a single index to sharding? E.g. if index size is larger than 150GB and num of docs is more than 25 mio. then it is better to change from single index to sharding and have two shards. Or something like this... Sure, solr might even handle 50 mio. docs but performance is going down and a sharded system with distributed search will be faster than a single index, or not? Is a single index always fast than sharding? Regards Bernd
Re: performance crossover between single index and sharding
On 02.08.2011 21:00, Shawn Heisey wrote: ... I did try some early tests with a single large index. Performance was pretty decent once it got warmed up, but I was worried about how it would perform under a heavy load, and how it would cope with frequent updates. I never really got very far with testing those fears, because the full rebuild time was unacceptable - at least 8 hours. The source database can keep up with six DIH instances reindexing at once, which completes much quicker than a single machine grabbing the entire database. I may increase the number of shards after I remove virtualization, but I'll need to fix a few limitations in my build system. ... At first, thanks a lot to all answers and here is my setup. I know that it is very difficult to give specific recommendations about this. Because of changing from FAST Search to Solr I can state that Solr performs very well, if not excellent. To show that I compare apples and oranges here are my previous FAST Search setup: - one master server (controlling, logging, search dispatcher) - six index server (4.25 mio docs per server, 5 slices per index) (searching and indexing at the same time, indexing once per week during the weekend) - each server has 4GB RAM, all servers are physical on seperate machines - RAM usage controlled by the processes - total of 25.5 mio. docs (mainly metadata) from 1500 databases worldwide - index size is about 67GB per indexer -- about 402GB total - about 3 qps at peek times - with average search time of 0.05 seconds at peek times And here is now my current Solr setup: - one master server (indexing only) - two slave server (search only) but only one is online, the second is fallback - each server has 32GB RAM, all server are virtuell (master on a seperate physical machine, both slaves together on a physical machine) - RAM usage is currently 20GB to java heap - total of 31 mio. docs (all metadata) from 2000 databases worldwide - index size is 156GB total - search handler statistic report 0.6 average requests per second - average time per request 39.5 (is that seconds?) - building the index from scratch takes about 20 hours The good thing is I have the ability to compare a commercial product and enterprise system to open source. I started with my simple Solr setup because of kiss (keep it simple and stupid). Actually it is doing excellent as single index on a single virtuell server. But the average time per request should be reduced now, thats why I started this discussion. While searches with smaller Solr index size (3 mio. docs) showed that it can stand with FAST Search it now shows that its time to go with sharding. I think we are already far behind the point of search performance crossover. What I hope to get with sharding: - reduce time for building the index - reduce average time per request What I fear with sharding: - i currently have master/slave, do I then have e.g. 3 master and 3 slaves? - the query changes because of sharding (is there a search distributor?) - how to distribute the content the indexer with DIH on 3 server? - anything else to think about while changing to sharding? Conclusion: - Solr can handle much more than 30 mio. docs of metadata in a single index if java heap size is large enough. Have an eye on Lucenes fieldCache and sorted fields, especially title (string) fields. - The crossover in my case is somewhere between 3 mio. and 10 mio. docs per index for Solr (compared to FAST Search). FAST recommends about 3 to 6 mio. docs per 4GB RAM server for their system. Anyone able to reduce my fears about sharding? Thanks again for all your answers. Regards Bernd -- * BASE - Bielefeld Academic Search Engine - www.base-search.net *
Re: performance crossover between single index and sharding
Hi Shawn, the 0.05 seconds for search time at peek times (3 qps) is my target for Solr. The numbers for solr are from Solr's statistic report page. So 39.5 seconds average per request is definately to long and I have to change to sharding. For FAST system the numbers for the search dispatcher are: 0.042 sec elapsed per normal search, on avg. 0.053 sec average uncached normal search time (last 100 queries). 99.898% of searches using 1 sec 99.999% of searches using 3 sec 0.000% of all requests timed out 22454567.577 sec time up (that is 259 days) Is there a report page for those numbers for Solr? About the RAM, the 32GB RAM sind physical for each VM and the 20GB RAM are -Xmx for Java. Yesterday I noticed that we are running out of heap during replication so I have to increase -Xmx to about 22g. The reported 0.6 average requests per second seams to me right because the Solr system isn't under full load yet. The FAST system is still taking most of the load. I plan to switch completely to Solr after sharding is up and running stable. So there will be additional 3 qps to Solr at peek times. I don't know if a controlling master like FAST makes any sense for Solr. The small VMs with heartbeat and haproxy sounds great, must be on my todo list. But the biggest problem currently is, how to configure the DIH to split up the content to several indexer. Is there an indexing distributor? Regards, Bernd Am 03.08.2011 16:33, schrieb Shawn Heisey: Replies inline. On 8/3/2011 2:24 AM, Bernd Fehling wrote: To show that I compare apples and oranges here are my previous FAST Search setup: - one master server (controlling, logging, search dispatcher) - six index server (4.25 mio docs per server, 5 slices per index) (searching and indexing at the same time, indexing once per week during the weekend) - each server has 4GB RAM, all servers are physical on seperate machines - RAM usage controlled by the processes - total of 25.5 mio. docs (mainly metadata) from 1500 databases worldwide - index size is about 67GB per indexer -- about 402GB total - about 3 qps at peek times - with average search time of 0.05 seconds at peek times An average query time of 50 milliseconds isn't too bad. If the number from your Solr setup below (39.5) is the QTime, then Solr thinks it is performing better, but Solr's QTime does not include absolutely everything that hs to happen. Do you by chance have 95th and 99th percentile query times for either system? And here is now my current Solr setup: - one master server (indexing only) - two slave server (search only) but only one is online, the second is fallback - each server has 32GB RAM, all server are virtuell (master on a seperate physical machine, both slaves together on a physical machine) - RAM usage is currently 20GB to java heap - total of 31 mio. docs (all metadata) from 2000 databases worldwide - index size is 156GB total - search handler statistic report 0.6 average requests per second - average time per request 39.5 (is that seconds?) - building the index from scratch takes about 20 hours I can't tell whether you mean that each physical host has 32GB or each VM has 32GB. You want to be sure that you are not oversubscribing your memory. If you can get more memory in your machines, you really should. Do you know whether that 0.6 seconds is most of the delay that a user sees when making a search request, or are there other things going on that contribute more delay? In our webapp, the Solr request time is usually small compared with everything else the server and the user's browser are doing to render the results page. As much as I hate being the tall pole in the tent, I look forward to the day when the developers can change that balance. The good thing is I have the ability to compare a commercial product and enterprise system to open source. I started with my simple Solr setup because of kiss (keep it simple and stupid). Actually it is doing excellent as single index on a single virtuell server. But the average time per request should be reduced now, thats why I started this discussion. While searches with smaller Solr index size (3 mio. docs) showed that it can stand with FAST Search it now shows that its time to go with sharding. I think we are already far behind the point of search performance crossover. What I hope to get with sharding: - reduce time for building the index - reduce average time per request You will probably achieve both of these things by sharding, especially if you have a lot of CPU cores available. Like mine, your query volume is very low, so the CPU cores are better utilized distributing the search. What I fear with sharding: - i currently have master/slave, do I then have e.g. 3 master and 3 slaves? - the query changes because of sharding (is there a search distributor?) - how to distribute the content the indexer with DIH on 3 server? - anything else to think about while changing to sharding? I
Re: segment.gen file is not replicated
I have now updated to solr 3.3 but segment.gen is still not replicated. Any idea why, is it a bug or a feature? Should I write a jira issue for it? Regards Bernd Am 29.07.2011 14:10, schrieb Bernd Fehling: Dear list, is there a deeper logic behind why the segment.gen file is not replicated with solr 3.2? Is it obsolete because I have a single segment? Regards, Bernd
Re: segment.gen file is not replicated
Am 04.08.2011 12:52, schrieb Michael McCandless: This file is actually optional; its there for redundancy in case the filesystem is not reliable when listing a directory. Ie, normally, we list the directory to find the latest segments_N file; but if this is wrong (eg the file system might have stale a cache) then we fallback to reading the segments.gen file. For example this is sometimes needed for NFS. Likely replication is just skipping it? That was my first idea. If not changed and touched then it will be skipped. While being smart I deleted it on slave from index dir and then replicated, but segment.gen was not replicated. Due to your explanation NFS could not be reliable any more. So my idea either a bug or a feature and the experts will know :-) Regards Bernd Mike McCandless http://blog.mikemccandless.com On Thu, Aug 4, 2011 at 3:38 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: I have now updated to solr 3.3 but segment.gen is still not replicated. Any idea why, is it a bug or a feature? Should I write a jira issue for it? Regards Bernd Am 29.07.2011 14:10, schrieb Bernd Fehling: Dear list, is there a deeper logic behind why the segment.gen file is not replicated with solr 3.2? Is it obsolete because I have a single segment? Regards, Bernd
Re: performance crossover between single index and sharding
java version 1.6.0_21 Java(TM) SE Runtime Environment (build 1.6.0_21-b06) Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode) java: file format elf64-x86-64 Including the -d64 switch. Am 04.08.2011 14:40, schrieb Bob Sandiford: Dumb question time - you are using a 64 bit Java, and not a 32 bit Java? Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com www.sirsidynix.com -Original Message- From: Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de] Sent: Thursday, August 04, 2011 2:39 AM To: solr-user@lucene.apache.org Subject: Re: performance crossover between single index and sharding Hi Shawn, the 0.05 seconds for search time at peek times (3 qps) is my target for Solr. The numbers for solr are from Solr's statistic report page. So 39.5 seconds average per request is definately to long and I have to change to sharding. For FAST system the numbers for the search dispatcher are: 0.042 sec elapsed per normal search, on avg. 0.053 sec average uncached normal search time (last 100 queries). 99.898% of searches using 1 sec 99.999% of searches using 3 sec 0.000% of all requests timed out 22454567.577 sec time up (that is 259 days) Is there a report page for those numbers for Solr? About the RAM, the 32GB RAM sind physical for each VM and the 20GB RAM are -Xmx for Java. Yesterday I noticed that we are running out of heap during replication so I have to increase -Xmx to about 22g. The reported 0.6 average requests per second seams to me right because the Solr system isn't under full load yet. The FAST system is still taking most of the load. I plan to switch completely to Solr after sharding is up and running stable. So there will be additional 3 qps to Solr at peek times. I don't know if a controlling master like FAST makes any sense for Solr. The small VMs with heartbeat and haproxy sounds great, must be on my todo list. But the biggest problem currently is, how to configure the DIH to split up the content to several indexer. Is there an indexing distributor? Regards, Bernd Am 03.08.2011 16:33, schrieb Shawn Heisey: Replies inline. On 8/3/2011 2:24 AM, Bernd Fehling wrote: To show that I compare apples and oranges here are my previous FAST Search setup: - one master server (controlling, logging, search dispatcher) - six index server (4.25 mio docs per server, 5 slices per index) (searching and indexing at the same time, indexing once per week during the weekend) - each server has 4GB RAM, all servers are physical on seperate machines - RAM usage controlled by the processes - total of 25.5 mio. docs (mainly metadata) from 1500 databases worldwide - index size is about 67GB per indexer -- about 402GB total - about 3 qps at peek times - with average search time of 0.05 seconds at peek times An average query time of 50 milliseconds isn't too bad. If the number from your Solr setup below (39.5) is the QTime, then Solr thinks it is performing better, but Solr's QTime does not include absolutely everything that hs to happen. Do you by chance have 95th and 99th percentile query times for either system? And here is now my current Solr setup: - one master server (indexing only) - two slave server (search only) but only one is online, the second is fallback - each server has 32GB RAM, all server are virtuell (master on a seperate physical machine, both slaves together on a physical machine) - RAM usage is currently 20GB to java heap - total of 31 mio. docs (all metadata) from 2000 databases worldwide - index size is 156GB total - search handler statistic report 0.6 average requests per second - average time per request 39.5 (is that seconds?) - building the index from scratch takes about 20 hours I can't tell whether you mean that each physical host has 32GB or each VM has 32GB. You want to be sure that you are not oversubscribing your memory. If you can get more memory in your machines, you really should. Do you know whether that 0.6 seconds is most of the delay that a user sees when making a search request, or are there other things going on that contribute more delay? In our webapp, the Solr request time is usually small compared with everything else the server and the user's browser are doing to render the results page. As much as I hate being the tall pole in the tent, I look forward to the day when the developers can change that balance. The good thing is I have the ability to compare a commercial product and enterprise system to open source. I started with my simple Solr setup because of kiss (keep it simple and stupid). Actually it is doing excellent as single index on a single virtuell server. But the average time per request should be reduced now, thats why I started this discussion. While searches with smaller Solr index size (3 mio. docs) showed that it can stand with FAST Search it now shows that its time to go
string cut-off filter?
Hi list, is there a string cut-off filter to limit the length of a KeywordTokenized string? So the string should not be dropped, only limitited to a certain length. Regards Bernd
Re: string cut-off filter?
Yes indeed I currently use a workaround with regex filter. Example for limiting to 30 characters: filter class=solr.PatternReplaceFilterFactory pattern=(.{1,30})(.{31,}) replacement=$1 replace=all/ Just thought there might be already a filter. But as Karsten showed it is pretty easy to implement. May be Karsten can open an issue and add his code? Regards Bernd Am 08.08.2011 22:56, schrieb Markus Jelsma: There is none indeed exept using copyField and maxChars. Could you perhaps come up with some regex that replaces the group of chars beyond the desired limit and replace it with '' ? That would fit in a pattern replace char filter. Hi Bernd, I also searched for such a filter but did not found it. Best regards Karsten P.S. I am using now this filter: public class CutMaxLengthFilter extends TokenFilter { public CutMaxLengthFilter(TokenStream in) { this(in, DEFAULT_MAXLENGTH); } public CutMaxLengthFilter(TokenStream in, int maxLength) { super(in); this.maxLength = maxLength; } public static final int DEFAULT_MAXLENGTH = 15; private final int maxLength; private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); @Override public final boolean incrementToken() throws IOException { if (!input.incrementToken()) { return false; } int length = termAtt.length(); if (maxLength 0 length maxLength) { termAtt.setLength(maxLength); } return true; } } with this factory public class CutMaxLengthFilterFactory extends BaseTokenFilterFactory { private int maxLength; @Override public void init(MapString, String args) { super.init(args); maxLength = getInt(maxLength, CutMaxLengthFilter.DEFAULT_MAXLENGTH); } public TokenStream create(TokenStream input) { return new CutMaxLengthFilter(input, maxLength); } } Original-Nachricht Datum: Mon, 08 Aug 2011 10:15:45 +0200 Von: Bernd Fehlingbernd.fehl...@uni-bielefeld.de An: solr-user@lucene.apache.org Betreff: string cut-off filter? Hi list, is there a string cut-off filter to limit the length of a KeywordTokenized string? So the string should not be dropped, only limitited to a certain length. Regards Bernd -- * Bernd FehlingUniversitätsbibliothek Bielefeld Dipl.-Inform. (FH)Universitätsstr. 25 Tel. +49 521 106-4060 Fax. +49 521 106-4052 bernd.fehl...@uni-bielefeld.de33615 Bielefeld BASE - Bielefeld Academic Search Engine - www.base-search.net *
question about query parsing
Hi list, while searching with debug on I see strange query parsing: str name=rawquerystringidentifier:ub.uni-bielefeld.de/str str name=querystringidentifier:ub.uni-bielefeld.de/str str name=parsedquery +MultiPhraseQuery(identifier:(ub.uni-bielefeld.de ub) uni bielefeld de) /str str name=parsedquery_toString +identifier:(ub.uni-bielefeld.de ub) uni bielefeld de /str It is a PhraseQuery, but - why is the string split apart? - why is it grouped this way? Default is edismax. FIELD: field name=identifier type=text_url indexed=true stored=false multiValued=true/ FIELDTYPE: fieldType name=text_url class=solr.TextField positionIncrementGap=100 − analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=0 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ /analyzer − analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType Regards Bernd
Re: Is optimize needed on slaves if it replicates from optimized master?
From what I see on my slaves, yes. After replication has finished and new index is in place and new reader has started I have always a write.lock file in my index directory on slaves, even though the index on master is optimized. Regards Bernd Am 10.08.2011 09:12, schrieb Pranav Prakash: Do slaves need a separate optimize command if they replicate from optimized master? *Pranav Prakash* temet nosce Twitterhttp://twitter.com/pranavprakash | Bloghttp://blog.myblive.com | Googlehttp://www.google.com/profiles/pranny
Re: Is optimize needed on slaves if it replicates from optimized master?
Sure there is actually no optimizing on the slave needed, but after calling optimize on the slave the write.lock will be removed. So why is the replication process not doing this? Regards Bernd Am 10.08.2011 10:57, schrieb Shalin Shekhar Mangar: On Wed, Aug 10, 2011 at 1:11 PM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: From what I see on my slaves, yes. After replication has finished and new index is in place and new reader has started I have always a write.lock file in my index directory on slaves, even though the index on master is optimized. That is not true. Replication is roughly a copy of the diff between the master and the slave's index. An optimized index is a merged and re-written index so replication from an optimized master will give an optimized copy on the slave. The write lock is due to the fact that an IndexWriter is always open in Solr even on the slaves.
Re: Solr 3.3 crashes after ~18 hours?
Hi, googling hotspot server 19.1-b02 shows that you are not alone with hanging threads and crashes. And not only with solr. Maybe try another JAVA? Bernd Am 10.08.2011 17:00, schrieb alexander sulz: Okay, with this command it hangs. Also: I managed to get a Thread Dump (attached). regards Am 05.08.2011 15:08, schrieb Yonik Seeley: On Fri, Aug 5, 2011 at 7:33 AM, alexander sulza.s...@digiconcept.net wrote: Usually you get a XML-Response when doing commits or optimize, in this case I get nothing in return, but the site ( http://[...]/solr/update?optimize=true ) DOESN'T load forever or anything. It doesn't hang! I just get a blank page / empty response. Sounds like you are doing it from a browser? Can you try it from the command line? It should give back some sort of response (or hang waiting for a response). curl http://localhost:8983/solr/update?commit=true; -Yonik http://www.lucidimagination.com I use the stuff in the example folder, the only changes i made was enable logging and changing the port to 8985. I'll try getting a thread dump if it happens again! So far its looking good with having allocated more memory to it. Am 04.08.2011 16:08, schrieb Yonik Seeley: On Thu, Aug 4, 2011 at 8:09 AM, alexander sulza.s...@digiconcept.net wrote: Thank you for the many replies! Like I said, I couldn't find anything in logs created by solr. I just had a look at the /var/logs/messages and there wasn't anything either. What I mean by crash is that the process is still there and http GET pings would return 200 but when i try visiting /solr/admin, I'd get a blank page! The server ignores any incoming updates or commits, ignores means what? The request hangs? If so, could you get a thread dump? Do queries work (like /solr/select?q=*:*) ? thous throwing no errors, no 503's.. It's like the server has a blackout and stares blankly into space. Are you using a different servlet container than what is shipped with solr? If you did start with the solr example server, what jetty configuration changes have you made? -Yonik http://www.lucidimagination.com -- * Bernd FehlingUniversitätsbibliothek Bielefeld Dipl.-Inform. (FH)Universitätsstr. 25 Tel. +49 521 106-4060 Fax. +49 521 106-4052 bernd.fehl...@uni-bielefeld.de33615 Bielefeld BASE - Bielefeld Academic Search Engine - www.base-search.net *
sorting issue with solr 3.3
It turned out that there is a sorting issue with solr 3.3. As fas as I could trace it down currently: 4 docs in the index and a search for *:* sorting on field dccreator_sort in descending order http://localhost:8983/solr/select?fsv=truesort=dccreator_sort%20descindent=onversion=2.2q=*%3A*start=0rows=10fl=dccreator_sort result is: -- lst name=sort_values arr name=dccreator_sort strconvertitovistitutonazionaled/str str莊國鴻chuangkuohung/str strzyywwwxxx/str strabdelhadiyasserabdelfattah/str /arr /lst fieldType: -- fieldType name=alphaOnlySortLim class=solr.TextField sortMissingLast=true omitNorms=true analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.TrimFilterFactory / filter class=solr.PatternReplaceFilterFactory pattern=([\x20-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E]) replacement= replace=all/ filter class=solr.PatternReplaceFilterFactory pattern=(.{1,30})(.{31,}) replacement=$1 replace=all/ /analyzer /fieldType field: -- field name=dccreator_sort type=alphaOnlySortLim indexed=true stored=true / According to documentation the sorting is UTF8 but _why_ is the first string at position 1 and _not_ at position 3 as it should be? Following sorting through the code is somewhat difficult. Any hint where to look for or where to start debugging? Regards Bernd
Re: sorting issue with solr 3.3
The issue was located in a 31 million docs index and i have already reduced it to a reproducable 4 documents index. It is stock solr 3.3.0. Yes, the documents are also in the wrong order as the field sort values. Just added only the field sort values to the email to keep it short. I will produce a test on Monday when I'm back in my office. Hang on... Regards Bernd http://www.base-search.net/ I've checked in an improved TestSort that adds deleted docs and randomizes things a lot more (and fixes the previous reliance on doc ids not being reordered). I still can't reproduce this error though. Is this stock solr? Can you verify that the documents are in the wrong order also (and not just the field sort values)? -Yonik http://www.lucidimagination.com
Re: sorting issue with solr 3.3
I have created an issue with test attached. https://issues.apache.org/jira/browse/SOLR-2713 Will try to figure out whats going wrong. Regards Bernd http://www.base-search.net/ Am 13.08.2011 16:20, schrieb Bernd Fehling: The issue was located in a 31 million docs index and i have already reduced it to a reproducable 4 documents index. It is stock solr 3.3.0. Yes, the documents are also in the wrong order as the field sort values. Just added only the field sort values to the email to keep it short. I will produce a test on Monday when I'm back in my office. Hang on... Regards Bernd http://www.base-search.net/ I've checked in an improved TestSort that adds deleted docs and randomizes things a lot more (and fixes the previous reliance on doc ids not being reordered). I still can't reproduce this error though. Is this stock solr? Can you verify that the documents are in the wrong order also (and not just the field sort values)? -Yonik http://www.lucidimagination.com
commit to jira and change Status and Resolution
Hi list, I have fixed an issue and created a patch (SOLR-2726) but how to change Status and Resolution in jira? And how to commit this, any idea? Regards, Bernd
Re: Unable to generate trace
How about using jmap or jvisualvm? Or even connecting with eclipse to the process for live analysis? Am 08.09.2011 11:07, schrieb Rohit: Nope not getting anything here also. Regards, Rohit -Original Message- From: Jerry Li [mailto:zongjie...@gmail.com] Sent: 08 September 2011 08:09 To: solr-user@lucene.apache.org Subject: Re: Unable to generate trace what about kill -3 PID command? On Thu, Sep 8, 2011 at 4:06 PM, Rohitro...@in-rev.com wrote: Hi, I am running solr in tomcat on a linux machine, my solr hangs after about 40 hrs, I wanted to generate the dump and analyse the logs. But the command kill -QUIT PID doesn't seem to be doing anything. How can I generate a dump otherwise to see, why solr hangs?
skipping parts of query analysis for some queries
I'm in the need of skipping some query analysis steps for some queries. Or more precisely, make it switchable with a query parameter. Use case: fieldType name=text_spec class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index charFilter class=solr.MappingCharFilterFactory mapping=mapping-FoldToASCII.txt/ tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=0 splitOnNumerics=1/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query charFilter class=solr.MappingCharFilterFactory mapping=mapping-FoldToASCII.txt/ tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0 splitOnNumerics=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ShingleFilterFactory maxShingleSize=3 outputUnigrams=false outputUnigramsIfNoShingles=true/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt format=solr tokenizerFactory=solr.KeywordTokenizerFactory ignoreCase=true expand=true/ /analyzer /fieldType For some queries I want to skip SynonymFilterFactory with or without ShingleFilterFactory. First I thought of a second field with a seperate fieldType, but why stuffing content twice in the index? So I had the idea to make things switchable with query parameter. E.g. for SynonymFilterFactory class there will we two optional attributes, querycontrol=true/false (default=false) queryparam=sff. (default=sff) With query ...sff=true... it will use SynonymFilterFactory with query ...sff=false... it will do nothing in SynonymFilterFactory. Easy to implement but this is only for SynonymFilterFactory. What if I want to swith of other filters with my query? Should I patch all FilterFactories? Next idea. How about to modify the analyzer? analyzer type=query charFilter... tokenizer... filter... optional switch=foo filter... filter... /optional /analyzer Now with query ...foo=true... it will use the filters enclosed by the optional tag, with query ...foo=false... they are skipped. Advantage: - more flexibility - no need to index content twice or more times if only changes in query analysis makes the difference Any opinions? Regards, Bernd
accessing the query string from inside TokenFilter
Dear list, while writing some TokenFilter for my analyzer chain I need access to the query string from inside of my TokenFilter for some comparison, but the Filters are working with a TokenStream and get seperate Tokens. Currently I couldn't get any access to the query string. Any idea how to get this done? Is there an Attribute for query or qstr? Regards Bernd
Report about Solr and multilingual Thesaurus
Dear list, just in case you are planning to integrate or combine a thesaurus with Solr the following report might help you. BASE - Solr and the multilingual EuroVoc Thesaurus http://www.ub.uni-bielefeld.de/~befehl/base/solr/eurovoc.html In brief: It explains how a working solution is possible to integrate/combine the multilingual EuroVoc Thesaurus with Solr. It is used as query time search term expansion. Covering over 22 languages this gives you the ability to also find documents in other languages than the original query and also expand the query with synonyms. Best regards Bernd
Re: cache monitoring tools?
Hi Otis, I can't find the download for the free SPM. What Hardware and OS do I need for installing SPM to monitor my servers? Regards Bernd Am 07.12.2011 18:47, schrieb Otis Gospodnetic: Hi Dmitry, You should use SPM for Solr - it exposes all Solr metrics and more (JVM, system info, etc.) PLUS it's currently 100% free. http://sematext.com/spm/solr-performance-monitoring/index.html We use it with our clients on a regular basis and it helps us a TON - we just helped a very popular mobile app company improve Solr performance by a few orders of magnitude (including filter tuning) with the help of SPM. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Dmitry Kandmitry@gmail.com To: solr-user@lucene.apache.org Sent: Wednesday, December 7, 2011 2:13 AM Subject: cache monitoring tools? Hello list, We've noticed quite huge strain on the filterCache in facet queries against trigram fields (see schema in the end of this e-mail). The typical query contains some keywords in the q parameter and boolean filter query on other solr fields. It is also facet query, the facet field is of type shingle_text_trigram (see schema) and facet.limit=50. Questions: are there some tools (except for solrmeter) and/or approaches to monitor / profile the load on caches, which would help to derive better tuning parameters? Can you recommend checking config parameters of other components but caches? BTW, this has become much faster compared to solr 1.4 where we had to a lot of optimizations on schema level (e.g. by making a number of stored fields non-stored) Here are the relevant stats from admin (SOLR 3.4): description: Concurrent LRU Cache(maxSize=1, initialSize=10, minSize=9000, acceptableSize=9500, cleanupThread=false) stats: lookups : 93 hits : 90 hitratio : 0.96 inserts : 1 evictions : 0 size : 1 warmupTime : 0 cumulative_lookups : 93 cumulative_hits : 90 cumulative_hitratio : 0.96 cumulative_inserts : 1 cumulative_evictions : 0 item_shingleContent_trigram : {field=shingleContent_trigram,memSize=326924381,tindexSize=4765394,time=222924,phase1=221106,nTerms=14827061,bigTerms=35,termInstances=114359167,uses=91} name: filterCache class: org.apache.solr.search.FastLRUCache version: 1.0 description: Concurrent LRU Cache(maxSize=512, initialSize=512, minSize=460, acceptableSize=486, cleanupThread=false) stats: lookups : 1003486 hits : 2809 hitratio : 0.00 inserts : 1000694 evictions : 1000221 size : 473 warmupTime : 0 cumulative_lookups : 1003486 cumulative_hits : 2809 cumulative_hitratio : 0.00 cumulative_inserts : 1000694 cumulative_evictions : 1000221 schema excerpt: fieldType name=shingle_text_trigram class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ShingleFilterFactory maxShingleSize=3 outputUnigrams=true/ /analyzer /fieldType -- Regards, Dmitry Kan
KStemmer for Solr
Because I'm using solr from trunk and not from lucid imagination I was missing KStemmer. So I decided to add this stemmer to my installation. After some modifications KStemmer is now working fine as stand-alone. Now I have a KStemmerFilter. Next will be to write the KStemmerFilterFactory. I would place the Factory in lucene-solr/solr/src/java/org/apache/solr/analysis/ to the other Factories, but where to place the Filter? Does it make sense to place the Filter somewhere under lucene-solr/modules/analysis/common/src/java/org/apache/lucene/analysis/ ? But this is for Lucene and not Solr... Or should I place the Filter in a subdirectory of the Factories? Any suggestion for me? Regards, Bernd
DIH delta-import question
Dear list, I'm trying to delta-import with datasource FileDataSource and processor FileListEntityProcessor. I want to load only files which are newer than dataimport.properties - last_index_time. It looks like that newerThan=${dataimport.last_index_time} is without any function. Can it be that newerThan is configured under FileListEntityProcessor but used for the next following entity processor and not for FileListEntityProcessor itself? This is in my case the XPathEntityProcessor which doesn't support newerThan. Version is solr 4.0 from trunk. Regards, Bernd
Re: How to use polish stemmer - Stempel - in schema.xml?
Hi Jakub, I have ported the KStemmer for use in most recent Solr trunk version. My stemmer is located in the lib directory of Solr solr/lib/KStemmer-2.00.jar because it belongs to Solr. Write it as FilterFactory and use it as Filter like: filter class=de.ubbielefeld.solr.analysis.KStemFilterFactory protected=protwords.txt / This is how my fieldType looks like: fieldType name=text_kstem class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=false / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 / filter class=solr.LowerCaseFilterFactory / filter class=de.ubbielefeld.solr.analysis.KStemFilterFactory protected=protwords.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1 / filter class=solr.LowerCaseFilterFactory / filter class=de.ubbielefeld.solr.analysis.KStemFilterFactory protected=protwords.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory / /analyzer /fieldType Regards, Bernd Am 28.10.2010 14:56, schrieb Jakub Godawa: Hi! There is a polish stemmer http://www.getopt.org/stempel/ and I have problems connecting it with solr 1.4.1 Questions: 1. Where EXACTLY do I put stemper-1.0.jar file? 2. How do I register the file, so I can build a fieldType like: fieldType name=text_pl class=solr.TextField analyzer class=org.geoopt.solr.analysis.StempelTokenFilterFactory/ /fieldType 3. Is that the right approach to make it work? Thanks for verbose explanation, Jakub.
Re: How to use polish stemmer - Stempel - in schema.xml?
Hi Jakub, if you unzip your stempel-1.0.jar do you have the required directory structure and file in there? org/getopt/stempel/lucene/StempelFilter.class Regards, Bernd Am 02.11.2010 13:54, schrieb Jakub Godawa: Erick I've put the jar files like that before. I also added the directive and put the file in instanceDir/lib What is still a problem is that even the files are loaded: 2010-11-02 13:20:48 org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/home/jgodawa/apache-solr-1.4.1/ifaq/lib/stempel-1.0.jar' to classloader I am not able to use the FilterFactory... maybe I am attempting it in a wrong way? Cheers, Jakub Godawa. 2010/11/2 Erick Erickson erickerick...@gmail.com: The polish stemmer jar file needs to be findable by Solr, if you copy it to solr_home/lib and restart solr you should be set. Alternatively, you can add another lib directive to the solrconfig.xml file (there are several examples in that file already). I'm a little confused about not being able to find TokenFilter, is that still a problem? HTH Erick On Tue, Nov 2, 2010 at 8:07 AM, Jakub Godawa jakub.god...@gmail.com wrote: Thank you Bernd! I couldn't make it run though. Here is my problem: 1. There is a file ~/apache-solr-1.4.1/ifaq/lib/stempel-1.0.jar 2. In ~/apache-solr-1.4.1/ifaq/solr/conf/solrconfig.xml there is a directive: lib path=../lib/stempel-1.0.jar / 3. In ~/apache-solr-1.4.1/ifaq/solr/conf/schema.xml there is fieldType: (...) !-- Polish -- fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=org.getopt.stempel.lucene.StempelFilter / !--filter class=org.getopt.solr.analysis.StempelTokenFilterFactory protected=protwords.txt / -- /analyzer /fieldType (...) 4. jar file is loaded but I got an error: SEVERE: Could not start SOLR. Check solr/home property java.lang.NoClassDefFoundError: org/apache/lucene/analysis/TokenFilter at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:634) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) (...) 5. Different class gave me that one: SEVERE: org.apache.solr.common.SolrException: Error loading class 'org.getopt.solr.analysis.StempelTokenFilterFactory' at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:375) at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:390) (...) Question is: How to make fieldType / and filter / work with that Stempel? :) Cheers, Jakub Godawa. 2010/10/29 Bernd Fehling bernd.fehl...@uni-bielefeld.de: Hi Jakub, I have ported the KStemmer for use in most recent Solr trunk version. My stemmer is located in the lib directory of Solr solr/lib/KStemmer-2.00.jar because it belongs to Solr. Write it as FilterFactory and use it as Filter like: filter class=de.ubbielefeld.solr.analysis.KStemFilterFactory protected=protwords.txt / This is how my fieldType looks like: fieldType name=text_kstem class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=false / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 / filter class=solr.LowerCaseFilterFactory / filter class=de.ubbielefeld.solr.analysis.KStemFilterFactory protected=protwords.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1 / filter class=solr.LowerCaseFilterFactory / filter class=de.ubbielefeld.solr.analysis.KStemFilterFactory protected=protwords.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory / /analyzer /fieldType Regards, Bernd Am 28.10.2010 14:56, schrieb Jakub Godawa: Hi! There is a polish stemmer http://www.getopt.org/stempel/ and I have problems connecting it with solr 1.4.1 Questions: 1. Where EXACTLY do I put stemper-1.0.jar file? 2. How do I register the file, so I can build a fieldType like: fieldType name=text_pl class=solr.TextField analyzer class=org.geoopt.solr.analysis.StempelTokenFilterFactory/ /fieldType 3. Is that the right approach to make it work? Thanks for verbose explanation, Jakub
Re: How to use polish stemmer - Stempel - in schema.xml?
So you call org.getopt.solr.analysis.StempelTokenFilterFactory. In this case I would assume a file StempelTokenFilterFactory.class in your directory org/getopt/solr/analysis/. And a class which extends the BaseTokenFilterFactory rigth? ... public class StempelTokenFilterFactory extends BaseTokenFilterFactory implements ResourceLoaderAware { ... Am 02.11.2010 14:20, schrieb Jakub Godawa: This is what stempel-1.0.jar consist of after jar -xf: jgod...@ubuntu:~/apache-solr-1.4.1/ifaq/lib$ ls -R org/ org/: egothor getopt org/egothor: stemmer org/egothor/stemmer: Cell.class Diff.classGener.class MultiTrie2.class Optimizer2.class Reduce.classRow.classTestAll.class TestLoad.class Trie$StrEnum.class Compile.class DiffIt.class Lift.class MultiTrie.class Optimizer.class Reduce$Remap.class Stock.class Test.class Trie.class org/getopt: stempel org/getopt/stempel: Benchmark.class lucene Stemmer.class org/getopt/stempel/lucene: StempelAnalyzer.class StempelFilter.class jgod...@ubuntu:~/apache-solr-1.4.1/ifaq/lib$ ls -R META-INF/ META-INF/: MANIFEST.MF jgod...@ubuntu:~/apache-solr-1.4.1/ifaq/lib$ ls -R res res: tables res/tables: readme.txt stemmer_1000.out stemmer_100.out stemmer_2000.out stemmer_200.out stemmer_500.out stemmer_700.out 2010/11/2 Bernd Fehling bernd.fehl...@uni-bielefeld.de: Hi Jakub, if you unzip your stempel-1.0.jar do you have the required directory structure and file in there? org/getopt/stempel/lucene/StempelFilter.class Regards, Bernd Am 02.11.2010 13:54, schrieb Jakub Godawa: Erick I've put the jar files like that before. I also added the directive and put the file in instanceDir/lib What is still a problem is that even the files are loaded: 2010-11-02 13:20:48 org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/home/jgodawa/apache-solr-1.4.1/ifaq/lib/stempel-1.0.jar' to classloader I am not able to use the FilterFactory... maybe I am attempting it in a wrong way? Cheers, Jakub Godawa. 2010/11/2 Erick Erickson erickerick...@gmail.com: The polish stemmer jar file needs to be findable by Solr, if you copy it to solr_home/lib and restart solr you should be set. Alternatively, you can add another lib directive to the solrconfig.xml file (there are several examples in that file already). I'm a little confused about not being able to find TokenFilter, is that still a problem? HTH Erick On Tue, Nov 2, 2010 at 8:07 AM, Jakub Godawa jakub.god...@gmail.com wrote: Thank you Bernd! I couldn't make it run though. Here is my problem: 1. There is a file ~/apache-solr-1.4.1/ifaq/lib/stempel-1.0.jar 2. In ~/apache-solr-1.4.1/ifaq/solr/conf/solrconfig.xml there is a directive: lib path=../lib/stempel-1.0.jar / 3. In ~/apache-solr-1.4.1/ifaq/solr/conf/schema.xml there is fieldType: (...) !-- Polish -- fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=org.getopt.stempel.lucene.StempelFilter / !--filter class=org.getopt.solr.analysis.StempelTokenFilterFactory protected=protwords.txt / -- /analyzer /fieldType (...) 4. jar file is loaded but I got an error: SEVERE: Could not start SOLR. Check solr/home property java.lang.NoClassDefFoundError: org/apache/lucene/analysis/TokenFilter at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:634) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) (...) 5. Different class gave me that one: SEVERE: org.apache.solr.common.SolrException: Error loading class 'org.getopt.solr.analysis.StempelTokenFilterFactory' at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:375) at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:390) (...) Question is: How to make fieldType / and filter / work with that Stempel? :) Cheers, Jakub Godawa. 2010/10/29 Bernd Fehling bernd.fehl...@uni-bielefeld.de: Hi Jakub, I have ported the KStemmer for use in most recent Solr trunk version. My stemmer is located in the lib directory of Solr solr/lib/KStemmer-2.00.jar because it belongs to Solr. Write it as FilterFactory and use it as Filter like: filter class=de.ubbielefeld.solr.analysis.KStemFilterFactory protected=protwords.txt / This is how my fieldType looks like: fieldType name=text_kstem class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=false / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1
result of filtered field not indexed
Dear list, solr/lucene has a strange problem. I'm currently using apache-solr-4.0-2010-10-12_08-05-48 I have written a MessageDigest for fields which generally works. Part of my schema.xml is: ... fieldType name=text_md class=solr.TextField analyzer type=index tokenizer class=solr.KeywordTokenizerFactory / filter class=de.ubbielefeld.solr.analysis.TextMessageDigestFilterFactory mdAlgorithm=MD5 / /analyzer /fieldType ... !-- UNIQUE ID -- field name=id type=string indexed=true stored=true required=true / ... field name=docid type=text_md indexed=true stored=true omitNorms=true / ... copyField source=id dest=docid / ... I have a field type text_md which uses the KeywordTokenizerFactory and then my TextMessageDigestFilterFactory. As example I do a MD5 of id and store it in docid. The Field Analysis runs fine. ... Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {luceneMatchVersion=LUCENE_40} term position 1 term text foo term type word source start,end0,3 payload de.ubbielefeld.solr.analysis.TextMessageDigestFilterFactory {mdAlgorithm=MD5, luceneMatchVersion=LUCENE_40} term position 1 term text acbd18db4cc2f85cedef654fccc4a4d8 term type word source start,end0,3 payload The problem is that while loading via DIH the debugger shows that the TextMessageDigestFilterFactory is called and running without problems and the result of my filter is properly returned, but somehow the result never reaches the IndexWriter and gets stored to the index. Any idea where to look at? May be a class at a higher level doesn't recognize the change? The above source start,end still has 0,3 even after the term text has changed from foo to MD5 string. Should it then be 0,32 ? Regards Bernd
Re: result of filtered field not indexed
Hi Rita, thanks for the advice, one problem solved. source start,end is now set to the correct value by the filter. After further debugging it looks like this is a bug in Lucene indexer. I wonder that noone ever noticed this... Kind regards, Bernd Am 23.11.2010 09:07, schrieb Bernd Fehling: Dear list, solr/lucene has a strange problem. I'm currently using apache-solr-4.0-2010-10-12_08-05-48 I have written a MessageDigest for fields which generally works. Part of my schema.xml is: ... fieldType name=text_md class=solr.TextField analyzer type=index tokenizer class=solr.KeywordTokenizerFactory / filter class=de.ubbielefeld.solr.analysis.TextMessageDigestFilterFactory mdAlgorithm=MD5 / /analyzer /fieldType ... !-- UNIQUE ID -- field name=id type=string indexed=true stored=true required=true / ... field name=docid type=text_md indexed=true stored=true omitNorms=true / ... copyField source=id dest=docid / ... I have a field type text_md which uses the KeywordTokenizerFactory and then my TextMessageDigestFilterFactory. As example I do a MD5 of id and store it in docid. The Field Analysis runs fine. ... Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {luceneMatchVersion=LUCENE_40} term position 1 term text foo term type word source start,end 0,3 payload de.ubbielefeld.solr.analysis.TextMessageDigestFilterFactory {mdAlgorithm=MD5, luceneMatchVersion=LUCENE_40} term position 1 term text acbd18db4cc2f85cedef654fccc4a4d8 term type word source start,end 0,3 payload The problem is that while loading via DIH the debugger shows that the TextMessageDigestFilterFactory is called and running without problems and the result of my filter is properly returned, but somehow the result never reaches the IndexWriter and gets stored to the index. Any idea where to look at? May be a class at a higher level doesn't recognize the change? The above source start,end still has 0,3 even after the term text has changed from foo to MD5 string. Should it then be 0,32 ? Regards Bernd
Re: question about Solr SignatureUpdateProcessorFactory
Dear list, another suggestion about SignatureUpdateProcessorFactory. Why can I make signatures of several fields and place the result in one field but _not_ make a signature of one field and place the result in several fields. Could be realized without huge programming? Best regards, Bernd Am 29.11.2010 14:30, schrieb Bernd Fehling: Dear list, a question about Solr SignatureUpdateProcessorFactory: for (String field : sigFields) { SolrInputField f = doc.getField(field); if (f != null) { *sig.add(field); Object o = f.getValue(); if (o instanceof String) { sig.add((String)o); } else if (o instanceof Collection) { for (Object oo : (Collection)o) { if (oo instanceof String) { sig.add((String)oo); } } } } } Why is also the field name (* above) added to the signature and not only the content of the field? By purpose or by accident? I would like to suggest removing the field name from the signature and not mixing it up. Best regards, Bernd
Re: question about Solr SignatureUpdateProcessorFactory
Am 29.11.2010 14:55, schrieb Markus Jelsma: On Monday 29 November 2010 14:51:33 Bernd Fehling wrote: Dear list, another suggestion about SignatureUpdateProcessorFactory. Why can I make signatures of several fields and place the result in one field but _not_ make a signature of one field and place the result in several fields. Use copyField Ooooh yes, you are right. Could be realized without huge programming? Best regards, Bernd Am 29.11.2010 14:30, schrieb Bernd Fehling: Dear list, a question about Solr SignatureUpdateProcessorFactory: for (String field : sigFields) { SolrInputField f = doc.getField(field); if (f != null) { *sig.add(field); Object o = f.getValue(); if (o instanceof String) { sig.add((String)o); } else if (o instanceof Collection) { for (Object oo : (Collection)o) { if (oo instanceof String) { sig.add((String)oo); } } } } } Why is also the field name (* above) added to the signature and not only the content of the field? By purpose or by accident? I would like to suggest removing the field name from the signature and not mixing it up. Best regards, Bernd
Re: question about Solr SignatureUpdateProcessorFactory
As mentioned, in the typical case it's important that the field names be included in the signature, but i imagine there would be cases where you wouldn't want them included (like a simple concat Signature for building basic composite keys) I think the Signature API could definitely be enhanced to have additional methods for adding field names vs adding field values. wanna open an issue in Jira sith some suggestions and use cases? -Hoss Done. Issue SOLR-2258 and SOLR-2258.patch as suggestion. Best regards, Bernd
Re: Creating Email Token Filter
Am 30.11.2010 10:56, schrieb Greg Smith: Hi, I have written a plugin to filter on email types and keep those tokens, however when I run it in the analysis in the admin it all works fine. But when I use the data import handler to import the data and set the field type it doesn't remove the other tokens and keeps the field in the original form. I have sent the query and index analyzers to use the standard tokenizer factory and my custom email filter only. What could be causing this issue? It sound like my misunderstanding which I had till the end of last week about indexing and storing of solr/lucene databases. I also had several Tokenizers and Filters and thought they aren't working but only in analysis of admin. As a matter of fact if they work in the analysis of admin then they work :-) But you can't see it with the search result page, because the search result page is always displaying the original stored value _not_ the tokenized or filtered indexed value. The Tokenized/Filtered content will be indexed which is not represented with the result page. Check with Schema Browser from admin what the indexed content of your Tokenized/Filtered field is. Best regards Bernd
Re: Dataimport performance
We are currently running Solr 4.x from trunk. -d64 -Xms10240M -Xmx10240M Total Rows Fetched: 24935988 Total Documents Skipped: 0 Total Documents Processed: 24568997 Time Taken: 5:55:19.104 24.5 Million Docs as XML from filesystem with less than 6 hours. May be your MySQL is the bottleneck? Regards Bernd Am 15.12.2010 14:40, schrieb Robert Gründler: Hi, we're looking for some comparison-benchmarks for importing large tables from a mysql database (full import). Currently, a full-import of ~ 8 Million rows from a MySQL database takes around 3 hours, on a QuadCore Machine with 16 GB of ram and a Raid 10 storage setup. Solr is running on a apache tomcat instance, where it is the only app. The tomcat instance has the following memory-related java_opts: -Xms4096M -Xmx5120M The data-config.xml looks like this (only 1 entity): entity name=track query=select t.id as id, t.title as title, l.title as label from track t left join label l on (l.id = t.label_id) where t.deleted = 0 transformer=TemplateTransformer field column=title name=title_t / field column=label name=label_t / field column=id name=sf_meta_id / field column=metaclass template=Track name=sf_meta_class/ field column=metaid template=${track.id} name=sf_meta_id/ field column=uniqueid template=Track_${track.id} name=sf_unique_id/ entity name=artists query=select a.name as artist from artist a left join track_artist ta on (ta.artist_id = a.id) where ta.track_id=${track.id} field column=artist name=artists_t / /entity /entity We have the feeling that 3 hours for this import is quite long - regarding the performance of the server running solr/mysql. Are we wrong with that assumption, or do people experience similar import times with this amount of data to be imported? thanks! -robert -- * Bernd FehlingUniversitätsbibliothek Bielefeld Dipl.-Inform. (FH)Universitätsstr. 25 Tel. +49 521 106-4060 Fax. +49 521 106-4052 bernd.fehl...@uni-bielefeld.de33615 Bielefeld BASE - Bielefeld Academic Search Engine - www.base-search.net *
names of index files
Dear list, some questions about the names of the index files. With an older Solr 4.x version from trunk my index looks like: _2t1.fdt _2t1.fdx _2t1.fnm _2t1.frq _2t1.nrm _2t1.prx _2t1.tii _2t1.tis segments_2 segments.gen With a most recent version from trunk it looks like: _3a9.fdt _3a9.fdx _3a9.fnm _3a9_0.frq _3a9.nrm _3a9_0.prx _3a9_0.tii _3a9_0.tis segments_4 segments.gen Why is there an _0 at some files? Is it from Lucene or from Solr or a fault in my system? Both indexes are optimized, any idea? Regards, Bernd
Re: WARNING: re-index all Lucene trunk indices
Because this is also posted for solr-user and from some earlier experiences with solr from trunk I think this is also recommended for solr users living from trunk, right? So solr trunk builds directly with lucene trunk? Bernd Am 05.01.2011 11:55, schrieb Michael McCandless: If you are using Lucene's trunk (to be 4.0) builds, read on... I just committed LUCENE-2843, which is a hard break on the index file format. If you are living on Lucene's trunk then you have to remove any previously created indices and re-index, after updating. The change cuts over to a more RAM efficient and faster terms index implementation, using FSTs (finite state transducers) to hold the term index data. Mike
DIH load only selected documents with XPathEntityProcessor
Hello list, is it possible to load only selected documents with XPathEntityProcessor? While loading docs I want to drop/skip/ignore documents with missing URL. Example: documents document titlefirst title/title ididentifier_01/id linkhttp://www.foo.com/path/bar.html/link /document document titlesecond title/title ididentifier_02/id link/link /document /documents The first document should be loaded, the second document should be ignored because it has an empty link (should also work for missing link field). Best regards Bernd
DIH Transformer
Hi list, currently the Transformers return row but can I skip or drop a row from the Transformer? If so, what should I return in that case, an empty row? Regards, Bernd
Re: DIH load only selected documents with XPathEntityProcessor
Hi Gora, thanks a lot, very nice solution, works perfectly. I will dig more into ScriptTransformer, seems to be very powerful. Regards, Bernd Am 08.01.2011 14:38, schrieb Gora Mohanty: On Fri, Jan 7, 2011 at 12:30 PM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Hello list, is it possible to load only selected documents with XPathEntityProcessor? While loading docs I want to drop/skip/ignore documents with missing URL. Example: documents document titlefirst title/title ididentifier_01/id linkhttp://www.foo.com/path/bar.html/link /document document titlesecond title/title ididentifier_02/id link/link /document /documents The first document should be loaded, the second document should be ignored because it has an empty link (should also work for missing link field). [...] You can use a ScriptTransformer, along with $skipRow/$skipDoc. E.g., something like this for your data import configuration file: dataConfig script![CDATA[ function skipRow(row) { var link = row.get( 'link' ); if( link == null || link == '' ) { row.put( '$skipRow', 'true' ); } return row; } ]]/script dataSource type=FileDataSource / document entity name=f processor=FileListEntityProcessor baseDir=/home/gora/test fileName=.*xml newerThan='NOW-3DAYS' recursive=true rootEntity=false dataSource=null entity name=top processor=XPathEntityProcessor forEach=/documents/document url=${f.fileAbsolutePath} transformer=script:skipRow field column=link xpath=/documents/document/link/ field column=title xpath=/documents/document/title/ field column=id xpath=/documents/document/id/ /entity /entity /document /dataConfig Regards, Gora
strange SOLR behavior with required field attribute
Dear list, while trying different options with DIH and SciptTransformer I also tried using the required=true option for a field. I have 3 records: documents document titlefirst title/title ididentifier_01/id linkhttp://www.foo.com/path/bar.html/link /document document titlesecond title/title ididentifier_02/id link/link /document document titlethierd title/title ididentifier_03/id /document /documents schema.xml snippet: field name=title type=string indexed=true stored=true / field name=id type=string indexed=true stored=true required=true / field name=link type=string indexed=true stored=true required=true / After loading I have 2 records in the index. str name=titlefirst title/str str name=ididentifier_01/str str name=linkhttp://www.foo.com/path/bar.html/link str name=titlesecond title/str str name=ididentifier_02/str str name=link/ Sure, I get an SolrException in the logs saying missing required field: link but this is for the third record whereas the second record gets loaded even if link is empty. So I guess this is a feature of Solr? And the required attribute means the presense of the tag and not the presense of content for the tag, right? Regards Bernd
Re: strange SOLR behavior with required field attribute
Hi Koji, I'm using apache-solr-4.0-2010-11-24_09-25-17 from trunk. A grep for SOLR-1973 in CHANGES.txt says that it should have been fixed. Strange... Regards, Bernd Am 10.01.2011 16:14, schrieb Koji Sekiguchi: (11/01/10 23:26), Bernd Fehling wrote: Dear list, while trying different options with DIH and SciptTransformer I also tried using the required=true option for a field. I have 3 records: documents document titlefirst title/title ididentifier_01/id linkhttp://www.foo.com/path/bar.html/link /document document titlesecond title/title ididentifier_02/id link/link /document document titlethierd title/title ididentifier_03/id /document /documents schema.xml snippet: field name=title type=string indexed=true stored=true / field name=id type=string indexed=true stored=true required=true / field name=link type=string indexed=true stored=true required=true / After loading I have 2 records in the index. str name=titlefirst title/str str name=ididentifier_01/str str name=linkhttp://www.foo.com/path/bar.html/link str name=titlesecond title/str str name=ididentifier_02/str str name=link/ Sure, I get an SolrException in the logs saying missing required field: link but this is for the third record whereas the second record gets loaded even if link is empty. So I guess this is a feature of Solr? And the required attribute means the presense of the tag and not the presense of content for the tag, right? Regards Bernd Bernd, Seems like same problem of SOLR-1973 that I've recently fixed in trunk and 3x, but I'm not sure. Which version are you using? Can you try trunk or 3x? If you still get same error with trunk/3x, please open a jira issue. Koji
LukeRequestHandler histogram?
Dear list, what is the LukeRequestHandler histogram telling me? Couldn't find any explanation and would be pleased to have it explained. Many thanks in advance, Bernd
Re: LukeRequestHandler histogram?
Hi Stefan, thanks a lot. Regards, Bernd Am 14.01.2011 15:25, schrieb Stefan Matheis: Hi Bernd, there is an explanation from Hoss: http://search.lucidimagination.com/search/document/149e7d25415c0a36/some_kind_of_crazy_histogram#b22563120f1ec32b HTH Stefan On Fri, Jan 14, 2011 at 3:15 PM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Dear list, what is the LukeRequestHandler histogram telling me? Couldn't find any explanation and would be pleased to have it explained. Many thanks in advance, Bernd
Re: DIH with full-import and cleaning still keeps old index
Looks like this is a bug and I should write a jira issue for it? Regards Bernd Am 20.01.2011 11:30, schrieb Bernd Fehling: Hi list, after sending full-import=trueclean=truecommit=true Solr 4.x (apache-solr-4.0-2010-11-24_09-25-17) responds with: - DataImporter doFullImport - DirectUpdateHandler2 deleteAll ... - DocBuilder finish - SolrDeletionPolicy.onCommit: commits:num=2 - SolrDeletionPolicy updateCommits - SolrIndexSearcher init - INFO: end_commit_flush - SolrIndexSearcher warm ... - QuerySenderListener newSearcher - SolrCore registerSearcher - SolrIndexSearcher close ... This all looks good to me but why is the old index not deleted? Am I missing a parameter? Regards, Bernd
Re: DIH with full-import and cleaning still keeps old index
Is there a difference between sending optimize=true with the full-import command or sending optimize=true as a separate command after finishing full-import? Regards, Bernd Am 23.01.2011 02:18, schrieb Espen Amble Kolstad: Your not doing optimize, I think optimize would delete your old index. Try it out with additional parameter optimize=true - Espen On Thu, Jan 20, 2011 at 11:30 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Hi list, after sending full-import=trueclean=truecommit=true Solr 4.x (apache-solr-4.0-2010-11-24_09-25-17) responds with: - DataImporter doFullImport - DirectUpdateHandler2 deleteAll ... - DocBuilder finish - SolrDeletionPolicy.onCommit: commits:num=2 - SolrDeletionPolicy updateCommits - SolrIndexSearcher init - INFO: end_commit_flush - SolrIndexSearcher warm ... - QuerySenderListener newSearcher - SolrCore registerSearcher - SolrIndexSearcher close ... This all looks good to me but why is the old index not deleted? Am I missing a parameter? Regards, Bernd
Re: DIH with full-import and cleaning still keeps old index
I sent commit=trueoptimize=true as a separate command but nothing happened. Will try with additional options waitFlush=falsewaitSearcher=falseexpungeDeletes=true I wonder why the DIH admin GUI (debug.jsp) is not sending optimize=true together with full-import ? Regards, Bernd Am 24.01.2011 08:12, schrieb Espen Amble Kolstad: I think optimize only ever gets done when either a full-import or delta-import is done. You could optimize the normal way though see: http://wiki.apache.org/solr/UpdateXmlMessages - Espen On Mon, Jan 24, 2011 at 8:05 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Is there a difference between sending optimize=true with the full-import command or sending optimize=true as a separate command after finishing full-import? Regards, Bernd Am 23.01.2011 02:18, schrieb Espen Amble Kolstad: Your not doing optimize, I think optimize would delete your old index. Try it out with additional parameter optimize=true - Espen On Thu, Jan 20, 2011 at 11:30 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Hi list, after sending full-import=trueclean=truecommit=true Solr 4.x (apache-solr-4.0-2010-11-24_09-25-17) responds with: - DataImporter doFullImport - DirectUpdateHandler2 deleteAll ... - DocBuilder finish - SolrDeletionPolicy.onCommit: commits:num=2 - SolrDeletionPolicy updateCommits - SolrIndexSearcher init - INFO: end_commit_flush - SolrIndexSearcher warm ... - QuerySenderListener newSearcher - SolrCore registerSearcher - SolrIndexSearcher close ... This all looks good to me but why is the old index not deleted? Am I missing a parameter? Regards, Bernd -- * Bernd FehlingUniversitätsbibliothek Bielefeld Dipl.-Inform. (FH)Universitätsstr. 25 Tel. +49 521 106-4060 Fax. +49 521 106-4052 bernd.fehl...@uni-bielefeld.de33615 Bielefeld BASE - Bielefeld Academic Search Engine - www.base-search.net *
solr admin result page error
Dear list, after loading some documents via DIH which also include urls I get this yellow XML error page as search result from solr admin GUI after a search. It says XML processing error not well-formed. The code it argues about is: arr name=dcurls strhttp://eprints.soton.ac.uk/43350//str strhttp://dx.doi.org/doi:10.1112/S0024610706023143/str strMartinez-Perez, Conchita and Nucinkis, Brita E.A. (2006) Cohomological dimension of Mackey functors for infinite groups. Journal of the London Mathematical Society, 74, (2), 379-396. (doi:10.1112/S0024610706023143 lt;http://dx.doi.org/10.1112/S002461070602314\ugt;)/str/arr See the \u utf8-code in the last line. 1. the loaded data is valid, well-formed and checked with xmllint. No errors. 2. there is no \u utf8-code in the source data. 3. the data is loaded via DIH without any errors. 4. if opening the source-view of the result page with firefox there is also no \u utf8-code. Only idea I have is solr itself or the result page generation. How to proceed, what else to check? Regards, Bernd
Re: solr admin result page error
Results so far. I could locate and isolate the document causing trouble. I've checked the document with xmllint again. It is valid, well-formed utf8. I've loaded the single document and get the XML error if displaying the search result. This is through solr admin search and also JSON interface, probably other interfaces also. Next step is to use debugger and see what goes wrong. One thing I can already say is that it is utf8-code F0 9D 94 90 (U+1D510) which makes the problem (Mathematical Fraktur Capital M). Any already known issues about that? Regards, Bernd Am 11.02.2011 08:59, schrieb Bernd Fehling: Dear list, after loading some documents via DIH which also include urls I get this yellow XML error page as search result from solr admin GUI after a search. It says XML processing error not well-formed. The code it argues about is: arr name=dcurls strhttp://eprints.soton.ac.uk/43350//str strhttp://dx.doi.org/doi:10.1112/S0024610706023143/str strMartinez-Perez, Conchita and Nucinkis, Brita E.A. (2006) Cohomological dimension of Mackey functors for infinite groups. Journal of the London Mathematical Society, 74, (2), 379-396. (doi:10.1112/S0024610706023143 lt;http://dx.doi.org/10.1112/S002461070602314\ugt;)/str/arr See the \u utf8-code in the last line. 1. the loaded data is valid, well-formed and checked with xmllint. No errors. 2. there is no \u utf8-code in the source data. 3. the data is loaded via DIH without any errors. 4. if opening the source-view of the result page with firefox there is also no \u utf8-code. Only idea I have is solr itself or the result page generation. How to proceed, what else to check? Regards, Bernd
Re: solr admin result page error
Hi Markus, yes it looks like the same issue. There is also a \u utf8-code in your dump. Till now I followed it into XMLResponseWriter. Some steps before the result in a buffer looks good and the utf8-code is correct. Really hard to debug this freaky problem. Have you looked deeper into this and located the bug? It is definately a bug and has nothing to do with firefox. Regards, Bernd Am 11.02.2011 13:48, schrieb Markus Jelsma: It looks like you hit the same issue as i did a while ago: http://www.mail-archive.com/solr-user@lucene.apache.org/msg46510.html On Friday 11 February 2011 08:59:27 Bernd Fehling wrote: Dear list, after loading some documents via DIH which also include urls I get this yellow XML error page as search result from solr admin GUI after a search. It says XML processing error not well-formed. The code it argues about is: arr name=dcurls strhttp://eprints.soton.ac.uk/43350//str strhttp://dx.doi.org/doi:10.1112/S0024610706023143/str strMartinez-Perez, Conchita and Nucinkis, Brita E.A. (2006) Cohomological dimension of Mackey functors for infinite groups. Journal of the London Mathematical Society, 74, (2), 379-396. (doi:10.1112/S0024610706023143 lt;http://dx.doi.org/10.1112/S002461070602314\ugt;)/str/arr See the \u utf8-code in the last line. 1. the loaded data is valid, well-formed and checked with xmllint. No errors. 2. there is no \u utf8-code in the source data. 3. the data is loaded via DIH without any errors. 4. if opening the source-view of the result page with firefox there is also no \u utf8-code. Only idea I have is solr itself or the result page generation. How to proceed, what else to check? Regards, Bernd -- * Bernd FehlingUniversitätsbibliothek Bielefeld Dipl.-Inform. (FH)Universitätsstr. 25 Tel. +49 521 106-4060 Fax. +49 521 106-4052 bernd.fehl...@uni-bielefeld.de33615 Bielefeld BASE - Bielefeld Academic Search Engine - www.base-search.net *
Re: solr admin result page error
Hi Markus, the result of my investigation is that Lucene currently can only handle UTF-8 code within BMP [Basic Multilingual Plane] (plane 0) = 0x. Any code above BMP might end in unpredictable results which is bad. If you get invalid UTF-8 from the index and use wt=xml it gives the error page. This is due to encoding=text/xml and charset=utf-8 in the header. If you use wt=json then the encoding is text/plain and charset=utf-8. Because of text/plain you don't get an error page but nevertheless the content is invalid. I guess it replaces all invalid code with UTF-8 BOM. So currently no solution, even not with JSON. This should (hopefully) be fixed with Lucene 3.1. Regards, Bernd Am 11.02.2011 15:50, schrieb Markus Jelsma: No i haven't located the issue. It might be Solr but it could also be Xerces having trouble with it. You can possibly work around the problem by using the JSONResponseWriter. On Friday 11 February 2011 15:45:23 Bernd Fehling wrote: Hi Markus, yes it looks like the same issue. There is also a \u utf8-code in your dump. Till now I followed it into XMLResponseWriter. Some steps before the result in a buffer looks good and the utf8-code is correct. Really hard to debug this freaky problem. Have you looked deeper into this and located the bug? It is definately a bug and has nothing to do with firefox. Regards, Bernd Am 11.02.2011 13:48, schrieb Markus Jelsma: It looks like you hit the same issue as i did a while ago: http://www.mail-archive.com/solr-user@lucene.apache.org/msg46510.html On Friday 11 February 2011 08:59:27 Bernd Fehling wrote: Dear list, after loading some documents via DIH which also include urls I get this yellow XML error page as search result from solr admin GUI after a search. It says XML processing error not well-formed. The code it argues about is: arr name=dcurls strhttp://eprints.soton.ac.uk/43350//str strhttp://dx.doi.org/doi:10.1112/S0024610706023143/str strMartinez-Perez, Conchita and Nucinkis, Brita E.A. (2006) Cohomological dimension of Mackey functors for infinite groups. Journal of the London Mathematical Society, 74, (2), 379-396. (doi:10.1112/S0024610706023143 lt;http://dx.doi.org/10.1112/S002461070602314\ugt;)/str/arr See the \u utf8-code in the last line. 1. the loaded data is valid, well-formed and checked with xmllint. No errors. 2. there is no \u utf8-code in the source data. 3. the data is loaded via DIH without any errors. 4. if opening the source-view of the result page with firefox there is also no \u utf8-code. Only idea I have is solr itself or the result page generation. How to proceed, what else to check? Regards, Bernd -- * Bernd FehlingUniversitätsbibliothek Bielefeld Dipl.-Inform. (FH)Universitätsstr. 25 Tel. +49 521 106-4060 Fax. +49 521 106-4052 bernd.fehl...@uni-bielefeld.de33615 Bielefeld BASE - Bielefeld Academic Search Engine - www.base-search.net *
Content-Type of XMLResponseWriter / QueryResponseWriter
Dear list, is there any deeper logic behind the fact that XMLResponseWriter is sending CONTENT_TYPE_XML_UTF8=application/xml; charset=UTF-8 ? I would assume (and also most browser) that for XML Output to receive text/xml and not application/xml. Or do you want the browser to call and XML-Editor with the result? Best regards, Bernd
Re: Content-Type of XMLResponseWriter / QueryResponseWriter
Hi Walter, many thanks! Bernd Am 03.03.2011 17:01, schrieb Walter Underwood: Never use text/xml, that overrides any encoding declaration inside the XML file. http://ln.hixie.ch/?start=1037398795count=1 http://www.grauw.nl/blog/entry/489 wunder == Lead Engineer, MarkLogic On Mar 3, 2011, at 7:30 AM, Bernd Fehling wrote: Dear list, is there any deeper logic behind the fact that XMLResponseWriter is sending CONTENT_TYPE_XML_UTF8=application/xml; charset=UTF-8 ? I would assume (and also most browser) that for XML Output to receive text/xml and not application/xml. Or do you want the browser to call and XML-Editor with the result? Best regards, Bernd
from multiValued field to non-multiValued field with copyField?
Is there a way to have a kind of casting for copyField? I have author names in multiValued string field and need a sorting on it, but sort on field is only for multiValued=false. I'm trying to get multiValued content from one field to a non-multiValued text or string field for sorting. And this, if possible, during loading with copyField. Or any other solution? I need this solution due to patch SOLR-2339, which is now more strict. May be anyone else also. Regards, Bernd
Re: from multiValued field to non-multiValued field with copyField?
Good idea. Was also just looking into this area. Assuming my input record looks like this: documents document id=foobar element name=authorvalueauthor_1 ; author_2 ; author_3/value/element /document /documents Do you know if I can use something like this: ... entity name=records processor=XPathEntityProcessor transformer=RegexTransformer ... field column=author xpath=/documents/document/element[@name='author']/value / field column=author_sort xpath=/documents/document/element[@name='author']/value / field column=author splitBy= ; / ... To just double the input and make author multiValued and author_sort a string field? Regards Bernd Am 17.03.2011 15:39, schrieb Gora Mohanty: On Thu, Mar 17, 2011 at 8:04 PM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Is there a way to have a kind of casting for copyField? I have author names in multiValued string field and need a sorting on it, but sort on field is only for multiValued=false. I'm trying to get multiValued content from one field to a non-multiValued text or string field for sorting. And this, if possible, during loading with copyField. Or any other solution? [...] Not sure about CopyField, but you could use a transformer to extract values from a multiValued field, and stick them into a single-valued field. Regards, Gora
Re: from multiValued field to non-multiValued field with copyField?
Hi Yonik, actually some applications misused sorting on a multiValued field, like VuFind. And as a matter oft fact also FAST doesn't support this because it doesn't make sense. FAST distinguishes between multiValue and singleValue by just adding the seperator-FieldAttribute to the field. So I moved this from FAST index-profile to Solr DIH and placed the seperator there. But now I'm looking for a solution for VuFind. Easiest thing would be to have a kind of casting, may be for copyField. Regards, Bernd Am 17.03.2011 15:58, schrieb Yonik Seeley: On Thu, Mar 17, 2011 at 10:34 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Is there a way to have a kind of casting for copyField? I have author names in multiValued string field and need a sorting on it, but sort on field is only for multiValued=false. I'm trying to get multiValued content from one field to a non-multiValued text or string field for sorting. And this, if possible, during loading with copyField. Or any other solution? I need this solution due to patch SOLR-2339, which is now more strict. May be anyone else also. Hmmm, you're the second person that's relied on that (sorting on a multiValued field working). Was SOLR-2339 a mistake? -Yonik http://lucidimagination.com
Re: from multiValued field to non-multiValued field with copyField?
Hi Bill, yes DIH is in use. Thanks, Bernd Am 17.03.2011 16:09, schrieb Bill Bell: Do you use Dih handler? A script can do this easily. Bill Bell Sent from mobile On Mar 17, 2011, at 9:02 AM, Bernd Fehlingbernd.fehl...@uni-bielefeld.de wrote: Good idea. Was also just looking into this area. Assuming my input record looks like this: documents document id=foobar element name=authorvalueauthor_1 ; author_2 ; author_3/value/element /document /documents Do you know if I can use something like this: ... entity name=records processor=XPathEntityProcessor transformer=RegexTransformer ... field column=author xpath=/documents/document/element[@name='author']/value / field column=author_sort xpath=/documents/document/element[@name='author']/value / field column=author splitBy= ; / ... To just double the input and make author multiValued and author_sort a string field? Regards Bernd Am 17.03.2011 15:39, schrieb Gora Mohanty: On Thu, Mar 17, 2011 at 8:04 PM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Is there a way to have a kind of casting for copyField? I have author names in multiValued string field and need a sorting on it, but sort on field is only for multiValued=false. I'm trying to get multiValued content from one field to a non-multiValued text or string field for sorting. And this, if possible, during loading with copyField. Or any other solution? [...] Not sure about CopyField, but you could use a transformer to extract values from a multiValued field, and stick them into a single-valued field. Regards, Gora -- * Bernd FehlingUniversitätsbibliothek Bielefeld Dipl.-Inform. (FH)Universitätsstr. 25 Tel. +49 521 106-4060 Fax. +49 521 106-4052 bernd.fehl...@uni-bielefeld.de33615 Bielefeld BASE - Bielefeld Academic Search Engine - www.base-search.net *