Re: Is this a bug of the RessourceLoader?
On Mon, Apr 5, 2010 at 2:28 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: Robert: BOMs are one of those things that strike me as being abhorent and inheriently evil because they seem to cause nothing but problems -- Yes. If text files that start with a BOM aren't properly being dealt with by Solr right now, should we consider that a bug? No. Is there something we can/should be doing in SolrResourceLoader to make Solr handle this situation better? Yes, we can ignore them for the first line of the file to be more user-friendly. I'll open an issue. -- Robert Muir rcm...@gmail.com
Re: Read Time Out Exception while trying to upload a huge SOLR input xml
Solr also has a feature to stream from a local file rather than over the network. The parameter stream.file=/full/local/file/name.txt means 'read this file from the local disk instead of the POST upload'. Of course, you have to get the entire file onto the Solr indexer machine (or a common file server). http://wiki.apache.org/solr/UpdateRichDocuments#Parameters On Thu, Apr 1, 2010 at 9:27 PM, Mark Fletcher mark.fletcher2...@gmail.com wrote: Hi Eric, Shawn, Thank you for your reply. Luckily just on the second time itself my 13GB SOLR XML (more than a million docs) went in fine into SOLR without any problem and I uploaded another 2 more sets of 1.2million+ docs fine without any hassle. I will try for lesser sized more xmls next time as well as the auto commit suggestion. Best Rgds, Mark. On Thu, Apr 1, 2010 at 6:18 PM, Shawn Smith sh...@thena.net wrote: The error might be that your http client doesn't handle really large files (32-bit overflow in the Content-Length header?) or something in your network is killing your long-lived socket? Solr can definitely accept a 13GB xml document. I've uploaded large files into Solr successfully, including recently a 12GB XML input file with ~4 million documents. My Solr instance had 2GB of memory and it took about 2 hours. Solr streamed the XML in nicely. I had to jump through a couple of hoops, but in my case it was easier than writing a tool to split up my 12GB XML file... 1. I tried to use curl to do the upload, but it didn't handle files that large. For my quick and dirty testing, netcat (nc) did the trick--it doesn't buffer the file in memory and it doesn't overflow the Content-Length header. Plus I could pipe the data through pv to get a progress bar and estimated time of completion. Not recommended for production! FILE=documents.xml SIZE=$(stat --format %s $FILE) (echo POST /solr/update HTTP/1.1 Host: localhost:8983 Content-Type: text/xml Content-Length: $SIZE ; cat $FILE ) | pv -s $SIZE | nc localhost 8983 2. Indexing seemed to use less memory if I configured Solr to auto commit periodically in solrconfig.xml. This is what I used: updateHandler class=solr.DirectUpdateHandler2 autoCommit maxDocs25000/maxDocs !-- maximum uncommited docs before autocommit triggered -- maxTime30/maxTime !-- 5 minutes, maximum time (in MS) after adding a doc before an autocommit is triggered -- /autoCommit /updateHandler Shawn On Thu, Apr 1, 2010 at 10:10 AM, Erick Erickson erickerick...@gmail.com wrote: Don't do that. For many reasons G. By trying to batch so many docs together, you're just *asking* for trouble. Quite apart from whether it'll work once, having *any* HTTP-based protocol work reliably with 13G is fragile... For instance, I don't want to have my know whether the XML parsing in SOLR parses the entire document into memory before processing or not. But I sure don't want my application to change behavior if SOLR changes it's mind and wants to process the other way. My perfectly working application (assuming an event-driven parser) could suddenly start requiring over 13G of memory... Oh my aching head! Your specific error might even be dependent upon GCing, which will cause it to break differently, sometimes, maybe.. So do break things up and transmit multiple documents. It'll save you a world of hurt. HTH Erick On Thu, Apr 1, 2010 at 4:34 AM, Mark Fletcher mark.fletcher2...@gmail.comwrote: Hi, For the first time I tried uploading a huge input SOLR xml having about 1.2 million *docs* (13GB in size). After some time I get the following exception:- uThe server encountered an internal error ([was class java.net.SocketTimeoutException] Read timed out java.lang.RuntimeException: [was class java.net.SocketTimeoutException] Read timed out at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:279) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:138) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at
Re: Unable to load MailEntityProcessor or org.apache.solr.handler.dataimport.MailEntityProcessor
Hi Can no-one help me with this? Andrew On 2 April 2010 22:24, Andrew McCombe eupe...@gmail.com wrote: Hi I am experimenting with Solr to index my gmail and am experiencing an error: 'Unable to load MailEntityProcessor or org.apache.solr.handler.dataimport.MailEntityProcessor' I downloaded a fresh 1.4 tgz, extracted it and added the following to example/solr/config/solrconfig.xml: requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=config/home/andrew/bin/apache-solr-1.5-dev/example/solr/conf/email-data-config.xml/str /lst /requestHandler email-data-config.xml containd the following: dataConfig document name=mailindex entity processor=MailEntityProcessor user=eupe...@gmail.com password=xx host=imap.gmail.com protocol=imaps folders = inbox/ /document /dataConfig Whenever I try to import data using /dataimport?command=full-import I am seeing the error below: Apr 2, 2010 10:14:51 PM org.apache.solr.handler.dataimport.DataImporter doFullImport SEVERE: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to load EntityProcessor implementation for entity:11418758786959 Processing Document # 1 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) at org.apache.solr.handler.dataimport.DocBuilder.getEntityProcessor(DocBuilder.java:805) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:536) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:261) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:185) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:391) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:372) Caused by: java.lang.ClassNotFoundException: Unable to load MailEntityProcessor or org.apache.solr.handler.dataimport.MailEntityProcessor at org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:966) at org.apache.solr.handler.dataimport.DocBuilder.getEntityProcessor(DocBuilder.java:802) ... 6 more Caused by: org.apache.solr.common.SolrException: Error loading class 'MailEntityProcessor' at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:373) at org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:956) ... 7 more Caused by: java.lang.ClassNotFoundException: MailEntityProcessor at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:592) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:357) ... 8 more Apr 2, 2010 10:14:51 PM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: start rollback Apr 2, 2010 10:14:51 PM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: end_rollback Am I missing a step somewhere? I have tried this with the standard apache 1.4, a nightly of 1.5 and also the LucidWorks release and get the same issue with each. The wiki isn't very detailed either. My backbground isn't in Java so a lot of this is new to me. Regards Andrew McCombe
Re: Experience with indexing billions of documents?
The 2B limitation is within one shard, due to using a signed 32-bit integer. There is no limit in that regard in sharding- Distributed Search uses the stored unique document id rather than the internal docid. On Fri, Apr 2, 2010 at 10:31 AM, Rich Cariens richcari...@gmail.com wrote: A colleague of mine is using native Lucene + some home-grown patches/optimizations to index over 13B small documents in a 32-shard environment, which is around 406M docs per shard. If there's a 2B doc id limitation in Lucene then I assume he's patched it himself. On Fri, Apr 2, 2010 at 1:17 PM, dar...@ontrenet.com wrote: My guess is that you will need to take advantage of Solr 1.5's upcoming cloud/cluster renovations and use multiple indexes to comfortably achieve those numbers. Hypthetically, in that case, you won't be limited by single index docid limitations of Lucene. We are currently indexing 5 million books in Solr, scaling up over the next few years to 20 million. However we are using the entire book as a Solr document. We are evaluating the possibility of indexing individual pages as there are some use cases where users want the most relevant pages regardless of what book they occur in. However, we estimate that we are talking about somewhere between 1 and 6 billion pages and have concerns over whether Solr will scale to this level. Does anyone have experience using Solr with 1-6 billion Solr documents? The lucene file format document (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations) mentions a limit of about 2 billion document ids. I assume this is the lucene internal document id and would therefore be a per index/per shard limit. Is this correct? Tom Burton-West. -- Lance Norskog goks...@gmail.com
Re: Index db data
It seems to work ;). However, trueman, you should subscribe to solr-user@lucene.apache.org, since not everybody looks up Nabble for mailing-list postings. - Mitch -- View this message in context: http://n3.nabble.com/Index-db-data-tp693204p698691.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr caches and nearly static indexes
In a word: no. What you can do instead of deleting them is to add them to a growing list of don't search for these documents. This could be listed in a filter query. We had exactly this problem in a consumer app; we had a small but continuously growing list of obscene documents in the index, and did not want to display these. So, we had a filter query with all of the obscene words, and used this with every query. Lance On Fri, Apr 2, 2010 at 6:34 PM, Shawn Heisey s...@elyograg.org wrote: My index has a number of shards that are nearly static, each with about 7 million documents. By nearly static, I mean that the only changes that normally happen to them are document deletions, done with the xml update handler. The process that does these deletions runs once every two minutes, and does them with a query on a field other than the one that's used for uniqueKey. Once a day, I will be adding data to these indexes with the DIH delta-import. One of my shards gets all new data once every two minutes, but it is less than 5% the size of the others. The problem that I'm running into is that every time a delete is committed, my caches are suddenly invalid and I seem to have two options: Spend a lot of time and I/O rewarming them, or suffer with slow (3 seconds or longer) search times. Is there any way to have the index keep its caches when the only thing that happens is deletions, then invalidate them when it's time to actually add data? It would have to be something I can dynamically change when switching between deletions and the daily import. Thanks, Shawn -- Lance Norskog goks...@gmail.com
Some help for folks trying to get new Solr/Lucene up in Eclipse
Hey All, Just to save some folks some time in case you are trying to get new Lucene/Solr up in running in Eclipse. If you continue to get weird errors, e.g., in solr/src/test/TestConfig.java regarding org.w3c.dom.Node#getTextContent(), I found for me this error was caused by including the Tidy.jar (which includes its own version of the Node API) in the build path. If you take that out, you should be good. Wanted to pass that along. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Obtaining SOLR index size on disk
This information is not available via the API. If you would like this information added to the statistics request, please file a JIRA requesting it. Without knowing the size of the index files to be transferred, the client cannot monitor its own disk space. This would be useful for the cloud management features. On Mon, Apr 5, 2010 at 5:35 AM, Na_D nabam...@zaloni.com wrote: hi, I am using the piece of code given below ReplicationHandler handler2 = new ReplicationHandler(); System.out.println( handler2.getDescription()); NamedList statistics = handler2.getStatistics(); System.out.println(Statistics + statistics); The result that i am getting (ie the printed statment is : Statistics {handlerStart=1270469530218,requests=0,errors=0,timeouts=0,totalTime=0,avgTimePerRequest=NaN,avgRequestsPerSecond=NaN} But the Statistics consists of the other info too: class org.apache.solr.handler.ReplicationHandler /class version $Revision: 829682 $ /version description ReplicationHandler provides replication of index and configuration files from Master to Slaves /description stats stat name=handlerStart 1270463612968 /stat stat name=requests 0 /stat stat name=errors 0 /stat stat name=timeouts 0 /stat stat name=totalTime 0 /stat stat name=avgTimePerRequest NaN /stat stat name=avgRequestsPerSecond 0.0 /stat stat name=indexSize 19.29 KB /stat stat name=indexVersion 1266984293131 /stat stat name=generation 3 /stat stat name=indexPath C:\solr\apache-solr-1.4.0\example\example-DIH\solr\db\data\index /stat stat name=isMaster true /stat stat name=isSlave false /stat stat name=confFilesToReplicate schema.xml,stopwords.txt,elevate.xml /stat stat name=replicateAfter [commit, startup] /stat stat name=replicationEnabled true /stat /stats /entry this is where the problem lies : i need the size of the index im not finding the API nor is the statistics printing out(sysout) the same. how to i get the size of the index -- View this message in context: http://n3.nabble.com/Obtaining-SOLR-index-size-on-disk-tp500095p697599.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
Re: Minimum Should Match the other way round
Sorry for doubleposting, but to avoid any missunderstanding: Accessing instantiated filters is not a really good idea, since a new Filter must be instantiated all the time. However, what I have ment was: if I create a WordDelimiterFilter or a StopFilter and I have set a param for a file like stopwords.txt or protwords.txt, I want to access those (as I understood cached) ressources. - Mitch -- View this message in context: http://n3.nabble.com/Minimum-Should-Match-the-other-way-round-tp694867p698796.html Sent from the Solr - User mailing list archive at Nabble.com.
one particular doc in results should always come first for a particular query
Hi, Suppose I search for the word *international. *A particular record (say * recordX*) I am looking for is coming as the Nth result now. I have a requirement that when a user queries for *international *I need recordX to always be the first result. How can I achieve this. Note:- When user searches with a *different* keyword, *recordX* need not be the expected first result record; it may be a different record that has to be made to come as the first in the result for that keyword. Is there a way to achieve this requirement. I am using dismax. Thanks in advance. BR, Mark
Re: Unable to load MailEntityProcessor or org.apache.solr.handler.dataimport.MailEntityProcessor
The MailEntityProcessor is an extra and does not come normally with the DataImportHandler. The wiki page should mention this. In the Solr distribution it should be in the dist/ directory as dist/apache-solr-dataimporthandler-extras-1.4.jar. The class it wants is in this jar . (Do 'unzip -l jar' to find the classes inside a jar.) You have to make a lib/ directory in the Solr core you are using, and copy this jar into there. On Mon, Apr 5, 2010 at 1:15 PM, Andrew McCombe eupe...@gmail.com wrote: Hi Can no-one help me with this? Andrew On 2 April 2010 22:24, Andrew McCombe eupe...@gmail.com wrote: Hi I am experimenting with Solr to index my gmail and am experiencing an error: 'Unable to load MailEntityProcessor or org.apache.solr.handler.dataimport.MailEntityProcessor' I downloaded a fresh 1.4 tgz, extracted it and added the following to example/solr/config/solrconfig.xml: requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=config/home/andrew/bin/apache-solr-1.5-dev/example/solr/conf/email-data-config.xml/str /lst /requestHandler email-data-config.xml containd the following: dataConfig document name=mailindex entity processor=MailEntityProcessor user=eupe...@gmail.com password=xx host=imap.gmail.com protocol=imaps folders = inbox/ /document /dataConfig Whenever I try to import data using /dataimport?command=full-import I am seeing the error below: Apr 2, 2010 10:14:51 PM org.apache.solr.handler.dataimport.DataImporter doFullImport SEVERE: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to load EntityProcessor implementation for entity:11418758786959 Processing Document # 1 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) at org.apache.solr.handler.dataimport.DocBuilder.getEntityProcessor(DocBuilder.java:805) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:536) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:261) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:185) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:391) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:372) Caused by: java.lang.ClassNotFoundException: Unable to load MailEntityProcessor or org.apache.solr.handler.dataimport.MailEntityProcessor at org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:966) at org.apache.solr.handler.dataimport.DocBuilder.getEntityProcessor(DocBuilder.java:802) ... 6 more Caused by: org.apache.solr.common.SolrException: Error loading class 'MailEntityProcessor' at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:373) at org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:956) ... 7 more Caused by: java.lang.ClassNotFoundException: MailEntityProcessor at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:592) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:357) ... 8 more Apr 2, 2010 10:14:51 PM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: start rollback Apr 2, 2010 10:14:51 PM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: end_rollback Am I missing a step somewhere? I have tried this with the standard apache 1.4, a nightly of 1.5 and also the LucidWorks release and get the same issue with each. The wiki isn't very detailed either. My backbground isn't in Java so a lot of this is new to me. Regards Andrew McCombe -- Lance Norskog goks...@gmail.com
Re: including external files in config by corename
Making snippets is part of highlighting. http://www.lucidimagination.com/search/s:lucid/li:cdrg?q=snippet On Mon, Apr 5, 2010 at 10:53 AM, Shawn Heisey s...@elyograg.org wrote: Is it possible to access the core name in a config file (such as solrconfig.xml) so I can include core-specific configlets into a common config file? I would like to pull in different configurations for things like shards and replication, but have all the cores otherwise use an identical config file. Also, I have been looking for the syntax to include a snippet and haven't turned anything up yet. Thanks, Shawn -- Lance Norskog goks...@gmail.com
Re: no of cfs files are more that the mergeFactor
mergeFactor=5 means that if there are 42 documents, there will be 3 index files: 1 with 25 documents, 3 with 5 documents, and 1 with 2 documents Imagine making change with coins of 1 document, 5 documents, 5^2 documents, 5^3 documents, etc. On Mon, Apr 5, 2010 at 10:59 AM, Chris Hostetter hossman_luc...@fucit.org wrote: This sounds completley normal form what i remembe about mergeFactor. Segmenets are merged by level meaning that with a mergeFactor of 5, once 5 level 1 segments are formed they are merged into a single level 2 segment. then 5 more level 1 segments are allowed to form before the next merge (resulting in 2 legel 2 sements). Once you have 5 level 2 sements, then they are all merged into a single level 3 segment, etc... : I had my mergeFactor as 5 , : but when i load a data with some 1,00,000 i got some 12 .cfs files in my : data/index folder . : : How come this is possible . : in what context we can have more no of .cfs files -Hoss -- Lance Norskog goks...@gmail.com
exact match coming as second record
Hi, I am using the dismax handler. I have a field named *myfield* which has a value say XXX.YYY.ZZZ. I have boosted myfield^20.0. Even with such a high boost (in fact among the qf fields specified this field has the max boost given), when I search for XXX.YYY.ZZZ I see my record as the second one in the results and a record of the form XXX.YYY.ZZZ.AAA.BBB is appearing as the first one. Can any one help me understand why is this so, as I thought an exact match on a heavily boosted field would give the exact match record first in dismax. Thanks and Rgds, Mark
Re: one particular doc in results should always come first for a particular query
Hmmm, how do you know which particular record corresponds to which keyword? Is this a list known at index time, as in this record should come up first whenever bonkers is the keyword? If that's the case, you could copy the magic keyword to a different field (say magic_keyword) and boost it right into orbit as an OR clause (magic_keyword:bonkers ^1). This kind of assumes that a magic keyword corresponds to one and only one document If this is way off base, perhaps you could characterize how keywords map to specific documents you want at the top. Best Erick P.S. It threw me for a minute when you used asterisks (*) for emphasis, it's easily confused with wildcards. On Mon, Apr 5, 2010 at 5:30 PM, Mark Fletcher mark.fletcher2...@gmail.comwrote: Hi, Suppose I search for the word *international. *A particular record (say * recordX*) I am looking for is coming as the Nth result now. I have a requirement that when a user queries for *international *I need recordX to always be the first result. How can I achieve this. Note:- When user searches with a *different* keyword, *recordX* need not be the expected first result record; it may be a different record that has to be made to come as the first in the result for that keyword. Is there a way to achieve this requirement. I am using dismax. Thanks in advance. BR, Mark
Re: exact match coming as second record
What do you get back when you specify debugQuery=on? Best Erick On Mon, Apr 5, 2010 at 7:31 PM, Mark Fletcher mark.fletcher2...@gmail.comwrote: Hi, I am using the dismax handler. I have a field named *myfield* which has a value say XXX.YYY.ZZZ. I have boosted myfield^20.0. Even with such a high boost (in fact among the qf fields specified this field has the max boost given), when I search for XXX.YYY.ZZZ I see my record as the second one in the results and a record of the form XXX.YYY.ZZZ.AAA.BBB is appearing as the first one. Can any one help me understand why is this so, as I thought an exact match on a heavily boosted field would give the exact match record first in dismax. Thanks and Rgds, Mark
Re: one particular doc in results should always come first for a particular query
: If that's the case, you could copy the magic keyword to a different field : (say magic_keyword) and boost it right into orbit as an OR clause : (magic_keyword:bonkers ^1). This kind of assumes that a magic keyword : corresponds to one and only one document : : If this is way off base, perhaps you could characterize how keywords map to : specific documents you want at the top. This smells like... http://wiki.apache.org/solr/QueryElevationComponent -Hoss
Re: Multicore and TermVectors
: Subject: Multicore and TermVectors It doesn't sound like Multicore is your issue ... it seems like what you mean is that you are using distributed search with TermVectors, and that is causing a problem. Can you please clarify exactly what you mean ... describe your exact setup (ie: how manay machines, how many solr ports running on each of those machines, what the solr.xml looks like on each of those ports, how many SolrCores running in each of those ports, what the slrconfig.xml looks like for each of those instances, which instances coordinate distributed searches of which shards, what urls your client hits, what URLs get hit on each of your shards (according to the logs) as a result, etc... details, details, details. -Hoss
Re: Solr caches and nearly static indexes
: times. Is there any way to have the index keep its caches when the only thing : that happens is deletions, then invalidate them when it's time to actually add : data? It would have to be something I can dynamically change when switching : between deletions and the daily import. The problem is a delete is a genuine hange that invalidates hte cache objects. The worst case is the QueryResultCache where a deleted doc would require shifting all of hte other docs up in any result set that it matched on -- even if that doc isn't in the actual DocSlice that's cached (ie: the cached version of results 50-100 is affected by deleting a doc from 1-50) In theory something like the filterCache could be warmed by copying entires from the old cache and just unsetting the bits corrisponding to the deleted docs -- except that i'm pretty sure even if all you do is delete some docs, a MergePolicy *could* decide to merge segments and collapse away the docids of the delete docs. -Hoss
Re: Solr caches and nearly static indexes
: We had exactly this problem in a consumer app; we had a small but : continuously growing list of obscene documents in the index, and did : not want to display these. So, we had a filter query with all of the : obscene words, and used this with every query. that doesn't seem like it would really help with the caching issue ... the reusing the FieldCache seems like hte only thing that would be advantageous in that case, the filterCache and queryResultCache are going to have a low cache hit rate as the filter queries involved keep changing as new doc eys get added to the filter query. or am i completely missunderstanding how you had this working? -Hoss
Re: Solr caches and nearly static indexes
On Mon, Apr 5, 2010 at 9:04 PM, Chris Hostetter hossman_luc...@fucit.org wrote: ... the reusing the FieldCache seems like hte only thing that would be advantageous in that case And FieldCache entries are currently reused when there have only been deletions on a segment (since Solr 1.4). -Yonik http://www.lucidimagination.com
Re: Solr caches and nearly static indexes
: ... the reusing the FieldCache seems like hte only thing that would be : advantageous in that case : : And FieldCache entries are currently reused when there have only been : deletions on a segment (since Solr 1.4). But that's kind of orthoginal to (what i think) Lance's point was: that instead of deleting docs and open a new searcher, you could instead just add the doc keys to a (negated) filter query (and never open a new searcher at all) -Hoss
Re: Solr caches and nearly static indexes
On Mon, Apr 5, 2010 at 9:10 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : ... the reusing the FieldCache seems like hte only thing that would be : advantageous in that case : : And FieldCache entries are currently reused when there have only been : deletions on a segment (since Solr 1.4). But that's kind of orthogina Yeah - just coming into the middle and pointing out the FieldCache reuse thing (which is new for 1.4). l to (what i think) Lance's point was: that instead of deleting docs and open a new searcher, you could instead just add the doc keys to a (negated) filter query (and never open a new searcher at all) I guess as long as you versioned the filter that could work. It would have the effect of invalidating all of the query cache, but wouldn't affect the filter cache. -Yonik http://www.lucidimagination.com
Re: exact match coming as second record
Hi Eric, Thanks many for your mail! Please find attached the debugQuery results. Thanks! Mark On Mon, Apr 5, 2010 at 7:38 PM, Erick Erickson erickerick...@gmail.comwrote: What do you get back when you specify debugQuery=on? Best Erick On Mon, Apr 5, 2010 at 7:31 PM, Mark Fletcher mark.fletcher2...@gmail.comwrote: Hi, I am using the dismax handler. I have a field named *myfield* which has a value say XXX.YYY.ZZZ. I have boosted myfield^20.0. Even with such a high boost (in fact among the qf fields specified this field has the max boost given), when I search for XXX.YYY.ZZZ I see my record as the second one in the results and a record of the form XXX.YYY.ZZZ.AAA.BBB is appearing as the first one. Can any one help me understand why is this so, as I thought an exact match on a heavily boosted field would give the exact match record first in dismax. Thanks and Rgds, Mark A personal note:- I have boosted the id field to the highest among my qf values specified in my dismax. Even then when I search for an id say XX.YYY.ZZZ, instead of pushing the record with id=XX.YYY.ZZZ to the first place, it is displaying another record XX.YYY.ZZZ.ME.PK as the first one...There are total 4 results but I have included details of only the first and second. Am surprised why XX.YY.ZZZ doesn't come as the first record even after an exact match found in it. My qf fields in dismax:- str name=qf name^10.0 id^20.0 subtopic1^1.0 indicator_value^1.0 country_name^1.0 country_code^1.0 source^0.8 database^1.4 definition^1.2 dr_report_name^1.0 dr_header^1.0 dr_footer^1.0 dr_mdx_query^1.0 dr_reportmetadata^1.0 content^1.0 aag_indicators^1.0 type^1.0 text^.3 /str str name=pf id^6.0 /str str name=bq type:Timeseries^1000.0 /str Debug Report:- lst name=debug str name=rawquerystringxx.yyy./str str name=querystringxx.yyy./str str name=parsedquery+DisjunctionMaxQuery((text:(xx.yyy.zzz xx) yyy ^0.3 | definition:(xx.yyy.zzz xx) yyy ^0.2 | indicator_value:(xx.yyy.zzz xx) yyy | subtopic1:(xx.yyy.zzz xx) yyy | dr_report_name:(xx.yyy.zzz xx) yyy | dr_reportmetadata:(xx.yyy.zzz xx) yyy | dr_footer:(xx.yyy.zzz xx) yyy | type:(xx.yyy.zzz xx) yyy | country_code:(xx.yyy.zzz xx) yyy ^2.0 | country_name:(xx.yyy.zzz xx) yyy ^2.0 | database:(xx.yyy.zzz xx) yyy ^1.4 | aag_indicators:(xx.yyy.zzz xx) yyy | content:(xx.yyy.zzz xx) yyy | id:xx.yyy.^1000.0 | dr_mdx_query:(xx.yyy.zzz xx) yyy | source:(xx.yyy.zzz xx) yyy ^0.2 | name:(xx.yyy.zzz xx) yyy ^10.0 | dr_header:(xx.yyy.zzz xx) yyy )~0.01) DisjunctionMaxQuery((id:xx.yyy.^6.0)~0.01) type:timeseries^1000.0/str str name=parsedquery_toString+(text:(xx.yyy.zzz xx) yyy ^0.3 | definition:(xx.yyy.zzz xx) yyy ^0.2 | indicator_value:(xx.yyy.zzz xx) yyy | subtopic1:(xx.yyy.zzz xx) yyy | dr_report_name:(xx.yyy.zzz xx) yyy | dr_reportmetadata:(xx.yyy.zzz xx) yyy | dr_footer:(xx.yyy.zzz xx) yyy | type:(xx.yyy.zzz xx) yyy | country_code:(xx.yyy.zzz xx) yyy ^2.0 | country_name:(xx.yyy.zzz xx) yyy ^2.0 | database:(xx.yyy.zzz xx) yyy ^1.4 | aag_indicators:(xx.yyy.zzz xx) yyy | content:(xx.yyy.zzz xx) yyy | id:xx.yyy.^1000.0 | dr_mdx_query:(xx.yyy.zzz xx) yyy | source:(xx.yyy.zzz xx) yyy ^0.2 | name:(xx.yyy.zzz xx) yyy ^10.0 | dr_header:(xx.yyy.zzz xx) yyy )~0.01 (id:xx.yyy.^6.0)~0.01 type:timeseries^1000.0/str lst name=explain str name=XX.YYY..ME.PK 0.15786289 = (MATCH) sum of: 6.086512E-4 = (MATCH) max plus 0.01 times others of: 6.086512E-4 = (MATCH) weight(text:(xx.yyy. sp) yyy ^0.3 in 1004), product of: 7.562088E-4 = queryWeight(text:(xx.yyy. xx) yyy ^0.3), product of: 0.3 = boost 20.604721 = idf(text:(xx.yyy. xx) yyy ^0.3) 1.2233584E-4 = queryNorm 0.8048719 = (MATCH) fieldWeight(text:(xx.yyy. xx) yyy ^0.3 in 1004), product of: 1.0 = tf(phraseFreq=1.0) 20.604721 = idf(text:(xx.yyy. xx) yyy ^0.3) 0.0390625 = fieldNorm(field=text, doc=1004) 0.15725423 = (MATCH) weight(type:timeseries^1000.0 in 1004), product of: 0.1387005 = queryWeight(type:timeseries^1000.0), product of: 1000.0 = boost 1.1337683 = idf(docFreq=1054, maxDocs=1206) 1.2233584E-4 = queryNorm 1.1337683 = (MATCH) fieldWeight(type:timeseries in 1004), product of: 1.0 = tf(termFreq(type:timeseries)=1) 1.1337683 = idf(docFreq=1054, maxDocs=1206) 1.0 = fieldNorm(field=type, doc=1004) /str str name=XX.YYY. 0.15774116 = (MATCH) sum of: 4.8692097E-4 = (MATCH) max plus 0.01 times others of: 4.8692097E-4 = (MATCH) weight(text:(xx.yyy. xx) yyy ^0.3 in 1003), product of: 7.562088E-4 = queryWeight(text:(xx.yyy. xx) yyy ^0.3), product of: 0.3 = boost 20.604721 =
Re: including external files in config by corename
On 04/05/2010 01:53 PM, Shawn Heisey wrote: Is it possible to access the core name in a config file (such as solrconfig.xml) so I can include core-specific configlets into a common config file? I would like to pull in different configurations for things like shards and replication, but have all the cores otherwise use an identical config file. Also, I have been looking for the syntax to include a snippet and haven't turned anything up yet. Thanks, Shawn The best you have to work with at the moment is Xincludes: http://wiki.apache.org/solr/SolrConfigXml#XInclude and System Property Substitution: http://wiki.apache.org/solr/SolrConfigXml#System_property_substitution -- - Mark http://www.lucidimagination.com
Re: Need info on CachedSQLentity processor
On 04/05/2010 02:28 PM, bbarani wrote: Hi, I am using cachedSqlEntityprocessor in DIH to index the data. Please find below my dataconfig structure, entity x query=select * from x --- object entity y query=select * from y processor=cachedSqlEntityprocessor cachekey=y.id cachevalue=x.id -- object properties For each and every object I would be retrieveing corresponding object properties (in my subqueries). I get in to OOM very often and I think thats a trade off if I use cachedSqlEntityprocessor. My assumption is that when I use cachedSqlEntityprocessor the indexing happens as follows, First entity x will get executed and the entire table gets stored in cache next entity y gets executed and entire table gets stored in cache Finally the compasion heppens through hash map . So always I need to have the memory allocated to SOLR JVM more than or equal to the data present in tables? Now my final question is that even after SOLR complexes indexing the memory used previously is not getting released. I could still see the JVM consuming 1.5 GB after the indexing completes. I tried to use Java hotspot options but didnt see any differences.. Any thoughts / confirmation on my assumptions above would be of great help to me to get in to a decision of choosing cachedSqlEntityprocessor or not. Thanks, BB You are right - CacheSQLEntityProcessor: the cache is an unbounded HashMap, with no option to bound it. IMO this should be fixed - want to make a JIRA issue? I've brought it up on the list before, but I don't think I ever got around to making an issue. As to why its not getting released - that is odd. Perhaps a GC has just not been triggered yet and it will be released? If not, that's a pretty nasty bug. Can you try forcing a GC to see? (say with jconsole?) -- - Mark http://www.lucidimagination.com
Re: including external files in config by corename
: The best you have to work with at the moment is Xincludes: : : http://wiki.apache.org/solr/SolrConfigXml#XInclude : : and System Property Substitution: : : http://wiki.apache.org/solr/SolrConfigXml#System_property_substitution Except that XInclude is a feature of hte XML parser, while property substitution is soemthing Solr does after the XML has been parsed into a DOM -- so you can't have an XInclude of a fle whose name is determined by a property (like the core name) what you cna do however, is have a distinct solrconfig.xml for each core, which is just a thin shell that uses XInclude to include big chunkcs of frequently reused declarations, and some cores can exclude some of thes includes. (ie: turn the problem inside out) -Hoss
Re: Some help for folks trying to get new Solr/Lucene up in Eclipse
I had a slight hiccup that I just ignored. Even when I used Java 1.6 JDK mode, Eclipse did not know this method. I had to comment out the three places that use this method. javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(true) Lance Norskog On Mon, Apr 5, 2010 at 1:49 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hey All, Just to save some folks some time in case you are trying to get new Lucene/Solr up in running in Eclipse. If you continue to get weird errors, e.g., in solr/src/test/TestConfig.java regarding org.w3c.dom.Node#getTextContent(), I found for me this error was caused by including the Tidy.jar (which includes its own version of the Node API) in the build path. If you take that out, you should be good. Wanted to pass that along. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- Lance Norskog goks...@gmail.com
Re: Need info on CachedSQLentity processor
Mark, I have opened a JIRA issue - https://issues.apache.org/jira/browse/SOLR-1867 Thanks, Barani -- View this message in context: http://n3.nabble.com/Need-info-on-CachedSQLentity-processor-tp698418p699329.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multicore and TermVectors
There is no query parameter. The query parser throws an NPE if there is no query parameter: http://issues.apache.org/jira/browse/SOLR-435 It does not look like term vectors are processed in distributed search anyway. On Mon, Apr 5, 2010 at 4:45 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : Subject: Multicore and TermVectors It doesn't sound like Multicore is your issue ... it seems like what you mean is that you are using distributed search with TermVectors, and that is causing a problem. Can you please clarify exactly what you mean ... describe your exact setup (ie: how manay machines, how many solr ports running on each of those machines, what the solr.xml looks like on each of those ports, how many SolrCores running in each of those ports, what the slrconfig.xml looks like for each of those instances, which instances coordinate distributed searches of which shards, what urls your client hits, what URLs get hit on each of your shards (according to the logs) as a result, etc... details, details, details. -Hoss -- Lance Norskog goks...@gmail.com
Re: including external files in config by corename
On 04/05/2010 10:12 PM, Chris Hostetter wrote: : The best you have to work with at the moment is Xincludes: : : http://wiki.apache.org/solr/SolrConfigXml#XInclude : : and System Property Substitution: : : http://wiki.apache.org/solr/SolrConfigXml#System_property_substitution Except that XInclude is a feature of hte XML parser, while property substitution is soemthing Solr does after the XML has been parsed into a DOM -- so you can't have an XInclude of a fle whose name is determined by a property (like the core name Didn't suggest he could - just giving him the features he has to work with. -- - Mark http://www.lucidimagination.com
What does it mean when you see a plus sign in between two words inside synonyms.txt?
Hi I'm new to this group, I would like to ask a question: What does it mean when you see a plus sign in between two words inside synonyms.txt? e.g. macbookair = macbook+air Thanks, Paulo -- View this message in context: http://n3.nabble.com/What-does-it-mean-when-you-see-a-plus-sign-in-between-two-words-inside-synonyms-txt-tp697235p697235.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: What does it mean when you see a plus sign in between two words inside synonyms.txt?
paulosalamat wrote: Hi I'm new to this group, I would like to ask a question: What does it mean when you see a plus sign in between two words inside synonyms.txt? e.g. macbookair = macbook+air Thanks, Paulo Welcome, Paulo! It depends on your tokenizer. You can specify a tokenizer via tokenizerFactory attribute when you use SynonymFilterFactory. The tokenizer is used when SynonymFilterFactory reads the synonyms.txt. If you do not specify it, WhitespaceTokenizer will be used as default. In the above example, the term text macbookair will be normalized to the term text macbook+air, if WhitespaceTokenizer is used. Koji -- http://www.rondhuit.com/en/
Re: What does it mean when you see a plus sign in between two words inside synonyms.txt?
Hi Koji, Thank you for the reply. I have another question. If WhitespaceTokenizer is used, is the term text macbook+air equal to macbook air? Thank you, Paulo On Mon, Apr 5, 2010 at 5:50 PM, Koji Sekiguchi [via Lucene] ml-node+697386-2142071620-218...@n3.nabble.comml-node%2b697386-2142071620-218...@n3.nabble.com wrote: paulosalamat wrote: Hi I'm new to this group, I would like to ask a question: What does it mean when you see a plus sign in between two words inside synonyms.txt? e.g. macbookair = macbook+air Thanks, Paulo Welcome, Paulo! It depends on your tokenizer. You can specify a tokenizer via tokenizerFactory attribute when you use SynonymFilterFactory. The tokenizer is used when SynonymFilterFactory reads the synonyms.txt. If you do not specify it, WhitespaceTokenizer will be used as default. In the above example, the term text macbookair will be normalized to the term text macbook+air, if WhitespaceTokenizer is used. Koji -- http://www.rondhuit.com/en/ -- View message @ http://n3.nabble.com/What-does-it-mean-when-you-see-a-plus-sign-in-between-two-words-inside-synonyms-txt-tp697235p697386.html To unsubscribe from What does it mean when you see a plus sign in between two words inside synonyms.txt?, click here (link removed) ==. -- View this message in context: http://n3.nabble.com/What-does-it-mean-when-you-see-a-plus-sign-in-between-two-words-inside-synonyms-txt-tp697235p697403.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: What does it mean when you see a plus sign in between two words inside synonyms.txt?
paulosalamat wrote: Hi Koji, Thank you for the reply. I have another question. If WhitespaceTokenizer is used, is the term text macbook+air equal to macbook air? No. In the field, macbook air will be a phrase (not a term). You can define not only terms but phrases in synonyms.txt: ex) macbookair = macbook air Koji -- http://www.rondhuit.com/en/
Re: Obtaining SOLR index size on disk
hi, I am using the piece of code given below ReplicationHandler handler2 = new ReplicationHandler(); System.out.println( handler2.getDescription()); NamedList statistics = handler2.getStatistics(); System.out.println(Statistics + statistics); The result that i am getting (ie the printed statment is : Statistics {handlerStart=1270469530218,requests=0,errors=0,timeouts=0,totalTime=0,avgTimePerRequest=NaN,avgRequestsPerSecond=NaN} But the Statistics consists of the other info too: class org.apache.solr.handler.ReplicationHandler /class version $Revision: 829682 $ /version description ReplicationHandler provides replication of index and configuration files from Master to Slaves /description stats stat name=handlerStart 1270463612968 /stat stat name=requests 0 /stat stat name=errors 0 /stat stat name=timeouts 0 /stat stat name=totalTime 0 /stat stat name=avgTimePerRequest NaN /stat stat name=avgRequestsPerSecond 0.0 /stat stat name=indexSize 19.29 KB /stat stat name=indexVersion 1266984293131 /stat stat name=generation 3 /stat stat name=indexPath C:\solr\apache-solr-1.4.0\example\example-DIH\solr\db\data\index /stat stat name=isMaster true /stat stat name=isSlave false /stat stat name=confFilesToReplicate schema.xml,stopwords.txt,elevate.xml /stat stat name=replicateAfter [commit, startup] /stat stat name=replicationEnabled true /stat /stats /entry this is where the problem lies : i need the size of the index im not finding the API nor is the statistics printing out(sysout) the same. how to i get the size of the index -- View this message in context: http://n3.nabble.com/Obtaining-SOLR-index-size-on-disk-tp500095p697599.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: cheking the size of the index using solrj API's
hi, I am using the piece of code given below ReplicationHandler handler2 = new ReplicationHandler(); System.out.println( handler2.getDescription()); NamedList statistics = handler2.getStatistics(); System.out.println(Statistics + statistics); The result that i am getting (ie the printed statment is : Statistics {handlerStart=1270469530218,requests=0,errors=0,timeouts=0,totalTime=0,avgTimePerRequest=NaN,avgRequestsPerSecond=NaN} But the Statistics consists of the other info too: class org.apache.solr.handler.ReplicationHandler /class version $Revision: 829682 $ /version description ReplicationHandler provides replication of index and configuration files from Master to Slaves /description stats stat name=handlerStart 1270463612968 /stat stat name=requests 0 /stat stat name=errors 0 /stat stat name=timeouts 0 /stat stat name=totalTime 0 /stat stat name=avgTimePerRequest NaN /stat stat name=avgRequestsPerSecond 0.0 /stat stat name=indexSize 19.29 KB /stat stat name=indexVersion 1266984293131 /stat stat name=generation 3 /stat stat name=indexPath C:\solr\apache-solr-1.4.0\example\example-DIH\solr\db\data\index /stat stat name=isMaster true /stat stat name=isSlave false /stat stat name=confFilesToReplicate schema.xml,stopwords.txt,elevate.xml /stat stat name=replicateAfter [commit, startup] /stat stat name=replicationEnabled true /stat /stats /entry this is where the problem lies : i need the size of the index im not finding the API nor is the statistics printing out(sysout) the same. how to i get the size of the index -- View this message in context: http://n3.nabble.com/cheking-the-size-of-the-index-using-solrj-API-s-tp692686p697603.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: cheking the size of the index using solrj API's
If you're using ReplicitionHandler directly, you already have the xml from which to extract the 'indexSize' attribute. From a client, you can get the indexSize by issuing: http://hostname:8983/solr/core/replication?command=details This will give you an xml response. Use: http://hostname:8983/solr/core/replication?command=detailswt=json to give you a json string that has 'indexSize' within it: {responseHeader:{status:0,QTime:0},details:{indexSize:6.63 KB,indexPath:usr//bin/solr/core0/index,commits:[[indexVersion,1259974360056,generation,1572,filelist,[segments_17o]],[indexVersion,1259974360057,generation,1573,filelist,[segments_17p,_zv.fdx,_zv.fnm,_zv.fdt,_zv.nrm,_zv.tis,_zv.prx,_zv.tii,_zv.frq]]],isMaster:true,isSlave:false,indexVersion:1259974360057,generation:1573,backup:[startTime,Mon Apr 05 14:28:46 BST 2010,fileCount,17,status,success,snapshotCompletedAt,Mon Apr 05 14:28:47 BST 2010]},WARNING:This response format is experimental. It is likely to change in the future.} Either way, you'll need to have some sort of parsing logic or formatting to get just the index size bit.
Re: Apache Lucene EuroCon Call For Participation: Prague, Czech Republic May 20 21, 2010
Just a reminder, just over one week left open on the CFP. Some great talks entered already. Keep it up! On Mar 24, 2010, at 8:03 PM, Grant Ingersoll wrote: Apache Lucene EuroCon Call For Participation - Prague, Czech Republic May 20 21, 2010 All submissions must be received by Tuesday, April 13, 2010, 12 Midnight CET/6 PM US EDT The first European conference dedicated to Lucene and Solr is coming to Prague from May 18-21, 2010. Apache Lucene EuroCon is running on on not-for-profit basis, with net proceeds donated back to the Apache Software Foundation. The conference is sponsored by Lucid Imagination with additional support from community and other commercial co-sponsors. Key Dates: 24 March 2010: Call For Participation Open 13 April 2010: Call For Participation Closes 16 April 2010: Speaker Acceptance/Rejection Notification 18-19 May 2010: Lucene and Solr Pre-conference Training Sessions 20-21 May 2010: Apache Lucene EuroCon This conference creates a new opportunity for the Apache Lucene/Solr community and marketplace, providing the chance to gather, learn and collaborate on the latest in Apache Lucene and Solr search technologies and what's happening in the community and ecosystem. There will be two days of Lucene and Solr training offered May 18 19, and followed by two days packed with leading edge Lucene and Solr Open Source Search content and talks by search and open source thought leaders. We are soliciting 45-minute presentations for the conference, 20-21 May 2010 in Prague. The conference and all presentations will be in English. Topics of interest include: - Lucene and Solr in the Enterprise (case studies, implementation, return on investment, etc.) - “How We Did It” Development Case Studies - Spatial/Geo search - Lucene and Solr in the Cloud - Scalability and Performance Tuning - Large Scale Search - Real Time Search - Data Integration/Data Management - Tika, Nutch and Mahout - Lucene Connectors Framework - Faceting and Categorization - Relevance in Practice - Lucene Solr for Mobile Applications - Multi-language Support - Indexing and Analysis Techniques - Advanced Topics in Lucene Solr Development All accepted speakers will qualify for discounted conference admission. Financial assistance is available for speakers that qualify. To submit a 45-minute presentation proposal, please send an email to c...@lucene-eurocon.org containing the following information in plain text: 1. Your full name, title, and organization 2. Contact information, including your address, email, phone number 3. The name of your proposed session (keep your title simple and relevant to the topic) 4. A 75-200 word overview of your presentation (in English); in addition to the topic, describe whether your presentation is intended as a tutorial, description of an implementation, an theoretical/academic discussion, etc. 5. A 100-200-word speaker bio that includes prior conference speaking or related experience (in English) To be considered, proposals must be received by 12 Midnight CET Tuesday, 13 April 2010 (Tuesday 13 April 6 PM US Eastern time, 3 PM US Pacific Time). Please email any questions regarding the conference to i...@lucene-eurocon.org. To be added to the conference mailing list, please email sig...@lucene-eurocon.org. If your organization is interested in sponsorship opportunities, email spon...@lucene-eurocon.org Key Dates 24 March 2010: Call For Participation Open 13 April 2010: Call For Participation Closes 16 April 2010: Speaker Acceptance/Rejection Notification 18-19 May 2010 Lucene and Solr Pre-conference Training Sessions 20-21 May 2010: Apache Lucene EuroCon We look forward to seeing you in Prague! Grant Ingersoll Apache Lucene EuroCon Program Chair www.lucene-eurocon.org
Re: cheking the size of the index using solrj API's
On Fri, Apr 2, 2010 at 7:07 AM, Na_D nabam...@zaloni.com wrote: hi, I need to monitor the index for the following information: 1. Size of the index 2 Last time the index was updated. If by 'size o the index' you mean document count, then check the Luke Request Handler http://wiki.apache.org/solr/LukeRequestHandler ryan
Re: add/update document as distinct operations? Is it possible?
Hi, I got the picture now. Not having distinct add/update actions force me to implement custom queueing mechanism. Thanks Cheers. Erick Erickson wrote: One of the most requested features in Lucene/SOLR is to be able to update only selected fields rather than the whole document. But that's not how it works at present. An update is really a delete and an add. So for your second message, you can't do a partial update, you must update the whole document. I'm a little confused by what you *want* in your first e-mail. But the current way SOLR works, if the SOLR server first received the delete then the update, the index would have the document in it. But the opposite order would delete the documen. But this really doesn't sound like a SOLR issue, since SOLR can't magically divine the desired outcome. Somewhere you have to coordinate the requests or your index will not be what you expect. That is, you have to define what rules index modifications follow and enforce them. Perhaps you can consider a queueing mechanism of some sort (that you'd have to implement yourself...) HTH Erick On Thu, Apr 1, 2010 at 1:03 AM, Julian Davchev j...@drun.net wrote: Hi I have distributed messaging solution where I need to distinct between adding a document and just trying to update it. Scenario: 1. message sent for document to be updated 2. meanwhile another message is sent for document to be deleted and is executed before 1 As a result when 1 comes instead of ignoring the update as document is no more...it will add it again. From what I see in manual I cannot distinct those operations which would. Any pointers? Cheers
Re: add/update document as distinct operations? Is it possible?
Chris, I don't see anything in the headers suggesting that Julian's message was a hijack of another thread On Thu, Apr 1, 2010 at 2:17 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : Subject: add/update document as distinct operations? Is it possible? : References: : dc9f7963609bed43b1ab02f3ce52863103dc35f...@bene-exch-01.benetech.local : In-Reply-To: : dc9f7963609bed43b1ab02f3ce52863103dc35f...@bene-exch-01.benetech.local http://people.apache.org/~hossman/#threadhijackhttp://people.apache.org/%7Ehossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See Also: http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking -Hoss -- Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once. http://www.israelekpo.com/
Re: Related terms/combined terms
Not sure of the exact vocabulary I am looking for so I'll try to explain myself. Given a search term is there anyway to return back a list of related/grouped keywords (based on the current state of the index) for that term. For example say I have a sports catalog and I search for Callaway. Is there anything that could give me back Callaway Driver Callaway Golf Balls Callaway Hat Callaway Glove Since these words are always grouped to together/related. Note sure if something like this is even possible. ShingleFilterFactory[1] plus TermsComponent[2] can give you grouped (phrases) keywords. You need to create an extra field (populate it via copyField) that constructs shingles (token n-grams). After that you can retrieve those trigram or bi-gram tokens starting with callaway. solr/terms?terms=trueterms.fl=yourNewFieldterms.prefix=Callaway [1]http://wiki.apache.org/solr/TermsComponent [2]http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory
Re: add/update document as distinct operations? Is it possible?
I still don't see what the difference is. If there was a distinct add/update process, how would that absolve you from having to implement your own queueing? To have predictable index content, you still must order your operations. Best Erick On Mon, Apr 5, 2010 at 12:45 PM, Julian Davchev j...@drun.net wrote: Hi, I got the picture now. Not having distinct add/update actions force me to implement custom queueing mechanism. Thanks Cheers. Erick Erickson wrote: One of the most requested features in Lucene/SOLR is to be able to update only selected fields rather than the whole document. But that's not how it works at present. An update is really a delete and an add. So for your second message, you can't do a partial update, you must update the whole document. I'm a little confused by what you *want* in your first e-mail. But the current way SOLR works, if the SOLR server first received the delete then the update, the index would have the document in it. But the opposite order would delete the documen. But this really doesn't sound like a SOLR issue, since SOLR can't magically divine the desired outcome. Somewhere you have to coordinate the requests or your index will not be what you expect. That is, you have to define what rules index modifications follow and enforce them. Perhaps you can consider a queueing mechanism of some sort (that you'd have to implement yourself...) HTH Erick On Thu, Apr 1, 2010 at 1:03 AM, Julian Davchev j...@drun.net wrote: Hi I have distributed messaging solution where I need to distinct between adding a document and just trying to update it. Scenario: 1. message sent for document to be updated 2. meanwhile another message is sent for document to be deleted and is executed before 1 As a result when 1 comes instead of ignoring the update as document is no more...it will add it again. From what I see in manual I cannot distinct those operations which would. Any pointers? Cheers
Re: Minimum Should Match the other way round
On Apr 3, 2010, at 10:18 AM, MitchK wrote: Hello, I want to tinkle a little bit with Solr, so I need a little feedback: Is it possible to define a Minimum Should Match for the document itself? I mean, it is possible to say, that a query this is my query should only match a document, if the document matches 3 of the four queried terms. However, I am searching for a solution that does something like: this is my query and the document has to consist of this query plus maximal - for example - two another terms? Example: Query: this is my query Doc1: this is my favorite query Doc2: I am searching for a lot of stuff, so this is my query Doc2: I'd like to say: this is my query Saying that maximal two another terms should occur in the document, Solr should response only doc1. If this is not possible out-of-the-box, I think one has to work with TermVectors, am I right? Not quite following. It sounds like you are saying you want to favor docs that are shorter, while still maximizing the number of terms that match, right? You might look at the Similarity class and the SimilarityFactory as well in the Solr/Lucene code. I think it's possible to do so outside of Lucene/Solr by aking the response of the TermVectorsComponent and filtering the result-list. But I'd like to integrate this into Lucene/Solr itself. Any ideas which components I have to customize? At the moment I am speculating that I have to customize the class which is collecting the result, before it is passing it to the ResponseWriter. Kind regards - Mitch -- View this message in context: http://n3.nabble.com/Minimum-Should-Match-the-other-way-round-tp694867p694867.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Does Lucidimagination search uses Multi facet query filter or uses session?
We are using multiselect facets like what you have below (although I haven't verified your syntax). So no, we are not using sessions. See http://www.lucidimagination.com/search/?q=multiselect+faceting#/s:email for help. -Grant http://www.lucidimagination.com On Apr 1, 2010, at 12:35 PM, bbarani wrote: Hi, I am trying to create a search functionality same as that of Lucidimagination search. As of now I have formed the Facet query as below http://localhost:8080/solr/db/select?q=*:*fq={!tag=3DotHierarchyFacet}3DotHierarchyFacet:ABCfacet=onfacet.field={!ex=3DotHierarchyFacet}3DotHierarchyFacetfacet.field=ApplicationStatusFacetfacet.mincount=1 Since I am having multiple facets I have planned to form the query based on the user selection. Something like below...if the user selects (multiple facets) application status as 'P' I would form the query as below http://localhost:8080/solr/db/select?q=*:*fq={!tag=3DotHierarchyFacet}3DotHierarchyFacet:NTSfq={!tag=ApplicationStatusFacet}ApplicationStatusFacet:Pfacet=onfacet.field={!ex=3DotHierarchyFacet}3DotHierarchyFacetfacet.field={!ex=ApplicationStatusFacet}facet.mincount=1 Can someone let me know I am forming the correct query to perform multiselect facets? I just want to know if I am doing anything wrong in the query.. We are also trying to achieve this using sessions but if we are able to solve this by query I would prefer using query than using session variables.. Thanks, Barani -- View this message in context: http://n3.nabble.com/Does-Lucidimagination-search-uses-Multi-facet-query-filter-or-uses-session-tp691167p691167.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: feature request for ivalid data formats
: : I don't know whether this is the good place to ask it, or there is a special : tool for issue : requests. We use Jira for bug reports and feature reuqests, but it's always a good idea to start with a solr-user email before filing a new bug/request to help discuss the behavior you are seeing. : 2010.03.23. 13:27:23 org.apache.solr.common.SolrException log : SEVERE: java.lang.NumberFormatException: For input string: 1595-1600 :at : java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) :at java.lang.Integer.parseInt(Integer.java:456) : : It would be great help in some cases, if I could know which field contained : this data in wrong format. you are 100% correct ... can you let us know what the rest of hte stack trace is (beyond that last line you posted) so we can figure out exactly where the bug is? : SimplePostTool: FATAL: Solr returned an error: For_input_string_15951600 : __javalangNumberFormatException_For_input_string_15951600 : ___at_javalangNumberFormatExceptionforInputStringNumberFormat : : (I added some line breaks for the shake of readability.) : : Could not be returned a string with the same format as in Solr log? Solr relies on the servlet container to format the error and return it to the user, with Jetty, the error does actually come back in human readable form as part of the response body -- what the SimplePostToll is printing out there is actually the one line HTTP response message which jetty (in it's infinite wisdom) set's using the entire response with the whitespace and newlinees escaped. If you us something like curl -D - to hit a Solr URL, you'll see what i mean about the response message vs the response body, and if you use a differnet servlet container (like tomcat) you'll see wha i mean baout the servlet container having control over what the error messages look like. -Hoss
Re: dismax multi search?
: I want to be able to direct some search terms to specific fields : : I want to do something like this : : keyword1 should search against book titles / authors : : keyword2 should search against book contents / book info / user reviews your question is a little vague ... will keyword1 and keyword2 be distinct params (ie: will the user tell you when certain words should be queried against titles/authors and when other keywords sould be queried against content/info/reviews) ... or are you going to have big ass giant workd lists, and anytime you see a word from one of those lists, you query a specific field for that word? assuming you mean the first (and not hte second) situation, you can use nested query parsers with param substitutio to get some interesting results... http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/ http://n3.nabble.com/How-to-compose-a-query-from-multiple-HTTP-URL-parameters-td519441.html#a679489 -Hoss
including external files in config by corename
Is it possible to access the core name in a config file (such as solrconfig.xml) so I can include core-specific configlets into a common config file? I would like to pull in different configurations for things like shards and replication, but have all the cores otherwise use an identical config file. Also, I have been looking for the syntax to include a snippet and haven't turned anything up yet. Thanks, Shawn
Re: Related terms/combined terms
Thanks for the response Mitch. I'm not too sure how well this will work for my needs but Ill certainly play around with it. I think something more along the lines of Ahmet's solution is what I was looking for. -- View this message in context: http://n3.nabble.com/Related-terms-combined-terms-tp694083p698327.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: no of cfs files are more that the mergeFactor
This sounds completley normal form what i remembe about mergeFactor. Segmenets are merged by level meaning that with a mergeFactor of 5, once 5 level 1 segments are formed they are merged into a single level 2 segment. then 5 more level 1 segments are allowed to form before the next merge (resulting in 2 legel 2 sements). Once you have 5 level 2 sements, then they are all merged into a single level 3 segment, etc... : I had my mergeFactor as 5 , : but when i load a data with some 1,00,000 i got some 12 .cfs files in my : data/index folder . : : How come this is possible . : in what context we can have more no of .cfs files -Hoss
Re: no of cfs files are more that the mergeFactor
I'm guessing the user is expecting there to be one cfs file for the index, and does not understand that its actually per segment. On 04/05/2010 01:59 PM, Chris Hostetter wrote: This sounds completley normal form what i remembe about mergeFactor. Segmenets are merged by level meaning that with a mergeFactor of 5, once 5 level 1 segments are formed they are merged into a single level 2 segment. then 5 more level 1 segments are allowed to form before the next merge (resulting in 2 legel 2 sements). Once you have 5 level 2 sements, then they are all merged into a single level 3 segment, etc... : I had my mergeFactor as 5 , : but when i load a data with some 1,00,000 i got some 12 .cfs files in my : data/index folder . : : How come this is possible . : in what context we can have more no of .cfs files -Hoss -- - Mark http://www.lucidimagination.com
Re: Getting solr response in HTML format : HTMLResponseWriter
: so I have tried to attach the xslt steelsheet to the response of SOLR with : passing this 2 variables wt=xslttr=example.xsl : : while example.xsl is an included steelsheet to SOLR , but the response in : HTML was'nt very perfect . can you elaborate on what you mean by wasn't very perfect ? ... what was wrong with it? ... was there an actaul bug, or were you just not happy with how it looked? did you try modifying the exampl.xsl? (it's intended purely as an example ... it's not ment to work for everyone as is) : So i have readen on the net that we can write an extension to the : QueryResponseWriter class like XMLResponseWriter (default) : and i m trying to build that . ... : I m proceeding like XMLREsponseWriter to create HTMLResponseWriter and i I would strongly suggest that instead of doing this, you take a look at the velocity response writer (in contrib) or tweet teh XSL some more ... writing a custom HTMLResponseWriter isn't neraly as flexible as either of those other two options -- particularly because the ResponseWriter API requires you to deal with the Response objects in the order they are added by the RequestHandler -- which isn't neccessarily the same order you want to deal with them in an HTML response. (this isn't typically a problem for most ResponseWriters because htey aren't typically intended to be read by humans) : org.apache.solr.common.SolrException: Error loading class : 'org.apache.solr.request.HTMLResponseWriter' 1) if you are writing a custom ResponseWriter, you should be using your own java package name, not org.apache.solr.request : Caused by: java.lang.ClassNotFoundException: : org.apache.solr.request.HTMLResponseWriter 2) it can't find your class. did you compile it? did you put it i na jar? where did you put the jar? what does your solr install look like? ... the details are the key to understanding why it can't find your class. -Hoss
Re: exceptionhandling error-reporting?
: This client uses a simple user-agent that requires JSON-syntax while parsing : searchresults from solr, but when solr drops an exception, tomcat returns an : error-500 page to the client and it crashes. define crashes ? ... presumabl you are tlaking about the client crashing because it can't parse theerro response, correct? ... the best suggestion given the current state of Solr is to make hte client smart enough to not attempt parsing of hte response unless the response code is 200. : I was wondering if theres already a way to prepare exceptions as error-reports : and integrate them into the search-result as a hint to the user? If it would : be just another element of the whole response-format, it would be possibly : compatible with any client out there. It's one of the oldest out standing improvements in the Solr issue tracker, but it hasn't gotten much love over the years... https://issues.apache.org/jira/browse/SOLR-141 One possible workarround if you are comfortable with Java andif you are willing to always get the erros in a single response format (ie: JSON)... you can customize the solr.war to specify an error jsp that your serlvet container will use to format all error responses. you can make that JSP extract the error message from the Exception and output it in JSON format. -Hoss
Re: Is this a bug of the RessourceLoader?
: Some applications (such as Windows Notepad), insert a UTF-8 Byte Order Mark : (BOM) as the first character of the file. So, perhaps the first word in your : stopwords list contains a UTF-8 BOM and thats why you are seeing this : behavior. Robert: BOMs are one of those things that strike me as being abhorent and inheriently evil because they seem to cause nothing but problems -- but in truth i understand very little baout them and have no idea if/when they actually add value. If text files that start with a BOM aren't properly being dealt with by Solr right now, should we consider that a bug? Is there something we can/should be doing in SolrResourceLoader to make Solr handle this situation better? -Hoss
Re: selecting documents older than 4 hours
: NOW/HOUR-5HOURS evaluates to 2010-03-31T21:00:00 which should not be the : case if the current time is Wed Mar 31 19:50:48 PDT 2010. Is SOLR converting : NOW to GMT time? 1) NOW means Now ... what moment in time is happening right at this moment is independent of what locale you are in and how you want to format that moment to represent it as a string. 2) Solr always parses/formats date time vlaues in UTC because Solr has no way of knowing what timezone the clients are in (or if some clients are in differnet timezones from eachother, or if the index is being replicated from a server in one timezone to a server i na differnet timezone, etc...). The documentation for DateField is very explicit about this (it's why the trailing Z is mandatory) 3) Rounding is always done relative UTC, largely for all of the same reasons listed above. If you want a specific offset you have to add it in using the DateMath syntax, ie... last_update_date:[NOW/DAY-7DAYS+8HOURS TO NOW/HOUR-5HOURS+8HOURS] -Hoss
Re: Is this a bug of the RessourceLoader?
On Mon, Apr 5, 2010 at 2:28 PM, Chris Hostetter hossman_luc...@fucit.org wrote: If text files that start with a BOM aren't properly being dealt with by Solr right now, should we consider that a bug? It's a Java bug: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058 But we should fix if it's practical to do so, rather than passing the buck. -Yonik http://www.lucidimagination.com