Search Regression Testing
Hey guys, I'm wondering how people are managing regression testing, in particular with things like text based search. I.e. if you change how fields are indexed or change boosts in dismax, ensuring that doesn't mean that critical queries are showing bad data. The obvious answer to me was using unit tests. These may be brittle as some index data can change over time, but I couldn't think of a better way. How is everyone else solving this problem? Cheers, Mark -- E: mark.man...@gmail.com T: http://www.twitter.com/neurotic W: www.compoundtheory.com cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia http://www.cfobjective.com.au Hands-on ColdFusion ORM Training www.ColdFusionOrmTraining.com
Re: how to start GarbageCollector
why is solr copy my complete index to somewhere when i start an delta-import? i copy one core, start an full-import from 35Million docs and then start an delta-import from the last hour (~2000Docs). dih/solr need start to copy the hole index... why ? i think he is copy the index, because my hdd-space starts to increase imediatly ... my live core ended a delta in 5-10 seconds !?!?!?!?!?!? i run jconsole during this time, what say it to me ? - --- System One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 1 Core with 31 Million Documents other Cores 100.000 - Solr1 for Search-Requests - commit every Minute - 5GB Xmx - Solr2 for Update-Request - delta every Minute - 4GB Xmx -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-start-GarbageCollector-tp2748080p2783923.html Sent from the Solr - User mailing list archive at Nabble.com.
very slow commit. copy of index ?
Hello again ;-) after a full-import from 36M Doc`s my delta import dont work fine. if i starts my delta (which runs on another core very fast) the commit need vry long. I think, that solr copys the hole index and commit the new documents in the index and then reduce the index size after this operations !?!!?!?!?! i start delta over DIH with: command=delta-importoptimize=falsecommit=true jconsole is running with but i dont know in which way jconsole can help me ... thx ! =) - --- System One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 1 Core with 31 Million Documents other Cores 100.000 - Solr1 for Search-Requests - commit every Minute - 5GB Xmx - Solr2 for Update-Request - delta every Minute - 4GB Xmx -- View this message in context: http://lucene.472066.n3.nabble.com/very-slow-commit-copy-of-index-tp2783940p2783940.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: command is still running ? delta-import?
i have the same problem. any resolutions ? - --- System One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 1 Core with 31 Million Documents other Cores 100.000 - Solr1 for Search-Requests - commit every Minute - 5GB Xmx - Solr2 for Update-Request - delta every Minute - 4GB Xmx -- View this message in context: http://lucene.472066.n3.nabble.com/command-is-still-running-delta-import-tp48p2783986.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Search Regression Testing
Hi Mark, What we're doing is using a bunch of acceptance tests with JBehave to drive our testing. We run this in a clean room environment, clearing out the indexes before a test run and inserting the data we're interested in. As well as tests to ensure things just work we have a bunch of tests that insert data and check it comes out in the order we're expecting to - so unexpected changes to boosts etc. can be caught early. Whereas what this doesn't tell us what a certain query will return with our live data set, it does affirm our assertions about the abstract case. You could use a similar technique to insert a bunch of data and then check your critical queries. Hey guys, I'm wondering how people are managing regression testing, in particular with things like text based search. I.e. if you change how fields are indexed or change boosts in dismax, ensuring that doesn't mean that critical queries are showing bad data. The obvious answer to me was using unit tests. These may be brittle as some index data can change over time, but I couldn't think of a better way. How is everyone else solving this problem? Cheers, Mark -- E: mark.man...@gmail.com T: http://www.twitter.com/neurotic W: www.compoundtheory.com cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia http://www.cfobjective.com.au Hands-on ColdFusion ORM Training www.ColdFusionOrmTraining.com __ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email __ -- Colin Vipurs Server Team Lead Shazam Entertainment Ltd 26-28 Hammersmith Grove, London W6 7HA m: +44 (0) 000 000 t: +44 (0) 20 8742 6820 w:www.shazam.com Please consider the environment before printing this document This e-mail and its contents are strictly private and confidential. It must not be disclosed, distributed or copied without our prior consent. If you have received this transmission in error, please notify Shazam Entertainment immediately on: +44 (0) 020 8742 6820 and then delete it from your system. Please note that the information contained herein shall additionally constitute Confidential Information for the purposes of any NDA between the recipient/s and Shazam Entertainment. Shazam Entertainment Limited is incorporated in England and Wales under company number 3998831 and its registered office is at 26-28 Hammersmith Grove, London W6 7HA. __ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email __
solr faceted search performance reason
Hello List, Please see my question at http://stackoverflow.com/questions/5552919/how-does-lucene-solr-achieve-high-performance-in-multi-field-faceted-search, I would be interested to know some details. Thank you, Robin
Re: Search Regression Testing
Mark, In one project, with Lucene not Solr, I also use a smallish unit test sample and apply some queries there. It is very limited but is automatable. I find a better way is to have precision and recall measures of real users run release after release. I could never fully apply this yet on a recurring basis sadly. My ideal world would be that the search sample is small enough and that users are able to restrict search to this. Then users have the possibility of checking correctness of each result (say, first 10) for each query out of which one can then read results. Often, users provide comments along, e.g. missing matches. This is packed as a wiki page. First samples generally do not use enough of the features, this is adjusted as a dialogue. As a developer I review the test suite run and plan for next adjustments. The numeric approach allows easy mean precision and mean recall which is good for reporting. My best reference for PR testing and other forms of testing Kavi Mahesh's Text Retrieval Quality, a primer: http://www.oracle.com/technetwork/database/enterprise-edition/imt-quality-092464.html I would love to hear more of what the users have been doing. paul Le 6 avr. 2011 à 08:10, Mark Mandel a écrit : Hey guys, I'm wondering how people are managing regression testing, in particular with things like text based search. I.e. if you change how fields are indexed or change boosts in dismax, ensuring that doesn't mean that critical queries are showing bad data. The obvious answer to me was using unit tests. These may be brittle as some index data can change over time, but I couldn't think of a better way. How is everyone else solving this problem? Cheers, Mark -- E: mark.man...@gmail.com T: http://www.twitter.com/neurotic W: www.compoundtheory.com cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia http://www.cfobjective.com.au Hands-on ColdFusion ORM Training www.ColdFusionOrmTraining.com
Re: help with Jetty log message
As far as I am aware of, licensing issues make that impossible for us ... On 04/05/2011 07:29 PM, Kaufman Ng wrote: Looks like you are using openjdk. Can you try using Sun jdk? On Mon, Apr 4, 2011 at 6:53 AM, Upayavirau...@odoko.co.uk wrote: This is not Solr crashing, per se, it is your JVM. I personally haven't generally had much success debugging these kinds of failure - see whether it happens again, and if it does, try updating your JVM/switching to another/etc. Anyone have better advice? Upayavira On Mon, 04 Apr 2011 11:59 +0200, Matthieu Huin matthieu.h...@wallix.com wrote: Greetings all, I am currently using solr as the backend behind a log aggregation and search system my team is developing. All was well and good until I noticed a test server crashing quite unexpectedly. We'd like to dig more into the incident but none of us has much experience with Jetty crash logs - not to mention that our Java is very rusty. The crash log is joined as an attachment. Could anyone help us with understanding what went wrong there ? Also, would it be possible and/or wise to automatically restart the server in case of such a crash ? Thanks for your help. If you need any extra info about that case, do not hesitate to ask ! Matthieu Huin Email had 1 attachment: + hs_err_pid5033.log 26k (text/x-log) --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source
How to avoid Lock file generation - solr 1.4.1
I am using Solr 1.4.1(windows os) and below are the settings in my solr config file: writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout ramBufferSizeMB32/ramBufferSizeMB maxMergeDocs1/maxMergeDocs lockTypenative/lockType While writing the index, I am doing the post procedure.. posting the xml with solr/update http request. I am gettting the following error. SEVERE: Could not start SOLR. Check solr/home property java.nio.channels.OverlappingFileLockException at sun.nio.ch.FileChannelImpl$SharedFileLockTable.checkList(Unknown Source) at sun.nio.ch.FileChannelImpl$SharedFileLockTable.add(Unknown Source) at sun.nio.ch.FileChannelImpl.tryLock(Unknown Source) at java.nio.channels.FileChannel.tryLock(Unknown Source) at org.apache.lucene.store.NativeFSLock.obtain(NativeFSLockFactory.java:233) at org.apache.lucene.store.Lock.obtain(Lock.java:73) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1545) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1402) at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:190) at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:98) at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:173) at org.apache.solr.update.DirectUpdateHandler2.forceOpenWriter(DirectUpdateHandler2.java:376) at org.apache.solr.handler.ReplicationHandler.inform(ReplicationHandler.java:845) at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:486) at org.apache.solr.core.SolrCore.init(SolrCore.java:588) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83) at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295) at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422) at org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:115) at org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4071) at org.apache.catalina.core.StandardContext.start(StandardContext.java:4725) at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:799) at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:779) at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:601) at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:675) at org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java:601) at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:502) at org.apache.catalina.startup.HostConfig.check(HostConfig.java:1383) at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:306) at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:142) at org.apache.catalina.core.ContainerBase.backgroundProcess(ContainerBase.java:1385) at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1649) at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1658) at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1638) at java.lang.Thread.run(Unknown Source) What are the correct settings to be made for avoiding this lock file?
Re: Script to remove all index.* leftovers
Yes my mistake, you're right about #1. On Wednesday 06 April 2011 05:25:50 William Bell wrote: Thank you for pointing out #2. The commitsToKeep is interesting, but I thought each commit would create a segment (before optimized) and be self contained in the index.* directory? I would only run this on the slave. Bill On Tue, Apr 5, 2011 at 2:54 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, This seems alright as it leaves the current index in place, doesn't mess with the spellchecker and leave the properties alone. But, there are two problems: 1. it doesn't take into account the commitsToKeep value set in the deletion policy, and; 2. it will remove any directory to which a current downloading replication is targetted to. Issue 1 may not be a big issue as most users leave only one commit on disk but 2 is a real problem in master/slave architectures. Cheers, There is a bug that leaves old index.* directories in the Solr data directory. Here is a script that will clean it up. I wanted to make sure this is okay, without doing a core reload. Thanks. #!/bin/bash DIR=/mnt/servers/solr/data LIST=`ls $DIR` INDEX=`cat $DIR/index.properties | grep index\= | awk 'BEGIN { FS = = } ; { print $2 }'` echo $INDEX for file in $LIST do if [ $INDEX == $file -o $file == index -o $file == index.properties -o $file == replication.properties -o $file == spellchecker ] then echo skip: $file else echo rm -rf $DIR/$file rm -rf $DIR/$file fi done -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: how to reset the index in solr
Hi Marcus, Your curl cmds don't work in that format on my unix. I conver them as follows, and they still don't work: $ curl --fail $solrIndex/update?commit=true -d '*:*' $ curl --fail $solrIndex/update -d '' From the browser: http://localhost:8080/solr/update?commit=true%20-d%20%27%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E%27 This is the response I get. − 0 18 The only thing that works: $rm - r SOLR_HOME/solr $CATALINA_HOME/bin/catalina.sh stop $CATALINA_HOME/bin/catalina.sh start I'm running a single core instance. I'm using this nutch script [1] and this[2] hints at my solr config. [1] http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script [2] http://wiki.apache.org/solr/Troubleshooting%20HTTP%20Status%20404%20-%20missing%20core%20name%20in%20path?action=recallrev=1 -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-reset-the-index-in-solr-tp496574p2784198.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr architecture diagram
Hi, At Cominvent we've often had the need to visualize the internal architecture of Apache Solr in order to explain both the relationships of the components as well as the flow of data and queries. The result is a conceptual architecture diagram, clearly showing how Solr relates to the app-server, how cores relate to a Solr instance, how documents enter through an UpdateRequestHandler, through an UpdateChain and Analysis and into the Lucene index etc. The drawing is created using Google draw, and the original is shared on Google Docs. We have licensed the diagram under the permissive Creative Commons CC-by license which lets you use, modify and re-distribute the diagram, even commercially, as long as you attribute us with a link. Check it out at http://ow.ly/4sOTm We'd love your comments -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com
Re: how to reset the index in solr
Solved. The correct translation of Marcus cmd: $ curl http://localhost:8080/solr/update?commit=true -H Content-Type: text/xml --data-binary 'deletequery*:*/query/delete' http://stackoverflow.com/questions/2358476/solr-delete-not-working-for-some-reason NB: the response is still not what I'd expect: ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime57/int/lst /response On Wed, Apr 6, 2011 at 11:39 AM, Gabriele Kahlout gabri...@mysimpatico.comwrote: Hi Marcus, Your curl cmds don't work in that format on my unix. I conver them as follows, and they still don't work: $ curl --fail $solrIndex/update?commit=true -d '*:*' $ curl --fail $solrIndex/update -d '' From the browser: http://localhost:8080/solr/update?commit=true%20-d%20%27%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E%27 This is the response I get. − 0 18 The only thing that works: $rm - r SOLR_HOME/solr $CATALINA_HOME/bin/catalina.sh stop $CATALINA_HOME/bin/catalina.sh start I'm running a single core instance. I'm using this nutch script [1] and this[2] hints at my solr config. [1] http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script [2] http://wiki.apache.org/solr/Troubleshooting%20HTTP%20Status%20404%20-%20missing%20core%20name%20in%20path?action=recallrev=1 -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-reset-the-index-in-solr-tp496574p2784198.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: Solr architecture diagram
Nice, thank you! Wish there was something similar or extra to this one depicting where do SolrJ's CommonsHttpSolrServer and EmbeddedSolrServer fit in. Regards, Stevo. On Wed, Apr 6, 2011 at 11:44 AM, Jan Høydahl jan@cominvent.com wrote: Hi, At Cominvent we've often had the need to visualize the internal architecture of Apache Solr in order to explain both the relationships of the components as well as the flow of data and queries. The result is a conceptual architecture diagram, clearly showing how Solr relates to the app-server, how cores relate to a Solr instance, how documents enter through an UpdateRequestHandler, through an UpdateChain and Analysis and into the Lucene index etc. The drawing is created using Google draw, and the original is shared on Google Docs. We have licensed the diagram under the permissive Creative Commons CC-by license which lets you use, modify and re-distribute the diagram, even commercially, as long as you attribute us with a link. Check it out at http://ow.ly/4sOTm We'd love your comments -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com
sort by function problem
I try to use sort by function in a new release of SOLR 3.1, but I have some problems, for example: http://localhost:8983/new_search/select?q=mothers dayindent=truefl=templateSetId,score,templateSetPopularitysort=product(templateSetPopularity,query(mothers day)) desc templateSetPopularity - my field with popularity rank query(mothers+day,0.0) - I try to get score value At result I get error: HTTP Status 400 - Can't determine Sort Order: 'sum(templateSetPopularity,query(mothers day)) desc', pos=3 Where is my error? -- View this message in context: http://lucene.472066.n3.nabble.com/sort-by-function-problem-tp2784493p2784493.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: FW: Very very large scale Solr Deployment = how to do (Expert Question)?
Hi all, I'd love to share the diagram, just not sure how to do that on the list (it's a word document I tried to send as attachment). Jens, to answer your questions: 1. Correct, in our setup the source of the data is a DB from which we pull the data using DIH (search the list for my previous post DIH - deleting documents, high performance (delta) imports, and passing parameters if you want info about that). We were lucky enough to have the data sharded at the DB level before we started using Solr, so using the same shards was an easy extension. Note that we're not (yet...) using SolrCloud, it was just something I thought you should consider. 2. I got the idea for the aggregator from the Solr book (PACKT). I don't remember if that term was used in the book or if I made it up (if Google doesn't know it, I probably mad it up...), but I think it conveys what this part of the puzzle does. As you said, this is simply a Solr instance which doesn't hold its own index, but shares the same schema as the slaves and masters. I actually defined the default query handler on this instance to include the shards parameter (see below), so the client doesn't have to know anything about the internal workings of the sharded setup, it just hits the aggregator load balancer with a regular query and everything is handled behind the scenes. This simplifies the client and allows me to change the architecture in the future (i.e. change the number of shards or their DNS name) without requiring a client change. Sharded query handler: requestHandler name=sharded class=solr.SearchHandler default=${aggregator:false} !-- default values for query parameters -- lst name=defaults str name=echoParamsexplicit/str str name=shards${slaveUrls:null}/str /lst /requestHandler All of our Solr instances share the same configs (solrconfig.xml, schema.xml, etc.) and different instances take different roles according to properties defined in solr.xml which is generated by a script specifically for each Solr instance (the script has a map of which instances should be on which host, and has to be run once on each host). In this case, this is how the generated solr.xml looks: solr sharedLib=../lib persistent=true property name=name value=aggregator /-- just a name that appears in Solr management -- to make it easier to know which instance you're on property name=aggregator value=true /-- this tells the instance is an aggregator, -- so it should use the sharded request handler by default -- masters and slaves have master/slave accordingly do define -- replication, a regular default search handler for slaves, -- and DIH on masters property name=shardID value= / -- this is used by instances which are shards in order to determine which -- DB they should import from (masters) -- and which master they should replicate from (slaves) property name=slaveUrls value=long,list.of,shard.urls / -- used by the sharded request handler property name=HealthCheckDir value=/data/servers/x_solr/ aggregator/core0/conf / -- used by load balancer to -- know if this instance is alive cores adminPath=/admin/cores defaultCoreName=prod core name=prod instanceDir=core0//-- just one core for this instance -- indexers have 2 cores, one prod and one for full reindex /cores /solr Let me know if I can assist any further. Ephraim Ofir -Original Message- From: Jonathan DeMello [mailto:demello@googlemail.com] Sent: Wednesday, April 06, 2011 8:58 AM To: solr-user@lucene.apache.org Cc: Isan Fulia; Tirthankar Chatterjee Subject: Re: FW: Very very large scale Solr Deployment = how to do (Expert Question)? I third that request. Would greatly appreciate taking a look at that diagram! Regards, Jonathan On Wed, Apr 6, 2011 at 9:12 AM, Isan Fulia isan.fu...@germinait.com wrote: Hi Ephraim/Jen, Can u share that diagram with all.It may really help all of us. Thanks, Isan Fulia. On 6 April 2011 10:15, Tirthankar Chatterjee tchatter...@commvault.com wrote: Hi Jen, Can you please forward the diagram attachment too that Ephraim sent. :-) Thanks, Tirthankar -Original Message- From: Jens Mueller [mailto:supidupi...@googlemail.com] Sent: Tuesday, April 05, 2011 10:30 PM To: solr-user@lucene.apache.org Subject: Re: FW: Very very large scale Solr Deployment = how to do (Expert Question)? Hello Ephraim, thank you so much for the great Document/Scaling-Concept!! First I think you really should publish this on the solr wiki. This approach is nowhere
solr-2351 patch
Hi, Tell me for which solr version does Patch file SOLR-2351(https://issues.apache.org/jira/secure/attachment/12470560/mlt.patch) fixed for . Regards! Isha
RE: Embedded Solr constructor not returning
Hi Greg, I need the servlet API in my app for it to work, despite being command line. So adding this to the maven POM fixed everything: dependency groupIdjavax.servlet/groupId artifactIdservlet-api/artifactId version2.5/version /dependency Perhaps this dependency could be listed on the wiki? Alongside the sample code for using embedded solr? http://wiki.apache.org/solr/Solrj Sounds good. Please go ahead and make this change yourself. FYI, the Solr 3.1 POM has a servlet-api dependency, but the scope is provided, because the servlet container includes this dependency. When *you* are the container, you have to provide it. Steve
Re: Solrj and display which Solr version is used
The only way I know of (and it's a little, well, a lot arcane) is to ping the admin/system handler. As it happens, I just had to do something like this. This uses apache commons http client 3X, NOT the most recent FWIW... The URl can be admin/see solrconfig.xml I'd really like to find out that there's an easier way. This brings back everything on the admin/info page. public static void main(String[] args) { HttpMethod method = new GetMethod( http://localhost:8983/solr/admin/system;); try { CommonsHttpSolrServer server = new CommonsHttpSolrServer( http://localhost:;); HttpClient client = server.getHttpClient(); int statusCode = client.executeMethod(method); // Really, you'd want to do something here. byte[] responseBody = method.getResponseBody(); System.out.println(new String(responseBody)); } catch (Exception e) { e.printStackTrace(); } finally { // Release the connection. method.releaseConnection(); } } On Tue, Apr 5, 2011 at 5:46 AM, Marc SCHNEIDER marc.schneide...@gmail.comwrote: Hi, I'm wondering how to find out which version of Solr is currently running using the Solrj library? Thanks, Marc.
Re: what happens to docsPending if stop solr before commit
They're lost, never to be seen again. You'll have to reindex them. Best Erick On Tue, Apr 5, 2011 at 4:25 PM, Robert Petersen rober...@buy.com wrote: Hello fellow enthusiastic solr users, I tried to find the answer to this simple question online, but failed. I was wondering about this, what happens to uncommitted docsPending if I stop solr and then restart solr? Are they lost? Are they still there but still uncommitted? Do they get committed at startup? I noticed after a restart my 250K pending doc count went to 0 is what got me wondering. TIA! Robi
Re: Synonym-time Reindexing Issues
Hmmm, this should work just fine. Here are my questions. 1 are you absolutely sure that the new synonym file is available when reindexing? 2 does the sunspot program do anything wonky with the ids? The documents will only be replaced if the IDs are identical. 3 are you sure that a commit is done at the end? 4 What happens if you optimize? At that point, maxdocs and numdocs should be the same, and should be the count of documents. if they differ by a factor of 2, I'd suspect your id field isn't being used correctly. If the hypothesis that you id field isn't working correctly, your number of hits should be going up after re-indexing... If none of that is relevant, let us know what you find and we'll try something else Best Erick On Tue, Apr 5, 2011 at 10:46 PM, Preston Marshall pres...@synergyeoc.comwrote: Hello all, I am having an issue with Solr and the SynonymFilterFactory. I am using a library to interface with Solr called sunspot. I realize that is not what this list is for, but I believe this may be an issue with Solr, not the library (plus the lib author doesn't know the answer). I am using the SynonymFilterFactory in my index-time analyzer, and it works great. My only problem is when it comes to changing the synonyms file. I would expect to be able to edit the file, run a reindex (this is through the library), and have the new synonyms function when the reindex is complete. Unfortunately this is not the case, as changing the synonyms file doesn't actually affect the search results. What DOES work is deleting the existing index, and starting from scratch. This is unacceptable for my usage though, because I need the old index to remain online while the new one is being built, so there is no downtime. Here's my schema in case anyone needs it: https://gist.github.com/88f8fb763e99abe4d5b8 Thanks, Preston P.S. Sorry if this dupes, first post and I didn't see it show up in the archives.
Re: solr faceted search performance reason
Please re-post the question here so others can see the discussion without going to another list. Best Erick On Wed, Apr 6, 2011 at 4:09 AM, Robin Palotai m.palotai.ro...@gmail.comwrote: Hello List, Please see my question at http://stackoverflow.com/questions/5552919/how-does-lucene-solr-achieve-high-performance-in-multi-field-faceted-search , I would be interested to know some details. Thank you, Robin
Re: sort by function problem
The problem is query(mothers day) See http://wiki.apache.org/solr/FunctionQuery#query You can't directly include query syntax because the function parser wouldn't know how to get to the end of that syntax. You could either do query($qq) and then add a qq=mothers day to the request Or if you really wanted the whole thing inline, you could do query({!v='mothers day'}) But the first form is nicer since you don't have to worry about escaping at all, and I think it's also more readable. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco On Wed, Apr 6, 2011 at 6:48 AM, ramzesua michaelnaza...@gmail.com wrote: I try to use sort by function in a new release of SOLR 3.1, but I have some problems, for example: http://localhost:8983/new_search/select?q=mothers dayindent=truefl=templateSetId,score,templateSetPopularitysort=product(templateSetPopularity,query(mothers day)) desc templateSetPopularity - my field with popularity rank query(mothers+day,0.0) - I try to get score value At result I get error: HTTP Status 400 - Can't determine Sort Order: 'sum(templateSetPopularity,query(mothers day)) desc', pos=3 Where is my error? -- View this message in context: http://lucene.472066.n3.nabble.com/sort-by-function-problem-tp2784493p2784493.html Sent from the Solr - User mailing list archive at Nabble.com.
Migrating from solr 1.4.1 to 3.1.0
Hi all, Solr 3.1.0 uses different javabin format from 1.4.1 So if I use Solrj 1.4.1 jar , then i get javabin error while saving to 3.1.0 and if I use Solrj 3.1.0 jar , then I get javabin error while reading the document from solr 1.4.1. How to go for reindexing in this situation. -- Thanks Regards, Isan Fulia.
Solr: Images, Docs and Binary data
Hello everyone, i need to know if some has used solr for indexing and storing images (upt to 16MB) or binary docs. How does solr behaves with this type of docs? How affects performance? Thanks Everyone -- __ Ezequiel. Http://www.ironicnet.com
dataimporhandler
Hello, i have a problem with dataimporthandler. i want to index many products directly from db with this component. i want to index some products little by little.. and every time i finish a piece i want to be sure that indexes are committed before go on with the other piece. i see that i can answer solr and he responds with xml which says to me committed inside the text, so i i tried to go on but it was not true..i loose some documents..do you know whY? thanx -- Gastone Penzo *www.solr-italia.it* *The first italian blog about Apache Solr*
Re: Solrj and display which Solr version is used
Ok thanks, that's an idea :-) Maybe we should suggest to have a method in CommonsHttpSolrServer that is returning Solr's version... Marc. On Wed, Apr 6, 2011 at 2:58 PM, Erick Erickson erickerick...@gmail.comwrote: The only way I know of (and it's a little, well, a lot arcane) is to ping the admin/system handler. As it happens, I just had to do something like this. This uses apache commons http client 3X, NOT the most recent FWIW... The URl can be admin/see solrconfig.xml I'd really like to find out that there's an easier way. This brings back everything on the admin/info page. public static void main(String[] args) { HttpMethod method = new GetMethod( http://localhost:8983/solr/admin/system;); try { CommonsHttpSolrServer server = new CommonsHttpSolrServer( http://localhost:;); HttpClient client = server.getHttpClient(); int statusCode = client.executeMethod(method); // Really, you'd want to do something here. byte[] responseBody = method.getResponseBody(); System.out.println(new String(responseBody)); } catch (Exception e) { e.printStackTrace(); } finally { // Release the connection. method.releaseConnection(); } } On Tue, Apr 5, 2011 at 5:46 AM, Marc SCHNEIDER marc.schneide...@gmail.comwrote: Hi, I'm wondering how to find out which version of Solr is currently running using the Solrj library? Thanks, Marc.
Re: solr faceted search performance reason
Carbon copied: *Context* This is a question mainly about Lucene (or possibly Solr) internals. The main topic is *faceted search*, in which search can happen along multiple independent dimensions (facets) of objects (for example size, speed, price of a car). When implemented with relational database, for a large number of facets multi-field indices are not useful, since facets can be searched in any order, so a specific ordered multi-index is used with low chance, and creating all possible orderings of indices is unbearable. Solr is advertised to cope well with the faceted search task, which if I think correctly has to be connected with Lucene (supposedly) performing well on multi-field queries (where fields of a document relate to facets of an object). *Question* The *inverted index* of Lucene can be stored in a relational database, and naturally taking the intersections of the matching documents can also be trivially achieved with RDBMS using single-field indices. Therefore, Lucene supposedly has some advanced technique for multi-field queries other than just taking the intersection of matching documents based on the inverted index. So the question is, what is this technique/trick? More broadly: Why can Lucene/Solr achieve better faceted search performance theoretically than RDBMS could (if so)? *Note: My first guess would be that Lucene would use some space partitioning method for partitioning a vector space built from the document fields as dimensions, but as I understand Lucene is not purely vector space based.* Thanks, Robin On Wed, Apr 6, 2011 at 3:15 PM, Erick Erickson erickerick...@gmail.comwrote: Please re-post the question here so others can see the discussion without going to another list. Best Erick On Wed, Apr 6, 2011 at 4:09 AM, Robin Palotai m.palotai.ro...@gmail.com wrote: Hello List, Please see my question at http://stackoverflow.com/questions/5552919/how-does-lucene-solr-achieve-high-performance-in-multi-field-faceted-search , I would be interested to know some details. Thank you, Robin
Re: dismax boost query not useful?
On 4/5/2011 1:17 PM, Chris Hostetter wrote: the boost param of edismax is probably a lot better choice then either bq/bf -- but it really depends on wether you want an additive boost or a multiplicitive one (of course with teh function query syntax add(), product() and (query() can be combined in anyway you want) in terms of the merrits of the bq vs bf, if we wanted to get rid of one or the other, i'd argue eliminating bf since it has *very* brittle parsing rules in place (for historic reasons). while you can use variable derefrencing to get the guts of either a bf=query($a) or a bq={$func v=$a}, promoting the use of bq over bf makes using the param body inline simpler so people are less likely to run into problems (ie: bq={!func}... doesn't require any special escaping, but bf=query(...) does) We aren't yet using dismax in production, but I've had it in my config for a while now. I've changed it to edismax in the 3.1 setup I'm putting together now. It has the following in the bf parameter: recip(ms(NOW/DAY,pd),3.16e-11,1,1) Is there a way to do this without bf? I couldn't make heads or tails of what you wrote above. Thanks, Shawn
Re: dismax boost query not useful?
On Wed, Apr 6, 2011 at 12:00 PM, Shawn Heisey s...@elyograg.org wrote: We aren't yet using dismax in production, but I've had it in my config for a while now. I've changed it to edismax in the 3.1 setup I'm putting together now. It has the following in the bf parameter: recip(ms(NOW/DAY,pd),3.16e-11,1,1) Is there a way to do this without bf? I couldn't make heads or tails of what you wrote above. bf parsing is fragile because it is a space-delimited list of functions (meaning no functions may have whitespace in them). bf also adds the function to the query score, but for boosting one is normally better off multiplying. With edismax, you can get a multiplicative boost via boost=recip(ms(NOW/DAY,pd),3.16e-11,1,1) -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
RE: Using MLT feature
Yes, I had already check the code for it and use it to compile a c# method that returns the same signature. But I have a strange issue: For instance, using MinTokenLenght=2 and default QUANT_RATE, passing the text frederico (simple text no big deal here): 1. using my c# app returns 8b92e01d67591dfc60adf9576f76a055 2. using SOLR, passing a doc with HeadLine frederico I get 8d9a5c35812ba75b8383d4538b91080f on my signature field. 3. Created a Java app (i'm not a Java expert..), using the code from SOLR SignatureUpdateProcessorFactory class (please check code below) and I get 8b92e01d67591dfc60adf9576f76a055. Java app code: TextProfileSignature textProfileSignature = new TextProfileSignature(); NamedListString params = new NamedListString(); params.add(, ); SolrParams solrParams = SolrParams.toSolrParams(params); textProfileSignature.init(solrParams); textProfileSignature.add(frederico); byte[] signature = textProfileSignature.getSignature(); char[] arr = new char[signature.length 1]; for (int i = 0; i signature.length; i++) { int b = signature[i]; int idx = i 1; arr[idx] = StrUtils.HEX_DIGITS[(b 4) 0xf]; arr[idx + 1] = StrUtils.HEX_DIGITS[b 0xf]; } String sigString = new String(arr); System.out.println(sigString); Here's my processor configs: updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldsig/str bool name=overwriteDupesfalse/bool str name=fieldsHeadLine/str str name=signatureClassorg.apache.solr.update.processor.TextProfileSignature/str str name=minTokenLen2/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain So both my apps (Java and C#) return the same signature but SOLR returns a different one.. Can anyone understand what I should be doing wrong? Thank you once again. Frederico -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: terça-feira, 5 de Abril de 2011 15:20 To: solr-user@lucene.apache.org Cc: Frederico Azeiteiro Subject: Re: Using MLT feature If you check the code for TextProfileSignature [1] your'll notice the init method reading params. You can set those params as you did. Reading Javadoc [2] might help as well. But what's not documented in the Javadoc is how QUANT is computed; it rounds. [1]: http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.4/src/java/org/apache/solr/update/processor/TextProfileSignature.java?view=markup [2]: http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html On Tuesday 05 April 2011 16:10:08 Frederico Azeiteiro wrote: Thank you, I'll try to create a c# method to create the same sig of SOLR, and then compare both sigs before index the doc. This way I can avoid the indexation of existing docs. If anyone needs to use this parameter (as this info is not on the wiki), you can add the option str name=minTokenLen5/str On the processor tag. Best regards, Frederico -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: terça-feira, 5 de Abril de 2011 12:01 To: solr-user@lucene.apache.org Cc: Frederico Azeiteiro Subject: Re: Using MLT feature On Tuesday 05 April 2011 12:19:33 Frederico Azeiteiro wrote: Sorry, the reply I made yesterday was directed to Markus and not the list... Here's my thoughts on this. At this point I'm a little confused if SOLR is a good option to find near duplicate docs. Yes there is, try set overwriteDupes to true and documents yielding the same signature will be overwritten The problem is that I don't want to overwrite the doc, I need to maintain the original version (because the doc has others fields I need to maintain). If you have need both fuzzy and exact matching then add a second update processor inside the chain and create another signature field. I just need the fuzzy search but the quick tests I made, return different signatures for what I consider duplicate docs. Army deploys as clan war kills 11 in Philippine south Army deploys as clan war kills 11 in Philippine south. Same sig for the above 2 strings, that's ok. But a different sig was created for: Army deploys as clan war kills 11 in Philippine south the. Is there a way to setup the TextProfileSignature parameters to adjust the sensibility on SOLR (QUANT_RATE or MIN_TOKEN_LEN)? Do you think that these parameters can help creating the same sig for the
Re: dataimporhandler
There's not much to go on here, can you provide details on how you check that you've committed? How are you configuring DIH? etc. It might be helpful to review: http://wiki.apache.org/solr/UsingMailingLists Best Erick On Wed, Apr 6, 2011 at 10:11 AM, Gastone Penzo gastone.pe...@gmail.comwrote: Hello, i have a problem with dataimporthandler. i want to index many products directly from db with this component. i want to index some products little by little.. and every time i finish a piece i want to be sure that indexes are committed before go on with the other piece. i see that i can answer solr and he responds with xml which says to me committed inside the text, so i i tried to go on but it was not true..i loose some documents..do you know whY? thanx -- Gastone Penzo *www.solr-italia.it* *The first italian blog about Apache Solr*
Re: dismax boost query not useful?
On Apr 5, 2011, at 3:17 PM, Chris Hostetter wrote: one of the original use cases for bq was for artificial keyword boosting, in which case it still comes in handy... bq=meta:promote^100 text:new^10 category:featured^100 (*:* -category:accessories)^10 Yeah I thought of this specific use-case. There are two issues with it though: 1. Each piece is still subject to the IDF component of the score, requiring me to make each individual category have a boost factoring that in. For example, if I want meta:promote to be twice as boosted as category:featured, I can't simply boost the first to 2 and the second to 1 (the default) -- I have enable debugQuery and carefully skew them appropriately to what I want. And the IDF might change as the data changes. 2. It still ads instead of multiplies which is always what I want. (should I not always want it?) It's hard to actually avoid the IDF irrespective of which parameter you use. The only way I know to give a fielded query a constant score is a range query which is a total hack, e.g. meta:[promote TO promote] which you could then boost. Ick! ~ David Smiley Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/
Re: Solr: Images, Docs and Binary data
Another question that maybe is easier to answer, how can i store binary data? Any example schema? 2011/4/6 Ezequiel Calderara ezech...@gmail.com Hello everyone, i need to know if some has used solr for indexing and storing images (upt to 16MB) or binary docs. How does solr behaves with this type of docs? How affects performance? Thanks Everyone -- __ Ezequiel. Http://www.ironicnet.com -- __ Ezequiel. Http://www.ironicnet.com
Re: solr faceted search performance reason
On 4/6/2011 10:55 AM, Robin Palotai wrote: Therefore, Lucene supposedly has some advanced technique for multi-field queries other than just taking the intersection of matching documents based on the inverted index. I don't think so, neccesarily. It's just that Lucene's algorithms to doing this is very fast, with some additional optimizations to make it even faster. There may be some edge cases where the optimizations take some shortcuts on top of this -- ie, if you ask for only the first ten facet values ordered by number of hits, in some cases solr/lucene won't even calculate the hit counts for facet values it already knows aren't going to be in the top 10. The facetting code in 1.4+ is actually kind of tangled, in that several different calculation approaches can be taken depending on the nature of the result set and schema. But anyway, I think you're right that you could set up an rdbms schema to _conceptually_ allow very similar operations to a lucene index. It would be unlikely to perform as well, because the devil is in the details of the storage formats and algorithms, and lucene has been optimized for these particular cases (at the expense of not covering a great many cases that an rdbms can cover). In fact, while I can't find it now on Google, I think someone HAS in the past written an extension to lucene to have it store it's indexes in an rdbms using a schema much like you describe, instead of in the file system. I'm not sure why they would want to do this instead of just using the rdbms -- either lucene's access algorithms still provide a performance benefit even when using an rdbms as the underlying 'file system', or lucene provides convenient functions that you wouldn't want to have to re-implement yourself solely in terms of an rdbms, or both. Ah, here's a brief reference to that approach in the lucene FAQ: http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_store_the_Lucene_index_in_a_relational_database.3F Jonathan So the question is, what is this technique/trick? More broadly: Why can Lucene/Solr achieve better faceted search performance theoretically than RDBMS could (if so)? *Note: My first guess would be that Lucene would use some space partitioning method for partitioning a vector space built from the document fields as dimensions, but as I understand Lucene is not purely vector space based.* Thanks, Robin On Wed, Apr 6, 2011 at 3:15 PM, Erick Ericksonerickerick...@gmail.comwrote: Please re-post the question here so others can see the discussion without going to another list. Best Erick On Wed, Apr 6, 2011 at 4:09 AM, Robin Palotaim.palotai.ro...@gmail.com wrote: Hello List, Please see my question at http://stackoverflow.com/questions/5552919/how-does-lucene-solr-achieve-high-performance-in-multi-field-faceted-search , I would be interested to know some details. Thank you, Robin
Re: solr faceted search performance reason
PS: If you want to see how Solr actually computes facetting (the facetting code lives in the 'Solr' codebase, not in the lower level lucene codebase), here's the file to look at, this web snapshot is from 1.4.1 dont' know if it's been changed more recently, but I don't think majorly: http://www.jarvana.com/jarvana/view/org/apache/solr/solr-core/1.4.1/solr-core-1.4.1-sources.jar!/org/apache/solr/request/SimpleFacets.java?format=ok It's kind of confusing, precisely because it takes several different approaches depending on the nature of the result set and schema, trying to pick the most performant approach for the context. I still haven't wrapped my head around it entirely (I am not a Solr/lucene developer, just a user). On 4/6/2011 2:06 PM, Jonathan Rochkind wrote: On 4/6/2011 10:55 AM, Robin Palotai wrote: Therefore, Lucene supposedly has some advanced technique for multi-field queries other than just taking the intersection of matching documents based on the inverted index. I don't think so, neccesarily. It's just that Lucene's algorithms to doing this is very fast, with some additional optimizations to make it even faster. There may be some edge cases where the optimizations take some shortcuts on top of this -- ie, if you ask for only the first ten facet values ordered by number of hits, in some cases solr/lucene won't even calculate the hit counts for facet values it already knows aren't going to be in the top 10. The facetting code in 1.4+ is actually kind of tangled, in that several different calculation approaches can be taken depending on the nature of the result set and schema. But anyway, I think you're right that you could set up an rdbms schema to _conceptually_ allow very similar operations to a lucene index. It would be unlikely to perform as well, because the devil is in the details of the storage formats and algorithms, and lucene has been optimized for these particular cases (at the expense of not covering a great many cases that an rdbms can cover). In fact, while I can't find it now on Google, I think someone HAS in the past written an extension to lucene to have it store it's indexes in an rdbms using a schema much like you describe, instead of in the file system. I'm not sure why they would want to do this instead of just using the rdbms -- either lucene's access algorithms still provide a performance benefit even when using an rdbms as the underlying 'file system', or lucene provides convenient functions that you wouldn't want to have to re-implement yourself solely in terms of an rdbms, or both. Ah, here's a brief reference to that approach in the lucene FAQ: http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_store_the_Lucene_index_in_a_relational_database.3F Jonathan So the question is, what is this technique/trick? More broadly: Why can Lucene/Solr achieve better faceted search performance theoretically than RDBMS could (if so)? *Note: My first guess would be that Lucene would use some space partitioning method for partitioning a vector space built from the document fields as dimensions, but as I understand Lucene is not purely vector space based.* Thanks, Robin On Wed, Apr 6, 2011 at 3:15 PM, Erick Ericksonerickerick...@gmail.comwrote: Please re-post the question here so others can see the discussion without going to another list. Best Erick On Wed, Apr 6, 2011 at 4:09 AM, Robin Palotaim.palotai.ro...@gmail.com wrote: Hello List, Please see my question at http://stackoverflow.com/questions/5552919/how-does-lucene-solr-achieve-high-performance-in-multi-field-faceted-search , I would be interested to know some details. Thank you, Robin
Re: Solr: Images, Docs and Binary data
You can store binary data using a binary field type -- then you need to send the data base64 encoded. I would strongly recommend against storing large binary files in solr -- unless you really don't care about performance -- the file system is a good option that springs to mind. ryan 2011/4/6 Ezequiel Calderara ezech...@gmail.com: Another question that maybe is easier to answer, how can i store binary data? Any example schema? 2011/4/6 Ezequiel Calderara ezech...@gmail.com Hello everyone, i need to know if some has used solr for indexing and storing images (upt to 16MB) or binary docs. How does solr behaves with this type of docs? How affects performance? Thanks Everyone -- __ Ezequiel. Http://www.ironicnet.com -- __ Ezequiel. Http://www.ironicnet.com
Re: Solr: Images, Docs and Binary data
I put binary data in an ordinary Solr stored field, don't need any special schema. I have run into trouble making sure the data is not corrupted on the way in during indexing, depending on exactly what form of communication is being used to index (SolrJ, SolrJ with EmbeddedSolr, DIH, etc.), as well as settings in the container (eg jetty or tomcat) used to house Solr. But I think it's possible to get it working no matter what the path, if you run into trouble someone may be able to help you. My binary data is not very large though (generally under 1 meg). However, in general, _indexing_ large data should be fine, although it will create a larger index which can require more RAM, or be slower, etc. But that's geenrally just a function of total size of index, or really total number of unique terms, doesn't matter if the docs they come from are big or small. _Storing_ large fields can sometimes be a problem, lucene/Solr are really optimized as an index, not a key/value store. Some people choose to _store_ their large objects in some external store (rdbms, nosql key/value, whatever), and have the client application look up the objects themselves by primary-key/unique-id, after the pk/uid's themselves are retrieved from Solr. Use Solr for what it's good at, indexing, use something else good at storing for storing large objects. But other people sometimes store large objects directly in Solr without problems, can depend on the exact nature of your index and use. On 4/6/2011 2:09 PM, Ezequiel Calderara wrote: Another question that maybe is easier to answer, how can i store binary data? Any example schema? 2011/4/6 Ezequiel Calderaraezech...@gmail.com Hello everyone, i need to know if some has used solr for indexing and storing images (upt to 16MB) or binary docs. How does solr behaves with this type of docs? How affects performance? Thanks Everyone -- __ Ezequiel. Http://www.ironicnet.com
Re: Solr: Images, Docs and Binary data
Ha, there's a binary field type?! I've stored binary data in an ordinary String field type, and it's worked. But there were some headaches to get it to work, might have been smoother if I had realized there was actually a binary field type. But wait I'm talking about Solr 'stored field', not about indexing. I didn't try to index my binary data, just store it for later retrieval (knowing this can sometimes be a performance problem, doing it anyway with relatively small data, got away with it). Does the field type even effect the _stored values_ in a Solr field? On 4/6/2011 2:25 PM, Ryan McKinley wrote: You can store binary data using a binary field type -- then you need to send the data base64 encoded. I would strongly recommend against storing large binary files in solr -- unless you really don't care about performance -- the file system is a good option that springs to mind. ryan 2011/4/6 Ezequiel Calderaraezech...@gmail.com: Another question that maybe is easier to answer, how can i store binary data? Any example schema? 2011/4/6 Ezequiel Calderaraezech...@gmail.com Hello everyone, i need to know if some has used solr for indexing and storing images (upt to 16MB) or binary docs. How does solr behaves with this type of docs? How affects performance? Thanks Everyone -- __ Ezequiel. Http://www.ironicnet.com -- __ Ezequiel. Http://www.ironicnet.com
Re: Solr: Images, Docs and Binary data
Hi, your answers were really helpfull I was thinking in putting the base64 encoded file into a string field. But was a little worried about solr trying to stem it or vectorize or those stuff. Seen in the example of the schema.xml: !--Binary data type. The data should be sent/retrieved in as Base64 encoded Strings -- fieldtype name=binary class=solr.BinaryField/ Anyone knows any storage for images that performs well, other than FS ? Thanks On Wed, Apr 6, 2011 at 3:31 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Ha, there's a binary field type?! I've stored binary data in an ordinary String field type, and it's worked. But there were some headaches to get it to work, might have been smoother if I had realized there was actually a binary field type. But wait I'm talking about Solr 'stored field', not about indexing. I didn't try to index my binary data, just store it for later retrieval (knowing this can sometimes be a performance problem, doing it anyway with relatively small data, got away with it). Does the field type even effect the _stored values_ in a Solr field? On 4/6/2011 2:25 PM, Ryan McKinley wrote: You can store binary data using a binary field type -- then you need to send the data base64 encoded. I would strongly recommend against storing large binary files in solr -- unless you really don't care about performance -- the file system is a good option that springs to mind. ryan 2011/4/6 Ezequiel Calderaraezech...@gmail.com: Another question that maybe is easier to answer, how can i store binary data? Any example schema? 2011/4/6 Ezequiel Calderaraezech...@gmail.com Hello everyone, i need to know if some has used solr for indexing and storing images (upt to 16MB) or binary docs. How does solr behaves with this type of docs? How affects performance? Thanks Everyone -- __ Ezequiel. Http://www.ironicnet.com -- __ Ezequiel. Http://www.ironicnet.com -- __ Ezequiel. Http://www.ironicnet.com
Re: Solr: Images, Docs and Binary data
Ha, there's a binary field type?! I've stored binary data in an ordinary String field type, and it's worked. But there were some headaches to get it to work, might have been smoother if I had realized there was actually a binary field type. How, you can't just embed control characters in an XML body? The need to be at least encoded as not to write tabs, deletes, backspaces and whatever garbage, base64 in Solr's case. But wait I'm talking about Solr 'stored field', not about indexing. I didn't try to index my binary data, just store it for later retrieval (knowing this can sometimes be a performance problem, doing it anyway with relatively small data, got away with it). Does the field type even effect the _stored values_ in a Solr field? Solr decodes the data and stores it. It reencodes the data when writing a response. On 4/6/2011 2:25 PM, Ryan McKinley wrote: You can store binary data using a binary field type -- then you need to send the data base64 encoded. I would strongly recommend against storing large binary files in solr -- unless you really don't care about performance -- the file system is a good option that springs to mind. ryan 2011/4/6 Ezequiel Calderaraezech...@gmail.com: Another question that maybe is easier to answer, how can i store binary data? Any example schema? 2011/4/6 Ezequiel Calderaraezech...@gmail.com Hello everyone, i need to know if some has used solr for indexing and storing images (upt to 16MB) or binary docs. How does solr behaves with this type of docs? How affects performance? Thanks Everyone -- __ Ezequiel. Http://www.ironicnet.com -- __ Ezequiel. Http://www.ironicnet.com
Re: Solr: Images, Docs and Binary data
Hi, your answers were really helpfull I was thinking in putting the base64 encoded file into a string field. But was a little worried about solr trying to stem it or vectorize or those stuff. String field types are not analyzed. So it doesn't brutalize your data. Better use BinaryField. Seen in the example of the schema.xml: !--Binary data type. The data should be sent/retrieved in as Base64 encoded Strings -- fieldtype name=binary class=solr.BinaryField/ Anyone knows any storage for images that performs well, other than FS ? CouchDB can deliver file attachments over HTTP. It needs to be sent encoded (of course). Thanks On Wed, Apr 6, 2011 at 3:31 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Ha, there's a binary field type?! I've stored binary data in an ordinary String field type, and it's worked. But there were some headaches to get it to work, might have been smoother if I had realized there was actually a binary field type. But wait I'm talking about Solr 'stored field', not about indexing. I didn't try to index my binary data, just store it for later retrieval (knowing this can sometimes be a performance problem, doing it anyway with relatively small data, got away with it). Does the field type even effect the _stored values_ in a Solr field? On 4/6/2011 2:25 PM, Ryan McKinley wrote: You can store binary data using a binary field type -- then you need to send the data base64 encoded. I would strongly recommend against storing large binary files in solr -- unless you really don't care about performance -- the file system is a good option that springs to mind. ryan 2011/4/6 Ezequiel Calderaraezech...@gmail.com: Another question that maybe is easier to answer, how can i store binary data? Any example schema? 2011/4/6 Ezequiel Calderaraezech...@gmail.com Hello everyone, i need to know if some has used solr for indexing and storing images (upt to 16MB) or binary docs. How does solr behaves with this type of docs? How affects performance? Thanks Everyone -- __ Ezequiel. Http://www.ironicnet.com -- __ Ezequiel. Http://www.ironicnet.com
Re: Solr: Images, Docs and Binary data
On 4/6/2011 2:39 PM, Markus Jelsma wrote: Ha, there's a binary field type?! I've stored binary data in an ordinary String field type, and it's worked. But there were some headaches to get it to work, might have been smoother if I had realized there was actually a binary field type. How, you can't just embed control characters in an XML body? The need to be at least encoded as not to write tabs, deletes, backspaces and whatever garbage, base64 in Solr's case. In my case using SolrJ with BinaryUpdateHandler. I think. That code was actually written by someone else, a while ago. However I've managed to do it at indexing -- ultimately getting it into a String-type stored field -- my binary data comes back not UUEncoded, but XML-escaped, ie: #30; This works for me because my binary data is actually MOSTLY ascii (so this isn't as terribly inefficient as it could be), but it has some control characters in it that need to be preserved. And nearly any library you use for consuming XML responses will properly un-escape things like #30; when reading.
Re: Solr: Images, Docs and Binary data
Well...by default there is a pretty decent schema that you can use as a template in the example project that builds with Solr. Tika is the library that does the actual content extraction so it would be a good idea to try the example project out first. Adam 2011/4/6 Ezequiel Calderara ezech...@gmail.com Another question that maybe is easier to answer, how can i store binary data? Any example schema? 2011/4/6 Ezequiel Calderara ezech...@gmail.com Hello everyone, i need to know if some has used solr for indexing and storing images (upt to 16MB) or binary docs. How does solr behaves with this type of docs? How affects performance? Thanks Everyone -- __ Ezequiel. Http://www.ironicnet.com -- __ Ezequiel. Http://www.ironicnet.com
Re: Concatenate multivalued DIH fields
Hi Everyone, I am having an identical problem with concatenating author's first and last names stored in an xml blob. Because this field is multivalued copyfield does not work. Does anyone have a solution? Regards, Alexei -- View this message in context: http://lucene.472066.n3.nabble.com/Concatenate-multivalued-DIH-fields-tp2749988p2786506.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr: Images, Docs and Binary data
Ezequiel, Am 06.04.2011 20:38, schrieb Ezequiel Calderara: Anyone knows any storage for images that performs well, other than FS ? you may have a look on http://www.danga.com/mogilefs/ ? :) Regards Stefan
Re: DIH: Indexing multiple datasources with the same schema
Sorry about bringing an old thread back, I thought my solution could be useful. I also had to deal with multiple data sources. If the data source number could be queried for in one of your parent entities then you could get it using a variable as follows: entity name=ChildEntity dataSource=db${YourParentEntity.DbId} ... For the above to work I had to modify the org.apache.solr.handler.dataimport.ContextImpl.getDataSource() Here is the replacement code for getDataSource: public DataSource getDataSource() { if (ds != null) return ds; if(entity == null) return null; String dataSourceResolved = this.getResolvedEntityAttribute(dataSource); if (entity.dataSrc == null) { entity.dataSrc = dataImporter.getDataSourceInstance(entity, dataSourceResolved, this); entity.dataSource = dataSourceResolved; } else if (!dataSourceResolved.equals(entity.dataSource)) { entity.dataSrc.close(); entity.dataSrc = dataImporter.getDataSourceInstance(entity, dataSourceResolved, this); entity.dataSource = dataSourceResolved; } if (entity.dataSrc != null docBuilder != null docBuilder.verboseDebug Context.FULL_DUMP.equals(currentProcess())) { //debug is not yet implemented properly for deltas entity.dataSrc = docBuilder.writer.getDebugLogger().wrapDs(entity.dataSrc); } return entity.dataSrc; } Cheers, Alexei -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-Indexing-multiple-datasources-with-the-same-schema-tp877781p2786599.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr: Images, Docs and Binary data
On Wed, Apr 6, 2011 at 15:31 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: Well...by default there is a pretty decent schema that you can use as a template in the example project that builds with Solr. Tika is the library that does the actual content extraction so it would be a good idea to try the example project out first. I wanted to know how large field's size affects performance. But i wasn't sure how to design the schema. On Wed, Apr 6, 2011 at 4:23 PM, Stefan Matheis matheis.ste...@googlemail.com wrote: Ezequiel, Am 06.04.2011 20:38, schrieb Ezequiel Calderara: Anyone knows any storage for images that performs well, other than FS ? you may have a look on http://www.danga.com/mogilefs/ ? :) Regards Stefan Stefan, we looked at mogilefs, also couchdb and mongodb. AFAIR (As Far as I Read :P), mogilefs runs on *nix OS, while we are using microsoft as the OS. (yeah, we are the open source evangelist in our company :P) Just for the moment we well start using Solr for storing and indexing (some info at least) images and docs. We have yet to see what are the needs in terms of scalability to choose between the options. Thanks all... If you have more info send it :) -- __ Ezequiel. Http://www.ironicnet.com
unindexible Chars?
Once and awhile, my post.jar seems to fail on commit. Durring the commit process, I have gotten a few errors. One is that EOF character found, and another is that semicolon expected after the. I also have come across a was expected. So my question is what characters do I need to strip out of the source text to ensure all posts are sucessful? One side note. I have placed the text fields within ![CDATA[ ]] before adding the document. Thanks, Charlie
Re: Solr: Images, Docs and Binary data
On Wed, Apr 6, 2011 at 15:31 PM, Adam Estrada estrada.adam.gro...@gmail.com I wanted to know how large field's size affects performance. If you use replication then it's a huge impact on performance as the data gets sent over the network. It's also a memory hog so there's less memory and more garbage collection. Indexing and merging is slower because of additional bytes being copied. If there's a lot of binary data and performance is important and diskspace is not a commodity then you shouldn't store it in the index; the index size can double during optimizing. But i wasn't sure how to design the schema. On Wed, Apr 6, 2011 at 4:23 PM, Stefan Matheis matheis.ste...@googlemail.com wrote: Ezequiel, Am 06.04.2011 20:38, schrieb Ezequiel Calderara: Anyone knows any storage for images that performs well, other than FS ? you may have a look on http://www.danga.com/mogilefs/ ? :) Regards Stefan Stefan, we looked at mogilefs, also couchdb and mongodb. AFAIR (As Far as I Read :P), mogilefs runs on *nix OS, while we are using microsoft as the OS. (yeah, we are the open source evangelist in our company :P) Just for the moment we well start using Solr for storing and indexing (some info at least) images and docs. We have yet to see what are the needs in terms of scalability to choose between the options. Thanks all... If you have more info send it :)
Re: unindexible Chars?
Once and awhile, my post.jar seems to fail on commit. Durring the commit process, I have gotten a few errors. One is that EOF character found, and another is that semicolon expected after the. I also have come across a was expected. So my question is what characters do I need to strip out of the source text to ensure all posts are sucessful? The usual, it _must_ be valid XML. One side note. I have placed the text fields within ![CDATA[ ]] before adding the document. That's not a bad idea, then at least nothing bad can happen with the data embedded in the element. Usually these errors indicate invalid XML. Try xmllint with some XML body giving errors. Thanks, Charlie
ClobTransformer Issues
Hi All, I'm hoping someone can give me some pointers. I've got Solr 1.4.1 and am using DIH to import a table from and Ingres database. The table contains a column which is a CLOB type. I've tried to use a CLOB transformer to transform the CLOB to a String but the index only contains something like INGRES-CLOB:(Loc 10). Does anyone have any ideas on why the CLOB transformer is not transforming this column? Thanks, Stephen
Re: Eclipse: Invalid character constant
Hi Stefan, Thanks, my eclipse is now perfectly configured. It makes it very easy for amateurs like me! For other amateurs the steps are: 1. checkout the sources: *svn checkout https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_1/* 2. the root folder (lucene_solr_3_1 in this example) contains a special build.xml to create project settings for either eclipse or IntelliJ IDEA. (it is not the build.xml in the solr subfolder that compiles tomcat/jetty) run: *ant eclipse* 3. Create a new Eclipse Java project, we need to specify an external folder. GALILEO: Create project from existing source HELIOS: Unclick Use Default Location *Select the root svn folder *(lucene_solr_3_1) Click finish and you should have solr configured in eclipse! Regards Ericz On Tue, Apr 5, 2011 at 11:34 PM, Stefan Matheis matheis.ste...@googlemail.com wrote: Eric, have a look at Line #67 in build.xml :) target name=eclipse description=Setup Eclipse configuration -- Only available with SVN checkout Regards Stefan Am 06.04.2011 00:28, schrieb Eric Grobler: Hi Robert, Thanks for the fast response! I used https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_1/ but did not find 'ant eclipse'. However setting my projects Resouce encoding to UTF-8 worked. Thanks for your help and have a nice day :-) Regards Ericz On Tue, Apr 5, 2011 at 11:14 PM, Robert Muirrcm...@gmail.com wrote: in eclipse you need to set your project's character encoding to UTF-8. if you are checking out the source code from svn, you can run 'ant eclipse' from the top level, and then hit refresh on your project. it will set your encoding and your classpath up. On Tue, Apr 5, 2011 at 6:10 PM, Eric Groblerimpalah...@googlemail.com wrote: Hi Everyone, Some language specific classes like GermanLightStemmer has invalid character compiler errors for code like: switch(s[i]) { case 'ä': case 'à ': case 'á': in Eclipse with JDK 1.6 How do I get rid of these errors? Thanks Regards Ericz
where is INFOSTREAM.txt located?
**Legal Disclaimer*** This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you. *
Re: Synonym-time Reindexing Issues
Reply Inline: On Apr 6, 2011, at 8:12 AM, Erick Erickson wrote: Hmmm, this should work just fine. Here are my questions. 1 are you absolutely sure that the new synonym file is available when reindexing? Not sure what you mean here, solr is running as root, and the file is never moved around or anything crazy. 2 does the sunspot program do anything wonky with the ids? The documents will only be replaced if the IDs are identical. Is there a way I can add debugging to show what it's doing with the IDs or something to view the index? I tried using Luke, but I can't get it to actually show me the actual data of the objects, only the name and some other basic info. 3 are you sure that a commit is done at the end? It appears that it commits a few times during reindexing. 4 What happens if you optimize? At that point, maxdocs and numdocs should be the same, and should be the count of documents. if they differ by a factor of 2, I'd suspect your id field isn't being used correctly. I'm unaware of what you mean by optimizing, or even viewing maxdocs and numdocs, but I will RTFM to find out. I did notice something strange earlier though that may relate to this. When I ran a search there were duplicate results. If the hypothesis that you id field isn't working correctly, your number of hits should be going up after re-indexing... If none of that is relevant, let us know what you find and we'll try something else Best Erick On Tue, Apr 5, 2011 at 10:46 PM, Preston Marshall pres...@synergyeoc.comwrote: Hello all, I am having an issue with Solr and the SynonymFilterFactory. I am using a library to interface with Solr called sunspot. I realize that is not what this list is for, but I believe this may be an issue with Solr, not the library (plus the lib author doesn't know the answer). I am using the SynonymFilterFactory in my index-time analyzer, and it works great. My only problem is when it comes to changing the synonyms file. I would expect to be able to edit the file, run a reindex (this is through the library), and have the new synonyms function when the reindex is complete. Unfortunately this is not the case, as changing the synonyms file doesn't actually affect the search results. What DOES work is deleting the existing index, and starting from scratch. This is unacceptable for my usage though, because I need the old index to remain online while the new one is being built, so there is no downtime. Here's my schema in case anyone needs it: https://gist.github.com/88f8fb763e99abe4d5b8 Thanks, Preston P.S. Sorry if this dupes, first post and I didn't see it show up in the archives. smime.p7s Description: S/MIME cryptographic signature
RE: what happens to docsPending if stop solr before commit
Oh woe is me... lol NP good to know. I'll get them on the next go 'round. :) Thanks for the answer! -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, April 06, 2011 6:05 AM To: solr-user@lucene.apache.org Subject: Re: what happens to docsPending if stop solr before commit They're lost, never to be seen again. You'll have to reindex them. Best Erick On Tue, Apr 5, 2011 at 4:25 PM, Robert Petersen rober...@buy.com wrote: Hello fellow enthusiastic solr users, I tried to find the answer to this simple question online, but failed. I was wondering about this, what happens to uncommitted docsPending if I stop solr and then restart solr? Are they lost? Are they still there but still uncommitted? Do they get committed at startup? I noticed after a restart my 250K pending doc count went to 0 is what got me wondering. TIA! Robi
RE: where is INFOSTREAM.txt located?
Thanks All, I figured it out. http://lucene.472066.n3.nabble.com/general-debugging-techniques-td868300.html See the last line on this page. -Original Message- From: Tirthankar Chatterjee [mailto:tchatter...@commvault.com] Sent: Wednesday, April 06, 2011 6:15 PM To: solr-user@lucene.apache.org Subject: where is INFOSTREAM.txt located? **Legal Disclaimer*** This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you. *
Re: Embedded Solr constructor not returning
Sounds good. Please go ahead and make this change yourself. Done. Ta, Greg On 6 April 2011 22:52, Steven A Rowe sar...@syr.edu wrote: Hi Greg, I need the servlet API in my app for it to work, despite being command line. So adding this to the maven POM fixed everything: dependency groupIdjavax.servlet/groupId artifactIdservlet-api/artifactId version2.5/version /dependency Perhaps this dependency could be listed on the wiki? Alongside the sample code for using embedded solr? http://wiki.apache.org/solr/Solrj Sounds good. Please go ahead and make this change yourself. FYI, the Solr 3.1 POM has a servlet-api dependency, but the scope is provided, because the servlet container includes this dependency. When *you* are the container, you have to provide it. Steve
Re: what happens to docsPending if stop solr before commit
(11/04/06 5:25), Robert Petersen wrote: I tried to find the answer to this simple question online, but failed. I was wondering about this, what happens to uncommitted docsPending if I stop solr and then restart solr? Are they lost? Are they still there but still uncommitted? Do they get committed at startup? I noticed after a restart my 250K pending doc count went to 0 is what got me wondering. Robi, Usually they are never lost, but they are committed. When you stop Solr, servlet container (Jetty) calls servlets/filters destroy() methods. This causes closing all SolrCores. Then SolrCore.close() calls UpdateHandler.close(). It calls SolrIndexWriter.close(). Then pending docs are flushed, then committed. Koji -- http://www.rondhuit.com/en/
RE: what happens to docsPending if stop solr before commit
Really? Great! I was wondering if there was some cleanup cycle like that which would occur upon shutdown. That sounds like much more logical behavior! -Original Message- From: Koji Sekiguchi [mailto:k...@r.email.ne.jp] Sent: Wednesday, April 06, 2011 4:03 PM To: solr-user@lucene.apache.org Subject: Re: what happens to docsPending if stop solr before commit (11/04/06 5:25), Robert Petersen wrote: I tried to find the answer to this simple question online, but failed. I was wondering about this, what happens to uncommitted docsPending if I stop solr and then restart solr? Are they lost? Are they still there but still uncommitted? Do they get committed at startup? I noticed after a restart my 250K pending doc count went to 0 is what got me wondering. Robi, Usually they are never lost, but they are committed. When you stop Solr, servlet container (Jetty) calls servlets/filters destroy() methods. This causes closing all SolrCores. Then SolrCore.close() calls UpdateHandler.close(). It calls SolrIndexWriter.close(). Then pending docs are flushed, then committed. Koji -- http://www.rondhuit.com/en/
Shared conf
Is there a configuration value I can specify for multiple cores to use the same conf directory? Thanks
difference between geospatial search from database angle and from solr angle
I understand Solr can do pretty powerful geospatial search http://www.ibm.com/developerworks/java/library/j-spatial/ http://www.ibm.com/developerworks/java/library/j-spatial/But I also understand lots of DB researchers have done lots of geospatial related work, can someone give an overview of the difference from the different angel? Thanks, -- --Sean
Re: Very very large scale Solr Deployment = how to do (Expert Question)?
I would not use replication. LinkedIn consumer search is a flat system where one process indexes new entries and does queries simultaneously. It's a custom Lucene app called Zoie. Their stuff is on Github.. I would get documents to indexers via a multicast IP-based queueing system. This scales very well and there's a lot of hardware support. The problem with distributed search is that it is a) inherently slower and b) has inherently more and longer jitter. The airplane wing distribution of query times becomes longer and flatter. This is going to have to be a federated system, where the front-end app aggregates results rather than Solr. On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller supidupi...@googlemail.com wrote: Hello Experts, I am a Solr newbie but read quite a lot of docs. I still do not understand what would be the best way to setup very large scale deployments: Goal (threoretical): A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size) B) Queries: 10 Queries/ per Second C) Updates: 10 Updates / per Second Solr offers: 1.) Replication = Scales Well for B) BUT A) and C) are not satisfied 2.) Sharding = Scales well for A) BUT B) and C) are not satisfied (= As I understand the Sharding approach all goes through a central server, that dispatches the updates and assembles the quries retrieved from the different shards. But this central server has also some capacity limits...) What is the right approach to handle such large deployments? I would be thankfull for just a rough sketch of the concepts so I can experiment/search further… Maybe I am missing something very trivial as I think some of the “Solr Users/Use Cases” on the homepage are that kind of large deployments. How are they implemented? Thanky very much!!! Jens -- Lance Norskog goks...@gmail.com
Re: Using MLT feature
A fuzzy signature system will not work here. You are right, you want to try MLT instead. Lance On Wed, Apr 6, 2011 at 9:47 AM, Frederico Azeiteiro frederico.azeite...@cision.com wrote: Yes, I had already check the code for it and use it to compile a c# method that returns the same signature. But I have a strange issue: For instance, using MinTokenLenght=2 and default QUANT_RATE, passing the text frederico (simple text no big deal here): 1. using my c# app returns 8b92e01d67591dfc60adf9576f76a055 2. using SOLR, passing a doc with HeadLine frederico I get 8d9a5c35812ba75b8383d4538b91080f on my signature field. 3. Created a Java app (i'm not a Java expert..), using the code from SOLR SignatureUpdateProcessorFactory class (please check code below) and I get 8b92e01d67591dfc60adf9576f76a055. Java app code: TextProfileSignature textProfileSignature = new TextProfileSignature(); NamedListString params = new NamedListString(); params.add(, ); SolrParams solrParams = SolrParams.toSolrParams(params); textProfileSignature.init(solrParams); textProfileSignature.add(frederico); byte[] signature = textProfileSignature.getSignature(); char[] arr = new char[signature.length 1]; for (int i = 0; i signature.length; i++) { int b = signature[i]; int idx = i 1; arr[idx] = StrUtils.HEX_DIGITS[(b 4) 0xf]; arr[idx + 1] = StrUtils.HEX_DIGITS[b 0xf]; } String sigString = new String(arr); System.out.println(sigString); Here's my processor configs: updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldsig/str bool name=overwriteDupesfalse/bool str name=fieldsHeadLine/str str name=signatureClassorg.apache.solr.update.processor.TextProfileSignature/str str name=minTokenLen2/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain So both my apps (Java and C#) return the same signature but SOLR returns a different one.. Can anyone understand what I should be doing wrong? Thank you once again. Frederico -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: terça-feira, 5 de Abril de 2011 15:20 To: solr-user@lucene.apache.org Cc: Frederico Azeiteiro Subject: Re: Using MLT feature If you check the code for TextProfileSignature [1] your'll notice the init method reading params. You can set those params as you did. Reading Javadoc [2] might help as well. But what's not documented in the Javadoc is how QUANT is computed; it rounds. [1]: http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.4/src/java/org/apache/solr/update/processor/TextProfileSignature.java?view=markup [2]: http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html On Tuesday 05 April 2011 16:10:08 Frederico Azeiteiro wrote: Thank you, I'll try to create a c# method to create the same sig of SOLR, and then compare both sigs before index the doc. This way I can avoid the indexation of existing docs. If anyone needs to use this parameter (as this info is not on the wiki), you can add the option str name=minTokenLen5/str On the processor tag. Best regards, Frederico -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: terça-feira, 5 de Abril de 2011 12:01 To: solr-user@lucene.apache.org Cc: Frederico Azeiteiro Subject: Re: Using MLT feature On Tuesday 05 April 2011 12:19:33 Frederico Azeiteiro wrote: Sorry, the reply I made yesterday was directed to Markus and not the list... Here's my thoughts on this. At this point I'm a little confused if SOLR is a good option to find near duplicate docs. Yes there is, try set overwriteDupes to true and documents yielding the same signature will be overwritten The problem is that I don't want to overwrite the doc, I need to maintain the original version (because the doc has others fields I need to maintain). If you have need both fuzzy and exact matching then add a second update processor inside the chain and create another signature field. I just need the fuzzy search but the quick tests I made, return different signatures for what I consider duplicate docs. Army deploys as clan war kills 11 in Philippine south Army deploys as clan war kills 11 in Philippine south. Same sig for the above 2 strings, that's ok. But a different sig was created for: Army deploys as clan war kills 11 in Philippine south the. Is there a way to setup the
Re: SOLR - problems with non-english symbols when extracting HTML
Tomcat has to be configured to use UTF-8. http://wiki.apache.org/solr/SolrTomcat?highlight=%28tomcat%29#URI_Charset_Config On Fri, Mar 25, 2011 at 6:58 PM, kushti sandyl...@gmail.com wrote: Grijesh wrote: Try to send HTML data using format CDATA . Doesn't work with $content = ; And my goal is not to avoid extraction, but have no problems with non-english chars -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-problems-with-non-english-symbols-when-extracting-HTML-tp2729126p2733858.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
Re: Very very large scale Solr Deployment = how to do (Expert Question)?
The bigger answer is that you cannot get to this size by just configuring Solr. You may have to invent a lot of stuff. Like all of Google. Where did you get these numbers? The proposed query rate is twice as big as Google (Feb 2010 estimate, 34K qps). I work at MarkLogic, and we scale to 100's of terabytes, with fast update and query rates. If you want a real system that handles that, you might want to look at our product. wunder On Apr 6, 2011, at 8:06 PM, Lance Norskog wrote: I would not use replication. LinkedIn consumer search is a flat system where one process indexes new entries and does queries simultaneously. It's a custom Lucene app called Zoie. Their stuff is on Github.. I would get documents to indexers via a multicast IP-based queueing system. This scales very well and there's a lot of hardware support. The problem with distributed search is that it is a) inherently slower and b) has inherently more and longer jitter. The airplane wing distribution of query times becomes longer and flatter. This is going to have to be a federated system, where the front-end app aggregates results rather than Solr. On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller supidupi...@googlemail.com wrote: Hello Experts, I am a Solr newbie but read quite a lot of docs. I still do not understand what would be the best way to setup very large scale deployments: Goal (threoretical): A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size) B) Queries: 10 Queries/ per Second C) Updates: 10 Updates / per Second Solr offers: 1.)Replication = Scales Well for B) BUT A) and C) are not satisfied 2.)Sharding = Scales well for A) BUT B) and C) are not satisfied (= As I understand the Sharding approach all goes through a central server, that dispatches the updates and assembles the quries retrieved from the different shards. But this central server has also some capacity limits...) What is the right approach to handle such large deployments? I would be thankfull for just a rough sketch of the concepts so I can experiment/search further… Maybe I am missing something very trivial as I think some of the “Solr Users/Use Cases” on the homepage are that kind of large deployments. How are they implemented? Thanky very much!!! Jens
solr-2351 patch
Hi, Tell me for which solr version does Patch file SOLR-2351(https://issues.apache.org/jira/secure/attachment/12470560/mlt.patch) fixed for . Regards! Isha
Re: difference between geospatial search from database angle and from solr angle
Sean, Geospatial search in Lucene/Solr is of course implemented based on Lucene's underlying index technology. That technology was originally just for text but it's been adapted very successfully for numerics and querying ranges too. The only mature geospatial field type in Solr 3.1 is LatLonType which under the hood is simply a pair of latitude longitude numeric fields. There really isn't anything sophisticated (geospatially speaking) in Solr 3.1. I'm not sure what sort of geospatial DB research you have in mind but I would expect other systems would be free to use an indexing strategy designed for spatial such as R-Trees. Nevertheless, I think Lucene offers the underlying primitives to compete with systems using other technologies. Case in point is my patch SOLR-2155 which indexes a single point in the form of a geohash at multiple resolutions (geohash lengths AKA spatial prefixes / grids) and uses a recursive algorithm to efficiently query an arbitrary shape. It's quite fast and bests LatLonType already; and there's a lot more I can do to make it faster. This is definitely a field of interest and a growing one in the Lucene/Solr community. There are even some external spatial providers (JTeam, MetaCarta) and I'm partnering with other individuals to create a new one. Expect to see more in the coming months. If you're looking for some specific geospatial capabilities then let us know. ~ David Smiley Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/ - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/difference-between-geospatial-search-from-database-angle-and-from-solr-angle-tp2788442p2788972.html Sent from the Solr - User mailing list archive at Nabble.com.
Tips for getting unique results?
Hi, I have documents with a field that has 1A2B3C alphanumeric characters. I can query for * and sort results based on this field, however I'd like to uniq these results (remove duplicates) so that I can get the 5 largest unique values. I can't use the StatsComponent because my values have letters in them too. Faceting (and ignoring the counts) gets me half of the way there, but I can only sort ascending. If I could also sort facet results descending, I'd be done. I'd rather not return all documents and just parse the last few results to work around this. Any ideas? -Pete
Re: Tips for getting unique results?
Hi, I think you are saying dupes are the main problem? If so, http://wiki.apache.org/solr/Deduplication ? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Peter Spam ps...@mac.com To: solr-user@lucene.apache.org Sent: Thu, April 7, 2011 1:13:44 AM Subject: Tips for getting unique results? Hi, I have documents with a field that has 1A2B3C alphanumeric characters. I can query for * and sort results based on this field, however I'd like to uniq these results (remove duplicates) so that I can get the 5 largest unique values. I can't use the StatsComponent because my values have letters in them too. Faceting (and ignoring the counts) gets me half of the way there, but I can only sort ascending. If I could also sort facet results descending, I'd be done. I'd rather not return all documents and just parse the last few results to work around this. Any ideas? -Pete
Re: Very very large scale Solr Deployment = how to do (Expert Question)?
Hello Ephraim, hello Lance, hello Walter, thanks for your replies: Ephraim, thanks very much for the further detailed explanation. I will try to setup a demo system in the next few days and use your advice. LoadBalancers are an important aspect of your design. Can you recommend one LB specificallly? (I would be using haproxy.1wt.eu) . I think the Idea with uploading your document is very good. However Google-Docs seemed not be be working (at least for me with the docx format?), but maybe you can simply output the document as PDF and then I think Google Docs is working, so all the others can also have a look at your concept. The best approach would be if you could upload your advice directly somewhere to the solr wiki as it is really helpful.I found some other documents meanwhile, but yours is much clearer and more complete, with the LBs and the Aggregators ( http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf) Lance, thanks I will have a look at what linkedin is doing. Walter, thanks for the advice: Well you are right, mentioning google. My question was also to understand how such large systems like google/facebook are actually working. So my numbers are just theoretical and made up. My system will be smaller, but I would be very happy to understand how such large systems are build and I think the approach Ephraim showd should be working quite well at large scale. If you know a good documents (besides the bigtable research paper that I already know) that technically describes how google is working in detail that would be of great interest. You seem to be working for a company that handles large datasets. Does google use this approach, sharing the index into N writers, and the procuded index is then replicated to N read only searchers? thank you all. best regards jens 2011/4/7 Walter Underwood wun...@wunderwood.org The bigger answer is that you cannot get to this size by just configuring Solr. You may have to invent a lot of stuff. Like all of Google. Where did you get these numbers? The proposed query rate is twice as big as Google (Feb 2010 estimate, 34K qps). I work at MarkLogic, and we scale to 100's of terabytes, with fast update and query rates. If you want a real system that handles that, you might want to look at our product. wunder On Apr 6, 2011, at 8:06 PM, Lance Norskog wrote: I would not use replication. LinkedIn consumer search is a flat system where one process indexes new entries and does queries simultaneously. It's a custom Lucene app called Zoie. Their stuff is on Github.. I would get documents to indexers via a multicast IP-based queueing system. This scales very well and there's a lot of hardware support. The problem with distributed search is that it is a) inherently slower and b) has inherently more and longer jitter. The airplane wing distribution of query times becomes longer and flatter. This is going to have to be a federated system, where the front-end app aggregates results rather than Solr. On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller supidupi...@googlemail.com wrote: Hello Experts, I am a Solr newbie but read quite a lot of docs. I still do not understand what would be the best way to setup very large scale deployments: Goal (threoretical): A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size) B) Queries: 10 Queries/ per Second C) Updates: 10 Updates / per Second Solr offers: 1.)Replication = Scales Well for B) BUT A) and C) are not satisfied 2.)Sharding = Scales well for A) BUT B) and C) are not satisfied (= As I understand the Sharding approach all goes through a central server, that dispatches the updates and assembles the quries retrieved from the different shards. But this central server has also some capacity limits...) What is the right approach to handle such large deployments? I would be thankfull for just a rough sketch of the concepts so I can experiment/search further… Maybe I am missing something very trivial as I think some of the “Solr Users/Use Cases” on the homepage are that kind of large deployments. How are they implemented? Thanky very much!!! Jens
Re: Very very large scale Solr Deployment = how to do (Expert Question)?
Just a quick comment re LinkedIn's stuff. You can look at Zoie (also covered in Lucene in Action 2), but you may be more interested in Sensei. And yes, big systems like that need sharding and replication, multiple master and lots of slaves. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Jens Mueller supidupi...@googlemail.com To: solr-user@lucene.apache.org Sent: Thu, April 7, 2011 1:29:40 AM Subject: Re: Very very large scale Solr Deployment = how to do (Expert Question)? Hello Ephraim, hello Lance, hello Walter, thanks for your replies: Ephraim, thanks very much for the further detailed explanation. I will try to setup a demo system in the next few days and use your advice. LoadBalancers are an important aspect of your design. Can you recommend one LB specificallly? (I would be using haproxy.1wt.eu) . I think the Idea with uploading your document is very good. However Google-Docs seemed not be be working (at least for me with the docx format?), but maybe you can simply output the document as PDF and then I think Google Docs is working, so all the others can also have a look at your concept. The best approach would be if you could upload your advice directly somewhere to the solr wiki as it is really helpful.I found some other documents meanwhile, but yours is much clearer and more complete, with the LBs and the Aggregators ( http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf) Lance, thanks I will have a look at what linkedin is doing. Walter, thanks for the advice: Well you are right, mentioning google. My question was also to understand how such large systems like google/facebook are actually working. So my numbers are just theoretical and made up. My system will be smaller, but I would be very happy to understand how such large systems are build and I think the approach Ephraim showd should be working quite well at large scale. If you know a good documents (besides the bigtable research paper that I already know) that technically describes how google is working in detail that would be of great interest. You seem to be working for a company that handles large datasets. Does google use this approach, sharing the index into N writers, and the procuded index is then replicated to N read only searchers? thank you all. best regards jens 2011/4/7 Walter Underwood wun...@wunderwood.org The bigger answer is that you cannot get to this size by just configuring Solr. You may have to invent a lot of stuff. Like all of Google. Where did you get these numbers? The proposed query rate is twice as big as Google (Feb 2010 estimate, 34K qps). I work at MarkLogic, and we scale to 100's of terabytes, with fast update and query rates. If you want a real system that handles that, you might want to look at our product. wunder On Apr 6, 2011, at 8:06 PM, Lance Norskog wrote: I would not use replication. LinkedIn consumer search is a flat system where one process indexes new entries and does queries simultaneously. It's a custom Lucene app called Zoie. Their stuff is on Github.. I would get documents to indexers via a multicast IP-based queueing system. This scales very well and there's a lot of hardware support. The problem with distributed search is that it is a) inherently slower and b) has inherently more and longer jitter. The airplane wing distribution of query times becomes longer and flatter. This is going to have to be a federated system, where the front-end app aggregates results rather than Solr. On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller supidupi...@googlemail.com wrote: Hello Experts, I am a Solr newbie but read quite a lot of docs. I still do not understand what would be the best way to setup very large scale deployments: Goal (threoretical): A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size) B) Queries: 10 Queries/ per Second C) Updates: 10 Updates / per Second Solr offers: 1.)Replication = Scales Well for B) BUT A) and C) are not satisfied 2.)Sharding = Scales well for A) BUT B) and C) are not satisfied (= As I understand the Sharding approach all goes through a central server, that dispatches the updates and assembles the quries retrieved from the different shards. But this central server has also some capacity limits...) What is the right approach to handle such large deployments? I would be thankfull for just a rough sketch of the concepts so I can experiment/search further… Maybe I am missing something very trivial as I think some of the “Solr Users/Use Cases” on the homepage are that kind of large deployments. How