Re: Solr edismax clarification
Please provide your full query, including your qf parameter and all other request parameters, and also the relevant fields/field-types from schema. Do you use stopwords? Can you also add debugQuery=true and paste in the parsedQuery? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 16. feb. 2012, at 18:07, Indika Tantrigoda wrote: Hi All, I am using edismax SearchHandler in my search and I have some issues in the search results. As I understand if the defaultOperator is set to OR the search query will be passed as - The OR quick OR brown OR fox implicitly. However if I search for The quick brown fox, I get lesser results than explicitly adding the OR. Another issue is that if I search for The quick brown fox other documents that contain the word fox is not in the search results. Thanks.
Re: custom scoring
Thanks Em, Robert, Chris for your time and valuable advice. We'll make some tests and will let you know soon. On Thu, Feb 16, 2012 at 11:43 PM, Em mailformailingli...@yahoo.de wrote: Hello Carlos, I think we missunderstood eachother. As an example: BooleanQuery ( clauses: ( MustMatch( DisjunctionMaxQuery( TermQuery(stopword_field, barcelona), TermQuery(stopword_field, hoteles) ) ), ShouldMatch( FunctionQuery( *please insert your function here* ) ) ) ) Explanation: You construct an artificial BooleanQuery which wraps your user's query as well as your function query. Your user's query - in that case - is just a DisjunctionMaxQuery consisting of two TermQueries. In the real world you might construct another BooleanQuery around your DisjunctionMaxQuery in order to have more flexibility. However the interesting part of the given example is, that we specify the user's query as a MustMatch-condition of the BooleanQuery and the FunctionQuery just as a ShouldMatch. Constructed that way, I am expecting the FunctionQuery only scores those documents which fit the MustMatch-Condition. I conclude that from the fact that the FunctionQuery-class also has a skipTo-method and I would expect that the scorer will use it to score only matching documents (however I did not search where and how it might get called). If my conclusion is wrong than hopefully Robert Muir (as far as I can see the author of that class) can tell us what was the intention by constructing an every-time-match-all-function-query. Can you validate whether your QueryParser constructs a query in the form I drew above? Regards, Em Am 16.02.2012 20:29, schrieb Carlos Gonzalez-Cadenas: Hello Em: 1) Here's a printout of an example DisMax query (as you can see mostly MUST terms except for some SHOULD terms used for boosting scores for stopwords) * * *((+stopword_shortened_phrase:hoteles +stopword_shortened_phrase:barcelona stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shortened_phrase:hoteles +stopword_short ened_phrase:barcelona stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shor tened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +wildcard_stopw ord_phrase:barcelona stopword_phrase:en) | (+stopword_shortened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +wildcard_stopword_phrase:barcelona stopword_phrase:en))* * * 2)* *The collector is inserted in the SolrIndexSearcher (replacing the TimeLimitingCollector). We trigger it through the SOLR interface by passing the timeAllowed parameter. We know this is a hack but AFAIK there's no out-of-the-box way to specify custom collectors by now ( https://issues.apache.org/jira/browse/SOLR-1680). In any case the collector part works perfectly as of now, so clearly this is not the problem. 3) Re: your sentence: * * **I* would expect that with a shrinking set of matching documents to the overall-query, the function query only checks those documents that are guaranteed to be within the result set.* * * Yes, I agree with this, but this snippet of code in FunctionQuery.java seems to say otherwise: // instead of matching all docs, we could also embed a query. // the score could either ignore the subscore, or boost it. // Containment: floatline(foo:myTerm, myFloatField, 1.0, 0.0f) // Boost:foo:myTerm^floatline(myFloatField,1.0,0.0f) @Override public int nextDoc() throws IOException { for(;;) { ++doc; if (doc=maxDoc) { return doc=NO_MORE_DOCS; } if (acceptDocs != null !acceptDocs.get(doc)) continue; return doc; } } It seems that the author also thought of maybe embedding a query in order to restrict matches, but this doesn't seem to be in place as of now (or maybe I'm not understanding how the whole thing works :) ). Thanks Carlos * * Carlos Gonzalez-Cadenas CEO, ExperienceOn - New generation search http://www.experienceon.com Mobile: +34 652 911 201 Skype: carlosgonzalezcadenas LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas On Thu, Feb 16, 2012 at 8:09 PM, Em mailformailingli...@yahoo.de wrote: Hello Carlos, We have some more tests on that matter: now we're moving from issuing this large query through the SOLR interface to creating our own QueryParser. The initial tests we've done in our QParser (that internally creates multiple queries and inserts them inside a
Re: Solr edismax clarification
Indika Tantrigoda wrote Hi All, I am using edismax SearchHandler in my search and I have some issues in the search results. As I understand if the defaultOperator is set to OR the search query will be passed as - The OR quick OR brown OR fox implicitly. Did you also remove mm? If not defaultOperator is ignored and it follows mm settings. http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29 -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-edismax-clarification-tp3751013p3753260.html Sent from the Solr - User mailing list archive at Nabble.com.
How to connect embedded solr with each other by sharding
I have been using sharding with multiple basic solr server for clustering. I also used one embedded solr server (Solrj Java API) with many basic solr servers and connecting them by sharding as embedded solr server is the caller of them. I used the code line below for this purpose. SolrQuery query = new SolrQuery(); query.set(shards, solr1URL,solr2URL,...); Now, I have many embedded solr servers running on different computers and they are unaware of each others. I want to communicate them with each other by sharding. Is it possible? if yes how? if not what are the other options that you can advice by using embedded solr servers? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-connect-embedded-solr-with-each-other-by-sharding-tp3753337p3753337.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Error Indexing in solr 3.5
Hi Chantal, I checked my client. It was pointing to the old solrj. After changing that, it got indexed properly. Thanks a lot. -- View this message in context: http://lucene.472066.n3.nabble.com/Error-Indexing-in-solr-3-5-tp3746735p3753359.html Sent from the Solr - User mailing list archive at Nabble.com.
Removing empty dynamic fields from a Solr 1.4 index
Hi all (Note: this question is cross-posted on stackoverflow: http://stackoverflow.com/questions/9327542/removing-empty-dynamic-fields-from-a-solr-1-4-index) I have a Solr index that uses quite a few dynamic fields. I've recently changed my code to reduce the amount of data we index with Solr, significantly reducing the number of dynamic fields that are in use. I've reindexed my data, and the doc count (as displayed in the admin schema browser) for the old fields has dropped to zero. But I'm confused as to why the fields still exist. I've done an optimize, and restarted the server, but I can't find any information on whether there's a way to get these fields to disappear. Am I now stuck with these fields unless I recreate the index from scratch? We're talking about a significant reduction in fields (about 200 - 30), and I'm worried about the performance impact of keeping them floating around. Thanks, Andrew Ingram
How to handle to run testcases in ruby code for solr
Hi all, Am writing rails application by using solr_ruby gem to access solr . Can anybody suggest how to handle testcaeses for solr code and connections in functionaltetsing. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-handle-to-run-testcases-in-ruby-code-for-solr-tp3753479p3753479.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Realtime search with multi clients updating index simultaneously.
See below On Thu, Feb 16, 2012 at 6:18 AM, v_shan varun.c...@gmail.com wrote: I have a heldesk application developed in PHP/MySQL. I want to implement real time Full text search and I have shortlisted Solr. MySQL database will store all the tickets and their updates and that data will be imported for building Solr index. All Search requests will be handled by Solr. What I want is a real time search. The moment someone updates a ticket, it should be available for search. As per my understanding of Solr, this is how I think the system will work. A user updates a ticket - database record is modified - a request is sent to Solr server to modify corresponding document in index. The first thing to understand: Solr does not update a document, it deletes the old one and adds a new one based on uniqueKey. I have read a book on Solr and below questions are troubling me. 1. The book mentions that commits are slow in Solr. Depending on the index size, Solr's auto-warming configuration, and Solr's cache state prior to committing, a commit can take a non-trivial amount of time. Typically, it takes a few seconds, but it can take some number of minutes in extreme cases. If this is true then how will I know when the data will be availbale for search and how can I implemnt realtime search? Also I don't want the ticket update operation to be slowed down (by adding extra step of updating Solr index) Well, Solr trunk is in the midst of getting NRT searching (Near Real Time), so that may be of interest. Otherwise, there is some latency defined by time until commit + replication time + autowarming time. You haven't indicated how big your data set is, so what those numbers really are is hard to even guess. Even if you do know how many records will be there, the answer is still try it and see. Replication time may not be necessary if you have a small enough system, it is possible to index and search on the same machine. On larger installations, a latency of a few minutes is common. 2. It is also mentioned that there is no transaction isolation. This means that if more than one Solr client were to submit modifications and commit them at overlapping times, it is possible for part of one client's set of changes to be committed before that client told Solr to commit. This applies to rollback as well. If this is a problem for your architecture then consider using one client process responsible for updating Solr. Does it mean that due to lack of transactional commits, Solr can mess up the updates when multiple people update the ticket simultaneously? As above, Solr deletes and replaces complete documents. So in this case your update process would simply honor the last-received. But I think you're missing a bit here. User's won't update your Solr index. Somewhere, you'll have a process that queries your MySql database and updates any changed records. The MySql database is your system-of-record and where your transactional integrity is maintained. The process that queries the database and sends the results to Solr will just see the results of the aggregate changes to the underlying database as single records, so I don't think this is an issue. Best Erick Now the question before me is: Is Solr fit in my case? If yes, How? Can't answer this for you. -- View this message in context: http://lucene.472066.n3.nabble.com/Realtime-search-with-multi-clients-updating-index-simultaneously-tp3749881p3749881.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Payload and exact search - 2
OK, payloads are a bit of a mystery to me, so this may be way off base. But... The ordering of your analysis chain is suspicious, the admin/analysis page is a life-saver. WordDelimiterFilterFactory is breaking up your input before it gets to the payload filter I think, so your payload information is completely disassociated with from your terms and treated as individual terms all by themselves. At that point what you get in your index *probably* has no payloads attached at all! Use the admin/schema browser link to actually look at the data (or just go straight to Luke) and I believe you'll see that your position information is being treated just like any other token in the input stream. There should be nothing about payloads that prevents normal text query on the text part, although. Best Erick On Thu, Feb 16, 2012 at 9:18 AM, leonardo2 leonardo.rigut...@gmail.com wrote: Hello, I already posted this question but for some reason it was attached to a thread with different topic. Is there the possibility of perform 'exact search' in a payload field? I'have to index text with auxiliary info for each word. In particular at each word is associated the bounding box containing it in the original pdf page (it is used for highligthing the search terms in the pdf). I used the payload to store that information. In the schema.xml, the fieldType definition is: --- fieldtype name=wppayloads stored=false indexed=true class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.DelimitedPayloadTokenFilterFactory encoder=identity/ /analyzer /fieldtype --- while the field definition is: --- field name=words type=wppayloads indexed=true stored=true required=true multiValued=true/ --- When indexing, the field 'words' contains a list of word|box as in the following example: --- doc_id=example words={Fonte:|307.62,948.16,324.62,954.25 Comune|326.29,948.16,349.07,954.25 di|350.74,948.16,355.62,954.25 Bologna|358.95,948.16,381.28,954.25} --- Such solution works well except in the case of an exact search. For example, assuming the only indexed doc is the 'example' doc (before shown), the query words:Comune di Bologna returns no results. Someone know if there is the possibility of perform 'exact search' in a payload field? Thanks in advance, Leonardo -- View this message in context: http://lucene.472066.n3.nabble.com/Payload-and-exact-search-2-tp3750355p3750355.html Sent from the Solr - User mailing list archive at Nabble.com.
customizing standard tokenizer
Hi, is it possible to extend the standard tokenizer or use a custom one (possible via extending the standard one) to add some custom tokens like Lucene-Core to be one token. regards smime.p7s Description: S/MIME cryptographic signature
Re: problem to indexing pdf directory
thanks gora for your help. I installed Maven and downloaded Tika following the guide: But I have an errore during the built of Tika about 'tika compiler', and the maven installation of Tika is stopped. there is another way? thank you a. 2012/2/16 Gora Mohanty g...@mimirtech.com On 16 February 2012 21:37, alessio crisantemi alessio.crisant...@gmail.com wrote: here the log: org.apache.solr.handler.dataimport.DataImporter doFullImport Grave: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' is a required attribute Processing Document # 1 [...] The exception message above is pretty clear. You need to define a baseDir attribute for the second entity. However, even if you fix this, the setup will *not* work for indexing PDFs. Did you read the URLs that I sent earlier? Regards, Gora
Re: problem to indexing pdf directory
You should not have to do anything with Maven, the instructions you followed were from 1.4.1 days.. Assuming you're working with a 3.x build, here's a data-config that worked for me, just a straight distro. But note a couple of things: 1 for simplicity, I changed the schema.xml to NOT require the id field. You'll have to change this back probably and select a good uniqueKey 2 I had to add this line to solrconfig.xml to find the path: lib dir=../../dist/ regex=apache-solr-dataimporthandler-extras-\d.*\.jar/ 3 If this all works without errors in the Solr log and you still can't find anything, be sure you issue a commit. Best Erick dataConfig dataSource name=bin type=BinFileDataSource/ document entity baseDir=/Users/Erick/testdocs fileName=.*pdf name=sd processor=FileListEntityProcessor recursive=true rootEntity=false entity dataSource=bin format=text name=tika-test processor=TikaEntityProcessor url=${sd.fileAbsolutePath} field column=Author meta=true name=author/ field column=Content-Type meta=true name=title/ !-- field column=title name=title meta=true/ -- field column=text name=text/ /entity !-- field column=fileLastModified name=date dateTimeFormat=-MM-dd'T'hh:mm:ss / -- field column=fileSize meta=true name=size/ /entity /document /dataConfig On Fri, Feb 17, 2012 at 9:35 AM, alessio crisantemi alessio.crisant...@gmail.com wrote: thanks gora for your help. I installed Maven and downloaded Tika following the guide: But I have an errore during the built of Tika about 'tika compiler', and the maven installation of Tika is stopped. there is another way? thank you a. 2012/2/16 Gora Mohanty g...@mimirtech.com On 16 February 2012 21:37, alessio crisantemi alessio.crisant...@gmail.com wrote: here the log: org.apache.solr.handler.dataimport.DataImporter doFullImport Grave: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' is a required attribute Processing Document # 1 [...] The exception message above is pretty clear. You need to define a baseDir attribute for the second entity. However, even if you fix this, the setup will *not* work for indexing PDFs. Did you read the URLs that I sent earlier? Regards, Gora
Re: distributed deletes working?
Thanks Mark. I'm still seeing some issues while indexing though. I have the same setup describe in my previous email. I do some indexing to the cluster with everything up and everything looks good. I then take down one instance which is running 2 cores (shard2 slice 1 and shard 1 slice 2) and do some more inserts. I then bring this second instance back up expecting that the system will recover the missing documents from the other instance but this isn't happening. I see the following log message Feb 17, 2012 9:53:11 AM org.apache.solr.cloud.RecoveryStrategy run INFO: Sync Recovery was succesful - registering as Active which leads me to believe things should be in sync, but they are not. I've made no changes to the default solrconfig.xml, not sure if I need to or not but it looks like everything should work now. Am I missing a configuration somewhere? Initial state {collection1:{ slice1:{ JamiesMac.local:8501_solr_slice1_shard1:{ shard_id:slice1, leader:true, state:active, core:slice1_shard1, collection:collection1, node_name:JamiesMac.local:8501_solr, base_url:http://JamiesMac.local:8501/solr}, JamiesMac.local:8502_solr_slice1_shard2:{ shard_id:slice1, state:active, core:slice1_shard2, collection:collection1, node_name:JamiesMac.local:8502_solr, base_url:http://JamiesMac.local:8502/solr}}, slice2:{ JamiesMac.local:8501_solr_slice2_shard2:{ shard_id:slice2, leader:true, state:active, core:slice2_shard2, collection:collection1, node_name:JamiesMac.local:8501_solr, base_url:http://JamiesMac.local:8501/solr}, JamiesMac.local:8502_solr_slice2_shard1:{ shard_id:slice2, state:active, core:slice2_shard1, collection:collection1, node_name:JamiesMac.local:8502_solr, base_url:http://JamiesMac.local:8502/solr state with 1 solr instance down {collection1:{ slice1:{ JamiesMac.local:8501_solr_slice1_shard1:{ shard_id:slice1, leader:true, state:active, core:slice1_shard1, collection:collection1, node_name:JamiesMac.local:8501_solr, base_url:http://JamiesMac.local:8501/solr}, JamiesMac.local:8502_solr_slice1_shard2:{ shard_id:slice1, state:active, core:slice1_shard2, collection:collection1, node_name:JamiesMac.local:8502_solr, base_url:http://JamiesMac.local:8502/solr}}, slice2:{ JamiesMac.local:8501_solr_slice2_shard2:{ shard_id:slice2, leader:true, state:active, core:slice2_shard2, collection:collection1, node_name:JamiesMac.local:8501_solr, base_url:http://JamiesMac.local:8501/solr}, JamiesMac.local:8502_solr_slice2_shard1:{ shard_id:slice2, state:active, core:slice2_shard1, collection:collection1, node_name:JamiesMac.local:8502_solr, base_url:http://JamiesMac.local:8502/solr state when everything comes back up after adding documents {collection1:{ slice1:{ JamiesMac.local:8501_solr_slice1_shard1:{ shard_id:slice1, leader:true, state:active, core:slice1_shard1, collection:collection1, node_name:JamiesMac.local:8501_solr, base_url:http://JamiesMac.local:8501/solr}, JamiesMac.local:8502_solr_slice1_shard2:{ shard_id:slice1, state:active, core:slice1_shard2, collection:collection1, node_name:JamiesMac.local:8502_solr, base_url:http://JamiesMac.local:8502/solr}}, slice2:{ JamiesMac.local:8501_solr_slice2_shard2:{ shard_id:slice2, leader:true, state:active, core:slice2_shard2, collection:collection1, node_name:JamiesMac.local:8501_solr, base_url:http://JamiesMac.local:8501/solr}, JamiesMac.local:8502_solr_slice2_shard1:{ shard_id:slice2, state:active, core:slice2_shard1, collection:collection1, node_name:JamiesMac.local:8502_solr, base_url:http://JamiesMac.local:8502/solr On Thu, Feb 16, 2012 at 10:24 PM, Mark Miller markrmil...@gmail.com wrote: Yup - deletes are fine. On Thu, Feb 16, 2012 at 8:56 PM, Jamie Johnson jej2...@gmail.com wrote: With solr-2358 being committed to trunk do deletes and updates get distributed/routed like adds do? Also when a down shard comes back up are the deletes/updates forwarded as well? Reading the jira I believe the answer is yes, I just want to verify before bringing the latest into my environment. -- - Mark http://www.lucidimagination.com
Re: distributed deletes working?
and having looked at this closer, shouldn't the down node not be marked as active when I stop that solr instance? On Fri, Feb 17, 2012 at 10:04 AM, Jamie Johnson jej2...@gmail.com wrote: Thanks Mark. I'm still seeing some issues while indexing though. I have the same setup describe in my previous email. I do some indexing to the cluster with everything up and everything looks good. I then take down one instance which is running 2 cores (shard2 slice 1 and shard 1 slice 2) and do some more inserts. I then bring this second instance back up expecting that the system will recover the missing documents from the other instance but this isn't happening. I see the following log message Feb 17, 2012 9:53:11 AM org.apache.solr.cloud.RecoveryStrategy run INFO: Sync Recovery was succesful - registering as Active which leads me to believe things should be in sync, but they are not. I've made no changes to the default solrconfig.xml, not sure if I need to or not but it looks like everything should work now. Am I missing a configuration somewhere? Initial state {collection1:{ slice1:{ JamiesMac.local:8501_solr_slice1_shard1:{ shard_id:slice1, leader:true, state:active, core:slice1_shard1, collection:collection1, node_name:JamiesMac.local:8501_solr, base_url:http://JamiesMac.local:8501/solr}, JamiesMac.local:8502_solr_slice1_shard2:{ shard_id:slice1, state:active, core:slice1_shard2, collection:collection1, node_name:JamiesMac.local:8502_solr, base_url:http://JamiesMac.local:8502/solr}}, slice2:{ JamiesMac.local:8501_solr_slice2_shard2:{ shard_id:slice2, leader:true, state:active, core:slice2_shard2, collection:collection1, node_name:JamiesMac.local:8501_solr, base_url:http://JamiesMac.local:8501/solr}, JamiesMac.local:8502_solr_slice2_shard1:{ shard_id:slice2, state:active, core:slice2_shard1, collection:collection1, node_name:JamiesMac.local:8502_solr, base_url:http://JamiesMac.local:8502/solr state with 1 solr instance down {collection1:{ slice1:{ JamiesMac.local:8501_solr_slice1_shard1:{ shard_id:slice1, leader:true, state:active, core:slice1_shard1, collection:collection1, node_name:JamiesMac.local:8501_solr, base_url:http://JamiesMac.local:8501/solr}, JamiesMac.local:8502_solr_slice1_shard2:{ shard_id:slice1, state:active, core:slice1_shard2, collection:collection1, node_name:JamiesMac.local:8502_solr, base_url:http://JamiesMac.local:8502/solr}}, slice2:{ JamiesMac.local:8501_solr_slice2_shard2:{ shard_id:slice2, leader:true, state:active, core:slice2_shard2, collection:collection1, node_name:JamiesMac.local:8501_solr, base_url:http://JamiesMac.local:8501/solr}, JamiesMac.local:8502_solr_slice2_shard1:{ shard_id:slice2, state:active, core:slice2_shard1, collection:collection1, node_name:JamiesMac.local:8502_solr, base_url:http://JamiesMac.local:8502/solr state when everything comes back up after adding documents {collection1:{ slice1:{ JamiesMac.local:8501_solr_slice1_shard1:{ shard_id:slice1, leader:true, state:active, core:slice1_shard1, collection:collection1, node_name:JamiesMac.local:8501_solr, base_url:http://JamiesMac.local:8501/solr}, JamiesMac.local:8502_solr_slice1_shard2:{ shard_id:slice1, state:active, core:slice1_shard2, collection:collection1, node_name:JamiesMac.local:8502_solr, base_url:http://JamiesMac.local:8502/solr}}, slice2:{ JamiesMac.local:8501_solr_slice2_shard2:{ shard_id:slice2, leader:true, state:active, core:slice2_shard2, collection:collection1, node_name:JamiesMac.local:8501_solr, base_url:http://JamiesMac.local:8501/solr}, JamiesMac.local:8502_solr_slice2_shard1:{ shard_id:slice2, state:active, core:slice2_shard1, collection:collection1, node_name:JamiesMac.local:8502_solr, base_url:http://JamiesMac.local:8502/solr On Thu, Feb 16, 2012 at 10:24 PM, Mark Miller markrmil...@gmail.com wrote: Yup - deletes are fine. On Thu, Feb 16, 2012 at 8:56 PM, Jamie Johnson jej2...@gmail.com wrote: With solr-2358 being committed to trunk do deletes and updates get distributed/routed like adds do? Also when a down shard comes back up are the deletes/updates forwarded as well? Reading the jira I believe the answer is yes, I just want to verify before bringing the latest into my environment. -- - Mark
Cloud tab hanging?
Hi, I'm pretty new to solr and especially solr cloud, so hopefully this isn't too dumb: I followed the wiki instructions for setting up a small cloud. Things seem to work, *except* on the UI [using chrome and safari], the cloud tab hangs. It says Zookeeper Data, and then there's a loading symbol.The old ui allows me to see what's in zookeeper, so I'm pretty sure it's mostly working. There's nothing in the logs at all about a connection timing out -- any help? Thanks, Ranjan
Re: distributed deletes working?
On Fri, Feb 17, 2012 at 5:10 PM, Jamie Johnson jej2...@gmail.com wrote: and having looked at this closer, shouldn't the down node not be marked as active when I stop that solr instance? Currently the shard state is not updated in the cloudstate when a node goes down. This behavior should probably be changed at some point. -- Sami Siren
Re: How to handle to run testcases in ruby code for solr
Just FYI the solr-ruby (hyphen, not underscore to be precise) is deprecated in that the source no longer lives under Apache's svn. The gem is still out there, and it's still a useful library, but the Ruby/Solr world seems to use RSolr the most. Both have their pros/cons, but solr-ruby works just fine as you'll see. The source code for it was relocated to my personal github account for posterity: https://github.com/erikhatcher/solr-ruby-flare All that being said, the solr-ruby library itself has extensive coverage with unit and functional tests. For the functional side, you can see here https://github.com/erikhatcher/solr-ruby-flare/blob/master/solr-ruby/test/functional/server_test.rb which ends up getting wrapped with a test Solr instance and leveraged in the :test Rake task here: https://github.com/erikhatcher/solr-ruby-flare/blob/master/solr-ruby/Rakefile Hope that helps. Erik On Feb 17, 2012, at 07:12 , solr wrote: Hi all, Am writing rails application by using solr_ruby gem to access solr . Can anybody suggest how to handle testcaeses for solr code and connections in functionaltetsing. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-handle-to-run-testcases-in-ruby-code-for-solr-tp3753479p3753479.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: distributed deletes working?
Thanks Sami, so long at it's expected ;) In regards to the replication not working the way I think it should, am I missing something or is it simply not working the way I think? On Fri, Feb 17, 2012 at 11:01 AM, Sami Siren ssi...@gmail.com wrote: On Fri, Feb 17, 2012 at 5:10 PM, Jamie Johnson jej2...@gmail.com wrote: and having looked at this closer, shouldn't the down node not be marked as active when I stop that solr instance? Currently the shard state is not updated in the cloudstate when a node goes down. This behavior should probably be changed at some point. -- Sami Siren
Re: Frequent garbage collections after a day of operation
A wonderful writeup on various memory collection concerns http://www.lucidimagination.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/ On Fri, Feb 17, 2012 at 12:27 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: One thing that could fit the pattern you describe would be Solr caches filling up and getting you too close to your JVM or memory limit This [uncommitted] issue would solve that problem by allowing the GC to collect caches that become too large, though in practice, the cache setting would need to be fairly large for an OOM to occur from them: https://issues.apache.org/jira/browse/SOLR-1513 On Thu, Feb 16, 2012 at 7:14 PM, Bryan Loofbourrow bloofbour...@knowledgemosaic.com wrote: A couple of thoughts: We wound up doing a bunch of tuning on the Java garbage collection. However, the pattern we were seeing was periodic very extreme slowdowns, because we were then using the default garbage collector, which blocks when it has to do a major collection. This doesn't sound like your problem, but it's something to be aware of. One thing that could fit the pattern you describe would be Solr caches filling up and getting you too close to your JVM or memory limit. For example, if you have large documents, and have defined a large document cache, that might do it. I found it useful to point jconsole (free with the JDK) at my JVM, and watch the pattern of memory usage. If the troughs at the bottom of the GC cycles keep rising, you know you've got something that is continuing to grab more memory and not let go of it. Now that our JVM is running smoothly, we just see a sawtooth pattern, with the troughs approximately level. When the system is under load, the frequency of the wave rises. Try it and see what sort of pattern you're getting. -- Bryan -Original Message- From: Matthias Käppler [mailto:matth...@qype.com] Sent: Thursday, February 16, 2012 7:23 AM To: solr-user@lucene.apache.org Subject: Frequent garbage collections after a day of operation Hey everyone, we're running into some operational problems with our SOLR production setup here and were wondering if anyone else is affected or has even solved these problems before. We're running a vanilla SOLR 3.4.0 in several Tomcat 6 instances, so nothing out of the ordinary, but after a day or so of operation we see increased response times from SOLR, up to 3 times increases on average. During this time we see increased CPU load due to heavy garbage collection in the JVM, which bogs down the the whole system, so throughput decreases, naturally. When restarting the slaves, everything goes back to normal, but that's more like a brute force solution. The thing is, we don't know what's causing this and we don't have that much experience with Java stacks since we're for most parts a Rails company. Are Tomcat 6 or SOLR known to leak memory? Is anyone else seeing this, or can you think of a reason for this? Most of our queries to SOLR involve the DismaxHandler and the spatial search query components. We don't use any custom request handlers so far. Thanks in advance, -Matthias -- Matthias Käppler Lead Developer API Mobile Qype GmbH Großer Burstah 50-52 20457 Hamburg Telephone: +49 (0)40 - 219 019 2 - 160 Skype: m_kaeppler Email: matth...@qype.com Managing Director: Ian Brotherston Amtsgericht Hamburg HRB 95913 This e-mail and its attachments may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail and its attachments. Any unauthorized copying, disclosure or distribution of this e-mail and its attachments is strictly forbidden. This notice also applies to future messages.
Re: customizing standard tokenizer
Hi Torsten, did you have a look at WordDelimiterTokenFilter? Sounds like it fits your needs. Regards, Em Am 17.02.2012 15:14, schrieb Torsten Krah: Hi, is it possible to extend the standard tokenizer or use a custom one (possible via extending the standard one) to add some custom tokens like Lucene-Core to be one token. regards
Re: distributed deletes working?
On Fri, Feb 17, 2012 at 6:03 PM, Jamie Johnson jej2...@gmail.com wrote: Thanks Sami, so long at it's expected ;) In regards to the replication not working the way I think it should, am I missing something or is it simply not working the way I think? It should work. I also tried to reproduce your issue but was not able to. Could you try reproduce your problem with the provided scripts that are in solr/cloud-dev/ I think example2.sh might be a good start. It's not identical to your situation (it has 1 core per instance) but would be great if you could verify that you see the issue with that setup or not. -- Sami Siren
Re: distributed deletes working?
On Feb 17, 2012, at 11:03 AM, Jamie Johnson wrote: Thanks Sami, so long at it's expected ;) Yeah, its expected - we always use both the live nodes info and state to determine the full state for a shard. In regards to the replication not working the way I think it should, am I missing something or is it simply not working the way I think? This should work - in fact I just did the same testing this morning. Are you indexing while you bring the shard down and then up (it should still work fine)? Or do you stop indexing, bring down the shard, index, bring up the shard? How far out of sync is it? When exactly is this build from? On Fri, Feb 17, 2012 at 11:01 AM, Sami Siren ssi...@gmail.com wrote: On Fri, Feb 17, 2012 at 5:10 PM, Jamie Johnson jej2...@gmail.com wrote: and having looked at this closer, shouldn't the down node not be marked as active when I stop that solr instance? Currently the shard state is not updated in the cloudstate when a node goes down. This behavior should probably be changed at some point. -- Sami Siren - Mark Miller lucidimagination.com
Re: distributed deletes working?
On Fri, Feb 17, 2012 at 11:13 AM, Mark Miller markrmil...@gmail.com wrote: When exactly is this build from? Yeah... I just checked in a fix yesterday dealing with sync while indexing is going on. -Yonik lucidimagination.com
Re: distributed deletes working?
I stop the indexing, stop the shard, then start indexing again. So shouldn't need Yonik's latest fix? In regards to how far out of sync, it's completely out of sync, meaning index 100 documents to the cluster (40 on shard1 60 on shard2) then stop the instance, index 100 more, when I bring the instance back up if I issue queries to just the solr instance I brought up the counts are the old counts. I'll startup the same test with out using multiple cores. Give me a few and I'll provide the details. On Fri, Feb 17, 2012 at 11:19 AM, Yonik Seeley yo...@lucidimagination.com wrote: On Fri, Feb 17, 2012 at 11:13 AM, Mark Miller markrmil...@gmail.com wrote: When exactly is this build from? Yeah... I just checked in a fix yesterday dealing with sync while indexing is going on. -Yonik lucidimagination.com
Re: Cloud tab hanging?
On Feb 17, 2012, at 11:00 AM, Ranjan Bagchi wrote: Hi, I'm pretty new to solr and especially solr cloud, so hopefully this isn't too dumb: I followed the wiki instructions for setting up a small cloud. Things seem to work, *except* on the UI [using chrome and safari], the cloud tab hangs. It says Zookeeper Data, and then there's a loading symbol.The old ui allows me to see what's in zookeeper, so I'm pretty sure it's mostly working. There's nothing in the logs at all about a connection timing out -- any help? Thanks, Ranjan I've intermittently seen this myself I think - its hard to debug without something like firebug to see what is actually failing (in the past, with the new UI, i've seen it choke on some json response (that was valid according to other tools)). You might want to file a JIRA issue on it and in the mean time, try using the old UI for this? localhost:8983/solr/collection1/admin/zookeeper.jsp It's still more full featured anyhow, in that you can actually inspect what data is on each node (important for being able to see the clusterstate.json!) - Mark Miller lucidimagination.com
Re: distributed deletes working?
I'm seeing the following. Do I need a _version_ long field in my schema? Feb 17, 2012 1:15:50 PM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {delete=[f2c29abe-2e48-4965-adfb-8bd611293ff0]} 0 0 Feb 17, 2012 1:15:50 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: missing _version_ on update from leader at org.apache.solr.update.processor.DistributedUpdateProcessor.versionDelete(DistributedUpdateProcessor.java:707) at org.apache.solr.update.processor.DistributedUpdateProcessor.processDelete(DistributedUpdateProcessor.java:478) at org.apache.solr.update.processor.LogUpdateProcessor.processDelete(LogUpdateProcessorFactory.java:137) at org.apache.solr.handler.XMLLoader.processDelete(XMLLoader.java:235) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:166) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:59) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1523) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:405) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:255) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) On Fri, Feb 17, 2012 at 11:25 AM, Jamie Johnson jej2...@gmail.com wrote: I stop the indexing, stop the shard, then start indexing again. So shouldn't need Yonik's latest fix? In regards to how far out of sync, it's completely out of sync, meaning index 100 documents to the cluster (40 on shard1 60 on shard2) then stop the instance, index 100 more, when I bring the instance back up if I issue queries to just the solr instance I brought up the counts are the old counts. I'll startup the same test with out using multiple cores. Give me a few and I'll provide the details. On Fri, Feb 17, 2012 at 11:19 AM, Yonik Seeley yo...@lucidimagination.com wrote: On Fri, Feb 17, 2012 at 11:13 AM, Mark Miller markrmil...@gmail.com wrote: When exactly is this build from? Yeah... I just checked in a fix yesterday dealing with sync while indexing is going on. -Yonik lucidimagination.com
Re: distributed deletes working?
On Fri, Feb 17, 2012 at 1:27 PM, Jamie Johnson jej2...@gmail.com wrote: I'm seeing the following. Do I need a _version_ long field in my schema? Yep... versions are the way we keep things sane (shuffled updates to a replica can be correctly reordered, etc). -Yonik lucidimagination.com
Re: distributed deletes working?
Ok, so I'm making some progress now. With _version_ in the schema (forgot about this because I remember asking about it before) deletes across the cluster work when I delete by id. Updates work as well if a node is down it recovered fine. Something that didn't work though was if a node was down when a delete happened and then comes back up, that node still listed the id I deleted. Is this currently supported? On Fri, Feb 17, 2012 at 1:33 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Fri, Feb 17, 2012 at 1:27 PM, Jamie Johnson jej2...@gmail.com wrote: I'm seeing the following. Do I need a _version_ long field in my schema? Yep... versions are the way we keep things sane (shuffled updates to a replica can be correctly reordered, etc). -Yonik lucidimagination.com
Re: distributed deletes working?
On Fri, Feb 17, 2012 at 1:38 PM, Jamie Johnson jej2...@gmail.com wrote: Something that didn't work though was if a node was down when a delete happened and then comes back up, that node still listed the id I deleted. Is this currently supported? Yes, that should work fine. Are you still seing that behavior? -Yonik lucidimagination.com
Custom Query Component: parameters are not appended to query
Hello folks, I build a simple custom component for “hl.q” query. My case was to inject hl.q=params on the fly, with filter params like fields which were in my standard query. These were highlighted , because Solr/Lucene have no way of interpreting an extended q clause and saying this part is a query and should be highlighted and this part isn't. If it works, the community can have it :) Facts: q=roomba AND irobot AND language:de My component is extended form SearchComponent. I use ResponseBuilder to get all needed params like field-names from schema, q-params, etc… My component is called as first (it works(debugging,debugQuery)) from my SearchHandler: arr name=first-components strhighlightQuery/str /arr Important Clippings from Sourcecode: public class HighlightQueryComponent extends SearchComponent { ……. ……. public void process(ResponseBuilder rb) throws IOException { if(rb.doHighlights){ ListString terms = new ArrayListString(0); SolrQueryRequest req = rb.req; IndexSchema schema = req.getSchema(); MapString,SchemaField fields = schema.getFields(); SolrParams params = req.getParams(); ….. …. …magic … …. Query hlq = new TermQuery(new Term(“text”, hlQuery.toString())); rb.setHighlightQuery(hlq); // hlq = text:(roomba AND irobot) Problem: In last step my query is adjusted (hlq params from debugging are “text:(roomba AND irobot)”). It looks fine, the magic in method process() works. But nothing happen. If I continue to debug the next components were called, But my query is the same, without changes. Either setHighlightQuery doesn´t work, or my params are overridden in following components. What can it be? Best Regards Vadim
Re: distributed deletes working?
Yes, still seeing that. Master has 8 items, replica has 9. So the delete didn't seem to work when the node was down. On Fri, Feb 17, 2012 at 1:41 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Fri, Feb 17, 2012 at 1:38 PM, Jamie Johnson jej2...@gmail.com wrote: Something that didn't work though was if a node was down when a delete happened and then comes back up, that node still listed the id I deleted. Is this currently supported? Yes, that should work fine. Are you still seing that behavior? -Yonik lucidimagination.com
Re: distributed deletes working?
Hmm...just tried this with only deletes, and the replica sync'd fine for me. Is this with your multi core setup or were you trying with instances? On Feb 17, 2012, at 1:52 PM, Jamie Johnson wrote: Yes, still seeing that. Master has 8 items, replica has 9. So the delete didn't seem to work when the node was down. On Fri, Feb 17, 2012 at 1:41 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Fri, Feb 17, 2012 at 1:38 PM, Jamie Johnson jej2...@gmail.com wrote: Something that didn't work though was if a node was down when a delete happened and then comes back up, that node still listed the id I deleted. Is this currently supported? Yes, that should work fine. Are you still seing that behavior? -Yonik lucidimagination.com - Mark Miller lucidimagination.com
Re: distributed deletes working?
This was with the cloud-dev solrcloud-start.sh script (after that I've used solrcloud-start-existing.sh). Essentially I run ./solrcloud-start-existing.sh index docs kill 1 of the solr instances (using kill -9 on the pid) delete a doc from running instances restart killed solr instance on doing this the deleted document is still lingering in the instance that was down. On Fri, Feb 17, 2012 at 2:04 PM, Mark Miller markrmil...@gmail.com wrote: Hmm...just tried this with only deletes, and the replica sync'd fine for me. Is this with your multi core setup or were you trying with instances? On Feb 17, 2012, at 1:52 PM, Jamie Johnson wrote: Yes, still seeing that. Master has 8 items, replica has 9. So the delete didn't seem to work when the node was down. On Fri, Feb 17, 2012 at 1:41 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Fri, Feb 17, 2012 at 1:38 PM, Jamie Johnson jej2...@gmail.com wrote: Something that didn't work though was if a node was down when a delete happened and then comes back up, that node still listed the id I deleted. Is this currently supported? Yes, that should work fine. Are you still seing that behavior? -Yonik lucidimagination.com - Mark Miller lucidimagination.com
RE: customizing standard tokenizer
Hi Torsten, The Lucene StandardTokenizer is written in JFlex (http://jflex.de) - you can see the version 3.X specification at: http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/core/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex?view=markup You can make changes to this file, then run ant jflex-StandardAnalyzer from the checked-out branch_3x sources or a source release (in the lucene/core/ directory in branch_3x, and in the lucene/ directory in a pre-3.6 source release), to generate the corresponding java source code at: lucene/core/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.java However, I recommend a simpler strategy: use a MappingCharFilter[1] in front of your tokenizer to map the tokens you want left intact to strings that will not be broken up by the tokenizer. For example, Lucene-Core could be mapped to Lucene_Core, because UAX#29[2], upon which StandardTokenizer is based, considers the underscore to be a word character, and so will leave Lucene_Core as a single token. You would need to use this strategy at both index-time and query-time. (I was going to add that if you wanted your indexed tokens to be the same as their original form, you could add a MappingTokenFilter after your tokenizer to do the reverse mapping, but such a thing does not yet exist :( - however, there is a JIRA issue for this idea: https://issues.apache.org/jira/browse/SOLR-1978.) Steve [1] http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/analysis/MappingCharFilter.html [2] http://unicode.org/reports/tr29/ -Original Message- From: Torsten Krah [mailto:tk...@fachschaft.imn.htwk-leipzig.de] Sent: Friday, February 17, 2012 9:15 AM To: solr-user@lucene.apache.org Subject: customizing standard tokenizer Hi, is it possible to extend the standard tokenizer or use a custom one (possible via extending the standard one) to add some custom tokens like Lucene-Core to be one token. regards
Re: distributed deletes working?
On Fri, Feb 17, 2012 at 2:07 PM, Jamie Johnson jej2...@gmail.com wrote: This was with the cloud-dev solrcloud-start.sh script (after that I've used solrcloud-start-existing.sh). Essentially I run ./solrcloud-start-existing.sh index docs kill 1 of the solr instances (using kill -9 on the pid) delete a doc from running instances restart killed solr instance on doing this the deleted document is still lingering in the instance that was down. Hmmm. Shot in the dark : is your id field type something other than string? -Yonik lucidimagination.com
Re: distributed deletes working?
You are committing in that mix right? On Feb 17, 2012, at 2:07 PM, Jamie Johnson wrote: This was with the cloud-dev solrcloud-start.sh script (after that I've used solrcloud-start-existing.sh). Essentially I run ./solrcloud-start-existing.sh index docs kill 1 of the solr instances (using kill -9 on the pid) delete a doc from running instances restart killed solr instance on doing this the deleted document is still lingering in the instance that was down. On Fri, Feb 17, 2012 at 2:04 PM, Mark Miller markrmil...@gmail.com wrote: Hmm...just tried this with only deletes, and the replica sync'd fine for me. Is this with your multi core setup or were you trying with instances? On Feb 17, 2012, at 1:52 PM, Jamie Johnson wrote: Yes, still seeing that. Master has 8 items, replica has 9. So the delete didn't seem to work when the node was down. On Fri, Feb 17, 2012 at 1:41 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Fri, Feb 17, 2012 at 1:38 PM, Jamie Johnson jej2...@gmail.com wrote: Something that didn't work though was if a node was down when a delete happened and then comes back up, that node still listed the id I deleted. Is this currently supported? Yes, that should work fine. Are you still seing that behavior? -Yonik lucidimagination.com - Mark Miller lucidimagination.com - Mark Miller lucidimagination.com
Re: problem to indexing pdf directory
i try...but i works with solr 1.4.1 Il giorno 17 febbraio 2012 15:59, Erick Erickson erickerick...@gmail.comha scritto: You should not have to do anything with Maven, the instructions you followed were from 1.4.1 days.. Assuming you're working with a 3.x build, here's a data-config that worked for me, just a straight distro. But note a couple of things: 1 for simplicity, I changed the schema.xml to NOT require the id field. You'll have to change this back probably and select a good uniqueKey 2 I had to add this line to solrconfig.xml to find the path: lib dir=../../dist/ regex=apache-solr-dataimporthandler-extras-\d.*\.jar/ 3 If this all works without errors in the Solr log and you still can't find anything, be sure you issue a commit. Best Erick dataConfig dataSource name=bin type=BinFileDataSource/ document entity baseDir=/Users/Erick/testdocs fileName=.*pdf name=sd processor=FileListEntityProcessor recursive=true rootEntity=false entity dataSource=bin format=text name=tika-test processor=TikaEntityProcessor url=${sd.fileAbsolutePath} field column=Author meta=true name=author/ field column=Content-Type meta=true name=title/ !-- field column=title name=title meta=true/ -- field column=text name=text/ /entity !-- field column=fileLastModified name=date dateTimeFormat=-MM-dd'T'hh:mm:ss / -- field column=fileSize meta=true name=size/ /entity /document /dataConfig On Fri, Feb 17, 2012 at 9:35 AM, alessio crisantemi alessio.crisant...@gmail.com wrote: thanks gora for your help. I installed Maven and downloaded Tika following the guide: But I have an errore during the built of Tika about 'tika compiler', and the maven installation of Tika is stopped. there is another way? thank you a. 2012/2/16 Gora Mohanty g...@mimirtech.com On 16 February 2012 21:37, alessio crisantemi alessio.crisant...@gmail.com wrote: here the log: org.apache.solr.handler.dataimport.DataImporter doFullImport Grave: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' is a required attribute Processing Document # 1 [...] The exception message above is pretty clear. You need to define a baseDir attribute for the second entity. However, even if you fix this, the setup will *not* work for indexing PDFs. Did you read the URLs that I sent earlier? Regards, Gora
Re: problem to indexing pdf directory
Sorry, my error! In that case you *do* have to do some fiddling to get it all to work. Good Luck! Erick On Fri, Feb 17, 2012 at 3:27 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: i try...but i works with solr 1.4.1 Il giorno 17 febbraio 2012 15:59, Erick Erickson erickerick...@gmail.comha scritto: You should not have to do anything with Maven, the instructions you followed were from 1.4.1 days.. Assuming you're working with a 3.x build, here's a data-config that worked for me, just a straight distro. But note a couple of things: 1 for simplicity, I changed the schema.xml to NOT require the id field. You'll have to change this back probably and select a good uniqueKey 2 I had to add this line to solrconfig.xml to find the path: lib dir=../../dist/ regex=apache-solr-dataimporthandler-extras-\d.*\.jar/ 3 If this all works without errors in the Solr log and you still can't find anything, be sure you issue a commit. Best Erick dataConfig dataSource name=bin type=BinFileDataSource/ document entity baseDir=/Users/Erick/testdocs fileName=.*pdf name=sd processor=FileListEntityProcessor recursive=true rootEntity=false entity dataSource=bin format=text name=tika-test processor=TikaEntityProcessor url=${sd.fileAbsolutePath} field column=Author meta=true name=author/ field column=Content-Type meta=true name=title/ !-- field column=title name=title meta=true/ -- field column=text name=text/ /entity !-- field column=fileLastModified name=date dateTimeFormat=-MM-dd'T'hh:mm:ss / -- field column=fileSize meta=true name=size/ /entity /document /dataConfig On Fri, Feb 17, 2012 at 9:35 AM, alessio crisantemi alessio.crisant...@gmail.com wrote: thanks gora for your help. I installed Maven and downloaded Tika following the guide: But I have an errore during the built of Tika about 'tika compiler', and the maven installation of Tika is stopped. there is another way? thank you a. 2012/2/16 Gora Mohanty g...@mimirtech.com On 16 February 2012 21:37, alessio crisantemi alessio.crisant...@gmail.com wrote: here the log: org.apache.solr.handler.dataimport.DataImporter doFullImport Grave: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' is a required attribute Processing Document # 1 [...] The exception message above is pretty clear. You need to define a baseDir attribute for the second entity. However, even if you fix this, the setup will *not* work for indexing PDFs. Did you read the URLs that I sent earlier? Regards, Gora
Re: problem to indexing pdf directory
I'm confused now.. so, my last question: I add this in my solrconfig.xml: requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configc:\solr\conf\db-config.xml/str /lst /requestHandler And I wrote my db-config.xml like this: dataConfig dataSource type=BinFileDataSource name=bin / document entity name=sd processor=FileListEntityProcessor newerThan='NOW-30DAYS' fileName=.*pdf$ baseDir=D:\myfiles recursive=true rootEntity=false transformer=DateFormatTransformer entity name=tika-test processor=TikaEntityProcessor url=${sd.fileAbsolutePath} format=text dataSource=bin field column=author name=author meta=true/ field column=title name=title meta=true/ field column=description name=description / field column=comments name=comments / field column=content_type name=content_type / field column=last_modified name=last_modified / /entity !-- field column=fileLastModified name=date dateTimeFormat=-MM-dd'T'hh:mm:ss / -- field column=fileSize name=size/ field column=file name=filename/ /entity /document /dataConfig that's must work, in your opinion, or you see an error in this code? thanks, alessio Il giorno 17 febbraio 2012 21:29, Erick Erickson erickerick...@gmail.comha scritto: Sorry, my error! In that case you *do* have to do some fiddling to get it all to work. Good Luck! Erick On Fri, Feb 17, 2012 at 3:27 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: i try...but i works with solr 1.4.1 Il giorno 17 febbraio 2012 15:59, Erick Erickson erickerick...@gmail.comha scritto: You should not have to do anything with Maven, the instructions you followed were from 1.4.1 days.. Assuming you're working with a 3.x build, here's a data-config that worked for me, just a straight distro. But note a couple of things: 1 for simplicity, I changed the schema.xml to NOT require the id field. You'll have to change this back probably and select a good uniqueKey 2 I had to add this line to solrconfig.xml to find the path: lib dir=../../dist/ regex=apache-solr-dataimporthandler-extras-\d.*\.jar/ 3 If this all works without errors in the Solr log and you still can't find anything, be sure you issue a commit. Best Erick dataConfig dataSource name=bin type=BinFileDataSource/ document entity baseDir=/Users/Erick/testdocs fileName=.*pdf name=sd processor=FileListEntityProcessor recursive=true rootEntity=false entity dataSource=bin format=text name=tika-test processor=TikaEntityProcessor url=${sd.fileAbsolutePath} field column=Author meta=true name=author/ field column=Content-Type meta=true name=title/ !-- field column=title name=title meta=true/ -- field column=text name=text/ /entity !-- field column=fileLastModified name=date dateTimeFormat=-MM-dd'T'hh:mm:ss / -- field column=fileSize meta=true name=size/ /entity /document /dataConfig On Fri, Feb 17, 2012 at 9:35 AM, alessio crisantemi alessio.crisant...@gmail.com wrote: thanks gora for your help. I installed Maven and downloaded Tika following the guide: But I have an errore during the built of Tika about 'tika compiler', and the maven installation of Tika is stopped. there is another way? thank you a. 2012/2/16 Gora Mohanty g...@mimirtech.com On 16 February 2012 21:37, alessio crisantemi alessio.crisant...@gmail.com wrote: here the log: org.apache.solr.handler.dataimport.DataImporter doFullImport Grave: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' is a required attribute Processing Document # 1 [...] The exception message above is pretty clear. You need to define a baseDir attribute for the second entity. However, even if you fix this, the setup will *not* work for indexing PDFs. Did you read the URLs that I sent earlier? Regards, Gora
Re: Solritas: Modify $content in layout.vm
Why do you want to? That is what are you trying to accomplish by modifying that variable? You may not really need to... This seems like an XY problem... Best Erick On Thu, Feb 16, 2012 at 11:06 PM, remi tassing tassingr...@gmail.com wrote: Hi all, How do we modify the $content variable in the layout.vm file? I could managed to change other stuff in doc.vm or header.vm but not this one. Is there any tutorial on this? Remi
Re: distributed deletes working?
yes committing in the mix. id field is a UUID. On Fri, Feb 17, 2012 at 3:22 PM, Mark Miller markrmil...@gmail.com wrote: You are committing in that mix right? On Feb 17, 2012, at 2:07 PM, Jamie Johnson wrote: This was with the cloud-dev solrcloud-start.sh script (after that I've used solrcloud-start-existing.sh). Essentially I run ./solrcloud-start-existing.sh index docs kill 1 of the solr instances (using kill -9 on the pid) delete a doc from running instances restart killed solr instance on doing this the deleted document is still lingering in the instance that was down. On Fri, Feb 17, 2012 at 2:04 PM, Mark Miller markrmil...@gmail.com wrote: Hmm...just tried this with only deletes, and the replica sync'd fine for me. Is this with your multi core setup or were you trying with instances? On Feb 17, 2012, at 1:52 PM, Jamie Johnson wrote: Yes, still seeing that. Master has 8 items, replica has 9. So the delete didn't seem to work when the node was down. On Fri, Feb 17, 2012 at 1:41 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Fri, Feb 17, 2012 at 1:38 PM, Jamie Johnson jej2...@gmail.com wrote: Something that didn't work though was if a node was down when a delete happened and then comes back up, that node still listed the id I deleted. Is this currently supported? Yes, that should work fine. Are you still seing that behavior? -Yonik lucidimagination.com - Mark Miller lucidimagination.com - Mark Miller lucidimagination.com
Re: distributed deletes working?
On Feb 17, 2012, at 3:56 PM, Jamie Johnson wrote: id field is a UUID. Strange - was using UUID's myself in same test this morning... I'll try again soon. - Mark Miller lucidimagination.com
proper syntax for using sort query parameter in responseHandler
what is the proper syntax for including sort directive in my responseHandler? i tried this but got an error: requestHandler name=partItemNoSearch class=solr.SearchHandler default=false lst name=defaults str name=defTypeedismax/str str name=echoParamsall/str int name=rows10/int str name=qfitemNo^1.0/str str name=q.alt*:*/str * str name=sortrankNo desc/str* /lst lst name=appends str name=fqitemType:1/str /lst lst name=invariants str name=facetfalse/str /lst /requestHandler thank you mark -- View this message in context: http://lucene.472066.n3.nabble.com/proper-syntax-for-using-sort-query-parameter-in-responseHandler-tp3755077p3755077.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solritas: Modify $content in layout.vm
$content is output of the main template rendered. To modify what is generated into $content, modify the main template or the sub-#parsed templates (which is what you've discovered, looks like) that is rendered (browse.vm, perhaps, if you're using the default example setup). The main template that is rendered is specified as v.template (in the /browse handler definition in solrconfig.xml, again if you're using the example configuration). Does that help? If not, let us know what you're trying to do exactly. Erik On Feb 16, 2012, at 23:06 , remi tassing wrote: Hi all, How do we modify the $content variable in the layout.vm file? I could managed to change other stuff in doc.vm or header.vm but not this one. Is there any tutorial on this? Remi
Indexing 100Gb of readonly numeric data
Hi guys, I'm cross posting this from lucene list as I guess I can have better help here for this scenario. Suppose I want to index 100Gb+ of numeric data. I'm not yet sure the specifics, but I can expect the following: - data is expected to be in one gigantic table. conceptually, is likea spreadsheet table: rows are objects and columns are properties.- values are mostly floating point numbers, and I expect them to be,let's say, unique or discreet, or almost randomly distributed (1.89868776E+50,1.434E-12)- The data is readonly. it will never change. Now I need to query this data based mostly in range queries on thecolumns. Something like: SELECT * FROM Table WHERE (Col1 1.2E2 AND Col1 1.8E2) OR (Col3 == 0) which is basically give me all the rows that satisfy this criteria. I believe this could be easily done with a standard RDBMS, but I wouldlike to avoid that route. While thinking about this, and assuming this could work well withSolr, I had some things I couldn't answer:- - In this case, it makes total sense to store the data in the index. If I will index all columns, I might as well have the data right there. - Does it make any sense to index this whole thing once, while offline, and then upload only the index to the servers? - I'm almost sure I will have to shard the index in some way, and this isn't difficult. But what are the possible hardware requirements to host this thing? I know this depends on lots of information I didn't provide (searches/sec for example), but can someone throw a number? I have completely no ideia... Thanks -- Pedro Ferreira mobile: 00 44 7712 557303 skype: pedrosilvaferreira email: psilvaferre...@gmail.com linkedin: http://uk.linkedin.com/in/pedrosilvaferreira
Re: Indexing 100Gb of readonly numeric data
Ouch... sorry about the format... I have no idea why gmail turned my text into that... On Fri, Feb 17, 2012 at 10:07 PM, Pedro Ferreira psilvaferre...@gmail.com wrote: Hi guys, I'm cross posting this from lucene list as I guess I can have better help here for this scenario. Suppose I want to index 100Gb+ of numeric data. I'm not yet sure the specifics, but I can expect the following: - data is expected to be in one gigantic table. conceptually, is likea spreadsheet table: rows are objects and columns are properties.- values are mostly floating point numbers, and I expect them to be,let's say, unique or discreet, or almost randomly distributed (1.89868776E+50,1.434E-12)- The data is readonly. it will never change. Now I need to query this data based mostly in range queries on thecolumns. Something like: SELECT * FROM Table WHERE (Col1 1.2E2 AND Col1 1.8E2) OR (Col3 == 0) which is basically give me all the rows that satisfy this criteria. I believe this could be easily done with a standard RDBMS, but I wouldlike to avoid that route. While thinking about this, and assuming this could work well withSolr, I had some things I couldn't answer:- - In this case, it makes total sense to store the data in the index. If I will index all columns, I might as well have the data right there. - Does it make any sense to index this whole thing once, while offline, and then upload only the index to the servers? - I'm almost sure I will have to shard the index in some way, and this isn't difficult. But what are the possible hardware requirements to host this thing? I know this depends on lots of information I didn't provide (searches/sec for example), but can someone throw a number? I have completely no ideia... Thanks -- Pedro Ferreira mobile: 00 44 7712 557303 skype: pedrosilvaferreira email: psilvaferre...@gmail.com linkedin: http://uk.linkedin.com/in/pedrosilvaferreira -- Pedro Ferreira mobile: 00 44 7712 557303 skype: pedrosilvaferreira email: psilvaferre...@gmail.com linkedin: http://uk.linkedin.com/in/pedrosilvaferreira
Re: proper syntax for using sort query parameter in responseHandler
Hi Mark, Having a look at that requestHandler it looks ok [1], are you experiencing any errors? If so did you check the wiki page FieldOptionsByUseCase [2], maybe that field (rankNo) options contain indexed=false or multiValued=true? HTH, Tommaso [1] : http://wiki.apache.org/solr/CommonQueryParameters#sort [2] : http://wiki.apache.org/solr/FieldOptionsByUseCase 2012/2/17 geeky2 gee...@hotmail.com what is the proper syntax for including sort directive in my responseHandler? i tried this but got an error: requestHandler name=partItemNoSearch class=solr.SearchHandler default=false lst name=defaults str name=defTypeedismax/str str name=echoParamsall/str int name=rows10/int str name=qfitemNo^1.0/str str name=q.alt*:*/str * str name=sortrankNo desc/str* /lst lst name=appends str name=fqitemType:1/str /lst lst name=invariants str name=facetfalse/str /lst /requestHandler thank you mark -- View this message in context: http://lucene.472066.n3.nabble.com/proper-syntax-for-using-sort-query-parameter-in-responseHandler-tp3755077p3755077.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr Wiki and mailing lists
The Apache Solr main page does not mention the mailing lists. The wiki main page has a broken link. I have had to search my incoming mail to find out how to unsubscribe to solr-user. Someone with full access- please fix these problems. Thanks, -- Lance Norskog goks...@gmail.com
Re: Solr Wiki and mailing lists
To unsubscribe, e-mail: solr-user-unsubscr...@lucene.apache.org Also you can request a FAQ, e-mail: solr-user-...@lucene.apache.org On Sat, Feb 18, 2012 at 12:38 AM, Lance Norskog goks...@gmail.com wrote: The Apache Solr main page does not mention the mailing lists. The wiki main page has a broken link. I have had to search my incoming mail to find out how to unsubscribe to solr-user. Someone with full access- please fix these problems. Thanks, -- Lance Norskog goks...@gmail.com -- Best regards, Artem Lokotosh mailto:arco...@gmail.com
RE: Improving proximity search performance
Apologies. I meant to type “1.4 TB” and somehow typed “1.4 GB.” Little wonder that no one thought the question was interesting, or figured I must be using Sneakernet to run my searches. -- Bryan Loofbourrow -- *From:* Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] *Sent:* Thursday, February 16, 2012 7:07 PM *To:* 'solr-user@lucene.apache.org' *Subject:* Improving proximity search performance Here’s my use case. I expect to set up a Solr index that is approximately 1.4GB (this is a real number from the proof-of-concept using the real data, which consists of about 10 million documents, many of significant size, and making use of the FastVectorHighlighter to do highlighting on the body text field, which is of course stored, and with termVectors, termPositions, and termOffsets on). I no longer have the proof-of-concept Solr core available (our live site uses Solr 1.4 and the ordinary Highlighter), so I can’t get an empirical answer to this question: Will storing that extra information about the location of terms help the performance of proximity searches? A significant and important subset of my users make extensive use of proximity searches. These sophisticated users have found that they are best able to locate what they want by doing searches about THISWORD within 5 words of THATWORD, or much more sophisticated variants on that theme, including plenty of booleans and wildcards. The problem I’m facing is performance. Some of these searches, when common words are used, can take many minutes, even with the index on an SSD. The question is, how to improve the performance. It occurred to me as possible that all of that term vector information, stored for the benefit of the FastVectorHighlighter, might be a significant aid to the performance of these searches. First question: is that already the case? Will storing this extra information automatically improve my proximity search performance? Second question: If not, I’m very willing to dive into the code and come up with a patch that would do this. Can someone with knowledge of the internals comment on whether this is a plausible strategy for improving performance, and, if so, give tips about the outlines of what a successful approach to the problem might look like? Third question: Any tips in general for improving the performance of these proximity searches? I have explored the question of whether the customers might be weaned off of them, and that does not appear to be an option. Thanks, -- Bryan Loofbourrow
Using nested entities in FileDataSource import of xml file contents
Can anybody help me understand the right way to define a data-config.xml file with nested entities for indexing the contents of an XML file? I used this data-config.xml file to index a database containing sample patient records: dataConfig dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost:3306/bioscope user=db_user password=/ document name=bioscope entity name=docs pk=doc_id query=SELECT doc_id, type FROM bioscope.docs field column=doc_id name=doc_id/ field column=type name=doc_type/ entity name=codes query=SELECT id, origin, type, code FROM bioscope.codes WHERE doc_id='${docs.doc_id}' field column=origin name=code_origin/ field column=type name=code_type/ field column=code name=code_value/ /entity entity name=notes query=SELECT id, origin, type, text FROM bioscope.texts WHERE doc_id='${docs.doc_id}' field column=origin name=note_origin/ field column=type name=note_type/ field column=text name=note_text/ /entity /entity /document /dataConfig I would like to do the same thing with an XML file containing the same data as is in the database. That XML file looks like this: docs doc id=97634811 type=RADIOLOGY_REPORT codes code origin=CMC_MAJORITY type=ICD-9-CM786.2/code code origin=COMPANY3 type=ICD-9-CM786.2/code code origin=COMPANY1 type=ICD-9-CM786.2/code code origin=COMPANY2 type=ICD-9-CM786.2/code /codes texts text origin=CCHMC_RADIOLOGY type=CLINICAL_HISTORYSeventeen year old with cough./text text origin=CCHMC_RADIOLOGY type=IMPRESSIONNormal./text /texts /doc /docs I tried using this data-config.xml file, in order to preserve the nested entity structure used with the database case: dataConfig dataSource type=FileDataSource encoding=UTF-8/ document name=bioscope entity name=doc processor=XPathEntityProcessor stream=true forEach=/docs/doc url=C:/data/bioscope.xml field column=doc_id xpath=/docs/doc/@id/ field column=doc_type xpath=/docs/doc/@type/ entity name=code processor=XPathEntityProcessor stream=true forEach=/docs/doc[@id='${doc.doc_id}']/codes/code url=C:/data/bioscope.xml field column=code_origin xpath=/docs/doc[@id='${doc.doc_id}']/codes/code/@origin/ field column=code_type xpath=/docs/doc[@id='${doc.doc_id}']/codes/code/@type/ field column=code_value xpath=/docs/doc[@id='${doc.doc_id}']/codes/code/ /entity entity name=note processor=XPathEntityProcessor stream=true forEach=/docs/doc[@id='${doc.doc_id}']/texts/text url=C:/data/bioscope.xml field column=note_origin xpath=/docs/doc[@id='${doc.doc_id}']/texts/text/@origin/ field column=note_type xpath=/docs/doc[@id='${doc.doc_id}']/texts/text/@type/ field column=note_text xpath=/docs/doc[@id='${doc.doc_id}']/texts/text/ /entity /entity /document /dataConfig This is wrong, and it fails to index any of the codes and texts blocks in the XML file. I'm sure that part of the problem must be that the xpath expressions such as /docs/doc[@id='${doc.doc_id}']/texts/text/@origin fail to match anything in the XML file, because when I try the same import without nested entities, using this data-config.xml file, the codes and texts blocks are also not indexed: dataConfig dataSource type=FileDataSource encoding=UTF-8/ document name=bioscope entity name=doc processor=XPathEntityProcessor stream=true forEach=/docs/doc url=C:/data/bioscope.xml field column=doc_id xpath=/docs/doc/@id/ field column=doc_type xpath=/docs/doc/@type/ field column=code_origin xpath=/docs/doc[@id='${doc.doc_id}']/codes/code/@origin/ field column=code_type xpath=/docs/doc[@id='${doc.doc_id}']/codes/code/@type/ field column=code_value xpath=/docs/doc[@id='${doc.doc_id}']/codes/code/ field column=note_origin xpath=/docs/doc[@id='${doc.doc_id}']/texts/text/@origin/ field column=note_type xpath=/docs/doc[@id='${doc.doc_id}']/texts/text/@type/ field column=note_text xpath=/docs/doc[@id='${doc.doc_id}']/texts/text/ /entity /document /dataConfig However, when I use this data-config.xml file, which doesn't use nested entities, all of the fields are included in the index: dataConfig dataSource type=FileDataSource encoding=UTF-8/ document name=bioscope entity name=doc processor=XPathEntityProcessor stream=true forEach=/docs/doc url=C:/data/bioscope.xml field column=doc_id xpath=/docs/doc/@id/ field column=doc_type xpath=/docs/doc/@type/ field column=code_origin xpath=/docs/doc/codes/code/@origin/ field column=code_type xpath=/docs/doc/codes/code/@type/ field column=code_value xpath=/docs/doc/codes/code/ field column=note_origin xpath=/docs/doc/texts/text/@origin/ field column=note_type xpath=/docs/doc/texts/text/@type/ field column=note_text
Re: how to delta index linked entities in 3.5.0
Thanks for your thoughts Shawn. I did notice 3.x tightened up alot and I did account for it by making sure I had pk defined and columns explicitly aliased with the same name (and I will make sure the bug text reflects that). To help others that are having the same problem, I just found a thread describing a workaround using group_concat() in mysql and then transformer on solr. So far this appears to work and also seems to delta around 10x faster. The only disadvantage is that the delta index process doesn't tell you how many rows have changed. It just says 1 row because you are hacking deltaQuery to return a single dummy row and making deltaImportQuery take in last_index_time and return all rows that have changed. Quote: The following (MySql) query concatenates 3 lang_code fields from the main table into one field and multiple emails from a secondary table into another field: SELECT u.id, u.name, IF((u.lang_code1 IS NULL AND u.lang_code2 IS NULL AND u.lang_code3 IS NULL), NULL, CONVERT(CONCAT_WS('|', u.lang_code1, u.lang_code2, u.lang_code3) USING ascii)) AS multi_lang_codes, GROUP_CONCAT(e.email SEPARATOR '|') AS multiple_emails FROM users_tb u LEFT JOIN emails_tb e ON u.id = e.id GROUP BY u.id The entity in data-config.xml looks something like: entity name=my_entity query=call get_solr_full(); transformer=RegexTransformer field name=email column=multiple_emails splitBy=\| / field name=lang_code column=multiple_lang_codes splitBy=\| / /entity Full Thread: https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201008.mbox/%3c9f8b39cb3b7c6d4594293ea29ccf438b01702...@icq-mail.icq.il.office.aol.com%3E So until the bug is fixed or docs are changed I hope this helps someone else searching for this same error message. Adam -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-delta-index-linked-entities-in-3-5-0-tp3752455p3755453.html Sent from the Solr - User mailing list archive at Nabble.com.
PointType hard-coded to Doubles?
The PointType seems to be hard-coded to use doubles. Where in the code does this happen? -- Lance Norskog goks...@gmail.com