Re: Index not getting refreshed
Is it possible you have two solr instances running off the same index folder? This was a mistake I stumbled into early on - I was writing with one, and reading with the other, so I didn't see updates. -Mike On 09/15/2011 12:37 AM, Pawan Darira wrote: I am commiting but not doing replication now. Mine sort order also includes last login timestamp. the new profiles are being reflected in my SOLR admin & db. but its not listed on my website. On Thu, Sep 15, 2011 at 4:25 AM, Chris Hostetter wrote: : I am using Solr 3.2 on a live website. i get live user's data of about 2000 : per day. I do an incremental index every 8 hours. but my search results : always show the same result with same sorting order. when i check the same Are you commiting? Are you using replication? Are you using a sort order that might not make it obvious that the new docs are actaully there? (ie: sort=timestamp asc) -Hoss
Re: Field Collapsing and Record Filtering
We have the identical problem in our system. Our plan is to encode the most recent version of a document using an explicit field/value; ie version=current (or maybe current=true) We also need to be able to allow users to search for the most current, but only within versions they have access to (might be among the 2010 and the 2007 versions only). I can't see any way to do this other than to index each document with a "most current as of 2010" flag, or something like that. But if anyone has brighter ideas on how to do this with a query, I'd be excited to here them! -Mike On 10/07/2011 05:21 AM, Martijn v Groningen wrote: I don't think this possible in only one search with what Solr currently has to offer. I guess the only way to support this, is by post processing your results on the client side. So for each group you display you query what to latest version is. If that doesn't match then you omit the result from rendering or execute a second grouped search to get more groups. The downsides are that pagination will never be correct and overall search time will take more time. Martijn On 5 October 2011 21:55, Daniel Skiles wrote: A while back I sent a question to the list about only returning the most recent version of a document, based on a numerical version field stored in each record. Someone suggested that I use field collapsing to do so, and in most cases it seems to work well. However, I've hit a snag and I'd appreciate it if anyone could offer some pointers. At the moment, my scheme looks roughly like this (not using exact data types): contents : string documentId : string version: float When I query on contents, I can use field collapsing to group by documentId, only return one instance of documentId, and sort each group by version in descending order. If the newest version of the document is returned by the query, everything works great. What I've realized, though, is that using field collapsing doesn't necessarily get me the most recent version of the document, if it matches the query, but the most recent version of any document that matches the query. Is there any good way to get the most recent version of the document that matches the query, but only if it's the record with the highest version number? For example, with the following record set: contents: angry horse documentId: 1a version: 1.0 contents: distraught horse documentId: 1a version: 1.1 contents: peevish horse documentId: 1a version: 2.0 Searching for "horse" will return version 2.0 of 1a using collapsing in the manner that I described above. If I search for "angry", I'll get back version 1.0 of 1a. I'd rather get back nothing at all. Is this possible?
Re: In-document highlighting DocValues?
Is there some reason you don't want to leverage Highlighter to do this work? It has all the necessary code for using the analyzed version of your query so it will only match tokens that really contribute to the search match. You might also be interested in LUCENE-2878 (which is still under development on a branch though). It aims to provide first-class access to payloads and positions during scoring, and this will be very useful for complex highlighting tasks. Another possible solution to the OCR problem could be: generate an XML file with a tag for each word encoding its x,y coords, like : x="3" y="10">This; index that file using XmlCharFilter or HTMLStripCharFilter. Then when you search, use the Solr highlighter to highlight the entire document, and process it using XML tools to find the locations of the matches. -Mike On 10/10/2011 10:19 AM, Jan Høydahl wrote: Hi, We index structured documents, with numbered chapters, paragraphs and sentences. After doing a (rather complex) search, we may get multiple matches in each result doc. We want to highlight those matches in our front-end and currently we do a simple string match of the query words against the raw text. However, this highlights some words that do not satisfy the original query, and also does not highlight other words where the match was in a stem, or synonym or wildcard. We thus need to improve this, and my plan was to utilize DocValues (Payloads). Would the following work? 1. For each term in the field "text", index DocValues with info about chapter#, paragraph#, sentence# and word#. This can be done in our application code, e.g. "foo|1,2,3,4" for chapter 1, paragraph 2, sentence 3 and word 4. 2. Then, for a specific document in the result list, retrieve a list of all matches in field "text", and for each match, retrieve the associated DocValues. 3. The client application can now use this information to highlight matches, as well as "jump to next match" etc, and would highlight the correct words only, e.g. it would be able to highlight "colour" even if the match was on the synonym "color". Another use case for this technique would be OCR applications where we store with each term its x,y offsets for where it occurs in the original TIFF image scan. What is in already in place and what code needs to be written? I don't currently see how to get a complete list of matches for a particular document. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com
Re: org.apache.pdfbox.pdmodel.PDPage Error
On 10/24/2011 02:35 PM, MBD wrote: Is this really a stumper? This is my first experience with Solr and having spent only an hour or so with it I hit this barrier (below). I'm sure *I* am doing something completely wrong just hoping someone more familiar with the platform can help me identify& fix it. For starters...what's "Could not initialize class ..." mean in Java exactly? Maybe that the class (ie code) itself doesn't exist? - so perhaps I haven't downloaded all the pieces of the project? Or, could it be a hint that my kit is just not configured correctly? Sorry, I'm not a Java expert...but would like to get this stabilized...if possible. Yeah - that's the problem. looks like the pdfbox jar is not installed in a place where Solr can find it (on its classpath). If this is the wrong mailing list then just tell me and I'll go away... Thanks! On Oct 20, 2011, at 2:54 PM, MBD wrote:
Re: How to delete a SOLR document if that particular data doesnt exist in DB?
Since you are performing a complete reload of all of your data, I don't understand why you can't create a new core, load your new data, swap your application to look at the new core, and then erase the old one, if you want. Even so, you could track the timestamps on all your documents, which will be updated when you update the content. Then when you're done you could delete anything with a timestamp prior to the time you started the latest import. -Mike On 10/20/2010 11:59 AM, bbarani wrote: ironicnet, Thanks for your reply. We actually use virtual DB modelling tool to fetch the data from various sources during run time hence we dont have any control over the source. We consolidate the data from more than one source and index the consolidated data using SOLR. We dont have any kind of update / access rights to source data. Thanks. Barani
different results depending on result format
I'm experiencing something really weird: I get different results depending on whether I specify wt=javabin, and retrieve using SolrJ, or wt=xml. I spent quite a while staring at query params to make sure everything else is the same, and they do seem to be. At first I thought the problem related to the javabin format change that has been talked about recently, but I am using solr 1.4.0 and solrj 1.4.0. Notice in the two entries that the wt param is different and the hits result count is different. Oct 21, 2010 4:22:19 PM org.apache.solr.core.SolrCore execute INFO: [bopp.ba] webapp=/solr path=/select/ params={wt=xml&rows=20&start=0&facet=true&facet.field=ref_taxid_ms&q=*:*&fl=uri,meta_ss&version=1} hits=261 status=0 QTime=1 Oct 21, 2010 4:22:28 PM org.apache.solr.core.SolrCore execute INFO: [bopp.ba] webapp=/solr path=/select params={wt=javabin&rows=20&start=0&facet=true&facet.field=ref_taxid_ms&q=*:*&fl=uri,meta_ss&version=1} hits=57 status=0 QTime=0 The xml format results seem to be the correct ones. So one thought I had is that I could somehow fall back to using xml format in solrj, but I tried SolrQuery.set('wt','xml') and that didn't have the desired effect (I get '&wt=javabin&wt=javabin' in the log - ie the param is repeated, but still javabin). Am I crazy? Is this a known issue? Thanks for any suggestions -- Michael Sokolov Engineering Director www.ifactory.com @iFactoryBoston PubFactory: the revolutionary e-publishing platform from iFactory
Re: different results depending on result format
quick follow-up: I also notice that the query from solrj gets version=1, whereas the admin webapp puts version=2.2 on the query string, although this param doesn't seem to change the xml results at all. Does this indicate an older version of solrj perhaps? -Mike On 10/21/2010 04:47 PM, Mike Sokolov wrote: I'm experiencing something really weird: I get different results depending on whether I specify wt=javabin, and retrieve using SolrJ, or wt=xml. I spent quite a while staring at query params to make sure everything else is the same, and they do seem to be. At first I thought the problem related to the javabin format change that has been talked about recently, but I am using solr 1.4.0 and solrj 1.4.0. Notice in the two entries that the wt param is different and the hits result count is different. Oct 21, 2010 4:22:19 PM org.apache.solr.core.SolrCore execute INFO: [bopp.ba] webapp=/solr path=/select/ params={wt=xml&rows=20&start=0&facet=true&facet.field=ref_taxid_ms&q=*:*&fl=uri,meta_ss&version=1} hits=261 status=0 QTime=1 Oct 21, 2010 4:22:28 PM org.apache.solr.core.SolrCore execute INFO: [bopp.ba] webapp=/solr path=/select params={wt=javabin&rows=20&start=0&facet=true&facet.field=ref_taxid_ms&q=*:*&fl=uri,meta_ss&version=1} hits=57 status=0 QTime=0 The xml format results seem to be the correct ones. So one thought I had is that I could somehow fall back to using xml format in solrj, but I tried SolrQuery.set('wt','xml') and that didn't have the desired effect (I get '&wt=javabin&wt=javabin' in the log - ie the param is repeated, but still javabin). Am I crazy? Is this a known issue? Thanks for any suggestions
Re: different results depending on result format
Yes - I really only have the one solr instance. And I have plenty of other cases where I am getting good results back via solrj. It's really a mystery. Unfortunately I have to catch up on other stuff I have been neglecting, but I'll follow up when I'm able to get a solution... -Mike On 10/22/2010 06:58 AM, Savvas-Andreas Moysidis wrote: strange..are you absolutely sure the two queries are directed to the same Solr instance? I'm running the same query from the admin page (which specifies the xml format) and I get the exact same results as solrj. On 21 October 2010 22:25, Mike Sokolov wrote: quick follow-up: I also notice that the query from solrj gets version=1, whereas the admin webapp puts version=2.2 on the query string, although this param doesn't seem to change the xml results at all. Does this indicate an older version of solrj perhaps? -Mike On 10/21/2010 04:47 PM, Mike Sokolov wrote: I'm experiencing something really weird: I get different results depending on whether I specify wt=javabin, and retrieve using SolrJ, or wt=xml. I spent quite a while staring at query params to make sure everything else is the same, and they do seem to be. At first I thought the problem related to the javabin format change that has been talked about recently, but I am using solr 1.4.0 and solrj 1.4.0. Notice in the two entries that the wt param is different and the hits result count is different. Oct 21, 2010 4:22:19 PM org.apache.solr.core.SolrCore execute INFO: [bopp.ba] webapp=/solr path=/select/ params={wt=xml&rows=20&start=0&facet=true&facet.field=ref_taxid_ms&q=*:*&fl=uri,meta_ss&version=1} hits=261 status=0 QTime=1 Oct 21, 2010 4:22:28 PM org.apache.solr.core.SolrCore execute INFO: [bopp.ba] webapp=/solr path=/select params={wt=javabin&rows=20&start=0&facet=true&facet.field=ref_taxid_ms&q=*:*&fl=uri,meta_ss&version=1} hits=57 status=0 QTime=0 The xml format results seem to be the correct ones. So one thought I had is that I could somehow fall back to using xml format in solrj, but I tried SolrQuery.set('wt','xml') and that didn't have the desired effect (I get '&wt=javabin&wt=javabin' in the log - ie the param is repeated, but still javabin). Am I crazy? Is this a known issue? Thanks for any suggestions
Re: different results depending on result format
OK I solved the problem. It turns out that I was connecting to the server using its FQDN (rosen.ifactory.com). When, instead, I connect to it using the name "rosen" (which maps to the same IP using the default domain name configured in my resolver, ifactory.com), I get results back. I am looking into the virtual hosts config in tomcat; it seems as if there must indeed be another solr instance running; in fact I'm now concerned there might be two solr instances running against the same data folder. yargh. -Mike On 10/22/2010 09:05 AM, Mike Sokolov wrote: Yes - I really only have the one solr instance. And I have plenty of other cases where I am getting good results back via solrj. It's really a mystery. Unfortunately I have to catch up on other stuff I have been neglecting, but I'll follow up when I'm able to get a solution... -Mike On 10/22/2010 06:58 AM, Savvas-Andreas Moysidis wrote: strange..are you absolutely sure the two queries are directed to the same Solr instance? I'm running the same query from the admin page (which specifies the xml format) and I get the exact same results as solrj. On 21 October 2010 22:25, Mike Sokolov wrote: quick follow-up: I also notice that the query from solrj gets version=1, whereas the admin webapp puts version=2.2 on the query string, although this param doesn't seem to change the xml results at all. Does this indicate an older version of solrj perhaps? -Mike On 10/21/2010 04:47 PM, Mike Sokolov wrote: I'm experiencing something really weird: I get different results depending on whether I specify wt=javabin, and retrieve using SolrJ, or wt=xml. I spent quite a while staring at query params to make sure everything else is the same, and they do seem to be. At first I thought the problem related to the javabin format change that has been talked about recently, but I am using solr 1.4.0 and solrj 1.4.0. Notice in the two entries that the wt param is different and the hits result count is different. Oct 21, 2010 4:22:19 PM org.apache.solr.core.SolrCore execute INFO: [bopp.ba] webapp=/solr path=/select/ params={wt=xml&rows=20&start=0&facet=true&facet.field=ref_taxid_ms&q=*:*&fl=uri,meta_ss&version=1} hits=261 status=0 QTime=1 Oct 21, 2010 4:22:28 PM org.apache.solr.core.SolrCore execute INFO: [bopp.ba] webapp=/solr path=/select params={wt=javabin&rows=20&start=0&facet=true&facet.field=ref_taxid_ms&q=*:*&fl=uri,meta_ss&version=1} hits=57 status=0 QTime=0 The xml format results seem to be the correct ones. So one thought I had is that I could somehow fall back to using xml format in solrj, but I tried SolrQuery.set('wt','xml') and that didn't have the desired effect (I get '&wt=javabin&wt=javabin' in the log - ie the param is repeated, but still javabin). Am I crazy? Is this a known issue? Thanks for any suggestions
Re: How do I this in Solr?
Right - my point was to combine this with the previous approaches to form a query like: samsung AND android AND GPS AND word_count:3 in order to exclude documents containing additional words. This would avoid the combinatoric explosion problem otehrs had alluded to earlier. Of course this would fail because android is "mis-" spelled :) -Mike On 10/27/2010 08:45 AM, Steven A Rowe wrote: I'm pretty sure the word-count strategy won't work. If I search with the text "samsung andriod GPS", search results should only conain "samsung", "GPS", "andriod" and "samsung andriod". Using the word-count strategy, a document containing "samsung andriod PDQ" would be a hit, but Varun doesn't want it, because it contains a word that is not in the query. Steve -Original Message- From: Michael Sokolov [mailto:soko...@ifactory.com] Sent: Wednesday, October 27, 2010 7:44 AM To: solr-user@lucene.apache.org Subject: RE: How do I this in Solr? You might try adding a field containing the word count and making sure that matches the query's word count? This would require you to tokenize the query and document yourself, perhaps. -Mike -Original Message- From: Varun Gupta [mailto:varun.vgu...@gmail.com] Sent: Tuesday, October 26, 2010 11:26 PM To: solr-user@lucene.apache.org Subject: Re: How do I this in Solr? Thanks everybody for the inputs. Looks like Steven's solution is the closest one but will lead to performance issues when the query string has many terms. I will try to implement the two filters suggested by Steven and see how the performance matches up. -- Thanks Varun Gupta On Wed, Oct 27, 2010 at 8:04 AM, scott chu (???) wrote: I think you have to write a "yet exact match" handler yourself (I mean yet cause it's not quite exact match we normally know). Steve's answer is quite near your request. You can do further work based on his solution. At the last step, I'll suggest you eat up all blank within query string and query result, respevtively& only returns those results that has equal string length as the query string's. For example, giving: *query string = "Samsung with GPS" *query results: resutl 1 = "Samsung has lots of mobile with GPS" result 2 = "with GPS Samsng" result 3 = "GPS mobile with vendors, such as Sony, Samsung" they become: *query result = "SamsungwithGPS" (length =14) *query results: resutl 1 = "SamsunghaslotsofmobilewithGPS" (length =29) result 2 = "withGPSSamsng" (length =14) result 3 = "GPSmobilewithvendors,suchasSony,Samsung" (length =43) so result 2 matches your request. In this way, you can avoid case-sensitive, word-order-rearrange load of works. Furthermore, you can do refined work, such as remove white characters, etc. Scott @ Taiwan - Original Message - From: "Varun Gupta" To: Sent: Tuesday, October 26, 2010 9:07 PM Subject: How do I this in Solr? Hi, I have lot of small documents (each containing 1 to 15 words) indexed in Solr. For the search query, I want the search results to contain only those documents that satisfy this criteria "All of the words of the search result document are present in the search query" For example: If I have the following documents indexed: "nokia n95", "GPS", "android", "samsung", "samsung andriod", "nokia andriod", "mobile with GPS" If I search with the text "samsung andriod GPS", search results should only conain "samsung", "GPS", "andriod" and "samsung andriod". Is there a way to do this in Solr. -- Thanks Varun Gupta -- -- %<&b6G$J0T.'$$'d(l/f,r!C Checked by AVG - www.avg.com Version: 9.0.862 / Virus Database: 271.1.1/3220 - Release Date: 10/26/10 14:34:00
Re: How do I this in Solr?
Yes I missed that requirement (as Steven also pointed out in a private e-mail). I now agree that the combinatorics are required. Another possibility to consider (if the queries are large, which actually seems unlikely) is to use the default behavior where all terms are optional, sort by relevance, and truncate the result list on the client side after some unwanted term is found. I *think* the scoring should find only docs with the searched-for terms first, although if there are a lot of repeated terms maybe not? Also result counts will be screwy. -Mike On 10/27/2010 09:34 AM, Toke Eskildsen wrote: That does not work either as it requires that all the terms in the query are present in the document. The original poster did not state this requirement. On the contrary, his examples were mostly single-word matches, implying an OR-search at the core. The query-explosion still seems like the only working idea. Maybe Varun could comment on the maximum numbers of terms that his queries will contain? Regards, Toke Eskildsen On Wed, 2010-10-27 at 15:02 +0200, Mike Sokolov wrote: Right - my point was to combine this with the previous approaches to form a query like: samsung AND android AND GPS AND word_count:3 in order to exclude documents containing additional words. This would avoid the combinatoric explosion problem otehrs had alluded to earlier. Of course this would fail because android is "mis-" spelled :) -Mike On 10/27/2010 08:45 AM, Steven A Rowe wrote: I'm pretty sure the word-count strategy won't work. If I search with the text "samsung andriod GPS", search results should only conain "samsung", "GPS", "andriod" and "samsung andriod". Using the word-count strategy, a document containing "samsung andriod PDQ" would be a hit, but Varun doesn't want it, because it contains a word that is not in the query. Steve -Original Message- From: Michael Sokolov [mailto:soko...@ifactory.com] Sent: Wednesday, October 27, 2010 7:44 AM To: solr-user@lucene.apache.org Subject: RE: How do I this in Solr? You might try adding a field containing the word count and making sure that matches the query's word count? This would require you to tokenize the query and document yourself, perhaps. -Mike -Original Message- From: Varun Gupta [mailto:varun.vgu...@gmail.com] Sent: Tuesday, October 26, 2010 11:26 PM To: solr-user@lucene.apache.org Subject: Re: How do I this in Solr? Thanks everybody for the inputs. Looks like Steven's solution is the closest one but will lead to performance issues when the query string has many terms. I will try to implement the two filters suggested by Steven and see how the performance matches up. -- Thanks Varun Gupta On Wed, Oct 27, 2010 at 8:04 AM, scott chu (???) wrote: I think you have to write a "yet exact match" handler yourself (I mean yet cause it's not quite exact match we normally know). Steve's answer is quite near your request. You can do further work based on his solution. At the last step, I'll suggest you eat up all blank within query string and query result, respevtively& only returns those results that has equal string length as the query string's. For example, giving: *query string = "Samsung with GPS" *query results: resutl 1 = "Samsung has lots of mobile with GPS" result 2 = "with GPS Samsng" result 3 = "GPS mobile with vendors, such as Sony, Samsung" they become: *query result = "SamsungwithGPS" (length =14) *query results: resutl 1 = "SamsunghaslotsofmobilewithGPS" (length =29) result 2 = "withGPSSamsng" (length =14) result 3 = "GPSmobilewithvendors,suchasSony,Samsung" (length =43) so result 2 matches your request. In this way, you can avoid case-sensitive, word-order-rearrange load of works. Furthermore, you can do refined work, such as remove white characters, etc. Scott @ Taiwan - Original Message - From: "Varun Gupta" To: Sent: Tuesday, October 26, 2010 9:07 PM Subject: How do I this in Solr? Hi, I have lot of small documents (each containing 1 to 15 words) indexed in Solr. For the search query, I want the search results to contain only those documents that satisfy this criteria "All of the words of the search result document are present in the search query" For example: If I have the following documents indexed: "nokia n95", "GPS", "android", "samsung", "samsung andriod", "nokia
Re: Query question
Another alternative (prettier to my eye), would be: (city:Chicago AND Romantic AND View)^10 OR (Romantic AND View) -Mike On 11/03/2010 09:28 AM, kenf_nc wrote: Unfortunately the default operator is set to AND and I can't change that at this time. If I do (city:Chicago^10 OR Romantic OR View) it returns way too many unwanted results. If I do (city:Chicago^10 OR (Romantic AND View)) it returns less unwanted results, but still a lot. iorixxx's solution of (Romantic AND View AND (city:Chicago^10 OR (*:* -city:Chicago))) does seem to work. Chicago results are at the top, and the remaining results seem to fit the other search parameters. It's an ugly query, but does seem to do the trick for now until I master Dismax. Thanks all!
Re: Is there a way to implement a IntRangeField in Solr?
If your ranges are always contiguous, you could index two fields: range-start and range-end and then perform queries like: range-start:[* TO 30] AND range-end:[5 TO *] If you have multiple ranges which could have gaps in between then you need something more complicated :) On 02/27/2012 04:09 PM, federico.wachs wrote: Hi all ! Here's my dreadful case, thank you for helping out! I want to have a document like this: ... -- multivalued range field 1 TO 10 5 TO 15 ... And the reason why I want to do this is because it's so much lighter than having all the numbers in there, of course. Just to be clear, I want to avoid having this in solr: ... -- multivalued range field 1 2 3 4 5 6 7 8 9 10 ... And then perform range queries on this range field like: fq=-occupiedDays:[5 TO 30] Anybody has any idea? I have asked and searched all over the internet and seems solr does not support this. Any help would be really helpful! Thanks in advanced. Federico -- View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-to-implement-a-IntRangeField-in-Solr-tp3782083p3782083.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is there a way to implement a IntRangeField in Solr?
I think your example case would end up like this: ... 1 -- single-valued range field 15 ... On 02/27/2012 04:26 PM, federico.wachs wrote: Michael thanks a lot for your quick answer, but i'm not exactly sure I understand your solution. How would the docuemnt you are proposing would look like? Do you mind showing me a simple xml as example? Again, thank you for your cooperation. And yes, the ranges are contiguous! -- View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-to-implement-a-IntRangeField-in-Solr-tp3782083p3782139.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is there a way to implement a IntRangeField in Solr?
No; contiguous means there are no gaps between them. You need something like what you described initially. Another approach is to de-normalize your data so that you have a single document for every range. But this might or might not suit your application. You haven't said anything about the context in which this is to be used. -Mike On 02/27/2012 04:43 PM, federico.wachs wrote: Oh No, I think I understood wrong when you said that my ranges where contiguous. I could have ranges like this: 1 TO 15 5 TO 30 50 TO 60 And so on... I'm not sure that what you supposed would work, right? -- View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-to-implement-a-IntRangeField-in-Solr-tp3782083p3782202.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is there a way to implement a IntRangeField in Solr?
Yes, I see - I think your best bet is to index every day as a distinct value. Don't worry about having 100's of values. -Mike On 02/27/2012 05:11 PM, federico.wachs wrote: This is used on an apartment booking system, and what I store as solr documents can be seen as apartments. These apartments can be booked for a certain amount of days with a check in and a check out date hence the ranges I was speaking of before. What I want to do is to filter off the apartments that are booked so my users won't have a bad user experience while trying to book an apartment that suits their needs. Did I make any sense? Please let me know, otherwise I can explain furthermore. -- View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-to-implement-a-IntRangeField-in-Solr-tp3782083p3782304.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is there a way to implement a IntRangeField in Solr?
I don't know if this would help with OOM conditions, but are you using a tint type field for this? That should be more efficient to search than a regular int or string. -Mike On 02/27/2012 05:27 PM, federico.wachs wrote: Yeah that's what I'm doing right now. But whenever I try to index an apartment that has many wide ranges, my master solr server throws OutOfMemoryError ( I have set max heap to 1024m). So I thought this could be a good workaround but puf it is a lot harder than it seems! -- View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-to-implement-a-IntRangeField-in-Solr-tp3782083p3782347.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: StreamingUpdateSolrServer - exceptions not propagated
On 3/27/2012 11:14 AM, Mark Miller wrote: On Mar 27, 2012, at 10:51 AM, Shawn Heisey wrote: On 3/26/2012 6:43 PM, Mark Miller wrote: It doesn't get thrown because that logic needs to continue - you don't necessarily want one bad document to stop all the following documents from being added. So the exception is sent to that method with the idea that you can override and do what you would like. I've written sample code around stopping and throwing an exception, but I guess its not totally trivial. Other ideas for reporting errors have been thrown around in the past, but no work on it has gotten any traction. It looks like StreamingUpdateSolrServer is not meant for situations where strict error checking is required. I think the documentation should reflect that. Would you be opposed to a javadoc update at the class level (plus a wiki addition) like the following? "Because document inserts are handled as background tasks, exceptions and errors that occur during those operations will not be available to the calling program, but they will be logged. For example, if the Solr server is down, your program must determine this on its own. If you need strict error handling, use CommonsHttpSolrServer." If my wording is bad, feel free to make suggestions. It might make sense to accumulate the errors in a fixed-size queue and report them either when the queue fills up or when the client commits (assuming the commit will wait for all outstanding inserts to complete or fail). This is what we do client-side when performing multi-threaded inserts. Sounds great in theory, I think, but then I haven't delved in to SUSS at all ... just a suggestion, take it or leave it. Actually I wonder whether SUSS is necessary of you do the threading client-side? You might get a similar perf gain; I know we see a substantial speedup that way. because then your updates spawn multiple threads in the server anyway, don't they? - Mike
Re: Populating 'multivalue' fields (m:1 relationships)
You can specify a solr field as "multi-valued", and then supply multiple values for it. What that really does is concatenate all the values with a positional gap between them to prevent phrases and other positional queries from traversing the boundary between the distinct values. -Mike On 05/10/2012 12:22 PM, Klostermeyer, Michael wrote: I am attempting to index a DB schema that has a many:one relationship. I assume I would index this within Solr as a 'multivalue=true' field, is that correct? I am currently populating the Solr index w/ a stored procedure in which each DB record is "flattened" into a single document in Solr. I would like one of those Solr document fields to contain multiple values from the m:1 table (i.e. [fieldName]=1,3,6,8,7). I then need to be able to do a "fq=fieldname:3" and return the previous record. My question is: how do I populate Solr with a multi-valued field for many:1 relationships? My first guess would be to concatenate all the values from the 'many' side into a single DB column in the SP, then pipe that column into a multivalue=true Solr field. The DB side of that will be ugly, but would the Solr side index this properly? If so, what would be the delimiter that would allow Solr to index each element of the multivalued field? [Warning: possible tangent below...but I think this question is relevant. If not, tell me and I'll break it out] I have gone out of my way to "flatten" the data within my SP prior to giving it to Solr. For my solution stated above, I would have the following data (Title being the "many" side of the m:1, and PK being the Solr unique ID): PK | Name | Title Pk_1 | Dwight | Sales, Assistant To The Regional Manager Pk_2 | Jim | Sales Pk_3 | Michael | Regional Manger Below is an example of a non-flattened record set. How would Solr handle a data set in which the following data was indexed: PK | Name | Title Pk_1 | Dwight | Sales Pk_1 | Dwight | Assistant To The Regional Manager Pk_2 | Jim | Sales Pk_3 | Michael | Regional Manger My assumption is that the second Pk_1 record would overwrite the first, thereby losing the "Sales" title from Pk_1. Am I correct on that assumption? I'm new to this ballgame, so don't be shy about pointing me down a different path if I am doing anything incorrectly. Thanks! Mike Klostermeyer
creating SchemaField and FieldType programmatically
I'm creating a some Solr plugins that index and search documents in a special way, and I'd like to make them as easy as possible to configure. Ideally I'd like users to be able to just drop a jar in place without having to copy any configuration into schema.xml, although I suppose they will have to register the plugins in solrconfig.xml. I tried making my UpdateProcessor "core aware" and creating FieldTypes and SchemaFields in the inform(SolrCore) method. This was a good start, but I'm running into some issues getting the types properly initialized. One of my types, for example, derives from TextField, but this seems to require an initialization pass in order to get its properties set up properly. What I'm seeing is that my field values aren't being tokenized, even though I specify TOKENIZED when I create the SchemaField. I'm beginning to get the feeling I'm doing something not-quite anticipated by the API designers. My question is: is there a way to go about doing something like this that isn't swimming upstream? Should I just give up and require users to incorporate my schema in the xml config? Here is a code snippet for anyone willing to dig in a little: /** Called when each core is initialized; we ensure that lux fields are configured. */ public void inform(SolrCore core) { IndexSchema schema = core.getSchema(); Map fields = schema.getFields(); if (fields.containsKey("lux_path")) { return; } Map fieldTypes = schema.getFieldTypes(); FieldType luxTextWs = fieldTypes.get("lux_text_ws"); if (luxTextWs == null) { luxTextWs = new TextField (); luxTextWs.setAnalyzer(new WhitespaceGapAnalyzer()); luxTextWs.setQueryAnalyzer(new WhitespaceGapAnalyzer()); fieldTypes.put("lux_text_ws", luxTextWs); } fields.put("lux_path", new SchemaField ("lux_path", luxTextWs, 0x233, "")); // 0x233 = INDEXED | TOKENIZED | OMIT_NORMS | OMIT_TF_POSITIONS | MULTIVALUED fields.put("lux_elt_name", new SchemaField ("lux_elt_name", new StrField(), 0x231, ""));// INDEXED | OMIT_NORMS | OMIT_TF_POSITIONS | MULTIVALUED fields.put("lux_att_name", new SchemaField ("lux_att_name", new StrField(), 0x231, "")); // must call this after making changes to the field map: schema.refreshAnalyzers(); }
Re: creating SchemaField and FieldType programmatically
ok, never mind all is well - I had a mismatch between the schema-declared field and my programmatic field, where I was overzealous in using OMIT_TF_POSITIONS. -Mike On 6/2/2012 5:02 PM, Mike Sokolov wrote: I'm creating a some Solr plugins that index and search documents in a special way, and I'd like to make them as easy as possible to configure. Ideally I'd like users to be able to just drop a jar in place without having to copy any configuration into schema.xml, although I suppose they will have to register the plugins in solrconfig.xml. I tried making my UpdateProcessor "core aware" and creating FieldTypes and SchemaFields in the inform(SolrCore) method. This was a good start, but I'm running into some issues getting the types properly initialized. One of my types, for example, derives from TextField, but this seems to require an initialization pass in order to get its properties set up properly. What I'm seeing is that my field values aren't being tokenized, even though I specify TOKENIZED when I create the SchemaField. I'm beginning to get the feeling I'm doing something not-quite anticipated by the API designers. My question is: is there a way to go about doing something like this that isn't swimming upstream? Should I just give up and require users to incorporate my schema in the xml config? Here is a code snippet for anyone willing to dig in a little: /** Called when each core is initialized; we ensure that lux fields are configured. */ public void inform(SolrCore core) { IndexSchema schema = core.getSchema(); Map fields = schema.getFields(); if (fields.containsKey("lux_path")) { return; } Map fieldTypes = schema.getFieldTypes(); FieldType luxTextWs = fieldTypes.get("lux_text_ws"); if (luxTextWs == null) { luxTextWs = new TextField (); luxTextWs.setAnalyzer(new WhitespaceGapAnalyzer()); luxTextWs.setQueryAnalyzer(new WhitespaceGapAnalyzer()); fieldTypes.put("lux_text_ws", luxTextWs); } fields.put("lux_path", new SchemaField ("lux_path", luxTextWs, 0x233, "")); // 0x233 = INDEXED | TOKENIZED | OMIT_NORMS | OMIT_TF_POSITIONS | MULTIVALUED fields.put("lux_elt_name", new SchemaField ("lux_elt_name", new StrField(), 0x231, ""));// INDEXED | OMIT_NORMS | OMIT_TF_POSITIONS | MULTIVALUED fields.put("lux_att_name", new SchemaField ("lux_att_name", new StrField(), 0x231, "")); // must call this after making changes to the field map: schema.refreshAnalyzers(); }
Re: creating SchemaField and FieldType programmatically
Oh yes, final followup for the terminally curious; I also had to add this little class in order to get analysis turned on for my programmatic field: class PathField extends TextField { PathField (IndexSchema schema) { setAnalyzer(new WhitespaceGapAnalyzer()); setQueryAnalyzer(new WhitespaceGapAnalyzer()); } protected Field.Index getFieldIndex(SchemaField field, String internalVal) { return Field.Index.ANALYZED; } } On 6/2/2012 5:48 PM, Mike Sokolov wrote: ok, never mind all is well - I had a mismatch between the schema-declared field and my programmatic field, where I was overzealous in using OMIT_TF_POSITIONS. -Mike On 6/2/2012 5:02 PM, Mike Sokolov wrote: I'm creating a some Solr plugins that index and search documents in a special way, and I'd like to make them as easy as possible to configure. Ideally I'd like users to be able to just drop a jar in place without having to copy any configuration into schema.xml, although I suppose they will have to register the plugins in solrconfig.xml. I tried making my UpdateProcessor "core aware" and creating FieldTypes and SchemaFields in the inform(SolrCore) method. This was a good start, but I'm running into some issues getting the types properly initialized. One of my types, for example, derives from TextField, but this seems to require an initialization pass in order to get its properties set up properly. What I'm seeing is that my field values aren't being tokenized, even though I specify TOKENIZED when I create the SchemaField. I'm beginning to get the feeling I'm doing something not-quite anticipated by the API designers. My question is: is there a way to go about doing something like this that isn't swimming upstream? Should I just give up and require users to incorporate my schema in the xml config? Here is a code snippet for anyone willing to dig in a little: /** Called when each core is initialized; we ensure that lux fields are configured. */ public void inform(SolrCore core) { IndexSchema schema = core.getSchema(); Map fields = schema.getFields(); if (fields.containsKey("lux_path")) { return; } Map fieldTypes = schema.getFieldTypes(); FieldType luxTextWs = fieldTypes.get("lux_text_ws"); if (luxTextWs == null) { luxTextWs = new TextField (); luxTextWs.setAnalyzer(new WhitespaceGapAnalyzer()); luxTextWs.setQueryAnalyzer(new WhitespaceGapAnalyzer()); fieldTypes.put("lux_text_ws", luxTextWs); } fields.put("lux_path", new SchemaField ("lux_path", luxTextWs, 0x233, "")); // 0x233 = INDEXED | TOKENIZED | OMIT_NORMS | OMIT_TF_POSITIONS | MULTIVALUED fields.put("lux_elt_name", new SchemaField ("lux_elt_name", new StrField(), 0x231, ""));// INDEXED | OMIT_NORMS | OMIT_TF_POSITIONS | MULTIVALUED fields.put("lux_att_name", new SchemaField ("lux_att_name", new StrField(), 0x231, "")); // must call this after making changes to the field map: schema.refreshAnalyzers(); }
Re: Efficiently mining or parsing data out of XML source files
I agree, that seems odd. We routinely index XML using either HTMLStripCharFilter, or XmlCharFilter (see patch: https://issues.apache.org/jira/browse/SOLR-2597), both of which parse the XML, and we don't see such a huge speed difference from indexing other field types. XmlCharFilter also allows you to specify which elements to index if you don't want the whole file. -Mike On 6/3/2012 8:42 AM, Erick Erickson wrote: This seems really odd. How big are these XML files? Where are you parsing them? You could consider using a SolrJ program with a SAX-style parser. But the first question I'd answer is "what is slow?". The implications of your post is that parsing the XML is the slow part, it really shouldn't be taking anywhere near this long IMO... Best Erick On Thu, May 31, 2012 at 9:14 AM, Van Tassell, Kristian wrote: I'm just wondering what the general consensus is on indexing XML data to Solr in terms of parsing and mining the relevant data out of the file and putting them into Solr fields. Assume that this is the XML file and resulting Solr fields: XML data: foo garbage data Solr Fields: Id=1234 Title=foo Bar=val1 I'd previously set this process up using XSLT and have since tested using XMLBeans, JAXB, etc. to get the relevant data. The speed at which this occurs, however, is not acceptable. 2800 objects take 11 minutes to parse and index into Solr. The big slowdown appears to be that I'm parsing the data with an XML parser. So, now I'm testing mining the data by opening the file as just a text file (using Groovy) and picking out relevant data using regular expression matching. I'm now able to parse (mine) the data and index the 2800 files in 72 seconds. So I'm wondering if the typical solution people use is to go with a non-XML solution. It seems to make sense considering the search index would only want to store (as much data) as possible and not rely on the incoming documents being xml compliant. Thanks in advance for any thoughts on this! -Kristian
highlighting field boundary detection
Does anybody know of a way to detect when the highlight snippet begins at the beginning of the field or ends at the end of the field using one of the standard highlighters shipped w/Solr? We'd like to display ellipses only when there is additional text surrounding the snippet in the original -Mike
Re: multi-core solr, specifying the data directory
Yes - I commented out the element in solrconfig.xml and then got the expected behavior: the core used a data subdirectory in the core subdirectory. It seems like the problem arises from using the solrconfig.xml that's distributed as example/solr/conf/solrconfig.xml The solrconfig.xml's in example/multicore/ don't have the element. -Mike On 03/01/2011 08:24 PM, Chris Hostetter wrote: : :${solr.data.dir:./solr/data} that directive says "use the solr.data.dir system property to pick a path, if it is not set, use "./solr/data" (realtive the CWD) if you want it to use the default, then you need to eliminate it completley, or you need to change it to the empty string... ${solr.data.dir:} or... -Hoss
Re: Automatic synonyms for multiple variations of a word
Suppose your analysis stack includes lower-casing, but your synonyms are only supposed to apply to upper-case tokens. For example, "PET" might be a synonym of "positron emission tomography", but "pet" wouldn't be. -Mike On 04/26/2011 09:51 AM, Robert Muir wrote: On Tue, Apr 26, 2011 at 12:24 AM, Otis Gospodnetic wrote: But somehow this feels bad (well, so does sticking word variations in what's supposed to be a synonyms file), partly because it means that the person adding new synonyms would need to know what they stem to (or always check it against Solr before editing the file). when creating the synonym map from your input file, currently the factory actually uses your Tokenizer only to pre-process the synonyms file. One idea would be to use the tokenstream up to the synonymfilter itself (including filters). This way if you put a stemmer before the synonymfilter, it would stem your synonyms file, too. I haven't totally thought the whole thing through to see if theres a big reason why this wouldn't work (the synonymsfilter is complicated, sorry). But it does seem like it would produce more consistent results... and perhaps the inconsistency isnt so obvious since in the default configuration the synonymfilter is directly after the tokenizer.
Re: Automatic synonyms for multiple variations of a word
Yes, I see. Makes sense. It is a bit hard to see a "bad" case for your proposal in that light. Here is one other example; I'm not sure whether it presents difficulties or not, and may be a bit contrived, but hey, food for thought at least: Say you have set up synonyms between names and commonly-used pseudonyms or alternate names that should not be stemmed: Malcolm X <=> Malcolm Little Prince <=> Rogers Nelson Prince Little Kim <=> Kimberly Denise Jones Biggy Smalls etc. You don't want "Malcolm Littler" or "Littlest Kim" or "Big Small" to match anything. And Princely shouldn't bring up the artist. But you also have regular linguistic synonyms (not names) that *should* be stemmed (as in the original example). So little <=> small should imply littler <=> smaller and so on via stemming. Ideally you could put one SynonymFilter before the stemming and the other one after. In that case do the SynonymFilters get composed? I can't think of a believable example where that would cause a problem, but maybe you can? -Mike On 04/26/2011 04:25 PM, Robert Muir wrote: Mike, thanks a lot for your example: the idea here would be you would put the lowercasefilter after the synonymfilter, and then you get this exact flexibility? e.g. WhitespaceTokenizer SynonymFilter -> no lowercasing of tokens are done as it "analyzes" your synonyms with just the tokenizer LowerCaseFilter but WhitespaceTokenizer LowerCaseFilter SynonymFilter -> the synonyms are lowercased, as it "analyzes" synonyms with the tokenizer+filter its already inconsistent today, because if you do: LowerCaseTokenizer SynonymFilter then your synonyms are in fact all being lowercased... its just arbitrary that they are only being analyzed with the "tokenizer". On Tue, Apr 26, 2011 at 4:13 PM, Mike Sokolov wrote: Suppose your analysis stack includes lower-casing, but your synonyms are only supposed to apply to upper-case tokens. For example, "PET" might be a synonym of "positron emission tomography", but "pet" wouldn't be. -Mike On 04/26/2011 09:51 AM, Robert Muir wrote: On Tue, Apr 26, 2011 at 12:24 AM, Otis Gospodnetic wrote: But somehow this feels bad (well, so does sticking word variations in what's supposed to be a synonyms file), partly because it means that the person adding new synonyms would need to know what they stem to (or always check it against Solr before editing the file). when creating the synonym map from your input file, currently the factory actually uses your Tokenizer only to pre-process the synonyms file. One idea would be to use the tokenstream up to the synonymfilter itself (including filters). This way if you put a stemmer before the synonymfilter, it would stem your synonyms file, too. I haven't totally thought the whole thing through to see if theres a big reason why this wouldn't work (the synonymsfilter is complicated, sorry). But it does seem like it would produce more consistent results... and perhaps the inconsistency isnt so obvious since in the default configuration the synonymfilter is directly after the tokenizer.
Re: Searching for escaped characters
StandardTokenizer will have stripped punctuation I think. You might try searching for all the entity names though: (agrave | egrave | omacron | etc... ) The names are pretty distinctive. Although you might have problems with greek letters. -Mike On 04/28/2011 12:10 PM, Paul wrote: I'm trying to create a test to make sure that character sequences like "è" are successfully converted to their equivalent utf character (that is, in this case, "è"). So, I'd like to search my solr index using the equivalent of the following regular expression: &\w{1,6}; To find any escaped sequences that might have slipped through. Is this possible? I have indexed these fields with text_lu, which looks like this: Thanks, Paul
Re: Replicaiton Fails with Unreachable error when master host is responding.
No clue. Try wireshark to gather more data? On 04/28/2011 02:53 PM, Jed Glazner wrote: Anybody? On 04/27/2011 01:51 PM, Jed Glazner wrote: Hello All, I'm having a very strange problem that I just can't figure out. The slave is not able to replicate from the master, even though the master is reachable from the slave machine. I can telnet to the port it's running on, I can use text based browsers to navigate the master from the slave. I just don't understand why it won't replicate. The admin screen gives me an Unreachable in the status, and in the log there is an exception thrown. Details below: BACKGROUND: OS: Arch Linux Solr Version: svn revision 1096983 from https://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/ No custom plugins, just whatever came with the version above. Java Setup: java version "1.6.0_22" OpenJDK Runtime Environment (IcedTea6 1.10) (ArchLinux-6.b22_1.10-1-x86_64) OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode) We have 3 cores running, all 3 cores are not able to replicate. The admin on the slave shows the Master as http://solr-master-01_dev.la.bo:8983/solr/music/replication - *Unreachable* Replicaiton def on the slave 529 530 531http://solr-master-01_dev.la.bo:8983/solr/music/replication 53200:15:00 533 534 Replication def on the master: 529 530 531commit 532startup 533schema.xml,stopwords.txt 534 535 Below is the log start to finish for replication attempts, note that it says connection refused, however, I can telnet to 8983 from the slave to the master, so I know it's up and reachable from the slave: telnet solr-master-01_dev.la.bo 8983 Trying 172.12.65.58... Connected to solr-master-01_dev.la.bo. Escape character is '^]'. I double checked the master to make sure that it didn't have replication turned off, and it's not. So I should be able to replicate but it can't. I just dont' know what else to check. The log from the slave is below. Apr 27, 2011 7:39:45 PM org.apache.solr.request.SolrQueryResponse WARNING: org.apache.solr.request.SolrQueryResponse is deprecated. Please use the corresponding class in org.apache.solr.response Apr 27, 2011 7:39:45 PM org.apache.commons.httpclient.HttpMethodDirector executeWithRetry INFO: I/O exception (java.net.ConnectException) caught when processing request: Connection refused Apr 27, 2011 7:39:45 PM org.apache.commons.httpclient.HttpMethodDirector executeWithRetry INFO: Retrying request Apr 27, 2011 7:39:45 PM org.apache.commons.httpclient.HttpMethodDirector executeWithRetry INFO: I/O exception (java.net.ConnectException) caught when processing request: Connection refused Apr 27, 2011 7:39:45 PM org.apache.commons.httpclient.HttpMethodDirector executeWithRetry INFO: Retrying request Apr 27, 2011 7:39:45 PM org.apache.commons.httpclient.HttpMethodDirector executeWithRetry INFO: I/O exception (java.net.ConnectException) caught when processing request: Connection refused Apr 27, 2011 7:39:45 PM org.apache.commons.httpclient.HttpMethodDirector executeWithRetry INFO: Retrying request Apr 27, 2011 7:39:45 PM org.apache.solr.handler.ReplicationHandler getReplicationDetails WARNING: Exception while invoking 'details' method for replication on master java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:327) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:193) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:180) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384) at java.net.Socket.connect(Socket.java:546) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.commons.httpclient.protocol.ReflectionSocketFactory.createSocket(ReflectionSocketFactory.java:140) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:125) at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1361) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) at org.apache.solr.handler.SnapPuller.getNamedListResponse(SnapPuller.java:193) at org.apache.solr.handler.SnapPuller.getCommandResponse(SnapPuller.java:188) at org.apache.solr.
updates not reflected in solr admin
This is in 1.4 - we push updates via SolrJ; our application sees the updates, but when we use the solr admin screens to run test queries, or use Luke to view the schema and field values, it sees the database in its state prior to the commit. I think eventually this seems to propagate, but I'm not clear how often since we generally restart the (tomcat) server in order to get the new commit to be visible. I saw a comment recently (from Lance) that there is (annoying) HTTP caching enabled by default in solrconfig.xml. Does this sound like something that would be caused by that cache? If so, I'd probably want to disable it. Does that affect performance of queries run via SolrJ? Also: why isn't that cache flushed by a commit? Seems weird... -- Michael Sokolov Engineering Director www.ifactory.com
Re: updates not reflected in solr admin
Thanks - we are issuing a commit via SolrJ; I think that's the same thing, right? Or are you saying really we need to do a separate commit (via HTTP) to update the admin console's view? -Mike On 05/02/2011 11:49 AM, Ahmet Arslan wrote: This is in 1.4 - we push updates via SolrJ; our application sees the updates, but when we use the solr admin screens to run test queries, or use Luke to view the schema and field values, it sees the database in its state prior to the commit. I think eventually this seems to propagate, but I'm not clear how often since we generally restart the (tomcat) server in order to get the new commit to be visible. You need to issue a commit from HTTP interface to see the changes made by embedded solr server. solr/update?commit=true
Re: updates not reflected in solr admin
Ah - I didn't expect that. Thank you! On 05/02/2011 12:07 PM, Ahmet Arslan wrote: Thanks - we are issuing a commit via SolrJ; I think that's the same thing, right? Or are you saying really we need to do a separate commit (via HTTP) to update the admin console's view? Yes separate commit is needed.
Re: how to do offline adding/updating index
I think the key question here is what's the best way to perform indexing without affecting search performance, or without affecting it much. If you have a batch of documents to index (say a daily batch that takes an hour to index and merge), you'd like to do that on an offline system, and then when ready, bring that index up for searching. but using Lucene's multiple commit points assumes you use the same box for search and indexing doesn't it? Something like this is what I have in mind (simple 2-server config here): Box 1 is live and searching Box 2 is offline and ready to index loading begins on Box 2... loading complete on Box 2 ... commit, optimize Swap Box 1 and Box 2 ( with a load balancer or application config?) Box 2 is live and searching Box 1 is offline and ready to index To make the best use of your resources, you'd then like to start using Box 1 for searching (until indexing starts up again). Perhaps if your load balancing is clever enough, it could be sensitive to the decreased performance of the indexing box and just send more requests to the other one(s). That's probably ideal. -Mike S Under the hood, Lucene can support this by keeping multiple commit points in the index. So you'd make a new commit whenever you finish indexing the updates from each hour, and record that this is the last "searchable" commit. Then you are free to commit while indexing the next hour's worth of changes, but these commits are not marked as searchable. But... this is a low level Lucene capability and I don't know of any plans for Solr to support multiple commit points in the index. Mike http://blog.mikemccandless.com On Tue, May 10, 2011 at 9:22 AM, vrpar...@gmail.com wrote: Hello all, indexing with dataimporthandler runs every hour (new records will be added, some records will be updated) note :large data requirement is when indexing is in progress, searching (on already indexed data) should not affect so should i use multicore-with merge and swap or delta query or any other way? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-index-tp2923035p2923035.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to do offline adding/updating index
Thanks - that sounds like what I was hoping for. So the I/O during replication will have *some* impact on search performance, but presumably much less than reindexing and merging/optimizing? -Mike Master/slave replication does this out of the box, easily. Just set the slave to update on Optimize only. Then you can update the master as much as you want. When you are ready to update the slave (the search instance), just optimize the master. On the slave's next cycle check it will refresh itself, quickly, efficiently, minimal impact to search performance. No need to build extra moving parts for swapping search servers or anything like that. -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-index-tp2923035p2924426.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: What is correct use of HTMLStripCharFilter in Solr 3.1
It preserves the location of the terms in the original HTML document so that you can highlight terms in HTML. This makes it possible (for instance) to display the entire document, with all the search terms highlighted, or (with some careful surgery) to display formatted HTML (bold, italic, etc) in your search results. -Mike On 05/12/2011 03:42 PM, Jonathan Rochkind wrote: On 5/12/2011 2:55 PM, Ahmet Arslan wrote: I recently upgraded from Solr 1.3 to Solr 3.1 in order to take advantage of the HTMLStripCharFilter. But it isn't working as I expected. You need to strip html tag before analysis phase. If you are using DIH, you can use stripHTML="true" transformer. Wait, then what's the HTMLStripCharFilter for?
document storage
Would anyone care to comment on the merits of storing indexed full-text documents in Solr versus storing them externally? It seems there are three options for us: 1) store documents both in Solr and externally - this is what we are doing now, and gives us all sorts of flexibility, but doesn't seem like the most scalable option, at least in terms of storage space and I/O required when updating/inserting documents. 2) store documents externally: For the moment, the only thing that requires us to store documents in Solr is the need to highlight them, both in search result snippets and in full document views. We are considering hunting for or writing a Highlighter extension that could pull in the document text from an external source (eg filesystem). 3) store documents only in Solr. We'd just retrieve document text as a Solr field value rather than reading from the filesystem. Somehow this strikes me as the wrong thing to do, but it could work: I'm not sure why. A lot of unnecessary merging I/O activity perhaps. Makes it hard to grep the documents or use other filesystem tools, I suppose. Which one of these sounds best to you? Under which circumstances? Are there other possibilities? Thanks! -- Michael Sokolov Engineering Director www.ifactory.com
Re: document storage
On 05/15/2011 11:48 AM, Erick Erickson wrote: Where are the documents coming from? Because storing them ONLY in Solr risks losing them if your index is somehow hosed. In our case, we generally have source documents and can reproduce the index if need be, but that's a good point. Storing them externally only has the advantage that your index will be much smaller, which helps when replicating as you scale. The downside here is that highlighting will be more resource-intensive since you're re-analyzing text in order to highlight. I had been imagining that the Highlighter could use stored term positions so as to avoid re-analysis. Is this incompatible with external storage? We might conceivably need to replicate the documents anyway, even if they are stored externally, in order to make them available to a farm of servers, although a SAN is another possibility here. My main concern about storing internally was the cost of merging (optimizing) the index. Presumably that would be increased if the docs are stored in it. So, as usual, "it depends" (tm). What is the scale you need? What is the QPS you're thinking of supporting? Things are working well at a small scale, and in that environment I think all of these solutions work more or less equally well. We're worrying about 10's of millions of documents and QPS around 50, so I expect we will have some significant challenges in coordinating a cluster of servers, and we're trying to plan as well as we can for that. We expect updates to be performed in a "batch" mode - they don't have to be real-time, but they might need to be daily. -Mike
Re: boolean versus non-boolean search
On 05/16/2011 09:24 AM, Dmitry Kan wrote: Dear list, Might have missed it from the literature and the list, sorry if so, but: SOLR 1.4.1 Consider the query: term1 term2 OR "term1 term2" OR "term1 term3" I think what's happening is that your query gets rewritten into something like: +term1 + (term2? "term1 term2"? term3?) where in my notation ? means is "optional", and "+" means required. So any document would match the second clause -Mike
Re: [POLL] How do you (like to) do logging with Solr
We use log4j explicitly and find it irritating to deal with the built-in JDK logging default. We also have conflicts with other packages that have their own ideas about how to bind slf4j, so the less of this the better, IMO. The 1.6.1 no-op default behavior seems a bit unfortunate as out-of-the-box behavior to me though. Not sure if there's anything to be done about that. Can you log to stderr when there's no logger available? -Mike On 05/16/2011 04:43 AM, Jan Høydahl wrote: Hi, This poll is to investigate how you currently do or would like to do logging with Solr when deploying solr.war to a SEPARATE java application server (such as Tomcat, Resin etc) outside of the bundled "solr/example". For background on how things work in Solr now, see http://wiki.apache.org/solr/SolrLogging and for more info on the SLF4J framework, see http://www.slf4j.org/manual.html Please tick one of the options below with an [X]: [ ] I always use the JDK logging as bundled in solr.war, that's perfect [ ] I sometimes use log4j or another framework and am happy with re-packaging solr.war [X] Give me solr.war WITHOUT an slf4j logger binding, so I can choose at deploy time [ ] Let me choose whether to bundle a binding or not at build time, using an ANT option [ ] What's wrong with the "solr/example" Jetty? I never run Solr elsewhere! [ ] What? Solr can do logging? How cool! Note that NOT bundling a logger binding with solr.war means defaulting to the NOP logger after outputting these lines to stderr: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com
Re: Storing, indexing and searching XML documents in Solr
You might want to create a field that's analyzed using HtmlStripCharFilter - this will index all the non-tag/non-attribute text in the document, and if you store the value, will store the entire XML document as well. I've done some work on an XmlStripCharFilter, which does the same thing (only for well-formed XML) using the WSTX XML parser, which provides a little bit of extra XML goodness (like entity resolution and xinclude processing) that HtmlStripCharFilter doesn't. I could share if there's interest. -Mike On 05/18/2011 05:27 PM, Judioo wrote: Great document. I can see how to import the data direct from the database. However it seems as though I need to write xpath's in the config to extract the fields that I wish to transform into an solr document. So it seems that there is no way of storing the document structure in solr as is? 2011/5/18 Yury Kats On 5/18/2011 4:19 PM, Judioo wrote: Any help is greatly appreciated. Pointers to documentation that address my issues is even more helpful. I think this would be a good start: http://wiki.apache.org/solr/DataImportHandler#Usage_with_XML.2BAC8-HTTP_Datasource
Re: [Contribution] Multiword Inline-Prefix Autocomplete Idea
Cool! suggestion: you might want to replace externalVal.toLowerCase().split(" "); with externalVal.toLowerCase().split("\\s+"); also I bet folks might have different ideas about what to do with hyphens, so maybe: externalVal.toLowerCase().split("[-\\s]+"); In fact why not make it a configurable parameter? Or - even better - use some other existing token analysis chain? I'm not sure how to fit that into Solr's architecture: can you analyze a field value and still access the unanalyzed text? -Mike
Re: Solr Highlight Component
A possible workaround is to re-fetch the documents in your result set with a query that is: +id=(id1 or id2 or ... id20) () where id1..20 are the doc ids in your result set would require two round-trips though -Mike On 05/24/2011 08:19 AM, Koji Sekiguchi wrote: (11/05/24 20:56), Lord Khan Han wrote: Hi , Can I limit the terms that the HighlightComponent uses. My query is generally long and I want specific ones to be highlighted and the rest is not highlighted. Is there an option like the SpellCheckComponent. it uses q unless spellcheck.q if specified. Is a hl.q parameter possible? No, but hl.q was proposed by me a year ago: https://issues.apache.org/jira/browse/SOLR-1926 I'm sorry but no progress is there at this moment. koji
Re: solr Invalid Date in Date Math String/Invalid Date String
The "*" endpoint for range terms wasn't implemented yet in 1.4.1 As a workaround, we use very large and very small values. -Mike On 05/27/2011 12:55 AM, alucard001 wrote: Hi all I am using SOLR 1.4.1 (according to solr info), but no matter what date field I use (date or tdate) defined in default schema.xml, I cannot do a search in solr-admin analysis.jsp: fieldtype: date(or tdate) fieldvalue(index): 2006-12-22T13:52:13Z (I type it in manually, no trailing space) fieldvalue(query): The only success case: 2006-12-22T13:52:13Z All search below are failed: * TO NOW [* TO NOW] 2006-12-22T00:00:00Z TO 2006-12-22T23:59:59Z 2006\-12\-22T00\:00\:00Z TO 2006\-12\-22T23\:59\:59Z [2006-12-22T00:00:00Z TO 2006-12-22T23:59:59Z] [2006\-12\-22T00\:00\:00Z TO 2006\-12\-22T23\:59\:59Z] 2006-12-22T00:00:00.000Z TO 2006-12-22T23:59:59.999Z 2006\-12\-22T00\:00\:00\.000Z TO 2006\-12\-22T23\:59\:59\.999Z [2006-12-22T00:00:00.000Z TO 2006-12-22T23:59:59.999Z] [2006\-12\-22T00\:00\:00\.000Z TO 2006\-12\-22T23\:59\:59\.999Z] 2006-12-22T00:00:00Z TO * 2006\-12\-22T00\:00\:00Z TO * [2006-12-22T00:00:00Z TO *] [2006\-12\-22T00\:00\:00Z TO *] 2006-12-22T00:00:00.000Z TO * 2006\-12\-22T00\:00\:00\.000Z TO * [2006-12-22T00:00:00.000Z TO *] [2006\-12\-22T00\:00\:00\.000Z TO *] (vice versa) I get either: Invalid Date in Date Math String or Invalid Date String error What's wrong with it? Can anyone please help me on that? Thank you. -- View this message in context: http://lucene.472066.n3.nabble.com/solr-Invalid-Date-in-Date-Math-String-Invalid-Date-String-tp2991763p2991763.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Obtaining query AST?
I believe there is a query parser that accepts queries formatted in XML, allowing you to provide a parse tree to Solr; perhaps that would get you the control you're after. -Mike On 05/31/2011 02:24 PM, dar...@ontrenet.com wrote: Hi, I want to write my own query expander. It needs to obtain the AST (abstract syntax tree) of an already parsed query string, navigate to certain parts of it (words) and make logical phrases of those words by adding to the AST - where necessary. This cannot be done to the string because the query logic cannot be semantically altered. (e.g. AND, OR, paren's etc) so it must be parsed first. How can this be done with SolrJ? thanks for any tips. Darren
Re: Text field case sensitivity problem
Wildcard queries aren't analyzed, I think? I'm not completely sure what the best workaround is here: perhaps simply lowercasing the query terms yourself in the application. Also - I hope someone more knowledgeable will say that the new HighlightQuery in trunk doesn't have this restriction, but I'm not sure about that. -Mike On 06/14/2011 05:13 PM, Jamie Johnson wrote: Also of interest to me is this returns results http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kristine On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnson wrote: I am using the following for my text field: I have a field defined as when I execute a go to the following url I get results http://localhost:8983/solr/select?defType=lucene&q=Person_Name:kris* but if I do http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kris* I get nothing. I thought the LowerCaseFilterFactory would have handled lowercasing both the query and what is being indexed, am I missing something?
Re: Text field case sensitivity problem
opps, please s/Highlight/Wildcard/ On 06/14/2011 05:31 PM, Mike Sokolov wrote: Wildcard queries aren't analyzed, I think? I'm not completely sure what the best workaround is here: perhaps simply lowercasing the query terms yourself in the application. Also - I hope someone more knowledgeable will say that the new HighlightQuery in trunk doesn't have this restriction, but I'm not sure about that. -Mike On 06/14/2011 05:13 PM, Jamie Johnson wrote: Also of interest to me is this returns results http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kristine On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnson wrote: I am using the following for my text field: I have a field defined as when I execute a go to the following url I get results http://localhost:8983/solr/select?defType=lucene&q=Person_Name:kris* but if I do http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kris* I get nothing. I thought the LowerCaseFilterFactory would have handled lowercasing both the query and what is being indexed, am I missing something?
Re: Text field case sensitivity problem
I wonder whether CharFilters are applied to wildcard terms? I suspect they might be. If that's the case, you could use the MappingCharFilter to perform lowercasing (and strip diacritics too if you want that) -Mike On 06/15/2011 10:12 AM, Jamie Johnson wrote: So simply lower casing the works but can get complex. The query that I'm executing may have things like ranges which require some words to be upper case (i.e. TO). I think this would be much better solved on Solrs end, is there a JIRA about this? On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolov <mailto:soko...@ifactory.com>> wrote: opps, please s/Highlight/Wildcard/ On 06/14/2011 05:31 PM, Mike Sokolov wrote: Wildcard queries aren't analyzed, I think? I'm not completely sure what the best workaround is here: perhaps simply lowercasing the query terms yourself in the application. Also - I hope someone more knowledgeable will say that the new HighlightQuery in trunk doesn't have this restriction, but I'm not sure about that. -Mike On 06/14/2011 05:13 PM, Jamie Johnson wrote: Also of interest to me is this returns results http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kristine <http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kristine> On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnsonmailto:jej2...@gmail.com>> wrote: I am using the following for my text field: I have a field defined as when I execute a go to the following url I get results http://localhost:8983/solr/select?defType=lucene&q=Person_Name:kris* <http://localhost:8983/solr/select?defType=lucene&q=Person_Name:kris*> but if I do http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kris* <http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kris*> I get nothing. I thought the LowerCaseFilterFactory would have handled lowercasing both the query and what is being indexed, am I missing something?
Re: Extending Solr Highlighter to pull information from external source
I'd be very interested in this, as well, if you do it before me and are willing to share... A related question I have tried to ask on this list, and have never really gotten a good answer to, is whether it makes sense to just chuck the external storage and treat the lucene index as the primary storage for documents. I have a feeling the answer is no; perhaps because of increased I/O costs for lucene and solr, but I don't really know. I've been considering doing some experimentation, but would really love an expert opinion... -Mike On 06/20/2011 08:41 AM, Jamie Johnson wrote: I am trying to index data where I'm concerned that storing the contents of a specific field will be a bit of a hog so we are planning to retrieve this information as needed for highlighting from an external source. I am looking to extend the default solr highlighting capability to work with information pulled from this external source and it looks like this is possible by extending DefaultSolrHighlighter (line 418 to pull a particular field from external source) for standard highlighting and BaseFragmentsBuilder (line 99) for FastVectorHighlighter. I could just hard code this to say if the field name is a specific value look into the external source, is this the best way to accomplish this? Are there any other extension points to do what I'm suggesting?
Re: Extending Solr Highlighter to pull information from external source
Another option for determining whether to go to external storage would be to examine the SchemaField, see if it is stored, and if not, try to fetch from a file or whatever. That way you won't have to configure anything. -Mike On 06/20/2011 09:46 AM, Jamie Johnson wrote: In my case chucking the external storage is simply not an option. I'll definitely share anything I find, the following is a very simple example of adding text to the default solr highlighter (had to copy a large portion of the class since the method that actually does the highlighting is private along with some classes to get this to run). If you look at the source it should hopefully make sense. String[] docTexts = null; if(fieldName.equals("title")){ SchemaField keyField = schema.getUniqueKeyField(); String key = doc.getValues(keyField.getName())[0]; //I know this field exists and is not multivalued docTexts = doc.getValues(fieldName); //this would be loaded from external store, but below just appends some information if(key != null && key.length > 0){ for(int x = 0; x < docTexts.length; x++){ docTexts[x] = docTexts[x] + " some added text"; } } } I have cheated since I know the name of the field that (title) which I am doing this for but it would probably be useful to allow this to be set on the highlighter class through configuration in solrconfig (I'm not familiar at all with doing this and have spent 0 time looking into it). Once configured the if(fieldName.equals("title")) line would be replaced with something like if(externalFields.contains(fieldName)){...} or something like that. Thoughts/comments? On Mon, Jun 20, 2011 at 9:05 AM, Mike Sokolov <mailto:soko...@ifactory.com>> wrote: I'd be very interested in this, as well, if you do it before me and are willing to share... A related question I have tried to ask on this list, and have never really gotten a good answer to, is whether it makes sense to just chuck the external storage and treat the lucene index as the primary storage for documents. I have a feeling the answer is no; perhaps because of increased I/O costs for lucene and solr, but I don't really know. I've been considering doing some experimentation, but would really love an expert opinion... -Mike On 06/20/2011 08:41 AM, Jamie Johnson wrote: I am trying to index data where I'm concerned that storing the contents of a specific field will be a bit of a hog so we are planning to retrieve this information as needed for highlighting from an external source. I am looking to extend the default solr highlighting capability to work with information pulled from this external source and it looks like this is possible by extending DefaultSolrHighlighter (line 418 to pull a particular field from external source) for standard highlighting and BaseFragmentsBuilder (line 99) for FastVectorHighlighter. I could just hard code this to say if the field name is a specific value look into the external source, is this the best way to accomplish this? Are there any other extension points to do what I'm suggesting?
Re: Extending Solr Highlighter to pull information from external source
Yes that sounds about right. I also have in mind an optimization for highlighting so it doesn't need to pull the whole field value. The fast vector highlighter is working with offsets into the field, and should work better w/random access into the field value(s). But that should come as a later optimization. Another thing that bugs me about fvh is that it seems to need to recompute all the terms that matched the query for each retrieved field value when it seems like it ought to be able to make use of information gleaned during the actual query process, but that probably involves some deep change to cache that info during query scoring, and that is beyond my ken at the moment. -Mike On 06/20/2011 10:00 AM, Jamie Johnson wrote: perhaps it should be an array that gets returned to be consistent with getValues(fieldName); On Mon, Jun 20, 2011 at 9:59 AM, Jamie Johnson <mailto:jej2...@gmail.com>> wrote: Yes, in that case the code becomes if(!schemaField.stored()){ SchemaField keyField = schema.getUniqueKeyField(); String key = doc.getValues(keyField.getName())[0]; docTexts = doc.getValues(fieldName); if(key != null && key.length() > 0){ for(int x = 0; x < docTexts.length; x++){ docTexts[x] = docTexts[x] + " some added text"; } } } I'd imagine that we'd want some type of interface to actually pull the text so you can plugin different providers, something like ISolrExternalFieldProvider { public String getFieldContent(String key, SchemaField field); } not sure if there is anything else that interface should include but that's all I would need at present. On Mon, Jun 20, 2011 at 9:54 AM, Mike Sokolov mailto:soko...@ifactory.com>> wrote: Another option for determining whether to go to external storage would be to examine the SchemaField, see if it is stored, and if not, try to fetch from a file or whatever. That way you won't have to configure anything. -Mike On 06/20/2011 09:46 AM, Jamie Johnson wrote: In my case chucking the external storage is simply not an option. I'll definitely share anything I find, the following is a very simple example of adding text to the default solr highlighter (had to copy a large portion of the class since the method that actually does the highlighting is private along with some classes to get this to run). If you look at the source it should hopefully make sense. String[] docTexts = null; if(fieldName.equals("title")){ SchemaField keyField = schema.getUniqueKeyField(); String key = doc.getValues(keyField.getName())[0]; //I know this field exists and is not multivalued docTexts = doc.getValues(fieldName); //this would be loaded from external store, but below just appends some information if(key != null && key.length > 0){ for(int x = 0; x < docTexts.length; x++){ docTexts[x] = docTexts[x] + " some added text"; } } } I have cheated since I know the name of the field that (title) which I am doing this for but it would probably be useful to allow this to be set on the highlighter class through configuration in solrconfig (I'm not familiar at all with doing this and have spent 0 time looking into it). Once configured the if(fieldName.equals("title")) line would be replaced with something like if(externalFields.contains(fieldName)){...} or something like that. Thoughts/comments? On Mon, Jun 20, 2011 at 9:05 AM, Mike Sokolov mailto:soko...@ifactory.com>> wrote: I'd be very interested in this, as well, if you do it before me and are willing to share... A related question I have tried to ask on this list, and have never really gotten a good answer to, is whether it makes sense to just chuck the external storage and treat the lucene index as the primary storage for documents. I have a feeling the answer is no; perhaps because of increased I/O costs for lucene and solr, but I don't really know. I've been considering doing some experimentation, but would really love an expert opinion... -Mike On 06/20/2011 08:41 AM, Jamie Johnson wrote: I am trying to in
Re: MultiValued facet behavior question
On 06/22/2011 04:01 AM, Dennis de Boer wrote: Hi Bill, as far as I understood now, with the help of my friend, you can't. Multivalued fields don't work that way. You can however always filter the facet results manually in the JSP. You knwo what the user chose as a facet. Yes - that is the most sensible suggestion: if you want to display the facets the user chose, and only those, regardless of what was found in the index, then I think you know what to do! The issue I ran into is when you have additional facet fields. For example when you also have country as a facetfield. Now when you search for Cardiologist, it also returns Internist and family doctor as you described. What Sorl now also returns for the country list are the countries for Cardiologist, but also for Internist and family doctor. This is not what you want. I don't think this is accurate. Your query matches some set of documents - the facet values shown will only be those that occur in that set. If some internist's countries are shown when the user selects Cardiologist, that is because those internists are aldo cardiologists, right? -Mike
Re: MultiValued facet behavior question
We always remove the facet filter when faceting: in other words, for a good user experience, you generally want to show facets based on the query excluding any restriction based on the facets. So in your example (facet B selected), we would continue to show *all* facets. Only if you performed a search using some other filter (proximity, gender, etc), would we restrict the facet list. -Mike On 06/22/2011 09:42 AM, Dennis de Boer wrote: Well, the use case is rather simple. It is not a use case but more auser experience. If I have a list of values I can facet on, for example : A B C D E And I click on B, does it make sense for the user to display B C E after the selection ? Just because items in B are C and E items as well? As A user I chose B because I'm interested in B items. I do not care if they are also C and E items. Technically this is correct, but functional wise, the user doesn't care because it is not what they searched for. In this case they were searching for a Cardiologists. Do I care that a cardiologist is also a family doctor? No. So I also do not want to see this as a facet value presented to me in frontend logic. In the item details you can show that the cardiologist is also a family doctor. That is fine, but not as an availbale facet option, if you just chose an speciality you want to filter on. Does it make sense? On Wed, Jun 22, 2011 at 3:31 PM, lee carroll wrote: Hi Dennis, I think maybe I just disagree. Your not showing facet counts for cardiologists and Family Doctors independently. The Family Doctor count will be all Family Doctors who are also Cardiologists. This allows users to further filter Cardiologists who are also family Doctors. (this could be of use to them ??) If your front end app implements the filtering as a list of fq=xxx then that would make for consistent results ? I don't see how not showing that some cardiologists are also Family Doctors is a better user experience... But again you might have a very specific use case? On 22 June 2011 13:44, Dennis de Boer wrote: Hi Lee, since I have the same problem, I might as well try to answer this question. You want this behaviour to make things clear for your users. If they select cardiologists, does it make sense to also show family doctors as a facetvalue to the user. The same thing goed for the facets that are related to family doctors. They are returned as well, thus making it even moren unclear for the end-user. On Wed, Jun 22, 2011 at 2:27 PM, lee carroll wrote: Hi Bill, So that part works. Then when I output the facet, I need a different behavior than the default. I need The facet to only output the value that matches (scored) - NOT ALL VALUES in the multiValued field. I think it makes sense? Why do you need this ? If your use case is faceted navigation then not showing all the facet terms which match your query would be mis-leading to your users. The fact is your data indicates Ben the cardiologist is also a GP etc. Is it not valid for your users to be able to further filter on cardiologists who are also specialists in x other disciplines ? If the specialisms are mutually exclusive then your data will reflect this. The fact is x number of cardiologists match and x number of GP's match etc I may be missing the point here as you have not said why you need to do this ? cheers lee c On 22 June 2011 09:34, Michael Kuhlmann wrote: Am 22.06.2011 09:49, schrieb Bill Bell: You can type q=cardiology and match on cardiologist. If stemming did not work you can just add a synonym: cardiology,cardiologist Okay, synonyms are the only way I can think of a realistic match. Stemming won't work on a facet field; you wouldn't get "Cardiologist: 3" as the result but "cardiolog: 3" or something like that instead. Normally, you use declare facet field explicitly for facetting, and not for searching, exactly because stemming and tokenizing on facet fields don't make sense. And the short answer is: No, that's not possible. -Kuli
Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff
Actually - you are both wrong! It is true that 0x is a valid UTF8 character, and not a valid UTF8 byte sequence. But the parser is reporting (or trying to) that 0x is an invalid XML character. And Robert - if the wording offends you, you might want to send a note to Tatu (http://jira.codehaus.org/) suggesting that he alter the wording of the error message :) -Mike On 06/27/2011 09:01 AM, Bernd Fehling wrote: Am 27.06.2011 14:48, schrieb Robert Muir: On Mon, Jun 27, 2011 at 8:47 AM, Bernd Fehling wrote: correct!!! but what i said, is totally different than what you said. you are still wrong. http://www.unicode.org/faq//utf_bom.html see Q: What is a UTF?
Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff
OK - re-reading your message it seems maybe that is what you were trying to say too, Robert. FWIW I agree with you that XML is rigid, sometimes for purely arbitrary reasons. But nobody has really helped Markus here - unfortunately, there is no easy way out of this mess. What I do to handle issues like this is to wrap the stream I'm handing to the parser in some kind of cleanup stream that handles a few yucky issues. You could, eg, just strip out invalid XML characters. Maybe Nutch should be doing this, or at least handling the error better? -Mike On 06/27/2011 09:19 AM, Mike Sokolov wrote: Actually - you are both wrong! It is true that 0x is a valid UTF8 character, and not a valid UTF8 byte sequence. But the parser is reporting (or trying to) that 0x is an invalid XML character. And Robert - if the wording offends you, you might want to send a note to Tatu (http://jira.codehaus.org/) suggesting that he alter the wording of the error message :) -Mike On 06/27/2011 09:01 AM, Bernd Fehling wrote: Am 27.06.2011 14:48, schrieb Robert Muir: On Mon, Jun 27, 2011 at 8:47 AM, Bernd Fehling wrote: correct!!! but what i said, is totally different than what you said. you are still wrong. http://www.unicode.org/faq//utf_bom.html see Q: What is a UTF?
Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff
I don't think this is a BOM - that would be 0xfeff. Anyway the problem we usually see w/processing XML with BOMs is in UTF8 (which really doesn't need a BOM since it's a byte stream anyway), in which if you transform the stream (bytes) into a reader (chars) before the xml parser can see it, the parser treats the BOM as white space. But in that case you typically get a more specific error about invalid characters in the XML prolog, not just a random invalid character error. -Mike On 06/27/2011 10:33 AM, lee carroll wrote: Hi Markus I've seen similar issue before (but not with solr) when processing files as xml. In our case the problem was due to processing a utf16 file with a byte order mark. This presents itself as 0x to the xml parser which is not used by utf8 (the bom unicode would be represented as efbfbf in utf8) This caused the utf8 aware parser to choke. I don't want to get involved in any unicode / utf war as I'm confused enough as it stands but could you check for utf16 files before processing ? lee c On 27 June 2011 14:26, Thomas Fischer wrote: Hello, Am 27.06.2011 um 12:40 schrieb Markus Jelsma: Hi, I came across the indexing error below. It happened in a huge batch update from Nutch with SolrJ 3.1. Since the crawl was huge it is very hard to trace the error back to a specific document. So i try my luck here: anyone seen this before with SolrJ 3.1? Anything else on the Nutch part i should have taken care off? Thanks! Jun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabin&version=2} status=500 QTime=423 Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) and loads of other rubbish and ... 26 more I see this as a problem of solr error-reporting. This is not only obnoxiously "loud" (white on grey with oversized fonts), but less useful than it should be. Instead of telling the user where the error occurred (i.e. while reading which file, which column at which line) it unravels the stack. This is useless if the program just choked on some unexpected input, like a typo in a schema of config file or an invalid character in a file to be indexed. I don't know if this is due to the Tomcat, the logging system of solr itself, but it is annoying. And yes, I've seen something like this before and found the error not by inspecting solr but by opening the suspected files with an appropriate browser (e.g. Firefox) which tells me exactly where something goes wrong. All the best Thomas
Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff
Markus - if you want to make sure not to offend XML parsers, you should strip all characters not in this list: http://en.wikipedia.org/wiki/XML#Valid_characters You'll see that article talks about XML 1.1, which accepts a wider range of characters than XML 1.0, and I believe the Woodstox parser used in Solr adheres to that convention. But note the restriction about control characters needing to be encoded - I'm not sure, but it might also be best to strip out chars < 32 except for \r, \n and \t. You definitely need to remove \0 also... On 06/27/2011 11:59 AM, Markus Jelsma wrote: Of course it doesn't work like this: use AND instead of OR! On Monday 27 June 2011 17:50:01 Markus Jelsma wrote: Hi all, thanks for your comments. I seem to have fixed it by now by simply stripping away all non-character codepoints [1] by iterating over the individual chars and checking them against: if (ch % 0x1 != 0x || ch % 0x1 != 0xfffe || (ch<= 0xfdd0&& ch = 0xfdef)) { pass; } Comments? [1]: http://unicode.org/cldr/utility/list- unicodeset.jsp?a=[:Noncharacter_Code_Point=True:] On Monday 27 June 2011 12:40:16 Markus Jelsma wrote: Hi, I came across the indexing error below. It happened in a huge batch update from Nutch with SolrJ 3.1. Since the crawl was huge it is very hard to trace the error back to a specific document. So i try my luck here: anyone seen this before with SolrJ 3.1? Anything else on the Nutch part i should have taken care off? Thanks! Jun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabin&version=2} status=500 QTime=423 Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java: 1 8) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java: 3 657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conten t StreamHandlerBase.java:67) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBa s e.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja va :356) at orJun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabin&version=2} status=500 QTime=423 Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java: 1 8) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java: 3 657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conten t StreamHandlerBase.java:67) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBa s e.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja va :356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.ja v a:252) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHand l er.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:21 6 ) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerC o llection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java : 114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.j av a:945) at org.m
Re: Looking for Custom Highlighting guidance
Does the phonetic analysis preserve the offsets of the original text field? If so, you should probably be able to hack up FastVectorHighlighter to do what you want. -Mike On 06/29/2011 02:22 PM, Jamie Johnson wrote: I have a schema with a text field and a text_phonetic field and would like to perform highlighting on them in such a way that the tokens that match are combined. What would be a reasonable way to accomplish this?
Re: Looking for Custom Highlighting guidance
It's going to be a bit complicated, but I would start by looking at providing a facility for merging an array of FieldTermStacks. The constructor for FieldTermStack() takes a fieldName and builds up a list of TermInfos (terms with positions and offsets): I *think* that if you make two of these, merge them, and hand that to the FieldPhraseList constructor (this is done in the main FVH class), you should get what you want. This is a bit speculative; I haven't tried it. -Mike On 06/30/2011 08:26 AM, Jamie Johnson wrote: Thanks for the suggestion Mike, I will give that a shot. Having no familiarity with FastVectorHighlighter is there somewhere specific I should be looking? On Wed, Jun 29, 2011 at 3:20 PM, Mike Sokolov wrote: Does the phonetic analysis preserve the offsets of the original text field? If so, you should probably be able to hack up FastVectorHighlighter to do what you want. -Mike On 06/29/2011 02:22 PM, Jamie Johnson wrote: I have a schema with a text field and a text_phonetic field and would like to perform highlighting on them in such a way that the tokens that match are combined. What would be a reasonable way to accomplish this?
Re: Text field case sensitivity problem
Yes, after posting that response, I read some more and came to the same conclusion... there seems to be some interest on the dev list in building a capability to specify an analysis chain for use with wildcard and related queries, but it doesn't exist now. -Mike On 06/30/2011 10:34 AM, Jamie Johnson wrote: I think my answer is here... "On wildcard and fuzzy searches, no text analysis is performed on the search word. " taken from http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers On Thu, Jun 30, 2011 at 10:23 AM, Jamie Johnson wrote: I'm not familiar with the CharFilters, I'll look into those now. Is the solr.LowerCaseFilterFactory not handling wildcards the expected result or is this a bug? On Wed, Jun 15, 2011 at 4:34 PM, Mike Sokolov wrote: I wonder whether CharFilters are applied to wildcard terms? I suspect they might be. If that's the case, you could use the MappingCharFilter to perform lowercasing (and strip diacritics too if you want that) -Mike On 06/15/2011 10:12 AM, Jamie Johnson wrote: So simply lower casing the works but can get complex. The query that I'm executing may have things like ranges which require some words to be upper case (i.e. TO). I think this would be much better solved on Solrs end, is there a JIRA about this? On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolov wrote: opps, please s/Highlight/Wildcard/ On 06/14/2011 05:31 PM, Mike Sokolov wrote: Wildcard queries aren't analyzed, I think? I'm not completely sure what the best workaround is here: perhaps simply lowercasing the query terms yourself in the application. Also - I hope someone more knowledgeable will say that the new HighlightQuery in trunk doesn't have this restriction, but I'm not sure about that. -Mike On 06/14/2011 05:13 PM, Jamie Johnson wrote: Also of interest to me is this returns results http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kristine On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnson wrote: I am using the following for my text field: I have a field defined as when I execute a go to the following url I get results http://localhost:8983/solr/select?defType=lucene&q=Person_Name:kris* but if I do http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kris* I get nothing. I thought the LowerCaseFilterFactory would have handled lowercasing both the query and what is being indexed, am I missing something?
Re: Text field case sensitivity problem
Yes, and this too: https://issues.apache.org/jira/browse/SOLR-219 On 06/30/2011 12:46 PM, Erik Hatcher wrote: Jamie - there is a JIRA about this, at least one:<https://issues.apache.org/jira/browse/SOLR-218> Erik On Jun 15, 2011, at 10:12 , Jamie Johnson wrote: So simply lower casing the works but can get complex. The query that I'm executing may have things like ranges which require some words to be upper case (i.e. TO). I think this would be much better solved on Solrs end, is there a JIRA about this? On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolov wrote: opps, please s/Highlight/Wildcard/ On 06/14/2011 05:31 PM, Mike Sokolov wrote: Wildcard queries aren't analyzed, I think? I'm not completely sure what the best workaround is here: perhaps simply lowercasing the query terms yourself in the application. Also - I hope someone more knowledgeable will say that the new HighlightQuery in trunk doesn't have this restriction, but I'm not sure about that. -Mike On 06/14/2011 05:13 PM, Jamie Johnson wrote: Also of interest to me is this returns results http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kristine On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnson wrote: I am using the following for my text field: I have a field defined as when I execute a go to the following url I get results http://localhost:8983/solr/select?defType=lucene&q=Person_Name:kris* but if I do http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kris* I get nothing. I thought the LowerCaseFilterFactory would have handled lowercasing both the query and what is being indexed, am I missing something?
Re: TermVectors and custom queries
Yes, that's right. But at the moment the HL code basically has to reconstruct and re-run your query - it doesn't have any special knowledge. There's some work going on to try and fix that, but it seems like it's going to require some fairly major deep re-plumbing. -Mike On 07/01/2011 07:54 AM, Jamie Johnson wrote: How would I know which ones were the ones I wanted? I don't see how from a query I couldn't match up the term vectors that met the query. Seems like what needs to be done is have the highlighting on the solr end where you have more access to the information I'm looking for. Sound about right? On Fri, Jul 1, 2011 at 7:26 AM, Michael Sokolov wrote: I think that's all you can do, although there is a callback-style interface that might save some time (or space). You still need to iterate over all of the vectors, at least until you get the one you want. -Mike On 6/30/2011 4:53 PM, Jamie Johnson wrote: Perhaps a better question, is this possible? On Mon, Jun 27, 2011 at 5:15 PM, Jamie Johnsonwrote: I have a field named content with the following definition I'm now trying to execute a query against content and get back the term vectors for the pieces that matched my query, but I must be messing something up. My query is as follows: http://localhost:8983/solr/select/?qt=tvrh&q=content:test&fl=content&tv.all=true where the word test is in my content field. When I get information back though I am getting the term vectors for all of the tokens in that field. How do I get back just the ones that match my search?
Re: How do I add a custom field?
Did you ever commit? On 07/07/2011 01:58 PM, Gabriele Kahlout wrote: so, how about this: Document doc = searcher.doc(i); // i get the doc doc.removeField("wc"); // remove the field in case there's addWc(doc, docLength); //add the new field writer.updateDocument(new Term("id", Integer.toString(i++)), doc); //update the doc For some reason it doesn't get added to the index. Should it? On 7/3/11, Michael Sokolov wrote: You'll need to index the field. I would think you would want to index/store the field along with the associated document, in which case you'll have to reindex the documents as well - there's no single-field update capability in Lucene (yet?). -Mike On 7/3/2011 1:09 PM, Gabriele Kahlout wrote: Is there how I can compute and add the field to all indexed documents without re-indexing? MyField counts the number of terms per document (unique word count). On Sun, Jul 3, 2011 at 12:24 PM, lee carroll wrote: Hi Gabriele, Did you index any docs with your new field ? The results will just bring back docs and what fields they have. They won't bring back "null" fields just because they are in your schema. Lucene is schema-less. Solr adds the schema to make it nice to administer and very powerful to use. On 3 July 2011 11:01, Gabriele Kahlout wrote: Hello, I want to have an additional field that appears for every document in search results. I understand that I should do this by adding the field to the schema.xml, so I add: Then I restart Solr (so that I loads the new schema.xml) and make a query specifying that it should return myField too, but it doesn't. Will it do only for newly indexed documents? Am I missing something? -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: How do I specify a different analyzer at search-time?
There is a syntax that allows you to specify different analyzers to use for indexing and querying, in solr.xml. But if you don't do that, it should use the same analyzer in both cases. -Mike On 07/11/2011 10:58 AM, Gabriele Kahlout wrote: With a lucene QueryParser instance it's possible to set the analyzer in use. I suspect Solr doesn't use the same analyzer it used at indexing, defined in schema.xml but I cannot verify that without the queryparser instance. From Jan's diagram it seems this is set in the SearchHandler's init. Is it? How? On Sun, Apr 10, 2011 at 11:05 AM, Jan Høydahl wrote: Looks really good, but two bits that i think might confuse people are the implications that a "Query Parser" then invokes a series of search components; and that "analysis" (and the pieces of an analyzer chain) are what to lookups in the underlying lucene index. the first might just be the ambiguity of "Query" .. using the term "request parser" might make more sense, in comparison to the "update parsing" from the other side of hte diagram. Thanks for commenting. Yea, the purpose is more to show a conceptual rather than actual relation between the different components, focusing on the flow. A 100% technical correct diagram would be too complex for beginners to comprehend, although it could certainly be useful for developers. I've removed the arrow between QueryParser and search components to clarify. The boxes first and foremost show that query parsing and response writers are within the realm of search request handler. the analysis piece is a little harder to fix cleanly. you really want the end of the analysis chain to feed back up to the searh components, and then show it (most of hte search components really) talking to the Lucene index. Yea, I know. Showing how Faceting communicate with the main index and spellchecker with its spellchecker index could also be useful, but I think that would be for another more detailed diagram. I felt it was more important for beginners to realize visually that analysis happens both at index and search time, and that the analyzers align 1:1. At this stage in the digram I often explain the importance of matching up the analysis on both sides to get a match in the index. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com
Re: strip html from data
I think you need to list the charfilter earlier in the analysis chain; before the tokenizer. Porbably Solr should tell you this... -Mike On 07/25/2011 09:03 AM, Merlin Morgenstern wrote: sounds logical. I just changed it to the following, restarted and reindexed with commit: Unfortunatelly that did not fix the error. There are still tags inside the data. Although I believe there are viewer then before but I can not prove that. Fact is, there are still html tags inside the data. Any other ideas what the problem could be? 2011/7/25 Markus Jelsma You've three analyzer elements, i wonder what that would do. You need to add the char filter to the index-time analyzer. On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote: Hi there, I am trying to strip html tags from the data before adding the documents to the index. To do that I altered schem.xml like this: Unfortunatelly this does not work, the hmtl tags like are still present after restarting and reindexing. I also tryed htmlstriptransformer, but this did not work either. Has anybody an idea how to get this done? Thank you in advance for any hint. Merlin -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: strip html from data
Hmm - I'm not sure about that; see https://issues.apache.org/jira/browse/SOLR-2119 On 07/25/2011 12:01 PM, Markus Jelsma wrote: charFilters are executed first regardless of their position in the analyzer. On Monday 25 July 2011 17:53:59 Mike Sokolov wrote: I think you need to list the charfilter earlier in the analysis chain; before the tokenizer. Porbably Solr should tell you this... -Mike On 07/25/2011 09:03 AM, Merlin Morgenstern wrote: sounds logical. I just changed it to the following, restarted and reindexed with commit: Unfortunatelly that did not fix the error. There are still tags inside the data. Although I believe there are viewer then before but I can not prove that. Fact is, there are still html tags inside the data. Any other ideas what the problem could be? 2011/7/25 Markus Jelsma You've three analyzer elements, i wonder what that would do. You need to add the char filter to the index-time analyzer. On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote: Hi there, I am trying to strip html tags from the data before adding the documents to the index. To do that I altered schem.xml like this: Unfortunatelly this does not work, the hmtl tags like are still present after restarting and reindexing. I also tryed htmlstriptransformer, but this did not work either. Has anybody an idea how to get this done? Thank you in advance for any hint. Merlin -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: strip html from data
Hmm that looks like it's working fine. I stand corrected. On 07/25/2011 12:24 PM, Markus Jelsma wrote: I've seen that issue too and read comments on the list yet i've never had trouble with the order, don't know what's going on. Check this analyzer, i've moved the charFilter to the bottom: The analysis chain still does its job as i expect for the input: bla bla Index Analyzer org.apache.solr.analysis.HTMLStripCharFilterFactory {luceneMatchVersion=LUCENE_34} textbla bla org.apache.solr.analysis.WhitespaceTokenizerFactory {luceneMatchVersion=LUCENE_34} position1 2 term text bla bla startOffset 6 10 endOffset 9 13 org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_34, generateWordParts=1, catenateAll=0, catenateNumbers=1} position1 2 term text bla bla startOffset 6 10 endOffset 9 13 typewordword org.apache.solr.analysis.LowerCaseFilterFactory {luceneMatchVersion=LUCENE_34} position1 2 term text bla bla startOffset 6 10 endOffset 9 13 typewordword org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, expand=true, ignoreCase=false, luceneMatchVersion=LUCENE_34} position1 2 term text bla bla typewordword startOffset 6 10 endOffset 9 13 org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, ignoreCase=false, luceneMatchVersion=LUCENE_34} position1 2 term text bla bla typewordword startOffset 6 10 endOffset 9 13 org.apache.solr.analysis.ASCIIFoldingFilterFactory {luceneMatchVersion=LUCENE_34} position1 2 term text bla bla typewordword startOffset 6 10 endOffset 9 13 org.apache.solr.analysis.SnowballPorterFilterFactory {protected=protwords.txt, language=Dutch, luceneMatchVersion=LUCENE_34} position1 2 term text bla bla keyword false false typewordword startOffset 6 10 endOffset 9 13 org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {luceneMatchVersion=LUCENE_34} position1 2 term text bla bla keyword false false typewordword startOffset 6 10 endOffset 9 13 On Monday 25 July 2011 18:07:29 Mike Sokolov wrote: Hmm - I'm not sure about that; see https://issues.apache.org/jira/browse/SOLR-2119 On 07/25/2011 12:01 PM, Markus Jelsma wrote: charFilters are executed first regardless of their position in the analyzer. On Monday 25 July 2011 17:53:59 Mike Sokolov wrote: I think you need to list the charfilter earlier in the analysis chain; before the tokenizer. Porbably Solr should tell you this... -Mike On 07/25/2011 09:03 AM, Merlin Morgenstern wrote: sounds logical. I just changed it to the following, restarted and reindexed with commit: Unfortunatelly that did not fix the error. There are stilltags inside the data. Although I believe there are viewer then before but I can not prove that. Fact is, there are still html tags inside the data. Any other ideas what the problem could be? 2011/7/25 Markus Jelsma You've three analyzer elements, i wonder what that would do. You need to add the char filter to the index-time analyzer. On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote: Hi there, I am trying to strip html tags from the data before adding the documents to the index. To do that I altered schem.xml like this: Unfortunatelly this does not work, the hmtl tags likeare still present after restarting and reindexing. I also tryed htmlstriptransformer, but this did not work either. Has anybody an idea how to get this done? Thank you in advance for any hint. Merlin -- Markus Jelsma - CTO - Openindex
Re: slow highlighting because of stemming
I'm not sure I would identify stemming as the culprit here. Do you have very large documents? If so, there is a patch for FVH committed to limit the number of phrases it looks at; see hl.phraseLimit, but this won't be available until 3.4 is released. You can also limit the amount of each document that is analyzed by the regular Highlighter using maxDocCharsToAnalyze (and maybe this applies to FVH? not sure) Using RegexFragmenter is also probably slower than something like SimpleFragmenter. There is work to implement faster highlighting for Solr/Lucene, but it depends on some basic changes to the search architecture so it might be a while before that becomes available. See https://issues.apache.org/jira/browse/LUCENE-3318 if you're interested in following that development. -Mike On 07/29/2011 04:55 AM, Orosz György wrote: Dear all, I am quite new about using Solr, but would like to ask your help. I am developing an application which should be able to highlight the results of a query. For this I am using regex fragmenter: 500 0.5 true [-\w ,/\n\"']{20,300}[.?!] dokumentum_syn_query The field is indexed with term vectors and offsets: The highlighting works well, excepts that its really slow. I realized that this is because the highlighter/fragmenter does stemming for all the results documents again. Could you please help me why does it happen an how should I avoid this? (I thought that using fastvectorhighlighter will solve my problem, but it didn't) Thanks in advance! Gyuri Orosz
ideas for versioning query?
A customer has an interesting problem: some documents will have multiple versions. In search results, only the most recent version of a given document should be shown. The trick is that each user has access to a different set of document versions, and each user should see only the most recent version of a document that they have access to. Is this something that can reasonably be solved with grouping? In 3.x? I haven't followed the grouping discussions closely: would someone point me in the right direction please? -- Michael Sokolov Engineering Director www.ifactory.com
Re: ideas for versioning query?
Thanks, Tomas. Yes we are planning to keep a "current" flag in the most current document. But there are cases where, for a given user, the most current document is not that one, because they only have access to some older documents. I took a look at http://wiki.apache.org/solr/FieldCollapsing and it seems as if it will do what we need here. My one concern is that it might not be efficient at computing group.ngroups for a very large number of groups, which we would ideally want. Is that something I should be worried about? -Mike On 08/01/2011 10:08 AM, Tomás Fernández Löbbe wrote: Hi Michael, I guess this could be solved using grouping as you said. Documents inside a group can be sorted on a field (in your case, the version field, see parameter group.sort), and you can show only the first one. It will be more complex to show facets (post grouping faceting is work in progress but still not committed to the trunk). I would be easier from the Solr side if you could do something at index time, like indicating which document is the "current" one and which one is an old one (you would need to update the old document whenever a new version is indexed). Regards, Tomás On Mon, Aug 1, 2011 at 10:47 AM, Mike Sokolov wrote: A customer has an interesting problem: some documents will have multiple versions. In search results, only the most recent version of a given document should be shown. The trick is that each user has access to a different set of document versions, and each user should see only the most recent version of a document that they have access to. Is this something that can reasonably be solved with grouping? In 3.x? I haven't followed the grouping discussions closely: would someone point me in the right direction please? -- Michael Sokolov Engineering Director www.ifactory.com
Re: ideas for versioning query?
I think a 30% increase is acceptable. Yes, I think we'll try it. Although our case is more like # groups ~ # documents / N, where N is a smallish number (~1-5?). We are planning for a variety of different index sizes, but aiming for a sweet spot around a few M docs. -Mike On 08/01/2011 11:00 AM, Martijn v Groningen wrote: Hi Mike, how many docs and groups do you have in your index? I think the group.sort option fits your requirements. If I remember correctly group.ngroup=true adds something like 30% extra time on top of the search request with grouping, but that was on my local test dataset (~30M docs, ~8000 groups) and my machine. You might encounter different search times when setting group.ngroup=true. Martijn 2011/8/1 Mike Sokolov Thanks, Tomas. Yes we are planning to keep a "current" flag in the most current document. But there are cases where, for a given user, the most current document is not that one, because they only have access to some older documents. I took a look at http://wiki.apache.org/solr/**FieldCollapsing<http://wiki.apache.org/solr/FieldCollapsing>and it seems as if it will do what we need here. My one concern is that it might not be efficient at computing group.ngroups for a very large number of groups, which we would ideally want. Is that something I should be worried about? -Mike On 08/01/2011 10:08 AM, Tomás Fernández Löbbe wrote: Hi Michael, I guess this could be solved using grouping as you said. Documents inside a group can be sorted on a field (in your case, the version field, see parameter group.sort), and you can show only the first one. It will be more complex to show facets (post grouping faceting is work in progress but still not committed to the trunk). I would be easier from the Solr side if you could do something at index time, like indicating which document is the "current" one and which one is an old one (you would need to update the old document whenever a new version is indexed). Regards, Tomás On Mon, Aug 1, 2011 at 10:47 AM, Mike Sokolov wrote: A customer has an interesting problem: some documents will have multiple versions. In search results, only the most recent version of a given document should be shown. The trick is that each user has access to a different set of document versions, and each user should see only the most recent version of a document that they have access to. Is this something that can reasonably be solved with grouping? In 3.x? I haven't followed the grouping discussions closely: would someone point me in the right direction please? -- Michael Sokolov Engineering Director www.ifactory.com
Re: German language specific problem (automatic Spelling correction, automatic Synonyms ?)
If you want to avoid re-indexing, you could consider building a synonym file that is generated using your rule set, and then using that to expand your queries. You'd need to get a list of all terms in your index and then process them to generate synyonyms. Actually, I don't know how to get a list of all the terms without Java programming, though: is there a way? -Mike On 08/01/2011 12:35 PM, thomas wrote: Thanks Alexei, Thanks Paul, I played with the solr.PhoneticFilterFactory. Analysing my query in solr admin backend showed me how and that it is working. My major problem is, that this filter needs to be applied to the index chain as well as to the query chain to generate matches for our search. We have a huge index at this point and i'am not really happy to reindex all content. Is there maybe a more subtle solution which is working by just manipulating the query chain only? Otherwise i need to backup the whole index and try to reindex overnight when cms users are sleeping. I will have a look into the ColognePhonetic encoder. Im just afraid ill have to reindex the whole content there as well. Thomas -- View this message in context: http://lucene.472066.n3.nabble.com/German-language-specific-problem-automatic-Spelling-correction-automatic-Synonyms-tp3216278p3216414.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Matching queries on a per-element basis against a multivalued field
You have a few choices: 1) flatten your field structure - like your "undesirable" example, but wouldn't you want to have the document identifier as a field value also? 2) use phrase queries to make sure the key/value pairs are adjacent 3) use a join query That's all I can think of -Mike On 08/01/2011 08:08 PM, Suk-Hyun Cho wrote: I'm sure someone asked this before, but I couldn't find a previous post regarding this. The problem: Let's say that I have a multivalued field called myFriends that tokenizes on whitespaces. Basically, I'm treating it like a List of Lists (attributes of friends): Document A: myFriends = [ "isCool=true SOME_JUNK_HERE gender=male bloodType=A" ] Document B: myFriends = [ "isCool=true SOME_JUNK_HERE gender=female bloodType=O", "isCool=false SOME_JUNK_HERE gender=male bloodType=AB" ] Now, let's say that I want to search for all the cool male friends I have. Naively, I can query q=myFriends:isCool=true+AND+myFriends:gender=male. However, this returns documents A and B, because the two criteria are tested against the entire collection, rather than against individual elements. I could work around this by not tokenizing on whitespaces and using wildcards: q=myFriends:isCool=true\ *\ gender=male but this becomes painful when the query becomes more complex. What if I wanted to find cool friends who are either type A or type O? I could do q=myFriends:(isCool=true\ *\ bloodType=A+OR+isCool=true\ *\ bloodType=O). And you can see that the number of criteria will just explode as queries get more complex. There are other methods that I've considered, such as duplicating documents for every friend, like so: Document A1: myFriend = [ "isCool=true", "gender=male", "bloodType=A" ] Document B1: myFriend = [ "isCool=true", "gender=female", "bloodType=O" ] Document B2: myFriend = [ "isCool=false", "gender=male", "bloodType=AB" ] but this would be less than desirable. I would like to hear any other ideas around solving this problem, but going back to the original question, is there a way to match multiple criteria on a per-item basis rather than against the entire multifield? -- View this message in context: http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3217432.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Strategies for sorting by array, when you can't sort by array?
Although you weren't very clear about it, it sounds as if you want the results to be sorted by a name that actually matched the query? In general that is not going to be easy, since it is not something that can be computed in advance and thus indexed. -Mike On 08/03/2011 10:39 AM, Olson, Ron wrote: Hi all- Well, this is a problem. I have a list of names as a multi-valued field and I am searching on this field and need to return the results sorted. I know from searching and reading the documentation (and getting the error) that sorting on a multi-valued field isn't possible. Okay, so, what I haven't found is any real good solution/workaround to the problem. I was wondering what strategies others have done to overcome this particular situation; collapsing the individual names into a single field with copyField doesn't work because the name searched may not be the first name in the field. Thanks for any hints/tips/tricks. Ron DISCLAIMER: This electronic message, including any attachments, files or documents, is intended only for the addressee and may contain CONFIDENTIAL, PROPRIETARY or LEGALLY PRIVILEGED information. If you are not the intended recipient, you are hereby notified that any use, disclosure, copying or distribution of this message or any of the information included in or with it is unauthorized and strictly prohibited. If you have received this message in error, please notify the sender immediately by reply e-mail and permanently delete and destroy this message and its attachments, along with any copies thereof. This message does not create any contractual obligation on behalf of the sender or Law Bulletin Publishing Company. Thank you.
solr-user@lucene.apache.org
I don't have any experience with DIH: maybe XPathEntityProcessor doesn't use a true XML parser? You might want to try passing your documents through "xmllint -noent" (basically parse and reserialize) - that should inline the characters as UTF-8? On 07/09/2012 03:18 PM, Michael Belenki wrote: Somebody any idea? Solr seems to ignore the DTD definition and therefore does not understand the entities likeü orä that are defined in dtd. Is it the problem? If yes how can I tell SOLR to consider the DTD definition? On Fri, 06 Jul 2012 10:58:59 +0200, Michael Belenki wrote: Dear community, I am experiencing strange problem while trying to index / to import XML document to SOLR via DataImportHandler. The XML document contains some special characters (e.g. german ü) that are represented as XML entities ü or ä. There is also DTD file that defines these entities () (I tried to use dtd file as well as to include the DTD definition to the xml itself). After I start the import command full-import, the import process throws an exception as soon as it tries to parse ü: "Un declared general entity "uuml". Did anyone already face such a problem? best regards, Michael My data-config for importing is: The XML file looks e.g. like this: ]> Marco Riccardi Solution of Cubic and Quartic Equations.ü 117-122 2009 17 Formalized Mathematics 1-4 http://dx.doi.org/10.2478/v10037-009-0012-zdb/journals/fm/fm17.html#Riccardi09 The stack-trace is: 05.07.2012 17:37:19 org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {deleteByQuery=*:*,add=[persons/Codd71a, persons/Hall74]} 0 1 05.07.2012 17:37:19 org.apache.solr.common.SolrException log SCHWERWIEGEND: Full Import failed:java.lang.RuntimeException: java.lang.RuntimeE xception: org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed for xml, url:documents/dblp.xml rows processed in this xml:2 last row in this xml:{title=Common Subexpression Identification in General Algebraic System s., $forEach=/dblp/article, key=persons/Hall74} Processing Document # 3 at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java :264) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo rter.java:375) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j ava:445) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja va:426) Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataIm portHandlerException: Parsing failed for xml, url:documents/dblp.xml rows proces sed in this xml:2 last row in this xml:{title=Common Subexpression Identificatio n in General Algebraic Systems., $forEach=/dblp/article, key=persons/Hall74} Pro cessing Document # 3 at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde r.java:621) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j ava:327) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java :225) ... 3 more Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: Parsin g failed for xml, url:documents/dblp.xml rows processed in this xml:2 last row i n this xml:{title=Common Subexpression Identification in General Algebraic Syste ms., $forEach=/dblp/article, key=persons/Hall74} Processing Document # 3 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAnd Throw(DataImportHandlerException.java:72) at org.apache.solr.handler.dataimport.XPathEntityProcessor$3.next(XPathE ntityProcessor.java:504) at org.apache.solr.handler.dataimport.XPathEntityProcessor$3.next(XPathE ntityProcessor.java:517) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(Entity ProcessorBase.java:120) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow( XPathEntityProcessor.java:225) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPath EntityProcessor.java:204) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(Ent ityProcessorWrapper.java:330) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent ityProcessorWrapper.java:296) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde r.java:683) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde r.java:619) ... 5 more Caused by: java.lang.RuntimeException: com.ctc.wstx.exc.WstxParsingException: Un declared general entity "uuml" at [row,col {unknown-source}]: [26,42] at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XP athRecordReader.java:187) at org.apache.solr.handler.dataimport
solr-user@lucene.apache.org
I think the issue here is that DIH uses Woodstox "BasicStreamReader" (see http://woodstox.codehaus.org/3.2.9/javadoc/com/ctc/wstx/sr/BasicStreamReader.html) which has only minimal DTD support. It might be best to use ValidatingStreamReader (http://woodstox.codehaus.org/3.2.9/javadoc/com/ctc/wstx/sr/ValidatingStreamReader.html) instead. I think you could get this by requesting a validating XmlReader; that's a setting that's exposed at the factory level that returns a parser (ie an XmlReader). But then you would probably also get validation turned on, which might not be so great in all cases. Probably should be a user setting for XPathEntityProcessor somewhere? -Mike On 07/10/2012 07:10 PM, Chris Hostetter wrote: : Somebody any idea? Solr seems to ignore the DTD definition and therefore : does not understand the entities likeü orä that are defined in : dtd. Is it the problem? If yes how can I tell SOLR to consider the DTD : definition? Solr is just utilizing the builtin java XML parser for this, so there's nothing you can tell solr to "consider the DTD" but it is odd that this isn't working by default with java's parser -- i supsect there is some "hint" XPathEntityProcessor should be giving hte parser to ask it to look at these ENTITY declarations. I've filed a Jira issue to try and track this (and included a test case) but unfortunately i don't relaly know what the fix is... https://issues.apache.org/jira/browse/SOLR-3614 -Hoss