Re: is indexing single-threaded?
Multiple threads work well. If you are using solrj, check the StreamingSolrServer for an implementation that will keep X number of threads busy. Your mileage will very, but in general I find a reasonable thread count is ~ (number of cores)+1 On Wed, Sep 22, 2010 at 5:52 AM, Andy angelf...@yahoo.com wrote: Does Solr index data in a single thread or can data be indexed concurrently in multiple threads? Thanks Andy
Re: How can I delete the entire contents of the index?
deletequery*:*/query/delete will leave you a fresh index On Thu, Sep 23, 2010 at 12:50 AM, xu cheng xcheng@gmail.com wrote: deletequerythe query that fetch the data you wanna delete/query/delete I did like this to delete my data best regards 2010/9/23 Igor Chudov ichu...@gmail.com Let's say that I added a number of elements to Solr (I use Webservice::Solr as the interface to do so). Then I change my mind and want to delete them all. How can I delete all contents of the database, but leave the database itself, just empty? Thanks i
Re: Concurrent DB updates and delta import misses few records
Thanks for the pointer, Shawn. It, definitely, is useful. I am wondering if you could retrieve minDid from the solr rather than storing it externally. Max id from Solr index and max id from DB should define the lower and upper thresholds, respectively, of the delta range. Am I missing something? --shashi On Wed, Sep 22, 2010 at 6:47 PM, Shawn Heisey s...@elyograg.org wrote: On 9/22/2010 1:39 AM, Shashikant Kore wrote: Hi, I'm using DIH to index records from a database. After every update on (MySQL) DB, Solr DIH is invoked for delta import. In my tests, I have observed that if db updates and DIH import is happening concurrently, import misses few records. Here is how it happens. The table has a column 'lastUpdated' which has default value of current timestamp. Many records are added to database in a single transaction that takes several seconds. For example, if 10,000 rows are being inserted, the rows may get timestamp values from '2010-09-20 18:21:20' to '2010-09-20 18:21:26'. These rows become visible only after transaction is committed. That happens at, say, '2010-09-20 18:21:30'. If Solr is import gets triggered at '18:20:29', it will use a timestamp of last import for delta query. This import will not see the records added in the aforementioned transaction as transaction was not committed at that instant. After this import, the dataimport.properties will have last index time as '18:20:29'. The next import will not able to get all the rows of previously referred trasaction as some of the rows have timestamp earlier than '18:20:29'. While I am testing extreme conditions, there is a possibility of missing out on some data. I could not find any solution in Solr framework to handle this. The table has an auto increment key, all updates are deletes followed by inserts. So, having last_indexed_id would have helped, where last_indexed_id is the max value of id fetched in that import. The query would then become Select id where idlast_indexed_id.' I suppose, Solr does not have any provision like this. Two options I could think of are: (a) Ensure at application level that there are no concurrent DB updates and DIH import requests going concurrently. (b) Use exclusive locking during DB update What is the best way to address this problem? Shashi, I was not solving the same problem, but perhaps you can adapt my solution to yours. My main problem was that I don't have a modified date in my database, and due to the size of the table, it is impractical to add one. Instead, I chose to track the database primary key (a simple autoincrement) outside of Solr and pass min/max values into DIH for it to use in the SELECT statement. You can see a simplified version of my entity here, with a URL showing how to send the parameters in via the dataimport GET: http://www.mail-archive.com/solr-user@lucene.apache.org/msg40466.html The update script that runs every two minutes gets MAX(did) from the database, retrieves the minDid from a file on an NFS share, and runs a delta-import with those two values. When the import is reported successful, it writes the maxDid value to the minDid file on the network share for the next run. If the import fails, it sends an alarm and doesn't update the minDid. Shawn
Re: Solr Reporting
Hi Adeel, I would use the first approach since it is more flexible and easier to use. Please consider XsltResponseWriter which allows to transform result set from Solr's default xml structure into custom using provided xslt template. Myron 2010/9/23 Adeel Qureshi adeelmahm...@gmail.com This probably isnt directly a solr user type question but its close enough so I am gonna post it here. I have been using solr for a few months now and it works just out of this world so I definitely love the software (and obviously lucene too) .. but I feel that solr output xml is in kind of weird format .. I mean its in a format that simply makes it difficult to plug solr output xml in any xml reading tool or api .. this whole concept of using str name=id123/str instead of id123/id doesnt makse sense to me .. what I am trying to do now is setup a reporting system off of solr .. and the concept is simply .. let the user do all the searches, facet etc and once they have finalized on some results .. simply allow them to export those results in an excel or pdf file .. what I have setup right now is I simply let the export feature use the same solr query that user used to search their results .. send that query to solr again and get all results back and simply iterate over xml and dump all data in an excel file this has worked fine in most situations but I want to improve this process and specifically use jasper reports for reporting .. and I want to use ireport to design my report templates .. thats where solr output xml format is causing problems .. as I cant figure out how to make it work with ireport because of solr xml not having any named nodes .. it all looks like the same nodes and ireport cant distinguish one column from another .. so I am thinking a couple of solutions here and wanted to get some suggestions from you guys on how to do it best 1. receive solr output xml .. convert it to a more readable xml form .. use named nodes instead of nodes by data type str name=id123/str str name=titlexyz/str = id123/id titlexyz/title and then feed that to jasper report template 2. use solrJ to recieve solr output in the NamedList resultset as it returns ..I havent tried this method so I am not sure how useful or easy to work, this NamedList structure is .. in this I would be feeding Collection of NamedList items to jasper .. havent played around with this so not sure how well its gonna work out .. if you have tried something like this please let me know how it worked out for u I would appreciate absolutely any kind of comments on this Thanks Adeel
Re: Autocomplete: match words anywhere in the token
On Wed, 2010-09-22 at 20:14 +0200, Arunkumar Ayyavu wrote: Thanks for the responses. Now, I included the EdgeNGramFilter. But, I get the following results when I search for canon pixma. Canon PIXMA MP500 All-In-One Photo Printer Canon PowerShot SD500 As you can guess, I'm not expecting the 2nd result entry. Though I understand why I'm getting the 2nd entry, I don't know how to ask Solr to exlcude it (I could fitler it in my application though). :-( Looks like I should study more of Solr's capabilites to get the solution. This has not so much to do with autosuggest, anymore? You put those quotes in to denote the search input, not to say that the search input was a phrase, I suppose. Searching for the phrase (quoted), only the first line should have been found. If you want to have returned hits that include most of the searched terms, and in case of only two input terms both: you can configure such sophisticated rules with the http://wiki.apache.org/solr/DisMaxQParserPlugin Have a look at the mm parameter (Minimum Should Match) Chantal
Custom Sorting with function queries
I need to 'rank' the documents in a solr index based on some field values and the query. Is this possible using function queries? Two example to illustrate what I am trying to achieve: The index contains two fields min_rooms and max_rooms, both integers, both optional. If I query the index for a value (rooms) I would like the documents that place this value between min and max to be ranked higher than those that don't. The smaller the difference between min and max is, the more exact a match the document is and the higher the document will be ranked. If either min or max or both are not specified then the document gets a 'negative rank'. The index contains a float field. If, and only if, the query contains a search for this field (field:1 or field:on), then the value of the field affects the ranking of the document. (1, on, yes, etc can be solved with synonyms) Lastly, once this 'custom ranking works), how do I switch off solr's built in ranking calculations?
bi-grams for common terms - any analyzers do that?
Hi, I was going thru this LucidImagnaton presentation on analysis: http://www.slideshare.net/LucidImagination/analyze-this-tips-and-tricks-on-getting-the-lucene-solr-analyzer-to-index-and-search-your-content-right 1) on p.31-33, it talks about forming bi-grams for the 32 most common terms during indexing. Is there an analyzer that does that? 2) on p. 34, it mentions that the default Solr configuraton would turn L'art into the phrase query L art but it is much more efficient to turn it into a single token 'L art'. Which analyzer would do that? Thanks. Andy
Re: Solr Reporting
keep in mind that the str name=id paradigm isn't completely useless, the str is a data type (string), it can be int, float, double, date, and others. So to not lose any information you may want to do something like: id type=int123/id title type=strxyz/title Which I agree makes more sense to me. The name of the field is more important than it's datatype, but I don't want to lose track of the data type. Ken -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Reporting-tp1565271p1567604.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: bi-grams for common terms - any analyzers do that?
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.CommonGramsFilterFactory -Original Message- From: Andy [mailto:angelf...@yahoo.com] Sent: Thursday, September 23, 2010 6:05 AM To: solr-user@lucene.apache.org Subject: bi-grams for common terms - any analyzers do that? Hi, I was going thru this LucidImagnaton presentation on analysis: http://www.slideshare.net/LucidImagination/analyze-this-tips-and-tricks- on-getting-the-lucene-solr-analyzer-to-index-and-search-your-content-right 1) on p.31-33, it talks about forming bi-grams for the 32 most common terms during indexing. Is there an analyzer that does that? 2) on p. 34, it mentions that the default Solr configuraton would turn L'art into the phrase query L art but it is much more efficient to turn it into a single token 'L art'. Which analyzer would do that? Thanks. Andy
Re: How can I delete the entire contents of the index?
Quick tangent... I went to the link you provided, and the delete part makes sense. But the next tip, how to re-index after a schema change. What is the point of step 5. Send an optimize/ command. ? Why do you need to optimize an empty index? Or is my understanding of Optimize incorrect? -- View this message in context: http://lucene.472066.n3.nabble.com/How-can-I-delete-the-entire-contents-of-the-index-tp1565548p1567640.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Searches with a period (.) in the query
Do you have any other Analyzers or Formatters involved? I use delimiters in certain string fields all the time. Usually a colon : or slash / but should be the same for a period. I've never seen this behavior. But if you have any kind of tokenizer or formatter involved beyond fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true / then you may be introducing something extra to the party. What does your fieldType definition look like? -- View this message in context: http://lucene.472066.n3.nabble.com/Searches-with-a-period-in-the-query-tp1564780p1567666.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: How can I delete the entire contents of the index?
Because even after you've deleted every document from the index, there are still actually index _files_ on disk taking up space. Lucene organizes it's files for quick access, and a consequence of this is that deleting a document does not neccesarily reclaim the disk space. Optimize will reclaim that disk space. For deleting ALL documents in your index there's actually a shortcut though. Delete the entire solr 'data' directory and restart Solr, Solr will recreate the data directory with starter index files. (Note you have to delete the directory itself, if you just delete all the files inside it, Solr will get unhappy). I am somewhat suspicious of doing this and would never do it on a production index, but for just development playing around where it's not that disastrous if something goes wrong, it's a lot lot quicker than an actual delete command followed by an optimize. From: kenf_nc [ken.fos...@realestate.com] Sent: Thursday, September 23, 2010 8:22 AM To: solr-user@lucene.apache.org Subject: Re: How can I delete the entire contents of the index? Quick tangent... I went to the link you provided, and the delete part makes sense. But the next tip, how to re-index after a schema change. What is the point of step 5. Send an optimize/ command. ? Why do you need to optimize an empty index? Or is my understanding of Optimize incorrect? -- View this message in context: http://lucene.472066.n3.nabble.com/How-can-I-delete-the-entire-contents-of-the-index-tp1565548p1567640.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: bi-grams for common terms - any analyzers do that?
I've been thinking about the CommonGramsFilter for a while, and am confused about how it works. Can anyone provide examples? Are you meant to include the analyzer at both index and query time? The description on the wiki says among other things: The CommonGramsQueryFilter converts the phrase query the cat into the single term query the_cat. -- does that mean it _only_ works on phrase queries?If you've indexed with commongrams, what will happen at query time to a non-phrase query the cat ? Very confused. From: Steven A Rowe [sar...@syr.edu] Sent: Thursday, September 23, 2010 8:21 AM To: solr-user@lucene.apache.org Subject: RE: bi-grams for common terms - any analyzers do that? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.CommonGramsFilterFactory -Original Message- From: Andy [mailto:angelf...@yahoo.com] Sent: Thursday, September 23, 2010 6:05 AM To: solr-user@lucene.apache.org Subject: bi-grams for common terms - any analyzers do that? Hi, I was going thru this LucidImagnaton presentation on analysis: http://www.slideshare.net/LucidImagination/analyze-this-tips-and-tricks- on-getting-the-lucene-solr-analyzer-to-index-and-search-your-content-right 1) on p.31-33, it talks about forming bi-grams for the 32 most common terms during indexing. Is there an analyzer that does that? 2) on p. 34, it mentions that the default Solr configuraton would turn L'art into the phrase query L art but it is much more efficient to turn it into a single token 'L art'. Which analyzer would do that? Thanks. Andy
Re: Xpath extract element name
Great. XSL worked like a charm! Thx lots. -- View this message in context: http://lucene.472066.n3.nabble.com/Xpath-extract-element-name-tp1534390p1567809.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How can I delete the entire contents of the index?
Lucene has an API for very fast deletion of the index (ie, it removes the files): IndexWriter.deleteAll(). It's part of the transaction, ie, you still must call .commit() to make the change visible to external readers. But I don't know whether this is exposed in Solr... Mike On Thu, Sep 23, 2010 at 8:50 AM, Jonathan Rochkind rochk...@jhu.edu wrote: Because even after you've deleted every document from the index, there are still actually index _files_ on disk taking up space. Lucene organizes it's files for quick access, and a consequence of this is that deleting a document does not neccesarily reclaim the disk space. Optimize will reclaim that disk space. For deleting ALL documents in your index there's actually a shortcut though. Delete the entire solr 'data' directory and restart Solr, Solr will recreate the data directory with starter index files. (Note you have to delete the directory itself, if you just delete all the files inside it, Solr will get unhappy). I am somewhat suspicious of doing this and would never do it on a production index, but for just development playing around where it's not that disastrous if something goes wrong, it's a lot lot quicker than an actual delete command followed by an optimize. From: kenf_nc [ken.fos...@realestate.com] Sent: Thursday, September 23, 2010 8:22 AM To: solr-user@lucene.apache.org Subject: Re: How can I delete the entire contents of the index? Quick tangent... I went to the link you provided, and the delete part makes sense. But the next tip, how to re-index after a schema change. What is the point of step 5. Send an optimize/ command. ? Why do you need to optimize an empty index? Or is my understanding of Optimize incorrect? -- View this message in context: http://lucene.472066.n3.nabble.com/How-can-I-delete-the-entire-contents-of-the-index-tp1565548p1567640.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Reporting
Thank you for your suggestions .. makes sense and I didnt knew about the XsltResponseWriter .. that opens up door to all kind of possibilities ..so its great to know about that but before I go that route .. what about performance .. In Solr Wiki it mentions that XSLT transformation isnt so bad in terms of memory usage but I guess its all relative to the amount of data and obviously system resources .. my data set will be around 15000 - 30'000 records at the most ..I do have about 30 some fields but all fields are either small strings (less than 500 chars) or dates, int, booleans etc .. so should I be worried about performances problems while doing the XSLT translations .. secondly for reports Ill have to request solr to send all 15000 some records at the same time to be entered in report output files .. is there a way to kind of stream that process .. well I think Solr native xml is already streamed to you but sounds like for the translation it will have to load the whole thing in RAM .. and again what about SolrJ .. isnt that supposed to provide better performance since its in java .. well I guess it shouldnt be much different since it also uses the HTTP calls to communicate to Solr .. Thanks for your help Adeel On Thu, Sep 23, 2010 at 7:16 AM, kenf_nc ken.fos...@realestate.com wrote: keep in mind that the str name=id paradigm isn't completely useless, the str is a data type (string), it can be int, float, double, date, and others. So to not lose any information you may want to do something like: id type=int123/id title type=strxyz/title Which I agree makes more sense to me. The name of the field is more important than it's datatype, but I don't want to lose track of the data type. Ken -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Reporting-tp1565271p1567604.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How can I delete the entire contents of the index?
: Lucene has an API for very fast deletion of the index (ie, it removes : the files): IndexWriter.deleteAll(). It's part of the transaction, ... : But I don't know whether this is exposed in Solr... Solr definitely has optimized the delete *:* case (but i don't know if it's using the specific method) I believe the poster is getting confused because immediately following this FAQ... http://wiki.apache.org/solr/FAQ#How_can_I_delete_all_documents_from_my_index.3F which says to use deletequery*:*/query/delete which specificly notes: This has been optimized to be more efficient then deleting by some arbitrary query which matches all docs because of the nature of the data. ...was this FAQ... http://wiki.apache.org/solr/FAQ#How_can_I_rebuild_my_index_from_scratch_if_I_change_my_schema.3F ...which until a moment ago gave outdated advice. -Hoss -- http://lucenerevolution.org/ ... October 7-8, Boston http://bit.ly/stump-hoss ... Stump The Chump!
Re: Searches with a period (.) in the query
Hey Ken, The filedType definition that I am using is: fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true / Thanks, Sid On Thu, Sep 23, 2010 at 5:29 AM, kenf_nc ken.fos...@realestate.com wrote: Do you have any other Analyzers or Formatters involved? I use delimiters in certain string fields all the time. Usually a colon : or slash / but should be the same for a period. I've never seen this behavior. But if you have any kind of tokenizer or formatter involved beyond fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true / then you may be introducing something extra to the party. What does your fieldType definition look like? -- View this message in context: http://lucene.472066.n3.nabble.com/Searches-with-a-period-in-the-query-tp1564780p1567666.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: bi-grams for common terms - any analyzers do that?
Hi all, The CommonGrams filter is designed to only work on phrase queries. It is designed to solve the problem of slow phrase queries with phrases containing common words, when you don't want to use stop words. It would not make sense for Boolean queries. Boolean queries just get passed through unchanged. For background on the CommonGramsFilter please see: http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2 There are two filters, CommonGramsFilter and CommonGramsQueryFilter you use CommonGramsFilter on indexing and CommonGramsQueryFilter for query processing. CommonGramsFilter outputs both CommonGrams and Unigrams so that Boolean queries (i.e. non-phrase queries) will work. For example the rain would produce 3 tokens: the position 1 rain position 2 the-rain position 1 When you have a phrase query, you want Solr to search for the token the-rain so you don't want the unigrams. When you have a Boolean query, the CommonGramsQueryFilter only gets one token as input and simply outputs it. Appended below is a sample config from our schema.xml. For background on the problem with l'art please see: http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance We used a custom filter to change all punctuation to spaces. You could probably use one of the other filters to do this. (See the comments from David Smiley at the end of the blog post regarding possible approaches.)At the time, I just couldn't get WordDelimiterFilter to behave as documented with various combinations of parameters and was not aware of the other filters David mentions. The problem with l'art is actually due to a bug or feature in the QueryParser. Currently the QueryParser interacts with the token chain and decides whether the tokens coming back from a tokenfilter should be treated as a phrase query based on whether or not more than one non-synonym token comes back from the tokestream for a single 'queryparser token'. It also splits on whitespace which causes all CJK queries to be treated as phrase queries regardless of the CJK tokenizer you use. This is a contentious issue. See https://issues.apache.org/jira/browse/LUCENE-2458. There is a semi-workaround using PositionFilter, but it has many undesirable side effects. I believe Robert Muir, who is an expert on the various problems involved and opened Lucene-2458 is working on a better fix. Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search fieldType name=CommonGramTest class=solr.TextField positionIncrementGap=100 − analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=ISOLatin1AccentFilterFactory/ filter class=solr.PunctuationFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.CommonGramsFilterFactory words=new400common.txt/ /analyzer − analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=ISOLatin1AccentFilterFactory/ filter class=solr.PunctuationFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.CommonGramsQueryFilterFactory words=new400common.txt/ /analyzer /fieldType
Re: is indexing single-threaded?
I was kind of wondering what magic had been done to achieve multiple writing to the index file :-) BTW, wouldn't it be possible to have seperate segments per thread? Set up the index with a minimum (desired?) segment count, and write each individually? Is there any organization in the segments? Or can adjacent data be found in different segments? I seem to remember that the new stuff gets committed to its own segment until some sort of 'consolidate' command takes place. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Thu, 9/23/10, Jan Høydahl / Cominvent jan@cominvent.com wrote: From: Jan Høydahl / Cominvent jan@cominvent.com Subject: Re: is indexing single-threaded? To: solr-user@lucene.apache.org Date: Thursday, September 23, 2010, 1:42 AM SolrJ threads speeds up feeding throughput. The building the index is still single threaded (per core), isn't it? Don't know about analysis. But you cannot have two threads write to the same file... -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 23. sep. 2010, at 08.01, Ryan McKinley wrote: Multiple threads work well. If you are using solrj, check the StreamingSolrServer for an implementation that will keep X number of threads busy. Your mileage will very, but in general I find a reasonable thread count is ~ (number of cores)+1 On Wed, Sep 22, 2010 at 5:52 AM, Andy angelf...@yahoo.com wrote: Does Solr index data in a single thread or can data be indexed concurrently in multiple threads? Thanks Andy
Re: matches in result grouping
(10/09/23 18:14), Koji Sekiguchi wrote: I'm using recent committed field collapsing / result grouping feature in trunk. I'm confusing matches parameter in the result at the second sample output of Wiki: http://wiki.apache.org/solr/FieldCollapsing#Quick_Start I cannot understand why there are two matches:5 entries in the result. Can anyone explain it? Probably multiple GroupCollectors are generated for each group.field, group.func and group.query and match can be counted per collector. Koji -- http://www.rondhuit.com/en/
Re: matches in result grouping
2010/9/23 Koji Sekiguchi k...@r.email.ne.jp: (10/09/23 18:14), Koji Sekiguchi wrote: I'm using recent committed field collapsing / result grouping feature in trunk. I'm confusing matches parameter in the result at the second sample output of Wiki: http://wiki.apache.org/solr/FieldCollapsing#Quick_Start I cannot understand why there are two matches:5 entries in the result. Can anyone explain it? Probably multiple GroupCollectors are generated for each group.field, group.func and group.query and match can be counted per collector. Correct. The matches is the doc count before any grouping (and for field.query that means before the restriction given by field.query is applied). It won't always be the same though - for example we might implement filter excludes like we do with faceting, etc. -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
Re: Solr Reporting
Hi, Are you going to generate a report with 3 records in it? That will be a very large report - will anyone really want to read through that? If you want/need 'summary' reports - i.e. stats on on the 30k records, it is much more efficient to setup faceting and/or server-side analysis to do this, rather than download 3 records to a client, then do statistical analysis on the result. It will take a while to stream 3 records over an http connection, and, if you're building, say, a PDF table for 30k records, that will take some time as well. Server-side analysis then just send the results will work better, if that fits your remit for reporting. Peter On Thu, Sep 23, 2010 at 4:14 PM, Adeel Qureshi adeelmahm...@gmail.com wrote: Thank you for your suggestions .. makes sense and I didnt knew about the XsltResponseWriter .. that opens up door to all kind of possibilities ..so its great to know about that but before I go that route .. what about performance .. In Solr Wiki it mentions that XSLT transformation isnt so bad in terms of memory usage but I guess its all relative to the amount of data and obviously system resources .. my data set will be around 15000 - 30'000 records at the most ..I do have about 30 some fields but all fields are either small strings (less than 500 chars) or dates, int, booleans etc .. so should I be worried about performances problems while doing the XSLT translations .. secondly for reports Ill have to request solr to send all 15000 some records at the same time to be entered in report output files .. is there a way to kind of stream that process .. well I think Solr native xml is already streamed to you but sounds like for the translation it will have to load the whole thing in RAM .. and again what about SolrJ .. isnt that supposed to provide better performance since its in java .. well I guess it shouldnt be much different since it also uses the HTTP calls to communicate to Solr .. Thanks for your help Adeel On Thu, Sep 23, 2010 at 7:16 AM, kenf_nc ken.fos...@realestate.com wrote: keep in mind that the str name=id paradigm isn't completely useless, the str is a data type (string), it can be int, float, double, date, and others. So to not lose any information you may want to do something like: id type=int123/id title type=strxyz/title Which I agree makes more sense to me. The name of the field is more important than it's datatype, but I don't want to lose track of the data type. Ken -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Reporting-tp1565271p1567604.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: bi-grams for common terms - any analyzers do that?
On Thu, Sep 23, 2010 at 12:02 PM, Burton-West, Tom tburt...@umich.eduwrote: The problem with l'art is actually due to a bug or feature in the QueryParser. Currently the QueryParser interacts with the token chain and decides whether the tokens coming back from a tokenfilter should be treated as a phrase query based on whether or not more than one non-synonym token comes back from the tokestream for a single 'queryparser token'. Just a note: in solr's trunk or 3x branch you have a lot more flexibility already with this stuff: 1. for the specific problem of l'art: you can use the ElisionFilterFactory, its actually designed to address this. But before it was a bit unwieldy to use (you had to supply your own list of french contractions: l', m', etc): with trunk or 3x you can just add it to your analyzer, if you don't specify a list it uses the default list from Lucene's FrenchAnalyzer. 2. if you are using WordDelimiterFilter, you can customize how it splits on a per-character basis. See https://issues.apache.org/jira/browse/SOLR-2059 , a user gave a nice example there of how you can treat '#' and '@' special for twitter messages. 3. in all cases, if you don't want phrase queries automatically formed unless the user put them in quotes, you can turn it off in your fieldtype: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=false (somewhat related) Tom, thanks for posting your schema. given your problems with huge amounts of terms, i looked at your previous messages and ran some quick math and guestimated your average term length must be quite large. Yet i notice from your website ( http://www.hathitrust.org/visualizations_languages) it says you have 18,329 thai books (and you have no ThaiWordFilter in your schema). Are you sure that your terms are not filled with tons of very long untokenized thai sentences? (thai uses no spaces between words) just an idea :) -- Robert Muir rcm...@gmail.com
Re: Issue with Solr Boosting
I think if you don't need to add more categories, just increasing boost factor of Electronics would work. As you said because of DocFreq of Mobile Phones, scoring algorithm is working as expected way. On Thu, Sep 23, 2010 at 3:42 PM, Jayant Patil jayan...@peopleinteractive.in wrote: Hi, We are using Solr for our searches. We are facing issues while applying boost on particular fields. E.g. We have a field Category, which contains values like Electronics, Computers, Home Appliances, Mobile Phones etc. We want to boost the category Electronics and Mobile Phones, we are using the following query (category:Electronics^2 OR category:Mobile Phones^1 OR category:[* TO *]^0) The results are unexpected as Category Mobile Phones gets more boost than Electronics even if we are specifying the boost factor 2 for electronics and 1 for mobile phones respectively. On debugging we found that DocFreq is manipulating the scores and hence affecting the overall boost. The no. of docs for mobile phones is much lower than that for electronics and solr is giving higher score to mobile phones for this reason. Please suggest a solution. Regards, Jayant People Interactive DISCLAIMER and CONFIDENTIALITY CAUTION This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. Unauthorized reading, dissemination, distribution or copying of this communication is prohibited. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. If you have received this communication in error, please notify us immediately and promptly destroy the original communication. Thank you for your cooperation. Please note that any views or opinions presented in this email are solely those of the author and may not necessarily represent those of the company. Communicating through email is not secure and capable of interception, corruption and delays. Anyone communicating with People Interactive (I) Private Limited by email accepts the risks involved and their consequences. The recipient should check this email and any attachments for the presence of viruses. People Interactive (I) Private Limited accepts no liability for any damage caused by any virus transmitted by this email.
Re: Searches with a period (.) in the query
Siddharth, did you check tokenizer and filter behaviour from ../admin/analysis.jsp page. That would be quite informative to you. On Thu, Sep 23, 2010 at 6:42 PM, Siddharth Powar powar.siddha...@gmail.com wrote: Hey Ken, The filedType definition that I am using is: fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true / Thanks, Sid On Thu, Sep 23, 2010 at 5:29 AM, kenf_nc ken.fos...@realestate.com wrote: Do you have any other Analyzers or Formatters involved? I use delimiters in certain string fields all the time. Usually a colon : or slash / but should be the same for a period. I've never seen this behavior. But if you have any kind of tokenizer or formatter involved beyond fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true / then you may be introducing something extra to the party. What does your fieldType definition look like? -- View this message in context: http://lucene.472066.n3.nabble.com/Searches-with-a-period-in-the-query-tp1564780p1567666.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Reporting
Hi Peter I understand what you are saying but I think you are thinking more of report as graph and analysis and summary kind of data .. for my reports I do need to include all records that qualify certain criteria .. e.g. a listing of all orders placed in last 6 months .. now that could be 1 orders and yes I will need probably a report that summarizes all that data but at the same time .. I need all those 1 records to be exported in an excel file .. those are the reports that I am talking about .. and 3 probably is a stretch .. it might be 10-15000 at the most but I guess its still the same idea .. and yes I realize that its alot of data to be transferred over http .. but thats exactly why i am asking for suggestion on how to do .. I find it hard to believe that this is an unusual requirement .. I think most companies do reports that dump all records from databases in excel files .. so again to clarify I definitely need reports that present statistics and averages and yes I will be using facets and all kind of stuff there and I am not so concerned about those reports because like you pointed out, for those reports there will be very little data transfer but its the full data dump reports that I am trying to figure out the best way to handle. Thanks for your help Adeel On Thu, Sep 23, 2010 at 11:43 AM, Peter Sturge peter.stu...@gmail.comwrote: Hi, Are you going to generate a report with 3 records in it? That will be a very large report - will anyone really want to read through that? If you want/need 'summary' reports - i.e. stats on on the 30k records, it is much more efficient to setup faceting and/or server-side analysis to do this, rather than download 3 records to a client, then do statistical analysis on the result. It will take a while to stream 3 records over an http connection, and, if you're building, say, a PDF table for 30k records, that will take some time as well. Server-side analysis then just send the results will work better, if that fits your remit for reporting. Peter On Thu, Sep 23, 2010 at 4:14 PM, Adeel Qureshi adeelmahm...@gmail.com wrote: Thank you for your suggestions .. makes sense and I didnt knew about the XsltResponseWriter .. that opens up door to all kind of possibilities ..so its great to know about that but before I go that route .. what about performance .. In Solr Wiki it mentions that XSLT transformation isnt so bad in terms of memory usage but I guess its all relative to the amount of data and obviously system resources .. my data set will be around 15000 - 30'000 records at the most ..I do have about 30 some fields but all fields are either small strings (less than 500 chars) or dates, int, booleans etc .. so should I be worried about performances problems while doing the XSLT translations .. secondly for reports Ill have to request solr to send all 15000 some records at the same time to be entered in report output files .. is there a way to kind of stream that process .. well I think Solr native xml is already streamed to you but sounds like for the translation it will have to load the whole thing in RAM .. and again what about SolrJ .. isnt that supposed to provide better performance since its in java .. well I guess it shouldnt be much different since it also uses the HTTP calls to communicate to Solr .. Thanks for your help Adeel On Thu, Sep 23, 2010 at 7:16 AM, kenf_nc ken.fos...@realestate.com wrote: keep in mind that the str name=id paradigm isn't completely useless, the str is a data type (string), it can be int, float, double, date, and others. So to not lose any information you may want to do something like: id type=int123/id title type=strxyz/title Which I agree makes more sense to me. The name of the field is more important than it's datatype, but I don't want to lose track of the data type. Ken -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Reporting-tp1565271p1567604.html Sent from the Solr - User mailing list archive at Nabble.com.
Calgary Solr Consultant?
Hi, I'm looking for a Solr expert local to Calgary, Alberta to help us jumpstart a search project. Ryan Courtnage PS: apologies if this is the wrong list for this type of request.
Grouping in solr ?
Hi all, is it possible somehow to group documents? I have services as documents, and I would like to show the filtered services grouped by company. So I filter services by given criteria, but I show the results grouped by companay. If I got 1000 services, maybe I need to show just 100 companies (this will affect pagination as well), and how could I get the company info? Should I store the company info in each service (I don't need the compnany info to be indexed) ? regards, Rich __ Information from ESET NOD32 Antivirus, version of virus signature database 5419 (20100902) __ The message was checked by ESET NOD32 Antivirus. http://www.eset.com
RE: Grouping in solr ?
http://wiki.apache.org/solr/FieldCollapsing https://issues.apache.org/jira/browse/SOLR-236 -Original message- From: Papp Richard ccode...@gmail.com Sent: Thu 23-09-2010 21:29 To: solr-user@lucene.apache.org; Subject: Grouping in solr ? Hi all, is it possible somehow to group documents? I have services as documents, and I would like to show the filtered services grouped by company. So I filter services by given criteria, but I show the results grouped by companay. If I got 1000 services, maybe I need to show just 100 companies (this will affect pagination as well), and how could I get the company info? Should I store the company info in each service (I don't need the compnany info to be indexed) ? regards, Rich __ Information from ESET NOD32 Antivirus, version of virus signature database 5419 (20100902) __ The message was checked by ESET NOD32 Antivirus. http://www.eset.com
Re: Solr Reporting
Yes, that makes sense. So, more of a bulk data export requirement. If the excel data doesn't have to go out on the web, you could export to a local file (using a local solj streamer), then publish it, which might save some external http bandwidth if that's a concern. We do this all the time using a local solrj client, so if you've got a big data stream (e.g. an entire core), you don't have to send it through your outward-facing web servers. Using a replica to retrieve/export the data might be worth considering as well. On Thu, Sep 23, 2010 at 7:21 PM, Adeel Qureshi adeelmahm...@gmail.com wrote: Hi Peter I understand what you are saying but I think you are thinking more of report as graph and analysis and summary kind of data .. for my reports I do need to include all records that qualify certain criteria .. e.g. a listing of all orders placed in last 6 months .. now that could be 1 orders and yes I will need probably a report that summarizes all that data but at the same time .. I need all those 1 records to be exported in an excel file .. those are the reports that I am talking about .. and 3 probably is a stretch .. it might be 10-15000 at the most but I guess its still the same idea .. and yes I realize that its alot of data to be transferred over http .. but thats exactly why i am asking for suggestion on how to do .. I find it hard to believe that this is an unusual requirement .. I think most companies do reports that dump all records from databases in excel files .. so again to clarify I definitely need reports that present statistics and averages and yes I will be using facets and all kind of stuff there and I am not so concerned about those reports because like you pointed out, for those reports there will be very little data transfer but its the full data dump reports that I am trying to figure out the best way to handle. Thanks for your help Adeel On Thu, Sep 23, 2010 at 11:43 AM, Peter Sturge peter.stu...@gmail.comwrote: Hi, Are you going to generate a report with 3 records in it? That will be a very large report - will anyone really want to read through that? If you want/need 'summary' reports - i.e. stats on on the 30k records, it is much more efficient to setup faceting and/or server-side analysis to do this, rather than download 3 records to a client, then do statistical analysis on the result. It will take a while to stream 3 records over an http connection, and, if you're building, say, a PDF table for 30k records, that will take some time as well. Server-side analysis then just send the results will work better, if that fits your remit for reporting. Peter On Thu, Sep 23, 2010 at 4:14 PM, Adeel Qureshi adeelmahm...@gmail.com wrote: Thank you for your suggestions .. makes sense and I didnt knew about the XsltResponseWriter .. that opens up door to all kind of possibilities ..so its great to know about that but before I go that route .. what about performance .. In Solr Wiki it mentions that XSLT transformation isnt so bad in terms of memory usage but I guess its all relative to the amount of data and obviously system resources .. my data set will be around 15000 - 30'000 records at the most ..I do have about 30 some fields but all fields are either small strings (less than 500 chars) or dates, int, booleans etc .. so should I be worried about performances problems while doing the XSLT translations .. secondly for reports Ill have to request solr to send all 15000 some records at the same time to be entered in report output files .. is there a way to kind of stream that process .. well I think Solr native xml is already streamed to you but sounds like for the translation it will have to load the whole thing in RAM .. and again what about SolrJ .. isnt that supposed to provide better performance since its in java .. well I guess it shouldnt be much different since it also uses the HTTP calls to communicate to Solr .. Thanks for your help Adeel On Thu, Sep 23, 2010 at 7:16 AM, kenf_nc ken.fos...@realestate.com wrote: keep in mind that the str name=id paradigm isn't completely useless, the str is a data type (string), it can be int, float, double, date, and others. So to not lose any information you may want to do something like: id type=int123/id title type=strxyz/title Which I agree makes more sense to me. The name of the field is more important than it's datatype, but I don't want to lose track of the data type. Ken -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Reporting-tp1565271p1567604.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Autocomplete: match words anywhere in the token
This works with _one_ entry per document, right? If you've actually found a clever trick to use this technique when you have more than one entry for auto-suggest per document, do let me know. Cause I haven't been able to come with one. Jonathan Chantal Ackermann wrote: What works very good for me: 1.) Keep the tokenized field (KeywordTokenizerFilter, WordDelimiterFilter) (like you described you had) 2.) create an additional field that stores uses the String type with the same content (use copy field to fill either) 3.) use facet.prefix instead of terms.prefix for searching the suggestions 4.) to your query add also the String field as a facet, and return the results from that field as suggestion list. They will include the complete String canon pixma mp500 for example. The other field can only return facets based on tokens. You probably never want that as facets. So your query was alright and the canon (2) facet count probably is the two occurrences that you listed, but as the field was tokenized, only tokens would be returned as facets. You need to have an additional field of pure String type to get the complete value as a facet back. In general, it worked out fine for me to create String fields as return values for facets while using the tokenized fields for searching and the actual facet queries. Cheers, Chantal On Wed, 2010-09-22 at 16:39 +0200, Jason Rutherglen wrote: This may be what you're looking for. http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ On Wed, Sep 22, 2010 at 4:41 AM, Arunkumar Ayyavu arunkumar.ayy...@gmail.com wrote: It's been over a week since I started learning Solr. Now, I'm using the electronics store example to explore the autocomplete feature in Solr. When I send the query terms.fl=nameterms.prefix=canon to terms request handler, I get the following response lst name=terms lst name=name int name=canon2/int /lst /lst But I expect the following results in the response. canon pixma mp500 all-in-one photo printer canon powershot sd500 So, I changed the schema for textgen fieldType to use KeywordTokenizerFactory and also removed WordDelimiterFilterFactory. That gives me the expected result. Now, I also want the Solr to return canon pixma mp500 all-in-one photo printer when I send the query terms.fl=nameterms.prefix=pixma. Could you gurus help me get the expected result? BTW, I couldn't quite understand the behavior of terms.lower and terms.upper (I tried these with the electronics store example). Could you also help me understand these 2 query fields? Thanks. -- Arun
Range query not working
I have this in my query: q=*:*facet.query=location_rating_total:[3 TO 100] And this document: result name=response numFound=6 start=0 maxScore=1.0 − doc float name=score1.0/float str name=id1/str int name=location_rating_total2/int /doc But still my total results equals 6 (total population) and not 0 as I would expect Why? -- View this message in context: http://lucene.472066.n3.nabble.com/Range-query-not-working-tp1570324p1570324.html Sent from the Solr - User mailing list archive at Nabble.com.
Search a URL
Is there a tokenizer that will allow me to search for parts of a URL? For example, the search google would match on the data http://mail.google.com/dlkjadf; This tokenizer factory doesn't seem to be sufficient: fieldType name=text_standard class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer /fieldType Thanks.
Re: Range query not working
On Thu, Sep 23, 2010 at 4:30 PM, PeterKerk vettepa...@hotmail.com wrote: I have this in my query: q=*:*facet.query=location_rating_total:[3 TO 100] And this document: result name=response numFound=6 start=0 maxScore=1.0 - doc float name=score1.0/float str name=id1/str int name=location_rating_total2/int /doc But still my total results equals 6 (total population) and not 0 as I would expect Why? facet.query will give you the number of docs matching location_rating_total:[3 TO 100], it does not restrict the results list. If you want that, you want a filter. Try q=*:*fq=location_rating_total:[3 TO 100] -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
Re: Range query not working
Forgot to mention..I tried that too already. So when I have: location_rating_total:[0 TO 100] It shows only the location for which the location_rating_total is EXACTLY 0...locations that have location_rating_total value of 2 are NOT included. Any other suggestions? -- View this message in context: http://lucene.472066.n3.nabble.com/Range-query-not-working-tp1570324p1570502.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Search a URL
LetterTokenizerFactory will use each contiguous sequence of letters and discard the rest. http, https, com, etc. would need to be a stopword. Alternatively you can try PatternTokenizerFactory with a regular expression if you are looking for a specific part of the URL. On Sep 23, 2010, at 10:59 PM, Max Lynch wrote: Is there a tokenizer that will allow me to search for parts of a URL? For example, the search google would match on the data http://mail.google.com/dlkjadf; This tokenizer factory doesn't seem to be sufficient: fieldType name=text_standard class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer /fieldType Thanks.
Re: Range query not working
This is the field in my schema.xml: field name=location_rating_total type=integer indexed=true stored=true/ Also in the response it clearly shows: int name=location_rating_total0/int What else can I do? -- View this message in context: http://lucene.472066.n3.nabble.com/Range-query-not-working-tp1570324p1570580.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Generating a sitemap
Hi all, Hate to bring forward a zombified thread (Mar 2010 though, not too bad), but I also am tasked to generate a sitemap for items indexed in a Solr index. Been at this job for only a few weeks, so Solr and Lucene are all new to me, but I think my path forward on this is to create a requesthandler that creates a flat datafile upon request, then program a script (Php) that calls this request, reformats the data into the appropriate xml format, then posts it for Google to find and crawl. Attach this script to a crontab item (daily, weekly, whatever schedule the Google Webmaster Tools has set for the site), and Boom! Problem solved. Anyone else try this method? Any successes, failures, advice, etc? Dave -- View this message in context: http://lucene.472066.n3.nabble.com/Generating-a-sitemap-tp478346p1570641.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Grouping in solr ?
thank you! this is really helpful. just tried it and it's amazing. do you know, how trustable is a nightly built version (solr4) ? Rich -Original Message- From: Markus Jelsma [mailto:markus.jel...@buyways.nl] Sent: Thursday, September 23, 2010 22:38 To: solr-user@lucene.apache.org Subject: RE: Grouping in solr ? http://wiki.apache.org/solr/FieldCollapsing https://issues.apache.org/jira/browse/SOLR-236 -Original message- From: Papp Richard ccode...@gmail.com Sent: Thu 23-09-2010 21:29 To: solr-user@lucene.apache.org; Subject: Grouping in solr ? Hi all, is it possible somehow to group documents? I have services as documents, and I would like to show the filtered services grouped by company. So I filter services by given criteria, but I show the results grouped by companay. If I got 1000 services, maybe I need to show just 100 companies (this will affect pagination as well), and how could I get the company info? Should I store the company info in each service (I don't need the compnany info to be indexed) ? regards, Rich __ Information from ESET NOD32 Antivirus, version of virus signature database 5419 (20100902) __ The message was checked by ESET NOD32 Antivirus. http://www.eset.com __ Information from ESET NOD32 Antivirus, version of virus signature database 5419 (20100902) __ The message was checked by ESET NOD32 Antivirus. http://www.eset.com __ Information from ESET NOD32 Antivirus, version of virus signature database 5419 (20100902) __ The message was checked by ESET NOD32 Antivirus. http://www.eset.com
Re: Range query not working
On Thu, Sep 23, 2010 at 5:44 PM, Jonathan Rochkind rochk...@jhu.edu wrote: The field type in a standard schema.xml that's defined as integer is NOT sortable. Right - before 1.4. There is no integer field type in 1.4 and beyond in the example schema. You can not sort on this and get what you want. (What's the point of it even existing then, if it pretty much does the same thing as a string field? You can sort on it... you just can't do range queries on it because the term order isn't correct for numerics. It's there only for support of legacy lucene indexes that indexed numerics as plain strings. They are now named pint for plain integer in 1.4 and above. Perhaps we should retain support for that, but remove them from the example schema and only document them somewhere (under supporting lucene indexed built by other software or something?) -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
RE: Search a URL
WDF is not WTF(what I think when I see WDF), right ;-) What is WDF? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Thu, 9/23/10, Markus Jelsma markus.jel...@buyways.nl wrote: From: Markus Jelsma markus.jel...@buyways.nl Subject: RE: Search a URL To: solr-user@lucene.apache.org Date: Thursday, September 23, 2010, 2:11 PM Try setting generateWordParts=1 in your WDF. Also, having a WhitespaceTokenizer makes little sense for URL's, there should be no whitespace in a URL, the StandardTokenizer can tokenize a URL. Anyway, the problem is your WDF. -Original message- From: Max Lynch ihas...@gmail.com Sent: Thu 23-09-2010 23:00 To: solr-user@lucene.apache.org; Subject: Search a URL Is there a tokenizer that will allow me to search for parts of a URL? For example, the search google would match on the data http://mail.google.com/dlkjadf; This tokenizer factory doesn't seem to be sufficient: fieldType name=text_standard class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer /fieldType Thanks.
Re: Can Solr do approximate matching?
Eric, it appears that the /solr/mlt handler is missing, at least based on the URL that I typed. How can I verify existence of MoreLikeThis handler and install it? Thanks a lot! Igor On Wed, Sep 22, 2010 at 11:18 AM, Erik Hatcher erik.hatc...@gmail.com wrote: http://www.lucidimagination.com/search/?q=%22find+similar%22 (then narrow to wiki to find things in documentation) which will get you to http://wiki.apache.org/solr/MoreLikeThisHandler Erik On Sep 22, 2010, at 12:12 PM, Li Li wrote: It seems there is a SimilarLikeThis in lucene . I don't know whether a counterpart in solr. It just use the found document as a query to find similar documents. Or you just use boolean or query and similar questions with getting higher score. Of course, you can analyse the question using some NLP techs such as identifying entities and ingore less usefull words such as which is ... but I guess tf*idf score function will also work well 2010/9/22 Igor Chudov ichu...@gmail.com: Hi guys. I am new here. So if I am unwittingly violating any rules, let me know. I am working with Solr because I own algebra.com, where I have a database of 250,000 or so answered math questions. I want to use Solr to provide approximate matching functionality called similar items. So that users looking at a problem could see how similar ones were answered. And my question is, does Solr support some find similar functionality. For example, in my mind, sentence I like tasty strawberries is 'similar' to a sentence such as I like yummy strawberries, just because both have a few of the same words. So, to end my long winded query, how would I implement a find top ten similar items to this one functionality? Thanks!
TokenFilter that removes payload ?
Is there an existing TokenFilter that simply removes payloads from the token stream? Teruhiko Kuro Kurosaka RLP + Lucene Solr = powerful search for global contents