Re: Memory use with sorting problem
Thanks for your reply. I made some memory saving changes, as per your advice, but the problem remains. Set the max warming searchers to 1 to ensure that you never have more than one warming at the same time. Done. How many documents are in your index? Currently about 8 million. If you don't need range queries on these numeric fields, you might try switching from sfloat to float and from sint to int. The fieldCache representation will be smaller. As far as I can see slong etc. is also needed for sorting queries (which I do, as mentioned). Anyway, I got an error message when I tried sorting on a long field. Is it normal to need that much Memory for such a small index? Some things are more related to the number of unique terms or the numer of documents more than the size of the index. Is there a manageable way to find out / limit the number of unique terms in Solr? Cheers, Chris
Re: Performance problems for OR-queries
1. Does Solr support this kind of index access with better performance ? Is there anything special to define in schema.xml? No... Solr uses Lucene at it's core, and all matching documents for a query are scored. So it is not possible to have a google like performance with Solr, i.e. to search for a set of keywords and only the 10 best documents are listed, without touching the other millions of (web) documents matching less keywords. I infact would not know how to program such an index, however google has done it somehow.. 2. Can one switch off this ordering and just return any 100 documents fullfilling the query (though getting best-matching documents would be a nice feature if it would be fast)? a feature like this could be developed... but what is the usecase for this? What are you tring to accomplish where either relevancy or complete matching doesn't matter? There may be an easier workaround for your specific case. This is not an actual Use-Case for my project, however I just wanted to know if it would be possible. Because of the performance results, we designed a new type of query. I would like to know how fast it would be before I implement the following query: I have N keywords and execute a query of the form keyword1 AND keyword2 AND .. AND keywordN there may be again some millions of matching documents and I want to get the first 100 documents. To have a ordering criteria, each Solr document has a field named REV which has a natural number. The returned 100 documents shall be those with the lowest numbers in the REV field. My questions now are: (1) Will the query perform in O(100) or in O(all possible matches)? (2) If the answer to (1) is O(all possible matches), what will be the performance if I dont order for the REV field? Will Solr order it after the point of time where a document was created/modified? What I have to do to get O(100) complexity finally? Thanks Jörg
Re: Document update based on ID
Yes, SOLR-139 will eventually do what you need. The most recent patch should not be *too* hard to get running (it may not apply cleanly though) The patch as is needs to be reworked before it will go into trunk. I hope this will happen in the next month or so. As for production? It depends ;) The API will most likely change so if you base your code on the current patch, it will need to change when things finalize. As for stability, it has worked well for me (and I think for Erik) A useful feature would be update based on query, so that documents matching the query condition will all be modified in the same way on the given update fields. If this feature also available in future?
Re: Strange behavior MoreLikeThis Feature
Now when I run the following query: http://localhost:8080/solr/mlt?q=id:neardup06mlt.fl=featuresmlt.mindf=1mlt.mintf=1mlt.displayTerms=detailswt=jsonindent=on try adding: debugQuery=on to your query string and you can see why each document matches... My guess is that features uses a text field with stemming and a stemmed word matches ryan
Re: Document update based on ID
Jörg Kiegeland wrote: Yes, SOLR-139 will eventually do what you need. The most recent patch should not be *too* hard to get running (it may not apply cleanly though) The patch as is needs to be reworked before it will go into trunk. I hope this will happen in the next month or so. As for production? It depends ;) The API will most likely change so if you base your code on the current patch, it will need to change when things finalize. As for stability, it has worked well for me (and I think for Erik) A useful feature would be update based on query, so that documents matching the query condition will all be modified in the same way on the given update fields. If this feature also available in future? interesting, I had not thought of that - but it could be useful. (Potentially dangerous and resource intensive, but so is rm) Can you add a comment to SOLR-139 with this idea? Once SOLR-139 is more stable, it would make sense to do this as a new issue. ryan
Grouping multiValued fields
Let's say I have a class Item that has a collection of Sell objects. Sell objects have two properties sellingTime (Date) and salesPerson (String). So in my Solr schema I have something like the following fields defined: field name=id type=string indexed=true stored=true required=true / field name=sellingTime type=date indexed=true stored=false multiValued=true / field name=salesPerson type=text indexed=true stored=false multiValued=true / An add might look like the following: add doc field name=id1/field field name=sellingTime2007-11-23T23:01:00Z/field field name=salesPersonJohn Doe/field /doc doc field name=id2/field field name=sellingTime2007-12-24T01:15:00Z/field field name=salesPersonJohn Doe/field field name=sellingTime2007-11-23T21:11:00Z/field field name=salesPersonJack Smith/field /doc /add My problem is that all the historical sales data for the items are getting flattened out. I need the sellingTime and salesPerson fields to be kept as a pair somehow, but I need to store the data as a seperate date field so that I can do range searches. Specifically I want to be able to do the following search: salesPerson:John Doe AND sellingTime:[2007-11-23T00:0:00Z TO 2007-11-24T00:00:00Z] Right now that query would return both items 1 and 2, but I want it to only return item 1. Is there some trick to get this query to work as I want it to? Or do I need to totally restructure my data?
Heritrix and Solr
I'm looking for a web crawler to use with Solr. The objective is to crawl about a dozen public web sites regarding a specific topic. After a lot of googling, I came across Heritrix, which seems to be the most robust well supported open source crawler out there. Heritrix has an integration with Nutch (NutchWax), but not with Solr. I'm wondering if anybody can share any experience using Heritrix with Solr. It seems that there are three options for integration: 1. Write a custom Heritrix Writer class which submits documents to Solr for indexing. 2. Write an ARC to Sol input XML format converter to import the ARC files. 3. Use the filesystem mirror writer and then another program to walk the downloaded files. Has anybody looked into this or have any suggestions on an alternative approach? The optimal answer would be You dummy, just use XXX to crawl your web sites - there's no 'integration' required at all. Can you believe the temerity? What a poltroon. Yours in Revolution, George
Re: Heritrix and Solr
I am interested in this too. any ideas? A. Banji Oyebisi Choicegen, LLC. Email: [EMAIL PROTECTED] Web URL: http://www.choicegen.com Choicegen... Helping you make better choices! Notice: This email message, together with any attachments, may contain information of Choicegen, LLC., its subsidiaries and affiliated entities, that may be confidential, proprietary, copyrighted and/or legally privileged, and is intended solely for the use of the individual or entity named in this message. If you are not the intended recipient, and have received this message in error, please immediately return this by email and then delete it. George Everitt wrote: I'm looking for a web crawler to use with Solr. The objective is to crawl about a dozen public web sites regarding a specific topic. After a lot of googling, I came across Heritrix, which seems to be the most robust well supported open source crawler out there. Heritrix has an integration with Nutch (NutchWax), but not with Solr. I'm wondering if anybody can share any experience using Heritrix with Solr. It seems that there are three options for integration: 1. Write a custom Heritrix "Writer" class which submits documents to Solr for indexing. 2. Write an ARC to Sol input XML format converter to import the ARC files. 3. Use the filesystem mirror writer and then another program to walk the downloaded files. Has anybody looked into this or have any suggestions on an alternative approach? The optimal answer would be "You dummy, just use XXX to crawl your web sites - there's no 'integration' required at all. Can you believe the temerity? What a poltroon." Yours in Revolution, George
Re: Heritrix and Solr
I have some sort of same requirement where I need to move to a good crawler. Currently I am using a custom crawler, I mean my own crawler to crawl some public domains and uses Lucene to index all downloaded pages. After doing lots of research I came across JSpider with Lucene. ALso I was looking for Nutch for doing crawler job but I dont think that is possible, I mean feasible. - BR A. Banji Oyebisi [EMAIL PROTECTED] wrote: I am interested in this too. any ideas? A. Banji Oyebisi Choicegen, LLC. Email: [EMAIL PROTECTED] Web URL: http://www.choicegen.com Choicegen... Helping you make better choices! Notice: This email message, together with any attachments, may contain information of Choicegen, LLC., its subsidiaries and affiliated entities, that may be confidential, proprietary, copyrighted and/or legally privileged, and is intended solely for the use of the individual or entity named in this message. If you are not the intended recipient, and have received this message in error, please immediately return this by email and then delete it. George Everitt wrote: I'm looking for a web crawler to use with Solr. The objective is to crawl about a dozen public web sites regarding a specific topic. After a lot of googling, I came across Heritrix, which seems to be the most robust well supported open source crawler out there. Heritrix has an integration with Nutch (NutchWax), but not with Solr. I'm wondering if anybody can share any experience using Heritrix with Solr. It seems that there are three options for integration: 1. Write a custom Heritrix Writer class which submits documents to Solr for indexing. 2. Write an ARC to Sol input XML format converter to import the ARC files. 3. Use the filesystem mirror writer and then another program to walk the downloaded files. Has anybody looked into this or have any suggestions on an alternative approach? The optimal answer would be You dummy, just use XXX to crawl your web sites - there's no 'integration' required at all. Can you believe the temerity? What a poltroon. Yours in Revolution, George - Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try it now.
Re: Performance problems for OR-queries
On 22-Nov-07, at 6:02 AM, Jörg Kiegeland wrote: 1. Does Solr support this kind of index access with better performance ? Is there anything special to define in schema.xml? No... Solr uses Lucene at it's core, and all matching documents for a query are scored. So it is not possible to have a google like performance with Solr, i.e. to search for a set of keywords and only the 10 best documents are listed, without touching the other millions of (web) documents matching less keywords. I infact would not know how to program such an index, however google has done it somehow.. I can be fairly certain that google does not execute queries that match millions of documents on a single machine. The default query operator is (mostly) AND, so the possible match sets is much smaller. Also, I imagine they have relatively few documents per machine. 2. Can one switch off this ordering and just return any 100 documents fullfilling the query (though getting best-matching documents would be a nice feature if it would be fast)? a feature like this could be developed... but what is the usecase for this? What are you tring to accomplish where either relevancy or complete matching doesn't matter? There may be an easier workaround for your specific case. This is not an actual Use-Case for my project, however I just wanted to know if it would be possible. Because of the performance results, we designed a new type of query. I would like to know how fast it would be before I implement the following query: I have N keywords and execute a query of the form keyword1 AND keyword2 AND .. AND keywordN there may be again some millions of matching documents and I want to get the first 100 documents. To have a ordering criteria, each Solr document has a field named REV which has a natural number. The returned 100 documents shall be those with the lowest numbers in the REV field. My questions now are: (1) Will the query perform in O(100) or in O(all possible matches)? O(all possible matches) (2) If the answer to (1) is O(all possible matches), what will be the performance if I dont order for the REV field? Will Solr order it after the point of time where a document was created/ modified? What I have to do to get O(100) complexity finally? Ordering by natural document order in the index is sufficient to achieve O(100), but you'll have to insert code in Solr to stop after 100 docs (another alternative is to stop processing after a given amount of time). Also, using O() in the case isn't quite accurate: there are costs that vary based on the number of docs in the index too. -Mike
Re: Heritrix and Solr
On Thu, 22 Nov 2007 10:41:41 -0500 George Everitt [EMAIL PROTECTED] wrote: After a lot of googling, I came across Heritrix, which seems to be the most robust well supported open source crawler out there. Heritrix has an integration with Nutch (NutchWax), but not with Solr. I'm wondering if anybody can share any experience using Heritrix with Solr. out on a limb here... both Nutch and SOLR use Lucene for the actual indexing / searching. Would the indexes generated with Nutch be compatible / readable with SOLR? _ {Beto|Norberto|Numard} Meijome Why do you sit there looking like an envelope without any address on it? Mark Twain I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Any tips for indexing large amounts of data?
Brendan - yes, 64-bit Linux this is, and the JVM got 5.5 GB heap, though it could have worked with less. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Brendan Grainger [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, November 21, 2007 1:24:05 PM Subject: Re: Any tips for indexing large amounts of data? Hi Otis, Thanks for this. Are you using a flavor of linux and is it 64bit? How much heap are you giving your jvm? Thanks again Brendan On Nov 21, 2007, at 2:03 AM, Otis Gospodnetic wrote: Mike is right about the occasional slow-down, which appears as a pause and is due to large Lucene index segment merging. This should go away with newer versions of Lucene where this is happening in the background. That said, we just indexed about 20MM documents on a single 8-core machine with 8 GB of RAM, resulting in nearly 20 GB index. The whole process took a little less than 10 hours - that's over 550 docs/second. The vanilla approach before some of our changes apparently required several days to index the same amount of data. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mike Klaas [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Monday, November 19, 2007 5:50:19 PM Subject: Re: Any tips for indexing large amounts of data? There should be some slowdown in larger indices as occasionally large segment merge operations must occur. However, this shouldn't really affect overall speed too much. You haven't really given us enough data to tell you anything useful. I would recommend trying to do the indexing via a webapp to eliminate all your code as a possible factor. Then, look for signs to what is happening when indexing slows. For instance, is Solr high in cpu, is the computer thrashing, etc? -Mike On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote: Hi, Thanks for answering this question a while back. I have made some of the suggestions you mentioned. ie not committing until I've finished indexing. What I am seeing though, is as the index get larger (around 1Gb), indexing is taking a lot longer. In fact it slows down to a crawl. Have you got any pointers as to what I might be doing wrong? Also, I was looking at using MultiCore solr. Could this help in some way? Thank you Brendan On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote: : I would think you would see better performance by allowing auto commit : to handle the commit size instead of reopening the connection all the : time. if your goal is fast indexing, don't use autoCommit at all ... just index everything, and don't commit until you are completely done. autoCommitting will slow your indexing down (the benefit being that more results will be visible to searchers as you proceed) -Hoss
Re: Heritrix and Solr
On Thu, 22 Nov 2007 19:10:46 -0800 (PST) Otis Gospodnetic [EMAIL PROTECTED] wrote: The answer to that question, Norberto, would depend on versions. Otis, would that relate to what underlying version of Lucene is being used in either Solr Nutch? _ {Beto|Norberto|Numard} Meijome Web2.0 is outsourced RD from Web1.0 companies. The Reverend I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
C++ type of analysis issues
Hi, there, I haven't found any existing filter/tokenizer that can deal with C++ type of search keywords. I'm using WordDelimiterFilter which removes the ++. One way I am thinking of right now is to use synonym filter before the WordDelimiterFilter to replace c++ (after low-cased it) with say cpp. And use the synonym filter for both indexing and querying. That would cause a cpp string to be found as a result of search c++ (or C++). But I guess this is not a big problem. Anyway, I feel this is a common issue and must be solved by someone already, so anyone has a better solution? Thanks, -Hui
Re: Document update based on ID
This can be useful, but it is limited. At Infoseek, we used this for demoting porn and spam in the index in 1996, but replaced it with more precise approaches. wunder On 11/22/07 6:49 AM, Ryan McKinley [EMAIL PROTECTED] wrote: Jörg Kiegeland wrote: Yes, SOLR-139 will eventually do what you need. The most recent patch should not be *too* hard to get running (it may not apply cleanly though) The patch as is needs to be reworked before it will go into trunk. I hope this will happen in the next month or so. As for production? It depends ;) The API will most likely change so if you base your code on the current patch, it will need to change when things finalize. As for stability, it has worked well for me (and I think for Erik) A useful feature would be update based on query, so that documents matching the query condition will all be modified in the same way on the given update fields. If this feature also available in future? interesting, I had not thought of that - but it could be useful. (Potentially dangerous and resource intensive, but so is rm) Can you add a comment to SOLR-139 with this idea? Once SOLR-139 is more stable, it would make sense to do this as a new issue. ryan
Re: Heritrix and Solr
The answer to that question, Norberto, would depend on versions. George: why not just use straight Nutch and forget about Heritrix? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Norberto Meijome [EMAIL PROTECTED] To: solr-user@lucene.apache.org Cc: [EMAIL PROTECTED] Sent: Thursday, November 22, 2007 5:54:32 PM Subject: Re: Heritrix and Solr On Thu, 22 Nov 2007 10:41:41 -0500 George Everitt [EMAIL PROTECTED] wrote: After a lot of googling, I came across Heritrix, which seems to be the most robust well supported open source crawler out there. Heritrix has an integration with Nutch (NutchWax), but not with Solr. I'm wondering if anybody can share any experience using Heritrix with Solr. out on a limb here... both Nutch and SOLR use Lucene for the actual indexing / searching. Would the indexes generated with Nutch be compatible / readable with SOLR? _ {Beto|Norberto|Numard} Meijome Why do you sit there looking like an envelope without any address on it? Mark Twain I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Heritrix and Solr
Hi George, Thank you for your kind words about Lucene in Action. :) I wouldn't compare Solr and Nutch, they are really made for different things. I was suggesting Nutch instead of Heritrix, not instead of Solr. The Solr+Nutch patch is in JIRA and there is a fresh patch in therestill warm, try it out. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: George Everitt [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Thursday, November 22, 2007 10:58:08 PM Subject: Re: Heritrix and Solr Otis: There are many reasons I prefer Solr to Nutch: 1. I actually tried to do some of the crawling with Nutch, but found the crawling options less flexible than I would have liked. 2. I prefer the Solr approach in general. I have a long background in Verity and Autonomy search, and Solr is a bit closer to them than Nutch. 3. I really like the schema support in Solr. 4. I really really like the facets/parametric search in Solr. 5. I really really really like the REST interface in Solr. 6. Finally, and not to put too fine a point on it, hadoop frightens the bejeebers out of me. I've skimmed some of the papers and it looks like a lot of study before I will fully understand it. I'm not saying I'm stupid and lazy, but if the map-reduce algorithm fits, I'll wear it. Plus, I'm trying to get a mental handle on Jeff Hawkins' HTM and it's application to the real world. It all makes my cerebral cortex itchy. Thanks for the suggestion, though. I'll probably revisit Nutch again if Heritrix lets me down. I had no luck getting the Nutch crawler Solr patch to work, either. Sadly, I'm the David Lee Roth of Java programmers - I may think that Im hard-core, but I'm not, really. And my groupies are getting a bit saggy. BTW - add my voice to the paeans of praise for Lucene in Action. You and Erik did a bang up job, and I surely appreciate all the feedback you give on this forum, Especially over the past few months as I feel my way through Solr and Lucene. On Nov 22, 2007, at 10:10 PM, Otis Gospodnetic wrote: The answer to that question, Norberto, would depend on versions. George: why not just use straight Nutch and forget about Heritrix? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Norberto Meijome [EMAIL PROTECTED] To: solr-user@lucene.apache.org Cc: [EMAIL PROTECTED] Sent: Thursday, November 22, 2007 5:54:32 PM Subject: Re: Heritrix and Solr On Thu, 22 Nov 2007 10:41:41 -0500 George Everitt [EMAIL PROTECTED] wrote: After a lot of googling, I came across Heritrix, which seems to be the most robust well supported open source crawler out there. Heritrix has an integration with Nutch (NutchWax), but not with Solr. I'm wondering if anybody can share any experience using Heritrix with Solr. out on a limb here... both Nutch and SOLR use Lucene for the actual indexing / searching. Would the indexes generated with Nutch be compatible / readable with SOLR? _ {Beto|Norberto|Numard} Meijome Why do you sit there looking like an envelope without any address on it? Mark Twain I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Heritrix and Solr
Otis: There are many reasons I prefer Solr to Nutch: 1. I actually tried to do some of the crawling with Nutch, but found the crawling options less flexible than I would have liked. 2. I prefer the Solr approach in general. I have a long background in Verity and Autonomy search, and Solr is a bit closer to them than Nutch. 3. I really like the schema support in Solr. 4. I really really like the facets/parametric search in Solr. 5. I really really really like the REST interface in Solr. 6. Finally, and not to put too fine a point on it, hadoop frightens the bejeebers out of me. I've skimmed some of the papers and it looks like a lot of study before I will fully understand it. I'm not saying I'm stupid and lazy, but if the map-reduce algorithm fits, I'll wear it. Plus, I'm trying to get a mental handle on Jeff Hawkins' HTM and it's application to the real world. It all makes my cerebral cortex itchy. Thanks for the suggestion, though. I'll probably revisit Nutch again if Heritrix lets me down. I had no luck getting the Nutch crawler Solr patch to work, either. Sadly, I'm the David Lee Roth of Java programmers - I may think that Im hard-core, but I'm not, really. And my groupies are getting a bit saggy. BTW - add my voice to the paeans of praise for Lucene in Action. You and Erik did a bang up job, and I surely appreciate all the feedback you give on this forum, Especially over the past few months as I feel my way through Solr and Lucene. On Nov 22, 2007, at 10:10 PM, Otis Gospodnetic wrote: The answer to that question, Norberto, would depend on versions. George: why not just use straight Nutch and forget about Heritrix? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Norberto Meijome [EMAIL PROTECTED] To: solr-user@lucene.apache.org Cc: [EMAIL PROTECTED] Sent: Thursday, November 22, 2007 5:54:32 PM Subject: Re: Heritrix and Solr On Thu, 22 Nov 2007 10:41:41 -0500 George Everitt [EMAIL PROTECTED] wrote: After a lot of googling, I came across Heritrix, which seems to be the most robust well supported open source crawler out there. Heritrix has an integration with Nutch (NutchWax), but not with Solr. I'm wondering if anybody can share any experience using Heritrix with Solr. out on a limb here... both Nutch and SOLR use Lucene for the actual indexing / searching. Would the indexes generated with Nutch be compatible / readable with SOLR? _ {Beto|Norberto|Numard} Meijome Why do you sit there looking like an envelope without any address on it? Mark Twain I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Memory use with sorting problem
I'd have to check, but Luke handler might spit that out. If not, Lucene's TermEnum co. are your friends. :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Chris Laux [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Thursday, November 22, 2007 7:22:56 AM Subject: Re: Memory use with sorting problem Thanks for your reply. I made some memory saving changes, as per your advice, but the problem remains. Set the max warming searchers to 1 to ensure that you never have more than one warming at the same time. Done. How many documents are in your index? Currently about 8 million. If you don't need range queries on these numeric fields, you might try switching from sfloat to float and from sint to int. The fieldCache representation will be smaller. As far as I can see slong etc. is also needed for sorting queries (which I do, as mentioned). Anyway, I got an error message when I tried sorting on a long field. Is it normal to need that much Memory for such a small index? Some things are more related to the number of unique terms or the numer of documents more than the size of the index. Is there a manageable way to find out / limit the number of unique terms in Solr? Cheers, Chris
Re: Strange behavior MoreLikeThis Feature
Thanks Ryan. I now know the reason why. Before I explain the reason, let me correct the mistake I made in my earlier mail. I was not using the first document mentioned in the xml . Instead it was this one: doc field name=idIW-02/field field name=nameiPod amp; iPod Mini USB 2.0 Cable/field field name=manuBelkin/field field name=catelectronics/field field name=catconnector/field field name=featurescar power adapter for iPod, white/field field name=weight2/field field name=price11.50/field field name=popularity1/field field name=inStockfalse/field /doc The reason I was getting strange result was because of the character i. Here is what I learnt from debug info: debug:{ rawquerystring:id:neardup06, querystring:id:neardup06, parsedquery:features:og features:en features:til features:er features:af features:der features:ts features:se features:i features:p features:pet features:brag features:efter features:zombier features:k features:tilbag features:ala features:sviner features:folk features:klassisk features:resid features:horder features:lidt features:man features:denn, parsedquery_toString:features:og features:en features:til features:er features:af features:der features:ts features:se features:i features:p features:pet features:brag features:efter features:zombier features:k features:tilbag features:ala features:sviner features:folk features:klassisk features:resid features:horder features:lidt features:man features:denn, explain:{ id=IW-02,internal_docid=8:\n0.0050230525 = (MATCH) product of:\n 0.12557632 = (MATCH) sum of:\n0.12557632 = (MATCH) weight(features:i in 8), product of:\n 0.17474915 = queryWeight(features:i), product of:\n1.9162908 = idf(docFreq=3)\n0.09119135 = queryNorm\n 0.71860904 = (MATCH) fieldWeight(features:i in 8), product of:\n1.0 = tf(termFreq(features:i)=1)\n1.9162908 = idf(docFreq=3)\n 0.375 = fieldNorm(field=features, doc=8)\n 0.04 = coord(1/25)\n}}} The field features uses the default fieldtype - text in the schema.xml. The problem was solved by adding the character i to the stopwords.txtfile. the is in document 2 were matched with the i in iPod of document 1. I still have to figure out why a single character - i - matched the i in a word - iPod. Regards, Rishabh On 22/11/2007, Ryan McKinley [EMAIL PROTECTED] wrote: Now when I run the following query: http://localhost:8080/solr/mlt?q=id:neardup06mlt.fl=featuresmlt.mindf=1mlt.mintf=1mlt.displayTerms=detailswt=jsonindent=on try adding: debugQuery=on to your query string and you can see why each document matches... My guess is that features uses a text field with stemming and a stemmed word matches ryan