Facing problem with the FieldType of UniqueField
Hi, Initially, in my schema, I had my uniqueField's field type as string and everything was working fine. But, then the users of my application wanted to search on the unique field and entered values which were in a different case than what was indexed. They never got proper results, at times, no results. I noticed this happened because the field type was string. I then changed it to a custom text type and had specified only the whitespace tokenizer and lowercase filter. It worked. The users were able to search on the uniqueField irrespective of the case the values were entered in. But now, another problem has risen. If I update a document, it really does not update, but it is added as a separate document with the same uniqueField value and the contents of the old document is merged with the new one. So, now what I want to achieve is that the users should be able to search on the uniqueField, irrespective of the case the values are entered in; and on updation of a document, not to have duplicate documents (documents with the same uniqueField value) in the index. Can anyone help me in as to how this can be done? Regards, Rishabh
Re: Facing problem with the FieldType of UniqueField
Ryan, Using the KeywordTokenizer does not help. And there are not any spaces in the unique keys. the keys are alpha numeric. E.g.: AA-23-E1 Regards, Rishabh On Thu, Feb 14, 2008 at 10:28 PM, Ryan McKinley [EMAIL PROTECTED] wrote: I noticed this happened because the field type was string. I then changed it to a custom text type and had specified only the whitespace tokenizer and lowercase filter. It worked. The users were able to search on the are there spaces in your unique key? Try using the KeywordTokenizer -- the main concern with the field type for uniqueKey is to make sure it only has one token. ryan
Restrict values in a multivalued field
Hi, In my schema I have a multivalued field, and the values of that field are stored and indexed in the index. I wanted to know if its possible to restrict the number of multiple values being returned from that field, on a search? And how? Because, lets say, if I have thousands of values in that multivalued field, returning all of them would be a lot of load on the system. So, I want to restrict it to send me only say, 50 values out of the thousands. Regards, Rishabh
How to perform a double query in one
Hi, Is there a way to perform 2 search queries in one search request, and then return their combined results? Currently I am performing the following: I have a document which consists of id field which is the unique identifier, the info field, and an xid field which contains the ids of other documents (these documents are also indexed) it relates to. Thus there is a mapping between two or more documents thru the xid field. Now, in my first search, I search for the document based on a given id, say XYZ on the id field. this gives me exactly one document and I retrieve the document's xid content. Then I search for the same id - XYZ in the xid field and retrive the xid content for all the matching documents. Can I perform the same operation in one query? If yes, how do I go about it? Do, I need to write my custom request handler? If no, any other efficient way to do the same? Regards, Rishabh
Retrieving Tokens
Hi, I have created my own Tokenizer and I am indexing the documents using the same. I wanted to know if there is a way to retrieve the tokens (created by my custom tokenizer) from the index. Do we have to modify the code to get these tokens? Regards, Rishabh
Creating user-defined field types
Hi, Can anyone guide me as to how one can go on to implement a user defined field types in solr? I could not find anything on the solr-wiki. Help of any kind would be appreciated. Regards, Rishabh
How to store a HashSet in the index?
Hi, Can anyone help me on, as to how I can go about efficiently indexing (actually, storing in the index) and retrieving, a HashSet object, which contains multiple string arrays? I just want to store the HashSet in the index, and not search on it. The HashSet should be returned with the document when I perform a search on any other fields. Regards, Rishabh
Re: How to store a HashSet in the index?
Thanks Eric! Rishabh On Dec 10, 2007 3:30 PM, Erik Hatcher [EMAIL PROTECTED] wrote: On Dec 10, 2007, at 3:10 AM, Rishabh Joshi wrote: Can anyone help me on, as to how I can go about efficiently indexing (actually, storing in the index) and retrieving, a HashSet object, which contains multiple string arrays? I just want to store the HashSet in the index, and not search on it. The HashSet should be returned with the document when I perform a search on any other fields. If you have Java on indexing and querying side of things, you could simply serialize and stringify (via uuencoding perhaps) the HashSet, and deserialize it on retrieval. Just be sure to set the field to be untokenized and stored. Erik
Re: Strange behavior MoreLikeThis Feature
Thanks Ryan. I now know the reason why. Before I explain the reason, let me correct the mistake I made in my earlier mail. I was not using the first document mentioned in the xml . Instead it was this one: doc field name=idIW-02/field field name=nameiPod amp; iPod Mini USB 2.0 Cable/field field name=manuBelkin/field field name=catelectronics/field field name=catconnector/field field name=featurescar power adapter for iPod, white/field field name=weight2/field field name=price11.50/field field name=popularity1/field field name=inStockfalse/field /doc The reason I was getting strange result was because of the character i. Here is what I learnt from debug info: debug:{ rawquerystring:id:neardup06, querystring:id:neardup06, parsedquery:features:og features:en features:til features:er features:af features:der features:ts features:se features:i features:p features:pet features:brag features:efter features:zombier features:k features:tilbag features:ala features:sviner features:folk features:klassisk features:resid features:horder features:lidt features:man features:denn, parsedquery_toString:features:og features:en features:til features:er features:af features:der features:ts features:se features:i features:p features:pet features:brag features:efter features:zombier features:k features:tilbag features:ala features:sviner features:folk features:klassisk features:resid features:horder features:lidt features:man features:denn, explain:{ id=IW-02,internal_docid=8:\n0.0050230525 = (MATCH) product of:\n 0.12557632 = (MATCH) sum of:\n0.12557632 = (MATCH) weight(features:i in 8), product of:\n 0.17474915 = queryWeight(features:i), product of:\n1.9162908 = idf(docFreq=3)\n0.09119135 = queryNorm\n 0.71860904 = (MATCH) fieldWeight(features:i in 8), product of:\n1.0 = tf(termFreq(features:i)=1)\n1.9162908 = idf(docFreq=3)\n 0.375 = fieldNorm(field=features, doc=8)\n 0.04 = coord(1/25)\n}}} The field features uses the default fieldtype - text in the schema.xml. The problem was solved by adding the character i to the stopwords.txtfile. the is in document 2 were matched with the i in iPod of document 1. I still have to figure out why a single character - i - matched the i in a word - iPod. Regards, Rishabh On 22/11/2007, Ryan McKinley [EMAIL PROTECTED] wrote: Now when I run the following query: http://localhost:8080/solr/mlt?q=id:neardup06mlt.fl=featuresmlt.mindf=1mlt.mintf=1mlt.displayTerms=detailswt=jsonindent=on try adding: debugQuery=on to your query string and you can see why each document matches... My guess is that features uses a text field with stemming and a stemmed word matches ryan
Re: Near Duplicate Documents
Thanks for the info Cuong! Regards, Rishabh On Nov 21, 2007 1:59 PM, climbingrose [EMAIL PROTECTED] wrote: The duplication detection mechanism in Nutch is quite primitive. I think it uses a MD5 signature generated from the content of a field. The generation algorithm is described here: http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html . The problem with this approach is MD5 hash is very sensitive: one letter difference will generate completely different hash. You probably have to roll your own near duplication detection algorithm. My advice is have a look at existing literature on near duplication detection techniques and then implement one of them. I know Google has some papers that describe a technique called minhash. I read the paper and found it's very interesting. I'm not sure if you can implement the algorithm because they have patented it. That said, there are plenty literature on near dup detection so you should be able to get one for free! On Nov 21, 2007 6:57 PM, Rishabh Joshi [EMAIL PROTECTED] wrote: Otis, Thanks for your response. I just gave a quick look to the Nutch Forum and find that there is an implementation to obtain de-duplicate documents/pages but none for Near Duplicates documents. Can you guide me a little further as to where exactly under Nutch I should be concentrating, regarding near duplicate documents? Regards, Rishabh On Nov 21, 2007 12:41 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: To whomever started this thread: look at Nutch. I believe something related to this already exists in Nutch for near-duplicate detection. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mike Klaas [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Sunday, November 18, 2007 11:08:38 PM Subject: Re: Near Duplicate Documents On 18-Nov-07, at 8:17 AM, Eswar K wrote: Is there any idea implementing that feature in the up coming releases? Not currently. Feel free to contribute something if you find a good solution g. -Mike On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote: On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote: We have a scenario, where we want to find out documents which are similar in content. To elaborate a little more on what we mean here, lets take an example. The example of this email chain in which we are interacting on, can be best used for illustrating the concept of near dupes (We are not getting confused with threads, they are two different things.). Each email in this thread is treated as a document by the system. A reply to the original mail also includes the original mail in which case it becomes a near duplicate of the orginal mail (depending on the percentage of similarity). Similarly it goes on. The near dupes need not be limited to emails. I think this is what's known as shingling. See http://en.wikipedia.org/wiki/W-shingling Lucene (and therefore Solr) does not implement shingling. The MoreLikeThis query might be close enough, however. -Stuart -- Regards, Cuong Hoang
Re: Performance of Solr on different Platforms
Eswar, This link would give you a fair idea of how Solr is used by some of the sites/companies - http://wiki.apache.org/solr/SolrPerformanceData Rishabh On Nov 20, 2007 10:49 AM, Eswar K [EMAIL PROTECTED] wrote: In our case, the load is kind of distributed. On an average, the QPS could be much less than that. 1000 qps could be the peak load ever expected could ever reach. However the number of documents going to be in the range of 2 - 20 million documents. We would possibly distribute the indexes to different solr instances and possibly direct it accordingly to reduce the QPS. - Eswar On Nov 20, 2007 10:42 AM, Walter Underwood [EMAIL PROTECTED] wrote: 1000 qps is a lot of load, at least 30M queries/day. We are running dual CPU Power P5 machines and getting about 80 qps with worst case response times of 5 seconds. 90% of responses are under 70 msec. Our expected peak load is 300 qps on our back-end Solr farm. We execute multiple back-end queries for each query page. With N+1 sizing (full throughput with one server down), we have five servers to do that. We have a separate server for indexing and use the Solr distribution scripts. We have a relatively small index, about 250K docs. wunder On 11/19/07 8:48 PM, Eswar K [EMAIL PROTECTED] wrote: Its not going to hit 1000 all the time, its the expected peak value. I guess for distributing the load we should be using collections and I was looking at the collections documentation ( http://wiki.apache.org/solr/CollectionDistribution) . - Eswar On Nov 20, 2007 12:07 AM, Matthew Runo [EMAIL PROTECTED] wrote: I'd think that any platform that can run Java would be fine to run SOLR on. Maybe this is more a question of preferred platforms for Java deployments? That is quite the load for SOLR though, you may find that you want more than one server. Do you mean that you're expecting about 1000 QPS over an index with up to 20 million documents? --Matthew On Nov 19, 2007, at 6:00 AM, Eswar K wrote: All, Can you give some information on this or atleast let me know where I can find this information if its already listed out anywhere. Regards, Eswar On Nov 18, 2007 9:45 PM, Eswar K [EMAIL PROTECTED] wrote: Hi, I understand that Solr can be used on different Linux flavors. Is there any preferred flavor (Like Red Hat, Ubuntu, etc)? Also what is the kind of configuration of hardware (Processors, RAM, etc) be best suited for the install? We expect to load it with millions of documents (varying from 2 - 20 million). There might be around 1000 concurrent users. Your help in this regard will be appreciated. Regards, Eswar
rows=VERY_LARGE_VALUE throws exception, and error in some cases
Hi, We are using Solr 1.2 for our project and have come across the following exception and error: Exception: SEVERE: java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.util.PriorityQueue.initialize (PriorityQueue.java :36) Steps to reproduce: 1. Restart your Web Server. 2. Enter a query with VERY_LARGE_VALUE for rows field. For example: http://xx.xx.xx.xx:8080/solr/select?q=unix%20start=0fl=idindent=offrows=9 3. Press enter or click on the 'Go' button on the browser. NOTE: 1. This exception is thrown if'999' (seven digits) VERY_LARGE_VALUE '9' (nine digits). 2. The exception DOES NOT APPEAR AGAIN if we change the VERY_LARGE_VALUE to = '999', execute the query and then change the VERY_LARGE_VALUE back to it's original value and execute the query again. 3. If the VERY_LARGE_VALUE = '99' (ten digits) we get the following error: Error: HTTP Status 400 - For input string: 99 Has anyone come across this scenario before? Regards, Rishabh
Re: Near Duplicate Documents
Otis, Thanks for your response. I just gave a quick look to the Nutch Forum and find that there is an implementation to obtain de-duplicate documents/pages but none for Near Duplicates documents. Can you guide me a little further as to where exactly under Nutch I should be concentrating, regarding near duplicate documents? Regards, Rishabh On Nov 21, 2007 12:41 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: To whomever started this thread: look at Nutch. I believe something related to this already exists in Nutch for near-duplicate detection. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mike Klaas [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Sunday, November 18, 2007 11:08:38 PM Subject: Re: Near Duplicate Documents On 18-Nov-07, at 8:17 AM, Eswar K wrote: Is there any idea implementing that feature in the up coming releases? Not currently. Feel free to contribute something if you find a good solution g. -Mike On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote: On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote: We have a scenario, where we want to find out documents which are similar in content. To elaborate a little more on what we mean here, lets take an example. The example of this email chain in which we are interacting on, can be best used for illustrating the concept of near dupes (We are not getting confused with threads, they are two different things.). Each email in this thread is treated as a document by the system. A reply to the original mail also includes the original mail in which case it becomes a near duplicate of the orginal mail (depending on the percentage of similarity). Similarly it goes on. The near dupes need not be limited to emails. I think this is what's known as shingling. See http://en.wikipedia.org/wiki/W-shingling Lucene (and therefore Solr) does not implement shingling. The MoreLikeThis query might be close enough, however. -Stuart
Near Duplicate Documents
Hi, I am evaluating Solr 1.2 for my project and wanted to know if it can return near duplicate documents (near dups) and how do i go about it? I am not sure, but is MoreLikeThisHandler the implementation for near dups? Rishabh
RE: Best way to create multiple indexes
Ryan, We currently have 8-9 million documents to index and this number will grow in the future. Also, we will never have a query that will search across groups, but, we will have queries that will search across sub-groups for sure. Now, keeping this in mind we were thinking if we could have multiple indexes at the 'group' level at least. Also, can multiple indexes be created dynamically? For example: In my application if I create a 'logical group', then an index should be created for that group. Rishabh -Original Message- From: Ryan McKinley [mailto:[EMAIL PROTECTED] Sent: Monday, November 12, 2007 7:44 PM To: solr-user@lucene.apache.org Subject: Re: Best way to create multiple indexes For starters, do you need to be able to search across groups or sub-groups (in one query?) If so, then you have to stick everything in one index. You can add a field to each document saying what 'group' or 'sub-group' it is in and then limit it at query time q=kittens +group:A The advantage to splitting it into multiple indexes is that you could put each index on independent hardware. Depending on your queries and index size that may make a big difference. ryan Rishabh Joshi wrote: Hi, I have a requirement and was wondering if someone could help me in how to go about it. We have to index about 8-9 million documents and their size can be anywhere from a few KBs to a couple of MBs. These documents are categorized into many 'groups' and 'sub-groups'. I wanted to know if we can create multiple indexes based on 'groups' and then on 'sub-groups' in Solr? If yes, then how do we go about it? I tried going through the section on 'Collections' in the Solr Wiki, but could not make much use of it. Regards, Rishabh Joshi