Re: Near Duplicate Documents
Can anyone help me? Rishabh rishabh9 wrote: Hi, I am evaluating Solr 1.2 for my project and wanted to know if it can return near duplicate documents (near dups) and how do i go about it? I am not sure, but is MoreLikeThisHandler the implementation for near dups? Rishabh -- View this message in context: http://www.nabble.com/Near-Duplicate-Documents-tf4820111.html#a13819048 Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query multiple fields
On Nov 18, 2007 1:50 AM, Dave C. [EMAIL PROTECTED] wrote: Maybe you can help me with this related problem I am having. My query is: q=description:(test)!(type:10)!(type:14). However, my results are not as expected (55 results instead of the expected 23) The response header shows: responseHeader:{ status:0, QTime:1, params:{ wt:json, !(type:10):, !(type:14):, indent:on, q:description:(test), fl:*}}, I am confused about why the !(type:10)!(type:14) is not in the 'q' parameter. Looks like the first in your query is being interpreted as a divider in the query string. You probably need to escape every as %26 in your query. -Stuart
Re: Near Duplicate Documents
On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote: We have a scenario, where we want to find out documents which are similar in content. To elaborate a little more on what we mean here, lets take an example. The example of this email chain in which we are interacting on, can be best used for illustrating the concept of near dupes (We are not getting confused with threads, they are two different things.). Each email in this thread is treated as a document by the system. A reply to the original mail also includes the original mail in which case it becomes a near duplicate of the orginal mail (depending on the percentage of similarity). Similarly it goes on. The near dupes need not be limited to emails. I think this is what's known as shingling. See http://en.wikipedia.org/wiki/W-shingling Lucene (and therefore Solr) does not implement shingling. The MoreLikeThis query might be close enough, however. -Stuart
Re: Near Duplicate Documents
We have a scenario, where we want to find out documents which are similar in content. To elaborate a little more on what we mean here, lets take an example. The example of this email chain in which we are interacting on, can be best used for illustrating the concept of near dupes (We are not getting confused with threads, they are two different things.). Each email in this thread is treated as a document by the system. A reply to the original mail also includes the original mail in which case it becomes a near duplicate of the orginal mail (depending on the percentage of similarity). Similarly it goes on. The near dupes need not be limited to emails. If we want to have such capability using Solr, can we use MoreLikeThisHandler or is there any other appropriate handler in Solr which we can use? What is the best way for achieving such a functionality? Regards, Eswar On Nov 18, 2007 9:06 PM, Ryan McKinley [EMAIL PROTECTED] wrote: I'm not sure I understand your question... A near duplicate document could mean a LOT of things depending on the context. perhaps you just need fuzzy searching? http://lucene.apache.org/java/docs/queryparsersyntax.html#Fuzzy%20Searches or proximity searches? http://lucene.apache.org/java/docs/queryparsersyntax.html#Proximity%20Searches MoreLikeThisHandler (added in 1.3-dev) may be able to help, but it is used to search for other similar documents based on the results of another query. ryan rishabh9 wrote: Can anyone help me? Rishabh rishabh9 wrote: Hi, I am evaluating Solr 1.2 for my project and wanted to know if it can return near duplicate documents (near dups) and how do i go about it? I am not sure, but is MoreLikeThisHandler the implementation for near dups? Rishabh
Re: Near Duplicate Documents
Is there any idea implementing that feature in the up coming releases? Regards, Eswar On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote: On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote: We have a scenario, where we want to find out documents which are similar in content. To elaborate a little more on what we mean here, lets take an example. The example of this email chain in which we are interacting on, can be best used for illustrating the concept of near dupes (We are not getting confused with threads, they are two different things.). Each email in this thread is treated as a document by the system. A reply to the original mail also includes the original mail in which case it becomes a near duplicate of the orginal mail (depending on the percentage of similarity). Similarly it goes on. The near dupes need not be limited to emails. I think this is what's known as shingling. See http://en.wikipedia.org/wiki/W-shingling Lucene (and therefore Solr) does not implement shingling. The MoreLikeThis query might be close enough, however. -Stuart
Performance of Solr on different Platforms
Hi, I understand that Solr can be used on different Linux flavors. Is there any preferred flavor (Like Red Hat, Ubuntu, etc)? Also what is the kind of configuration of hardware (Processors, RAM, etc) be best suited for the install? We expect to load it with millions of documents (varying from 2 - 20 million). There might be around 1000 concurrent users. Your help in this regard will be appreciated. Regards, Eswar
Re: Near Duplicate Documents
Eswar K wrote: We have a scenario, where we want to find out documents which are similar in content. To elaborate a little more on what we mean here, lets take an example. The example of this email chain in which we are interacting on, can be best used for illustrating the concept of near dupes (We are not getting confused with threads, they are two different things.). Each email in this thread is treated as a document by the system. A reply to the original mail also includes the original mail in which case it becomes a near duplicate of the orginal mail (depending on the percentage of similarity). Similarly it goes on. The near dupes need not be limited to emails. If we want to have such capability using Solr, can we use MoreLikeThisHandler or is there any other appropriate handler in Solr which we can use? What is the best way for achieving such a functionality? mess around with the MoreLikeThisHandler, see if it gives you what you are looking for. Check: http://wiki.apache.org/solr/MoreLikeThis For your example, you would want to make sure that the 'type' field (email) is in the mlt.fl param. Perhaps: mlt.fl=type,content
Finding all possible synonyms for a word
Hi All, I am new to Lucene / SOLR and developing a POC as part of research. Check below my requirement and problem statement. Need help on how I can index the data such data I have a very good search functionality in my POC. -- Requirement: -- Assume my web application is an Online book store and it sell all categories of books like Computers, Social Studies, Physical Sciences etc. Each of these categories has sub-categories. For example Computers has sub-categories like Software Engineering, Java, SQL Server etc I have a database table called Categories and it contains both Parent Category descriptions and also Child Category descriptions. Data structure of Category table is: Category_ID_Primay_Key integer Parent_Category_ID integer Category_Name varchar(100) Category_Description varchar(1000) -- My Search UI: -- My search page is very simple. We have a text field with Search button. -- User Action: -- User enter below search text in above text field and clicks on Search button. Books on Data Center -- What is my expected behavior: -- Since the word Data Center more relevant computers I should show books related to computers. -- My Problem statement and Question to you all: -- To have a better search in my web applications what kind of strategy should I have and index the data accordingly in SOLR/Lucene. In my Lucene Index I may or may not have the word data center. Still I should be able to return data center One thought I have is as follows: Modify the Category table by adding one more column to it: Category_ID_Primay_Key integer Parent_Category_ID integer Category_Name varchar(100) Category_Description varchar(1000) Category_Description_Keywords varchar(8000) Now take each word in Category_description, find synonyms of it and store that data in Category_Description_Keywords column. After doing it, index the Category table records in SOLR/Lucene. Below are my questions to you all: Question 1: Need your feedbacks on above approach or any other approach which help me to make my search better that returns most relevant results to the user. Question 2: Can you suggest me Java based best Open Source or commercial synonym engines. I want such a best synonym engine that gives me all possible synonyms of a word. Thanks in Advance, Kishore Veleti A.V.K.
RE: Query multiple fields
q=description:(test)!(type:10)!(type:14) You can't use an '' symbol in your query (without escaping it). The boolean operator for 'and' in Lucene is 'AND': and it is case sensitive. Your query should probably look like: q=description:test AND -type:10 AND -type:14 See the Lucene query syntax here: http://lucene.apache.org/java/docs/queryparsersyntax.html#Boolean%20operators Thanks, Stu -Original Message- From: Dave C. [EMAIL PROTECTED] Sent: Sunday, November 18, 2007 1:50am To: solr-user@lucene.apache.org Subject: RE: Query multiple fields Hi Nick, Maybe you can help me with this related problem I am having. My query is: q=description:(test)!(type:10)!(type:14). However, my results are not as expected (55 results instead of the expected 23) The response header shows: responseHeader:{ status:0, QTime:1, params:{ wt:json, !(type:10):, !(type:14):, indent:on, q:description:(test), fl:*}}, I am confused about why the !(type:10)!(type:14) is not in the 'q' parameter. Any ideas? Thanks, David From: [EMAIL PROTECTED] To: solr-user@lucene.apache.org Subject: RE: Query multiple fields Date: Sun, 18 Nov 2007 03:18:12 + oh, awesome thanks -david Date: Sun, 18 Nov 2007 15:24:00 +1300 From: [EMAIL PROTECTED] To: solr-user@lucene.apache.org Subject: Re: Query multiple fields Hi David You had it write in your example :) description:test AND type:10 But it would probably be wise to wrap any text in parenthesis: description:(test foo bar baz) AND type:10 You can find more info on the query syntax here: http://lucene.apache.org/java/docs/queryparsersyntax.html -Nick On 11/18/07, Dave C. [EMAIL PROTECTED] wrote: Hello, I've been trying to figure out how to query multiple fields at a time. For example, I want to do something like: description:test AND type:10. I've tried things like: ?q=description:testtype:10 etc, but I keep getting syntax errors. Can anyone tell me how this can be done? Thanks, David P.S. Perhaps the solution to this could/should be added to the FAQ/tutorial? _ You keep typing, we keep giving. Download Messenger and join the i'm Initiative now. http://im.live.com/messenger/im/home/?source=TAGLM _ You keep typing, we keep giving. Download Messenger and join the i’m Initiative now. http://im.live.com/messenger/im/home/?source=TAGLM _ You keep typing, we keep giving. Download Messenger and join the i’m Initiative now. http://im.live.com/messenger/im/home/?source=TAGLM
Re: Payloads in Solr
Thanks for your comments, Yonik! All for it... depending on what one means by payload functionality of course. We should probably hold off on adding a new lucene version to Solr until the Payload API has stabilized (it will most likely be changing very soon). It sounds like Lucene 2.3 is going to be released soonish (http://www.nabble.com/How%27s-2.3-doing--tf4802426.html#a13740605). As best I can tell it will include the Payload stuff marked experimental. The new Lucene version will have many improvements besides Payloads which would benefit Solr (examples galore in CHANGES.txt http://svn.apache.org/viewvc/lucene/java/trunk/CHANGES.txt?view=log). So I find it hard to believe that the new release will not be included. I recognize that the experimental status would be worrisome. What will it take to get Payloads to the place that they would be excepted for use in the Solr community? You probably know more about the projected changes to the API than I. Care to fill me in or suggest who I should ask? On the [EMAIL PROTECTED] list Grant Ingersoll suggested that the Payload object would be done away with and the API would just deal with byte arrays directly. That's a lot of data to associate with every token... I wonder how others have accomplished this? One could compress it with a dictionary somewhere. I wonder if one could index special begin_tag and end_tag tokens, and somehow use span queries? I agree that is a lot of data to associate with every token - especially since the data is repetitive in nature. Erik Hatcher suggested I store a representation of the structure of the document in a separate field, store a numeric representation of the mapping of the token to the structure as the payload for each token, and do a lookup at query time based on the numeric mapping in the payload at the position hit to get the structure/context back for the token. I'm also wondering how others have accomplished this. Grant Ingersoll noted that one of the original use cases was XPath queries so I'm particularly interested in finding out if anyone has implemented that, and how. Yes, this will be an issue for many custom tokenizers that don't yet know about payloads but that create tokens. It's not clear what to do in some cases when multiple tokens are created from one... should identical payloads be created for the new tokens... it depends on what the semantics of those payloads are. I suppose that it is only fair to take this on a case by case basis. Maybe we will have to write new TokenFilters for each Tokenzier that uses Payloads (but I sure hope not!). Maybe we can build some optional configuration options into the TokenFilter constructor that guide their behavior with regard to Payloads. Maybe there is something stored in the TokenStream that dictates how the Payloads are handled by the TokenFilters. Maybe there is no case where identical payloads would not be created for new tokens and we can just change the TokenFilter to deal with payloads directly in a uniform way. Tricia
solrj users -- API feedback, suggestions, etc
Hello- Solrj has been out there for a while, but is not yet baked into an official release. If there is anything major to change just so it feels better, now is the time. Here are a few things I'm thinking about: 1. The setFields() behavior Currently: query.setFields( name,id ); generates: fl=name,id while: query.setFields( name, id ); generates: fl=namefl=id (undefined behavior, it will probably just use 'name') 2. though maybe just because I'm looking at it is that request response are split into two packages when it seems like the request/response pair should sit next to eachother. 3. Interface vs Abstract super class? I know interfaces are an OO standard, but I have found they are pain to maintain across releases (you can't add a function to an interface without breaking existing implementations) Perhaps we should convert the interfaces to abstract super classes where possible. Other thoughts? ryan
Re: Payloads, Tokenizers, and Filters. Oh My!
I apologize for cross-posting but I believe both Solr and Lucene users and developers should be concerned with this. I am not aware of a better way to reach both communities. In this email I'm looking for comments on: * Do TokenFilters belong in the Solr code base at all? * How to deal with TokenFilters that add new Tokens to the stream? * How to patch TokenFilters and Tokenizers using the model of LUCENE-969 in the Solr code base and in Lucene contrib? Earlier in this thread I identified that at least one TokenFilter is eating Payloads (WordDelimiterFilter). Yonik pointed out: Yes, this will be an issue for many custom tokenizers that don't yet know about payloads but that create tokens. It's not clear what to do in some cases when multiple tokens are created from one... should identical payloads be created for the new tokens... it depends on what the semantics of those payloads are. And I responded: I suppose that it is only fair to take this on a case by case basis. Maybe we will have to write new TokenFilters for each Tokenzier that uses Payloads (but I sure hope not!). Maybe we can build some optional configuration options into the TokenFilter constructor that guide their behavior with regard to Payloads. Maybe there is something stored in the TokenStream that dictates how the Payloads are handled by the TokenFilters. Maybe there is no case where identical payloads would not be created for new tokens and we can just change the TokenFilter to deal with payloads directly in a uniform way. I thought it might be useful to figure out which existing TokenFilters need to know about Payloads. To this end I have taken an inventory of the TokenFilters out there. I think it is fair to categorize them by Add (A), Delete (D), Modify (M), Observe (O): *org.apache.solr.analysis.*HyphenatedWordsFilter, DM *org.apache.solr.analysis.*KeepWordFilter, D *org.apache.solr.analysis.*LengthFilter, D *org.apache.solr.analysis.*PatternReplaceFilter, M *org.apache.solr.analysis.*PhoneticFilter, AM *org.apache.solr.analysis.*RemoveDuplicatesTokenFilter, D *org.apache.solr.analysis.*SynonymFilter, ADM *org.apache.solr.analysis.*TrimFilter, M *org.apache.solr.analysis.*WordDelimiterFilter, AM *org.apache.lucene.analysis.*CachingTokenFilter, O *org.apache.lucene.analysis.*ISOLatin1AccentFilter, M *org.apache.lucene.analysis.*LengthFilter, D *org.apache.lucene.analysis.*LowerCaseFilter, M *org.apache.lucene.analysis.*PorterStemFilter, M *org.apache.lucene.analysis.*StopFilter, D *org.apache.lucene.analysis.standard*.StandardFilter, M* org.apache.lucene.analysis.br.*BrazilianStemFilter, M *org.apache.lucene.analysis.cn.*ChineseFilter, D* org.apache.lucene.analysis.de.*GermanStemFilter, M *org.apache.lucene.analysis.el.*GreekLowerCaseFilter, M *org.apache.lucene.analysis.fr.*ElisionFilter, M *org.apache.lucene.analysis.fr.*FrenchStemFilter, M *org.apache.lucene.analysis.ngram.*EdgeNGramTokenFilter, AM *org.apache.lucene.analysis.ngram.*NGramTokenFilter, AM *org.apache.lucene.analysis.nl.*DutchStemFilter, M *org.apache.lucene.analysis.ru.*RussianLowerCaseFilter, M *org.apache.lucene.analysis.ru.*RussianStemFilter, M *org.apache.lucene.analysis.th.*ThaiWordFilter, AM *org.apache.lucene.analysis.snowball.*SnowballFilter, M Some characteristics of Add (A), Delete (D), Modify (M), Observe (O) Add: new Token() and buffer of Tokens to consider before addressing input.next() Delete: loop ignoring tokens based on some criteria Modify: new Token(), or use of Token set methods Observe: rare CachingTokenFilter The categories of TokenFilters that are affected by Payloads are add and modify. The default behavior of TokenFilters which only delete or observe return the Token fed through intact, hence the Payload will remain intact. Maybe the Lucene community has thought about this problem? I noticed that the org.apache.lucene.analysis TokenFilters in the modify category (there are none in the add category) refrain from using new Token(). That led me to the comment in the JavaDocs: *NOTE:* As of 2.3, Token stores the term text internally as a malleable char[] termBuffer instead of String termText. The indexing code and core tokenizers have been changed re-use a single Token instance, changing its buffer and other fields in-place as the Token is processed. This provides substantially better indexing performance as it saves the GC cost of new'ing a Token and String for every term. The APIs that accept String termText are still available but a warning about the associated performance cost has been added (below). The |termText()| http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/Token.html#termText%28%29 method has been deprecated. Tokenizers and filters should try to re-use a Token instance when possible for best performance, by implementing the |TokenStream.next(Token)|
Re: Query multiple fields
On Nov 18, 2007 9:58 PM, Dave C. [EMAIL PROTECTED] wrote: According to the Lucene query syntax: The symbol can be used in place of the word AND. So, I shouldn't have to use 'AND'. Yes, but before the query parser can even get the query string, the servlet container parses query args and is a delimiter. Hence you need to escape '' for the sake of the servlet container. If I do the same query: q=description:(test)!(type:10)!(type:14) in the Solr admin interface, I get the correct results. Right, because the browser knows to escape '' for you. You need to escape '' as %26, since that is how URL escaping works (it has nothing to do with lucene syntax.) -Yonik
RE: Query multiple fields
okay thanks for the details - David Date: Sun, 18 Nov 2007 22:14:23 -0500 From: [EMAIL PROTECTED] To: solr-user@lucene.apache.org Subject: Re: Query multiple fields On Nov 18, 2007 9:58 PM, Dave C. [EMAIL PROTECTED] wrote: According to the Lucene query syntax: The symbol can be used in place of the word AND. So, I shouldn't have to use 'AND'. Yes, but before the query parser can even get the query string, the servlet container parses query args and is a delimiter. Hence you need to escape '' for the sake of the servlet container. If I do the same query: q=description:(test)!(type:10)!(type:14) in the Solr admin interface, I get the correct results. Right, because the browser knows to escape '' for you. You need to escape '' as %26, since that is how URL escaping works (it has nothing to do with lucene syntax.) -Yonik _ Connect and share in new ways with Windows Live. http://www.windowslive.com/connect.html?ocid=TXT_TAGLM_Wave2_newways_112007
Re: Near Duplicate Documents
On 18-Nov-07, at 8:17 AM, Eswar K wrote: Is there any idea implementing that feature in the up coming releases? Not currently. Feel free to contribute something if you find a good solution g. -Mike On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote: On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote: We have a scenario, where we want to find out documents which are similar in content. To elaborate a little more on what we mean here, lets take an example. The example of this email chain in which we are interacting on, can be best used for illustrating the concept of near dupes (We are not getting confused with threads, they are two different things.). Each email in this thread is treated as a document by the system. A reply to the original mail also includes the original mail in which case it becomes a near duplicate of the orginal mail (depending on the percentage of similarity). Similarly it goes on. The near dupes need not be limited to emails. I think this is what's known as shingling. See http://en.wikipedia.org/wiki/W-shingling Lucene (and therefore Solr) does not implement shingling. The MoreLikeThis query might be close enough, however. -Stuart
Re: Finding all possible synonyms for a word
Kishore, Solr has a SynonymFilterFactory which might be off use to you ( http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46) Regards, Eswar On Nov 18, 2007 10:39 PM, Kishore AVK. Veleti [EMAIL PROTECTED] wrote: Hi All, I am new to Lucene / SOLR and developing a POC as part of research. Check below my requirement and problem statement. Need help on how I can index the data such data I have a very good search functionality in my POC. -- Requirement: -- Assume my web application is an Online book store and it sell all categories of books like Computers, Social Studies, Physical Sciences etc. Each of these categories has sub-categories. For example Computers has sub-categories like Software Engineering, Java, SQL Server etc I have a database table called Categories and it contains both Parent Category descriptions and also Child Category descriptions. Data structure of Category table is: Category_ID_Primay_Key integer Parent_Category_ID integer Category_Name varchar(100) Category_Description varchar(1000) -- My Search UI: -- My search page is very simple. We have a text field with Search button. -- User Action: -- User enter below search text in above text field and clicks on Search button. Books on Data Center -- What is my expected behavior: -- Since the word Data Center more relevant computers I should show books related to computers. -- My Problem statement and Question to you all: -- To have a better search in my web applications what kind of strategy should I have and index the data accordingly in SOLR/Lucene. In my Lucene Index I may or may not have the word data center. Still I should be able to return data center One thought I have is as follows: Modify the Category table by adding one more column to it: Category_ID_Primay_Key integer Parent_Category_ID integer Category_Name varchar(100) Category_Description varchar(1000) Category_Description_Keywords varchar(8000) Now take each word in Category_description, find synonyms of it and store that data in Category_Description_Keywords column. After doing it, index the Category table records in SOLR/Lucene. Below are my questions to you all: Question 1: Need your feedbacks on above approach or any other approach which help me to make my search better that returns most relevant results to the user. Question 2: Can you suggest me Java based best Open Source or commercial synonym engines. I want such a best synonym engine that gives me all possible synonyms of a word. Thanks in Advance, Kishore Veleti A.V.K.
Re: multiple delete by id in one delete command?
The easiest solution I know is: deletequeryid:1 OR id:2 OR .../query/delete If you know that all of these ids can be found by issuing a query, you can do delete by query: deletequeryYOUR_DELETE_QUERY_HERE/query/delete Cheers On Nov 19, 2007 4:18 PM, Norberto Meijome [EMAIL PROTECTED] wrote: Hi everyone, I'm trying to issue, via curl to SOLR (testing at the moment), 3 deletes by id. I tried sending : deleteid1/idid2/idid3/id/delete and solr didn't like it at all. When I changed it to : deleteid1/id/deletedeleteid2/id/deletedeleteid3/id/delete as in : curl http://localhost:8983/vcs/update -H Content-Type: text/xml --data-binary 'deleteid816bc47fd52ffb9c6059e6975eafa168949d51dfa93dbe3c1eca169edd19b3/id/deletedeleteid53f3f80e65482a5be353e7110f5308949d51dfa93dbe3c1eca169edd19b3/id/delete' only the 1st ( id =1 , or id = 816bc47fd52ffb9c6059e6975eafa168949d51dfa93dbe3c1eca169edd19b3 gets deleted (after a commit, of course). So i figure I will have to issue a series of independent deleteidxxx/id/delete commandsIs it not possible to bunch them all together as it's possible with adddoc../docdoc.../doc/add ? thanks!! Beto _ {Beto|Norberto|Numard} Meijome Imagination is more important than knowledge. Albert Einstein, On Science I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned. -- Regards, Cuong Hoang
RE: I18N with SOLR?
Hello, Does SOLR supports searching for a keyword which has a combination of more than 1 language within the same search page? -Original Message- From: Guglielmo Celata [mailto:[EMAIL PROTECTED] Sent: Thursday, November 15, 2007 7:39 PM To: solr-user@lucene.apache.org; [EMAIL PROTECTED] Subject: Re: I18N with SOLR? Hi Dillip, don't know if this helps, but I have set up a TextIt field in the config/schema.xml file, in order to index italian text. It works pretty well with non-ascii characters (we do have some accented vowels, even if not as many as the french). It also works with stopwords (and I assume with protwords as well, though I didn't try). I created an italian-stopwords.txt file in the config/ path. I think the SnowballPorterFilterFactory is a default usable class in Solr, although I remember having read it's a bit slower than other libraries. But I am no expert. fieldtype name=textIt class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumber s=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=italian-stopwords.txt ignoreCase=true/ filter class=solr.SnowballPorterFilterFactory language=Italian/ /analyzer /fieldtype On 15/11/2007, Dilip.TS [EMAIL PROTECTED] wrote: Hi Ed, Thanks for the help, but i have some queries, i understand that we need to have a stopwords_french.txt and protwords_french.txt files say for french in solr/conf directory. Is it like we need to write the classes like FrenchStopFilterFactory, FrenchPorterFilterFactory for each language or do we have these classes in built in solr? I didnt find them in SOLR/Lucene APIs. I found some classes like org.apache.lucene.analysis.fr.FrenchAnalyzer etc., in lucene-analyzers.jar. Any idea what is this class used for? Thanks in advance, Regards Dilip -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Ed Summers Sent: Monday, November 12, 2007 7:00 PM To: solr-user@lucene.apache.org ; [EMAIL PROTECTED] Subject: Re: I18N with SOLR? I'd say yes. Solr supports Unicode and ships with language specific analyzers, and allows you to provide your own custom analyzers if you need them. This allows you to create different fieldType definitions for the languages you want to support. For example here is an example field type for French text which uses a French stopword list and French stemming. fieldType name=text_french class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.FrenchStopFilterFactory ignoreCase=true words=stopwords_french.txt / filter class= solr.FrenchPorterFilterFactory protected=protwords_french.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory / /analyzer /fieldType Then you can create a dynamicField definitions that allow you to index and query your documents using the correct field type: dynamicField name=*_french type=text_french indexed=true stored=true/ This means that when you index you need to know what language your data is in so that you know what field names to use in your document (e.g. title_french). And at search time you need to know what language you are in so you know which fields to search. Most user interfaces are in a single language context so from the query perspective you'll most likely know the language they want to search in. If you don't know the language context in either case you could try to guess using something like org.apache.nutch.analysis.lang.LanguageIdentifier. I hope this helps. We used this technique (without the guessing) quite effectively at the Library of Congress recently for a prototype application that needed to provide search functionality in 7 different languages. //Ed On Nov 12, 2007 1:56 AM, Dilip.TS [EMAIL PROTECTED] wrote: Hello, Does SOLR supports I18N (with multiple language support) ? Thanks in advance. Regards, Dilip TS
RE: I18N with SOLR?
Hello, Also can we have something like this ? i.e having multiple defaultSearchField entries in the schema.xml while searching for a keyword which has a combination of more than 1 language: defaultSearchFieldtext/defaultSearchField defaultSearchFieldtext_french/defaultSearchField... -Original Message- From: Dilip.TS [mailto:[EMAIL PROTECTED] Sent: Monday, November 19, 2007 11:29 AM To: solr-user@lucene.apache.org Subject: RE: I18N with SOLR? Hello, Does SOLR supports searching for a keyword which has a combination of more than 1 language within the same search page? -Original Message- From: Guglielmo Celata [mailto:[EMAIL PROTECTED] Sent: Thursday, November 15, 2007 7:39 PM To: solr-user@lucene.apache.org; [EMAIL PROTECTED] Subject: Re: I18N with SOLR? Hi Dillip, don't know if this helps, but I have set up a TextIt field in the config/schema.xml file, in order to index italian text. It works pretty well with non-ascii characters (we do have some accented vowels, even if not as many as the french). It also works with stopwords (and I assume with protwords as well, though I didn't try). I created an italian-stopwords.txt file in the config/ path. I think the SnowballPorterFilterFactory is a default usable class in Solr, although I remember having read it's a bit slower than other libraries. But I am no expert. fieldtype name=textIt class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumber s=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=italian-stopwords.txt ignoreCase=true/ filter class=solr.SnowballPorterFilterFactory language=Italian/ /analyzer /fieldtype On 15/11/2007, Dilip.TS [EMAIL PROTECTED] wrote: Hi Ed, Thanks for the help, but i have some queries, i understand that we need to have a stopwords_french.txt and protwords_french.txt files say for french in solr/conf directory. Is it like we need to write the classes like FrenchStopFilterFactory, FrenchPorterFilterFactory for each language or do we have these classes in built in solr? I didnt find them in SOLR/Lucene APIs. I found some classes like org.apache.lucene.analysis.fr.FrenchAnalyzer etc., in lucene-analyzers.jar. Any idea what is this class used for? Thanks in advance, Regards Dilip -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Ed Summers Sent: Monday, November 12, 2007 7:00 PM To: solr-user@lucene.apache.org ; [EMAIL PROTECTED] Subject: Re: I18N with SOLR? I'd say yes. Solr supports Unicode and ships with language specific analyzers, and allows you to provide your own custom analyzers if you need them. This allows you to create different fieldType definitions for the languages you want to support. For example here is an example field type for French text which uses a French stopword list and French stemming. fieldType name=text_french class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.FrenchStopFilterFactory ignoreCase=true words=stopwords_french.txt / filter class= solr.FrenchPorterFilterFactory protected=protwords_french.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory / /analyzer /fieldType Then you can create a dynamicField definitions that allow you to index and query your documents using the correct field type: dynamicField name=*_french type=text_french indexed=true stored=true/ This means that when you index you need to know what language your data is in so that you know what field names to use in your document (e.g. title_french). And at search time you need to know what language you are in so you know which fields to search. Most user interfaces are in a single language context so from the query perspective you'll most likely know the language they want to search in. If you don't know the language context in either case you could try to guess using something like org.apache.nutch.analysis.lang.LanguageIdentifier. I hope this helps. We used this technique (without the guessing) quite
Re: multiple delete by id in one delete command?
On Mon, 19 Nov 2007 16:53:17 +1100 climbingrose [EMAIL PROTECTED] wrote: The easiest solution I know is: deletequeryid:1 OR id:2 OR .../query/delete If you know that all of these ids can be found by issuing a query, you can do delete by query: deletequeryYOUR_DELETE_QUERY_HERE/query/delete thanks, so i'm not going nuts (at least not due to this :) ) i may just change the way i handle deletes ... thanks, B _ {Beto|Norberto|Numard} Meijome I've had a perfectly wonderful evening. But, this wasn't it. Groucho Marx I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.