CommonGrams phrase query
Hi, I have made an index using CommonGrams. Now when I query a b and explain it, SOLR makes it +MultiPhraseQuery(Contents:(a a_b) b). Shouldn't it just be searching a_b? I am asking this coz even though I am using CommonGrams it's much slower than normal index which just searches on a b. Note: Both words are in the words list of CommonGrams. -- Regards, Salman Akram Senior Software Engineer - Tech Lead 80-A, Abu Bakar Block, Garden Town, Pakistan Cell: +92-321-4391210
Re: spell suggest response
Hi Grijesh, Though i use autosuggest i maynot get the exact results, the order is not accurate.. As for example if i type http://localhost:8080/solr/terms/?terms.fl=spellterms.prefix=solrterms.sort=indexterms.lower=solrterms.upper.incl=true i get results as... solr solr.amp solr.datefield solr.p solr.pdf like that.But this may not lead to getting accurate results as we get in spellchecking, i require suggestions for any word irrespective of whether it is correct or not, is there anything to be changed in solr to get suggestions as we get when we type a wrong word in spellchecking... If so please let me know... Regards, satya
Re: spell suggest response
Hi Satya, In this example you are not using spellchecking .I am saying use spellcheck component also with Terms component so it will give you the spellcheck suggestion also. Then combined both the lists. - Thanx: Grijesh -- View this message in context: http://lucene.472066.n3.nabble.com/spell-suggest-response-tp2233409p2271114.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: CommonGrams phrase query
Ok sorry it was my fault. I wasn't using CommonGramsQueryFilter for query, just had Filter for indexing. The query seems fine now. On Mon, Jan 17, 2011 at 1:44 PM, Salman Akram salman.ak...@northbaysolutions.net wrote: Hi, I have made an index using CommonGrams. Now when I query a b and explain it, SOLR makes it +MultiPhraseQuery(Contents:(a a_b) b). Shouldn't it just be searching a_b? I am asking this coz even though I am using CommonGrams it's much slower than normal index which just searches on a b. Note: Both words are in the words list of CommonGrams. -- Regards, Salman Akram -- Regards, Salman Akram
sort problem
Hi guys, I use solr with utf8 charset and i've a sort problem. For example, i make a sort on a name field.. results looks like: Article Banana Foo aviation brunch ... So my question is, how to force solr to ignore case in result ? I would like to have result as: Article aviation Banana brunch Foo ... Thanks Philippe
Re: sort problem
Use Lowercase filter to lowering your data at both index time and search time it will make case insensitive - Thanx: Grijesh -- View this message in context: http://lucene.472066.n3.nabble.com/sort-problem-tp2271207p2271231.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: spell suggest response
Hi Grijesh, i added both the termscomponent and spellcheck component to the terms requesthandler, when i send a query as http://localhost:8080/solr/terms?terms.fl=textterms.prefix=javarows=7omitHeader=truespellcheck=truespellcheck.q=javaspellcheck.count=20 the result i get is response - lst name=terms - lst name=text int name=java6/int int name=javabas6/int int name=javas6/int int name=javascript6/int int name=javac6/int int name=javax6/int /lst /lst - lst name=spellcheck lst name=suggestions/ /lst /response when i send this http://localhost:8080/solr/terms?terms.fl=textterms.prefix=jawarows=5omitHeader=truespellcheck=truespellcheck.q=jawaspellcheck.count=20 i get the result as response - lst name=terms lst name=text/ /lst - lst name=spellcheck - lst name=suggestions - lst name=jawa int name=numFound20/int int name=startOffset0/int int name=endOffset4/int - arr name=suggestion strjava/str straway/str strjav/str strjar/str strara/str strapa/str strana/str strajax/str Now i need to know how to make ordering of the terms as in the 1st query the result obtained is inorder and i want only javax, javac,javascript but not javas,javabas how can it be done?? Regards, satya
Re: sort problem
Le 17/01/11 10:32, Grijesh a écrit : Use Lowercase filter to lowering your data at both index time and search time it will make case insensitive - Thanx: Grijesh Thanks, so tell me if i m wrong... i need to modify my schema.xml to add lowercase filter and reindex my content?
Re: sort problem
Yes. On Mon, Jan 17, 2011 at 2:44 PM, Philippe VINCENT-ROYOL vincent.ro...@gmail.com wrote: Le 17/01/11 10:32, Grijesh a écrit : Use Lowercase filter to lowering your data at both index time and search time it will make case insensitive - Thanx: Grijesh Thanks, so tell me if i m wrong... i need to modify my schema.xml to add lowercase filter and reindex my content? -- Regards, Salman Akram Senior Software Engineer - Tech Lead 80-A, Abu Bakar Block, Garden Town, Pakistan Cell: +92-321-4391210
latest patches and big picture of search grouping
I need to dive into search grouping / field collapsing again. I've seen there are lot's of issues about it now. Can someone point me to the minimum patches I need to run this feature in trunk? I want to see the code of the most optimised version and what's being done in distributed search. I think I need this: https://issues.apache.org/jira/browse/SOLR-2068 https://issues.apache.org/jira/browse/SOLR-2205 https://issues.apache.org/jira/browse/SOLR-2066 But not sure if I am missing anything else. By the way, I think the current implementation of group searching is totally different that what it was before when you could choose normal or adjacent collapse. Can someone give me a quick big picture of the current implementation (I will trace the code anyway, but it's just to get an idea). Is there still a double trip? Thanks in advance. -- View this message in context: http://lucene.472066.n3.nabble.com/latest-patches-and-big-picture-of-search-grouping-tp2271383p2271383.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: exception obtaining write lock on startup
In that case why is there a separate lock factory of SingleInstanceLockFactory? On Fri, Dec 31, 2010 at 6:25 AM, Lance Norskog goks...@gmail.com wrote: This will not work. At all. You can only have one Solr core instance changing an index. On Thu, Dec 30, 2010 at 4:38 PM, Tri Nguyen tringuye...@yahoo.com wrote: Hi, I'm getting this exception when I have 2 cores as masters. Seems like one of the cores obtains a lock (file) and then the other tries to obtain the same one. However, the first one is not deleted. How do I fix this? Dec 30, 2010 4:34:48 PM org.apache.solr.handler.ReplicationHandler inform WARNING: Unable to get IndexCommit on startup org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: Native FSLock@..\webapps\solr\tnsolr\data\index\lucene-fe3fc928a4bbfeb55082e49b32a70c10 -write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:85) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1565) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1421) at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:19 1) at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHand ler.java:98) at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHa ndler2.java:173) at org.apache.solr.update.DirectUpdateHandler2.forceOpenWriter(DirectUpd ateHandler2.java:376) at org.apache.solr.handler.ReplicationHandler.inform(ReplicationHandler. Tri -- Lance Norskog goks...@gmail.com
Re: Single value vs multi value setting in tokenized field
No, I have both, a single field (for free form text search), and individual fields (for directed search). I already duplicate the data and that's not a problem, disk space is cheap. What I wanted to know was whether it is best to make the single field multiValued=true or not. That is, should my 'content' field hold data like: some description maybe a paragraph or two a product or service title tag1 tag2 feature1 feature2 or would it be better to make it a concatenated, single value field like: some description maybe a paragraph or two a product or service title tag1 tag2 feature1 feature2 my indexing seems to take longer than most, it takes about 2 1/2 hours to index 3.5 million records. I have a colleague that, in a separate project, is indexing 70 million records in about 4 hours, albeit in a much simpler schema. So I'm trying to see if this could be a factor in my indexing performance. I also wanted to know what impact, in general, not just in this situation, using a MultiValued field versus a Single Valued field has in search results. I would have thought that having to support a free-form-text search, and a field (directed) search would be a common problem, and was just looking for advice. -- View this message in context: http://lucene.472066.n3.nabble.com/Single-value-vs-multi-value-setting-in-tokenized-field-tp2268635p2271543.html Sent from the Solr - User mailing list archive at Nabble.com.
solrconfig.xml settings question
In the Wiki and the book by Smiley and Pugh, and in the comments inside the solrconfig.xml file itself, it always talks about the various settings in the context of a blended use solr index. By that I mean, it assumes you are indexing and querying from the same solr instance. However, if I have a Master-Slave set up I should be able to optimize the Master for indexing data, and optimize the Slave for querying the data. Does anyone have links to information that talks about this? I want to index as furiously as possible into one solr instance without regard to the impact it will have on queries, and to query on another solr instance that only has to worry about replication, but not constant add/update/delete/commit activity. I want my solrconfig settings to be as optimal as possible. Links, comments, references to previous forum threads, any and all feedback is appreciated. Thanks, Ken -- View this message in context: http://lucene.472066.n3.nabble.com/solrconfig-xml-settings-question-tp2271594p2271594.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: boilerpipe solr tika howto please
Thanks Ken, this what I wanted to know, I'm not very familiar with this kind of modification. However, I will try to do it and ask you some information in case of need. regards, Arno Le 14.01.2011 18:04, Ken Krugler a écrit : Hi Arno, On Jan 14, 2011, at 3:57am, arnaud gaudinat wrote: Hello, I would like to use BoilerPipe (a very good program which cleans the html content from surplus clutter). I saw that BoilerPipe is inside Tika 0.8 and so should be accessible from solr, am I right? How I can Activate BoilerPipe in Solr? Do I need to change solrconfig.xml ( with org.apache.solr.handler.extraction.ExtractingRequestHandler)? Or do I need to modify some code inside Solr? I so something like TikaCLI -F in the tika forum (http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration) is it the right way? You need to add the BoilerpipeContentHandler into Tika's content handler chain. Which I'm pretty sure means you'd need to modify Solr, e.g. (in trunk) the TikaEntityProcessor.getHtmlHandler() method. I'd try something like: return new BoilerpipeContentHandler(new ContentHandlerDecorator( Though from a quick look at that code, I'm curious why it doesn't use BodyContentHandler, versus the current ContentHandlerDecorator. -- Ken -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: solrconfig.xml settings question
In the Wiki and the book by Smiley and Pugh, and in the comments inside the solrconfig.xml file itself, it always talks about the various settings in the context of a blended use solr index. By that I mean, it assumes you are indexing and querying from the same solr instance. However, if I have a Master-Slave set up I should be able to optimize the Master for indexing data, and optimize the Slave for querying the data. Does anyone have links to information that talks about this? I want to index as furiously as possible into one solr instance without regard to the impact it will have on queries, and to query on another solr instance that only has to worry about replication, but not constant add/update/delete/commit activity. I want my solrconfig settings to be as optimal as possible. Links, comments, references to previous forum threads, any and all feedback is appreciated. Besides caches described here http://search-lucene.com/m/DBdghoZPh01 , ramBufferSizeMB can be different on slave and master.
Clustering using Carrot2 clustering componet
Dear All, Can anyone tell me how to use carrot2 clustering component to cluster search results. What are its dependencies ? Which type of changes are required in solr.config or anywhere else. Thanks! Isha
FilterQuery reaching maxBooleanClauses, alternatives?
Hi List, we are sometimes reaching the maxBooleanClauses Limit (which is 1024, per default). So, the used query looks like: ?q=name:Stefanfq=5 10 12 15 16 [...] where the values are ids of users, which the current user is allowed to see - so long, nothing special. sometimes the filter-query includes user-ids from an different Type of User (let's say we have TypeA and TypeB) where TypeB contains more then 2k users. Then we hit the given Limit. Now the Question is .. is it possible to enable an Filter/Function/Feature in Solr, which it makes possible, that we don't need to send over alle the user ids from TypeB Users? Just to tell Solr include all TypeB Users in the (given) FilterQuery (or something in that direction)? If so, what's the Name of this Filter/Function/Feature? :) Don't hesitate to ask, if my question/description is weird! Thanks Stefan
RE: sort problem
Haha, Yes, you're not wrong. The field you are sorting on should be a fieldtype that has the lowercase filter applied. You'll probably have to re-index your data, unless you happen to already have such a field (via copyField, perhaps). Brad -Original Message- From: Salman Akram [mailto:salman.ak...@northbaysolutions.net] Sent: January-17-11 5:47 AM To: solr-user@lucene.apache.org Subject: Re: sort problem Yes. On Mon, Jan 17, 2011 at 2:44 PM, Philippe VINCENT-ROYOL vincent.ro...@gmail.com wrote: Le 17/01/11 10:32, Grijesh a écrit : Use Lowercase filter to lowering your data at both index time and search time it will make case insensitive - Thanx: Grijesh Thanks, so tell me if i m wrong... i need to modify my schema.xml to add lowercase filter and reindex my content? -- Regards, Salman Akram Senior Software Engineer - Tech Lead 80-A, Abu Bakar Block, Garden Town, Pakistan Cell: +92-321-4391210
Re: Single value vs multi value setting in tokenized field
Functionally, the two options are equivalent, and I've never really heard of any speed difference. Assuming it's not that big a programming change, though, you probably want to just test... Do be aware of one subtle difference in the approaches, though. If the increment gap is != 1 then multiValued fields will NOT be functionally equivalent because phrases won't match across boundaries quite the same way. Which is often desirable behavior but may not be in your situation. Best Erick On Mon, Jan 17, 2011 at 5:50 AM, kenf_nc ken.fos...@realestate.com wrote: No, I have both, a single field (for free form text search), and individual fields (for directed search). I already duplicate the data and that's not a problem, disk space is cheap. What I wanted to know was whether it is best to make the single field multiValued=true or not. That is, should my 'content' field hold data like: some description maybe a paragraph or two a product or service title tag1 tag2 feature1 feature2 or would it be better to make it a concatenated, single value field like: some description maybe a paragraph or two a product or service title tag1 tag2 feature1 feature2 my indexing seems to take longer than most, it takes about 2 1/2 hours to index 3.5 million records. I have a colleague that, in a separate project, is indexing 70 million records in about 4 hours, albeit in a much simpler schema. So I'm trying to see if this could be a factor in my indexing performance. I also wanted to know what impact, in general, not just in this situation, using a MultiValued field versus a Single Valued field has in search results. I would have thought that having to support a free-form-text search, and a field (directed) search would be a common problem, and was just looking for advice. -- View this message in context: http://lucene.472066.n3.nabble.com/Single-value-vs-multi-value-setting-in-tokenized-field-tp2268635p2271543.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: FilterQuery reaching maxBooleanClauses, alternatives?
You can index a field which can the User types e.g. UserType (possible values can be TypeA,TypeB and so on...) and then you can just do ?q=name:Stefanfq=UserType:TypeB BTW you can even increase the size of maxBooleanClauses but in this case definitely this is not a good idea. Also you would hit the max limit of HTTP GET so you will have to change it to POST. Better handle it with a new field. On Mon, Jan 17, 2011 at 5:57 PM, Stefan Matheis matheis.ste...@googlemail.com wrote: Hi List, we are sometimes reaching the maxBooleanClauses Limit (which is 1024, per default). So, the used query looks like: ?q=name:Stefanfq=5 10 12 15 16 [...] where the values are ids of users, which the current user is allowed to see - so long, nothing special. sometimes the filter-query includes user-ids from an different Type of User (let's say we have TypeA and TypeB) where TypeB contains more then 2k users. Then we hit the given Limit. Now the Question is .. is it possible to enable an Filter/Function/Feature in Solr, which it makes possible, that we don't need to send over alle the user ids from TypeB Users? Just to tell Solr include all TypeB Users in the (given) FilterQuery (or something in that direction)? If so, what's the Name of this Filter/Function/Feature? :) Don't hesitate to ask, if my question/description is weird! Thanks Stefan -- Regards, Salman Akram
Re: sort problem
Note two things: 1 the lowercasefilter is NOT applied to the STORED data. So the display will still have the original case although the sorting should be what you want. 2 you should NOT be sorting on a tokenized field. Use something like KeywordTokenizer followed by the lowercase filter. String types don't go through filters as I remember. Best Erick On Mon, Jan 17, 2011 at 7:57 AM, Brad Dewar bde...@stfx.ca wrote: Haha, Yes, you're not wrong. The field you are sorting on should be a fieldtype that has the lowercase filter applied. You'll probably have to re-index your data, unless you happen to already have such a field (via copyField, perhaps). Brad -Original Message- From: Salman Akram [mailto:salman.ak...@northbaysolutions.net] Sent: January-17-11 5:47 AM To: solr-user@lucene.apache.org Subject: Re: sort problem Yes. On Mon, Jan 17, 2011 at 2:44 PM, Philippe VINCENT-ROYOL vincent.ro...@gmail.com wrote: Le 17/01/11 10:32, Grijesh a écrit : Use Lowercase filter to lowering your data at both index time and search time it will make case insensitive - Thanx: Grijesh Thanks, so tell me if i m wrong... i need to modify my schema.xml to add lowercase filter and reindex my content? -- Regards, Salman Akram Senior Software Engineer - Tech Lead 80-A, Abu Bakar Block, Garden Town, Pakistan Cell: +92-321-4391210
Re: FilterQuery reaching maxBooleanClauses, alternatives?
Thanks Salman, talking with others about problems really helps. Adding another FilterQuery is a bit too much - but combining both is working fine! not seen the wood for the trees =) Thanks, Stefan On Mon, Jan 17, 2011 at 2:07 PM, Salman Akram salman.ak...@northbaysolutions.net wrote: You can index a field which can the User types e.g. UserType (possible values can be TypeA,TypeB and so on...) and then you can just do ?q=name:Stefanfq=UserType:TypeB BTW you can even increase the size of maxBooleanClauses but in this case definitely this is not a good idea. Also you would hit the max limit of HTTP GET so you will have to change it to POST. Better handle it with a new field. On Mon, Jan 17, 2011 at 5:57 PM, Stefan Matheis matheis.ste...@googlemail.com wrote: Hi List, we are sometimes reaching the maxBooleanClauses Limit (which is 1024, per default). So, the used query looks like: ?q=name:Stefanfq=5 10 12 15 16 [...] where the values are ids of users, which the current user is allowed to see - so long, nothing special. sometimes the filter-query includes user-ids from an different Type of User (let's say we have TypeA and TypeB) where TypeB contains more then 2k users. Then we hit the given Limit. Now the Question is .. is it possible to enable an Filter/Function/Feature in Solr, which it makes possible, that we don't need to send over alle the user ids from TypeB Users? Just to tell Solr include all TypeB Users in the (given) FilterQuery (or something in that direction)? If so, what's the Name of this Filter/Function/Feature? :) Don't hesitate to ask, if my question/description is weird! Thanks Stefan -- Regards, Salman Akram
Re: FilterQuery reaching maxBooleanClauses, alternatives?
You are welcome. By new field I meant if you don't have a field for UserType already. On Mon, Jan 17, 2011 at 6:22 PM, Stefan Matheis matheis.ste...@googlemail.com wrote: Thanks Salman, talking with others about problems really helps. Adding another FilterQuery is a bit too much - but combining both is working fine! not seen the wood for the trees =) Thanks, Stefan On Mon, Jan 17, 2011 at 2:07 PM, Salman Akram salman.ak...@northbaysolutions.net wrote: You can index a field which can the User types e.g. UserType (possible values can be TypeA,TypeB and so on...) and then you can just do ?q=name:Stefanfq=UserType:TypeB BTW you can even increase the size of maxBooleanClauses but in this case definitely this is not a good idea. Also you would hit the max limit of HTTP GET so you will have to change it to POST. Better handle it with a new field. On Mon, Jan 17, 2011 at 5:57 PM, Stefan Matheis matheis.ste...@googlemail.com wrote: Hi List, we are sometimes reaching the maxBooleanClauses Limit (which is 1024, per default). So, the used query looks like: ?q=name:Stefanfq=5 10 12 15 16 [...] where the values are ids of users, which the current user is allowed to see - so long, nothing special. sometimes the filter-query includes user-ids from an different Type of User (let's say we have TypeA and TypeB) where TypeB contains more then 2k users. Then we hit the given Limit. Now the Question is .. is it possible to enable an Filter/Function/Feature in Solr, which it makes possible, that we don't need to send over alle the user ids from TypeB Users? Just to tell Solr include all TypeB Users in the (given) FilterQuery (or something in that direction)? If so, what's the Name of this Filter/Function/Feature? :) Don't hesitate to ask, if my question/description is weird! Thanks Stefan -- Regards, Salman Akram -- Regards, Salman Akram
Re: Tika Update, no Data
Hey! Thanks a lot, nice tip.. works fine.. But one Problem i have too... to indexing ZIP. i tryed : curl http://192.168.105.66:8983/solr/update/extract?literal.id=zipuprefix=attr_commit=true; -F myfile@constellio_standalone-1.0.zip and i get: Warning: Illegally formatted input field! curl: option -F: is badly used here curl: try 'curl --help' or 'curl --manual' for more information service@joa-Desktop:~/Downloads$ Maby you hav an idea?
Re: Tika Update, no Data
missing the = char between myfile and @filename.ext? On Mon, Jan 17, 2011 at 2:47 PM, Jörg Agatz joerg.ag...@googlemail.comwrote: Hey! Thanks a lot, nice tip.. works fine.. But one Problem i have too... to indexing ZIP. i tryed : curl http://192.168.105.66:8983/solr/update/extract?literal.id=zipuprefix=attr_commit=true -F myfile@constellio_standalone-1.0.zip and i get: Warning: Illegally formatted input field! curl: option -F: is badly used here curl: try 'curl --help' or 'curl --manual' for more information service@joa-Desktop:~/Downloads$ Maby you hav an idea?
Re: Tika Update, no Data
ohh, your right.. embarrassing! i have tryed, and it works, but it seems it works not Perfect, the txt documents into the ZIP are not indext, lonly the Names of documents into the zip.. King
CommonGrams and SOLR - 1604
Hi, I am trying to use CommonGrams with SOLR - 1604 patch but doesn't seem to work. If I don't add {!complexphrase} it uses CommonGramsQueryFilterFactory and proper bi-grams are made but of course doesn't use this patch. If I add {!complexphrase} it simply does it the old way i.e. ignore CommonGrams. Does anyone know how to combine both these features? Also once they are combined (hopefully they will be) would phrase proximity search work fine? Thanks -- Regards, Salman Akram
resetting the statistics
Hi everybody, Is it possible to reset solr statistics without restarting solr or reloading cores? Conform the thread here http://osdir.com/ml/solr-user.lucene.apache.org/2010-03/msg01078.html this was not possible in March 2010. I am wondering if something like this has been implemented in the meanwhile. Thanks, roxana
spellchecking even the key is true....
Hi All, can we get the spellchecking results even when the keyword is true. As for spellchecking will give only to the wrong keywords, cant we get similar and near words of the keyword though the spellcheck.q is true.. as an example http://localhost:8080/solr/spellcheck?q=javaspellcheck=truespellcheck.count=5 the result will be 1)- response - lst name=spellcheck lst name=suggestions/ /lst /response can we get the result as 2) response - lst name=spellcheck lst name=suggestions strjavax/str strjavac/str strjavabean/str strjavascript/str /lst /response NOTE:: all the keywords in the 2nd result is are in index... Regards, satya
partitioning documents with fields
Hi, I'm crawling different intranets so i developed a nutch plugin to add a static field for each of these crawls. I do have now in SOLR my documents with their specific craw field. If i search withing solr i can see my documents being returned with that field. The field definition in the schema is: field name=crawl type=string stored=true indexed=true/ I'd like to put a checkbox in my websearch app to choose with partition to search in. So i thought i'd implement it by simply using: /select?indent=onversion=2.2q=crawl%3Avalue+AND+query but nothing is returned. I also just tried crawl:value, which i'd expect to return all the documents from that crawl, but no results are sent back. As the field is indexed and stored and i can see the documents owning that field from normal query results, what could i be missing? -- Claudio Martella Digital Technologies Unit Research Development - Analyst TIS innovation park Via Siemens 19 | Siemensstr. 19 39100 Bolzano | 39100 Bozen Tel. +39 0471 068 123 Fax +39 0471 068 129 claudio.marte...@tis.bz.it http://www.tis.bz.it Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to priv...@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
Re: partitioning documents with fields
String fields are unanalyzed, so case matters. Are you sure you're not using a different case (try KeywordTokenizer + lowercaseFilter if you want these normalized to, say, lower case). If that isn't the problem, could we see the results if you add debugQuery=on to your URL? That often helps diagnose the problem. Take a look at your solr/admin page, schema browser to examine the actual contents of the crawl field and see if they're really what you expect. Best Erick On Mon, Jan 17, 2011 at 11:59 AM, Claudio Martella claudio.marte...@tis.bz.it wrote: Hi, I'm crawling different intranets so i developed a nutch plugin to add a static field for each of these crawls. I do have now in SOLR my documents with their specific craw field. If i search withing solr i can see my documents being returned with that field. The field definition in the schema is: field name=crawl type=string stored=true indexed=true/ I'd like to put a checkbox in my websearch app to choose with partition to search in. So i thought i'd implement it by simply using: /select?indent=onversion=2.2q=crawl%3Avalue+AND+query but nothing is returned. I also just tried crawl:value, which i'd expect to return all the documents from that crawl, but no results are sent back. As the field is indexed and stored and i can see the documents owning that field from normal query results, what could i be missing? -- Claudio Martella Digital Technologies Unit Research Development - Analyst TIS innovation park Via Siemens 19 | Siemensstr. 19 39100 Bolzano | 39100 Bozen Tel. +39 0471 068 123 Fax +39 0471 068 129 claudio.marte...@tis.bz.it http://www.tis.bz.it Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to priv...@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it.
Re: partitioning documents with fields
Thanks for your answer. Yes, schema browser shows that the field contains the right values as i expect. From debugQuery=on i see there must be some problem though: str name=rawquerystringcrawl:DIGITALDATA/str str name=querystringcrawl:DIGITALDATA/str str name=parsedquery+DisjunctionMaxQuery((contentEN:crawl (digitaldata crawldigitaldata)^0.8 | title:crawl (digitaldata crawldigitaldata)^1.2 | url:crawl digitaldata^1.5 | contentDE:crawl (digitaldata crawldigitaldata)^0.8 | contentIT:crawl (digitald crawldigitald)^0.8 | anchor:crawl:DIGITALDATA^1.5)~0.1) DisjunctionMaxQuery((contentEN:crawl (digitaldata crawldigitaldata)^0.8 | title:crawl (digitaldata crawldigitaldata)^1.2 | url:crawl digitaldata^1.5 | contentDE:crawl (digitaldata crawldigitaldata)^0.8 | contentIT:crawl (digitald crawldigitald)^0.8 | anchor:crawl:DIGITALDATA^1.5)~0.1)/str str name=parsedquery_toString+(contentEN:crawl (digitaldata crawldigitaldata)^0.8 | title:crawl (digitaldata crawldigitaldata)^1.2 | url:crawl digitaldata^1.5 | contentDE:crawl (digitaldata crawldigitaldata)^0.8 | contentIT:crawl (digitald crawldigitald)^0.8 | anchor:crawl:DIGITALDATA^1.5)~0.1 (contentEN:crawl (digitaldata crawldigitaldata)^0.8 | title:crawl (digitaldata crawldigitaldata)^1.2 | url:crawl digitaldata^1.5 | contentDE:crawl (digitaldata crawldigitaldata)^0.8 | contentIT:crawl (digitald crawldigitald)^0.8 | anchor:crawl:DIGITALDATA^1.5)~0.1/str It looks like there's some problem with my dismax query handler. It doesn't recognize the query with the colon format. Here's the handler definition: requestHandler name=/content class=solr.SearchHandler default=true lst name=defaults str name=defTypedismax/str str name=pftitle^1.2 anchor^1.5 url^1.5 contentEN^0.8 contentIT^0.8 contentDE^0.8/str str name=qftitle^1.2 anchor^1.5 url^1.5 contentEN^0.8 contentIT^0.8 contentDE^0.8/str float name=tie0.1/float bool name=hltrue/bool str name=hl.fltitle url content anchor/str int name=hl.fragsize150/int int name=hl.snippets3/int bool name=hl.mergeContiguoustrue/bool /lst /requestHandler On 1/17/11 6:06 PM, Erick Erickson wrote: String fields are unanalyzed, so case matters. Are you sure you're not using a different case (try KeywordTokenizer + lowercaseFilter if you want these normalized to, say, lower case). If that isn't the problem, could we see the results if you add debugQuery=on to your URL? That often helps diagnose the problem. Take a look at your solr/admin page, schema browser to examine the actual contents of the crawl field and see if they're really what you expect. Best Erick On Mon, Jan 17, 2011 at 11:59 AM, Claudio Martella claudio.marte...@tis.bz.it wrote: Hi, I'm crawling different intranets so i developed a nutch plugin to add a static field for each of these crawls. I do have now in SOLR my documents with their specific craw field. If i search withing solr i can see my documents being returned with that field. The field definition in the schema is: field name=crawl type=string stored=true indexed=true/ I'd like to put a checkbox in my websearch app to choose with partition to search in. So i thought i'd implement it by simply using: /select?indent=onversion=2.2q=crawl%3Avalue+AND+query but nothing is returned. I also just tried crawl:value, which i'd expect to return all the documents from that crawl, but no results are sent back. As the field is indexed and stored and i can see the documents owning that field from normal query results, what could i be missing? -- Claudio Martella Digital Technologies Unit Research Development - Analyst TIS innovation park Via Siemens 19 | Siemensstr. 19 39100 Bolzano | 39100 Bozen Tel. +39 0471 068 123 Fax +39 0471 068 129 claudio.marte...@tis.bz.it http://www.tis.bz.it Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to priv...@tis.bz.it in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it. -- Claudio Martella Digital Technologies Unit Research Development - Analyst TIS innovation park Via Siemens 19 | Siemensstr. 19 39100 Bolzano | 39100 Bozen Tel.
Re: partitioning documents with fields
It looks like there's some problem with my dismax query handler. It doesn't recognize the query with the colon format. Here's the handler definition: It is expected behavior of dismax. You can append/use defType=lucene for colon format.
Re: what would cause large numbers of executeWithRetry INFO messages?
I am facing exact same issue. Did you find out root cause for this? Please let me know any information you have -- View this message in context: http://lucene.472066.n3.nabble.com/what-would-cause-large-numbers-of-executeWithRetry-INFO-messages-tp1453417p2274077.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: partitioning documents with fields
As Ahmet says, this is what dismax does. You could also append a filter query (fq=crawl:DIGITALDATA) to your query. eDismax supports fielded queries, see: https://issues.apache.org/jira/browse/SOLR-1553 This is already in the trunk and 3.x code lines I'm pretty sure. Best Erick On Mon, Jan 17, 2011 at 12:15 PM, Claudio Martella claudio.marte...@tis.bz.it wrote: Thanks for your answer. Yes, schema browser shows that the field contains the right values as i expect. From debugQuery=on i see there must be some problem though: str name=rawquerystringcrawl:DIGITALDATA/str str name=querystringcrawl:DIGITALDATA/str str name=parsedquery+DisjunctionMaxQuery((contentEN:crawl (digitaldata crawldigitaldata)^0.8 | title:crawl (digitaldata crawldigitaldata)^1.2 | url:crawl digitaldata^1.5 | contentDE:crawl (digitaldata crawldigitaldata)^0.8 | contentIT:crawl (digitald crawldigitald)^0.8 | anchor:crawl:DIGITALDATA^1.5)~0.1) DisjunctionMaxQuery((contentEN:crawl (digitaldata crawldigitaldata)^0.8 | title:crawl (digitaldata crawldigitaldata)^1.2 | url:crawl digitaldata^1.5 | contentDE:crawl (digitaldata crawldigitaldata)^0.8 | contentIT:crawl (digitald crawldigitald)^0.8 | anchor:crawl:DIGITALDATA^1.5)~0.1)/str str name=parsedquery_toString+(contentEN:crawl (digitaldata crawldigitaldata)^0.8 | title:crawl (digitaldata crawldigitaldata)^1.2 | url:crawl digitaldata^1.5 | contentDE:crawl (digitaldata crawldigitaldata)^0.8 | contentIT:crawl (digitald crawldigitald)^0.8 | anchor:crawl:DIGITALDATA^1.5)~0.1 (contentEN:crawl (digitaldata crawldigitaldata)^0.8 | title:crawl (digitaldata crawldigitaldata)^1.2 | url:crawl digitaldata^1.5 | contentDE:crawl (digitaldata crawldigitaldata)^0.8 | contentIT:crawl (digitald crawldigitald)^0.8 | anchor:crawl:DIGITALDATA^1.5)~0.1/str It looks like there's some problem with my dismax query handler. It doesn't recognize the query with the colon format. Here's the handler definition: requestHandler name=/content class=solr.SearchHandler default=true lst name=defaults str name=defTypedismax/str str name=pftitle^1.2 anchor^1.5 url^1.5 contentEN^0.8 contentIT^0.8 contentDE^0.8/str str name=qftitle^1.2 anchor^1.5 url^1.5 contentEN^0.8 contentIT^0.8 contentDE^0.8/str float name=tie0.1/float bool name=hltrue/bool str name=hl.fltitle url content anchor/str int name=hl.fragsize150/int int name=hl.snippets3/int bool name=hl.mergeContiguoustrue/bool /lst /requestHandler On 1/17/11 6:06 PM, Erick Erickson wrote: String fields are unanalyzed, so case matters. Are you sure you're not using a different case (try KeywordTokenizer + lowercaseFilter if you want these normalized to, say, lower case). If that isn't the problem, could we see the results if you add debugQuery=on to your URL? That often helps diagnose the problem. Take a look at your solr/admin page, schema browser to examine the actual contents of the crawl field and see if they're really what you expect. Best Erick On Mon, Jan 17, 2011 at 11:59 AM, Claudio Martella claudio.marte...@tis.bz.it wrote: Hi, I'm crawling different intranets so i developed a nutch plugin to add a static field for each of these crawls. I do have now in SOLR my documents with their specific craw field. If i search withing solr i can see my documents being returned with that field. The field definition in the schema is: field name=crawl type=string stored=true indexed=true/ I'd like to put a checkbox in my websearch app to choose with partition to search in. So i thought i'd implement it by simply using: /select?indent=onversion=2.2q=crawl%3Avalue+AND+query but nothing is returned. I also just tried crawl:value, which i'd expect to return all the documents from that crawl, but no results are sent back. As the field is indexed and stored and i can see the documents owning that field from normal query results, what could i be missing? -- Claudio Martella Digital Technologies Unit Research Development - Analyst TIS innovation park Via Siemens 19 | Siemensstr. 19 39100 Bolzano | 39100 Bozen Tel. +39 0471 068 123 Fax +39 0471 068 129 claudio.marte...@tis.bz.it http://www.tis.bz.it Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to priv...@tis.bz.it in order to object the processing of your personal data for the purpose of
RE: Spell Checking a multi word phrase
Camden, You may also want to be aware that there is a new feature added to Spell Check's collate functionality that will guarantee the collations will return hits. It also is able to return more than one collation and tell you how many hits each one would result in if re-queried. This might do the same thing you're trying to do using shingles, but with more accuracy and less work. For info, look at spellcheck.collate, spellcheck.maxCollations, spellcheck.maxCollationTries spellcheck.collateExtendedResults on the component's wiki page: http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate This feature is committed to 3.x and 4.x and is available as a patch for 1.4.1 (here: https://issues.apache.org/jira/browse/SOLR-2010). James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Camden Daily [mailto:cam...@jaunter.com] Sent: Monday, January 17, 2011 1:01 PM To: solr-user@lucene.apache.org Subject: Spell Checking a multi word phrase Hello all, I'm pretty new to Solr, and trying to set up a spell checker that can handle entire phrases. My goal would be to have something that could offer a suggestion of united states for a query of untied stats. I have a very large index, and I've worked a bit with creating shingles for the spelling index. The problem I'm running into now is that the SpellCheckComponent is always tokenizing the query that I pass to it. For example, a query like this http://localhost:8080/solr/spell?q=untied\statsspellcheck=truedebugQuery=on The debug information shows me that the parsed query is: PhraseQuery(text:untied stats) But I receive the spelling suggestions for untied and stats separately. From what I understand, this is not a case where I would want to collate; I simply want the entire phrase treated as one token. I found the following post after much searching that suggests setting up a custom QueryConverter: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200810.mbox/%3c1224516331.3820.119.ca...@localhost.localdomain.tld%3E Does anyone know if that would be required? I had hoped to avoid Java code entirely with Solr (I haven't used Java in a very long time), but if I do need to set up the 'MultiWordSpellingQueryConvert' class, would anyone be able to give me some tips of exactly how I would add that functionality to Solr? Relevant configs below: solrconfig.xml: searchComponent name=spellcheck class=solr.SpellCheckComponent lst name=spellchecker str name=namedefault/str str name=fieldspellShingle/str str name=spellcheckIndexDir./spellShingle/str str name=queryAnalyzerFieldTypetextSpellShingle/str str name=buildOnOptimizetrue/str /lst /searchComponent schema.xml: fieldType name=textSpellShingle class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.ShingleFilterFactory maxShingleSize=2 outputUnigrams=true/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType (I had thought setting the KeywordTokenizer for the query analyzer would keep it from being tokenized, but it doesn't seem to make any difference) -Camden Daily
RE: spellchecking even the key is true....
Add spellcheck.onlyMorePopular=true to your query and I think it'll do what you want. See http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular for more info. One caveat is if you use spellcheck.collate, this will likely result in useless, nonsensical collations most of the time. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: satya swaroop [mailto:satya.yada...@gmail.com] Sent: Monday, January 17, 2011 10:32 AM To: solr-user@lucene.apache.org Subject: spellchecking even the key is true Hi All, can we get the spellchecking results even when the keyword is true. As for spellchecking will give only to the wrong keywords, cant we get similar and near words of the keyword though the spellcheck.q is true.. as an example http://localhost:8080/solr/spellcheck?q=javaspellcheck=truespellcheck.count=5 the result will be 1)- response - lst name=spellcheck lst name=suggestions/ /lst /response can we get the result as 2) response - lst name=spellcheck lst name=suggestions strjavax/str strjavac/str strjavabean/str strjavascript/str /lst /response NOTE:: all the keywords in the 2nd result is are in index... Regards, satya
Re: Spell Checking a multi word phrase
James, Thank you, but I'm not sure that will work for my needs. I'm very interested in contextual spell checking. Take for example the author stephenie meyer. stephenie is a far less popular spelling than stephanie, but in this context it's the correct option. I feel like shingles with an un tokenized query string would be able to catch this, but I can't find too many examples of people attempting this. On Mon, Jan 17, 2011 at 2:19 PM, Dyer, James james.d...@ingrambook.comwrote: Camden, You may also want to be aware that there is a new feature added to Spell Check's collate functionality that will guarantee the collations will return hits. It also is able to return more than one collation and tell you how many hits each one would result in if re-queried. This might do the same thing you're trying to do using shingles, but with more accuracy and less work. For info, look at spellcheck.collate, spellcheck.maxCollations, spellcheck.maxCollationTries spellcheck.collateExtendedResults on the component's wiki page: http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate This feature is committed to 3.x and 4.x and is available as a patch for 1.4.1 (here: https://issues.apache.org/jira/browse/SOLR-2010). James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Camden Daily [mailto:cam...@jaunter.com] Sent: Monday, January 17, 2011 1:01 PM To: solr-user@lucene.apache.org Subject: Spell Checking a multi word phrase Hello all, I'm pretty new to Solr, and trying to set up a spell checker that can handle entire phrases. My goal would be to have something that could offer a suggestion of united states for a query of untied stats. I have a very large index, and I've worked a bit with creating shingles for the spelling index. The problem I'm running into now is that the SpellCheckComponent is always tokenizing the query that I pass to it. For example, a query like this http://localhost:8080/solr/spell?q=untied\statsspellcheck=truedebugQuery=onhttp://localhost:8080/solr/spell?q=untied%5Cstatsspellcheck=truedebugQuery=on The debug information shows me that the parsed query is: PhraseQuery(text:untied stats) But I receive the spelling suggestions for untied and stats separately. From what I understand, this is not a case where I would want to collate; I simply want the entire phrase treated as one token. I found the following post after much searching that suggests setting up a custom QueryConverter: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200810.mbox/%3c1224516331.3820.119.ca...@localhost.localdomain.tld%3E Does anyone know if that would be required? I had hoped to avoid Java code entirely with Solr (I haven't used Java in a very long time), but if I do need to set up the 'MultiWordSpellingQueryConvert' class, would anyone be able to give me some tips of exactly how I would add that functionality to Solr? Relevant configs below: solrconfig.xml: searchComponent name=spellcheck class=solr.SpellCheckComponent lst name=spellchecker str name=namedefault/str str name=fieldspellShingle/str str name=spellcheckIndexDir./spellShingle/str str name=queryAnalyzerFieldTypetextSpellShingle/str str name=buildOnOptimizetrue/str /lst /searchComponent schema.xml: fieldType name=textSpellShingle class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.ShingleFilterFactory maxShingleSize=2 outputUnigrams=true/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType (I had thought setting the KeywordTokenizer for the query analyzer would keep it from being tokenized, but it doesn't seem to make any difference) -Camden Daily
RE: solrj http client 4
Hi Stevo, Thanks for reviewing the Maven POMs in LUCENE-2657 - I appreciate it. In those poms, not all modules have explicit version and groupId which is a bad practice. Really? According to the POM best practices section in Sonatype's Maven book http://www.sonatype.com/books/mvnref-book/reference/pom-relationships-sect-pom-best-practice.html, inheriting version and groupId is standard and acceptable. However, since the Lucene/Solr source tree contains two groupIds (org.apache.lucene and org.apache.solr), I agree that all modules should have an explicit groupId, and you're right: several of the aggregator POMs don't have explicit groupId. I'll fix this. But I don't think it's a bad practice to inherit the version from the parent POM. All Lucene and Solr modules have synchronized versions - it doesn't make sense for them to be specified independent of the whole project. Also some parent references contain invalid default (../pom.xml) relativePath - path to their parent pom.xml. AFAICT, the default relativePath concept no longer exists (as of Maven 2.2+). That is, the parent POM resolution method uses the explicit relativePath if specified, then the local repository -- ../pom.xml is never used unless explicitly specified. (I don't know this for a fact, I just found that I had to mvn install before parent POM changes because visible to child POMs, even when the parent POM location was in the parent directory.) That said, I agree it would be useful to have explicit relativePaths - I'll add them. Paths to build directories look suspicious to me. lucene-bdb module references missing library com.sleepycat:berkeleydb:jar:4.7.25 - I see lib/db-4.7.25.jar, if it's supposed to be installed in local repository then pom would be handy. Run mvn -N -P bootstrap install from the top level to install non-mavenized dependencies into your local repository. Wiki page http://wiki.apache.org/solr/HowToContribute references this http://markmail.org/message/yb5qgeamosvdscao mail but files (.classpath) in archives attached to that email are very outdated. eclipse target in base ant build script generates .classpath and .settings so it seems mentioned wiki page is outdated too. I agree, this should be changed. Go for it! Steve
RE: Spell Checking a multi word phrase
Camden, Have you seen SmileyPugh's Solr book? They describe something very similar to what you're trying to do on p180ff. The difference seems to be they use a field that only has a couple of terms so they don't bother with shingles. The book makes a big point about using spellcheck.q in this case in order to get the analysis right. I'm not sure if this is the solution but I thought I'd mention it. I never tried spell checking this way because it seemed very limited and possibly quite expensive. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Camden Daily [mailto:cam...@jaunter.com] Sent: Monday, January 17, 2011 1:41 PM To: solr-user@lucene.apache.org Subject: Re: Spell Checking a multi word phrase James, Thank you, but I'm not sure that will work for my needs. I'm very interested in contextual spell checking. Take for example the author stephenie meyer. stephenie is a far less popular spelling than stephanie, but in this context it's the correct option. I feel like shingles with an un tokenized query string would be able to catch this, but I can't find too many examples of people attempting this. On Mon, Jan 17, 2011 at 2:19 PM, Dyer, James james.d...@ingrambook.comwrote: Camden, You may also want to be aware that there is a new feature added to Spell Check's collate functionality that will guarantee the collations will return hits. It also is able to return more than one collation and tell you how many hits each one would result in if re-queried. This might do the same thing you're trying to do using shingles, but with more accuracy and less work. For info, look at spellcheck.collate, spellcheck.maxCollations, spellcheck.maxCollationTries spellcheck.collateExtendedResults on the component's wiki page: http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate This feature is committed to 3.x and 4.x and is available as a patch for 1.4.1 (here: https://issues.apache.org/jira/browse/SOLR-2010). James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Camden Daily [mailto:cam...@jaunter.com] Sent: Monday, January 17, 2011 1:01 PM To: solr-user@lucene.apache.org Subject: Spell Checking a multi word phrase Hello all, I'm pretty new to Solr, and trying to set up a spell checker that can handle entire phrases. My goal would be to have something that could offer a suggestion of united states for a query of untied stats. I have a very large index, and I've worked a bit with creating shingles for the spelling index. The problem I'm running into now is that the SpellCheckComponent is always tokenizing the query that I pass to it. For example, a query like this http://localhost:8080/solr/spell?q=untied\statsspellcheck=truedebugQuery=onhttp://localhost:8080/solr/spell?q=untied%5Cstatsspellcheck=truedebugQuery=on The debug information shows me that the parsed query is: PhraseQuery(text:untied stats) But I receive the spelling suggestions for untied and stats separately. From what I understand, this is not a case where I would want to collate; I simply want the entire phrase treated as one token. I found the following post after much searching that suggests setting up a custom QueryConverter: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200810.mbox/%3c1224516331.3820.119.ca...@localhost.localdomain.tld%3E Does anyone know if that would be required? I had hoped to avoid Java code entirely with Solr (I haven't used Java in a very long time), but if I do need to set up the 'MultiWordSpellingQueryConvert' class, would anyone be able to give me some tips of exactly how I would add that functionality to Solr? Relevant configs below: solrconfig.xml: searchComponent name=spellcheck class=solr.SpellCheckComponent lst name=spellchecker str name=namedefault/str str name=fieldspellShingle/str str name=spellcheckIndexDir./spellShingle/str str name=queryAnalyzerFieldTypetextSpellShingle/str str name=buildOnOptimizetrue/str /lst /searchComponent schema.xml: fieldType name=textSpellShingle class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.ShingleFilterFactory maxShingleSize=2 outputUnigrams=true/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType (I had thought setting the KeywordTokenizer for the query analyzer would keep it from being tokenized, but it doesn't seem to make any difference) -Camden Daily
what is the diff between katta and solrcloud?
Are their goal fudanmentally different at all or just different approaches to solve the same problem (sharding)? Can someone give a technical review? Thanks, --Sean
Does field collapsing (with facet) reduce performance?
Just wanted to know how efficient field collapsing is. And if there is a performance penalty, how big is it likely to be? I'm interested in using field collapsing with faceting. Thanks.
Any way to query by offset?
Say I do a query that matches 4000 documents. Is there a query syntax or parser that would allow me to say retrieve offsets 1000, 2000, 3000? I would prefer to not do multiple starts and limit 1's. Thanks in advance. Steve
Re: Any way to query by offset?
Have you seen the start and rows parameters? If they don't work, perhaps you could explain what you need that they don't provide. Best Erick On Mon, Jan 17, 2011 at 4:58 PM, 5 Diamond IT i...@smallbusinessconsultingexperts.com wrote: Say I do a query that matches 4000 documents. Is there a query syntax or parser that would allow me to say retrieve offsets 1000, 2000, 3000? I would prefer to not do multiple starts and limit 1's. Thanks in advance. Steve
Re: Any way to query by offset?
I think Steve wants the 1000th, 2000th and 3000th document from the query. And since there's no method of doing so you're constrained to executing three queries with rows=1 and start is resp. 1000, 2000 and 3000. If you want these documents to return you will have to do multiple queries with different start and limit=1 parameters. Have you seen the start and rows parameters? If they don't work, perhaps you could explain what you need that they don't provide. Best Erick On Mon, Jan 17, 2011 at 4:58 PM, 5 Diamond IT i...@smallbusinessconsultingexperts.com wrote: Say I do a query that matches 4000 documents. Is there a query syntax or parser that would allow me to say retrieve offsets 1000, 2000, 3000? I would prefer to not do multiple starts and limit 1's. Thanks in advance. Steve
Re: Any way to query by offset?
I want to start at row 1000, 2000, and 3000 and retrieve those 3 rows ONLY from the result set of whatever search was used. Yes, I can do 3 queries, start=1000 and limit 1, etc., but, want ONE query to get those 3 rows from the result set. It's the poor mans way of doing price buckets the way I want them to be. So, what I need that they do not provide is the ability to find those 3 rows out of the result set in one query. Was hoping for a function, a parser that supported this perhaps, some hidden field I am not aware of I could simply match on, any trick that would work. On Jan 17, 2011, at 6:13 PM, Erick Erickson wrote: Have you seen the start and rows parameters? If they don't work, perhaps you could explain what you need that they don't provide. Best Erick On Mon, Jan 17, 2011 at 4:58 PM, 5 Diamond IT i...@smallbusinessconsultingexperts.com wrote: Say I do a query that matches 4000 documents. Is there a query syntax or parser that would allow me to say retrieve offsets 1000, 2000, 3000? I would prefer to not do multiple starts and limit 1's. Thanks in advance. Steve
Re: Does field collapsing (with facet) reduce performance?
There is always CPU and RAM involved for every nice component you use. Just how much the penalty is depends completely on your hardware, index and type of query. Under heavy load it numbers will change. Since we don't know your situation and it's hard to predict without benchmarks, you should really do the tests yourself. Just wanted to know how efficient field collapsing is. And if there is a performance penalty, how big is it likely to be? I'm interested in using field collapsing with faceting. Thanks.
Re: Is deduplication possible during Tika extract?
In my opinion it should work for every update handler. If you're really sure your configuration if fine and it still doesn't work you might have to file an issue. Your configuration looks alright but don't forget you've configured overwriteDupes=false! Hello, here is an excerpt of my solrconfig.xml: requestHandler name=/update/extract class=org.apache.solr.handler.extraction.ExtractingRequestHandler startup=lazy lst name=defaults str name=update.processordedupe/str !-- All the main content goes into text... if you need to return the extracted text or do highlighting, use a stored field. -- str name=fmap.contenttext/str str name=lowernamestrue/str str name=uprefixignored_/str !-- capture link hrefs but ignore div attributes -- str name=captureAttrtrue/str str name=fmap.alinks/str str name=fmap.divignored_/str /lst /requestHandler and updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldsignature/str bool name=overwriteDupesfalse/bool str name=fieldstext/str str name=signatureClassorg.apache.solr.update.processor.TextProfileSignature /str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain deduplication works when I use only /update but not when solr does an extract with Tika! Is deduplication possible during Tika extract? Thanks in advance, Arno
NRT
How is NRT doing, being used in production? Which Solr is it in? And is there built in Spatial in that version? How is Solr 4.x doing? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: Does field collapsing (with facet) reduce performance?
I understand that the specific figures differ for everybody. I just wanted to see if anyone who has used this feature could share their experience. A ballpark figure -- e.g. 50% slowdown or 10 times slower -- would be helpful. --- On Mon, 1/17/11, Markus Jelsma markus.jel...@openindex.io wrote: From: Markus Jelsma markus.jel...@openindex.io Subject: Re: Does field collapsing (with facet) reduce performance? To: solr-user@lucene.apache.org Cc: Andy angelf...@yahoo.com Date: Monday, January 17, 2011, 7:27 PM There is always CPU and RAM involved for every nice component you use. Just how much the penalty is depends completely on your hardware, index and type of query. Under heavy load it numbers will change. Since we don't know your situation and it's hard to predict without benchmarks, you should really do the tests yourself. Just wanted to know how efficient field collapsing is. And if there is a performance penalty, how big is it likely to be? I'm interested in using field collapsing with faceting. Thanks.
Re: Spell Checking a multi word phrase
James, Thanks, the spellcheck.q was exactly what I needed to be using! -Camden On Mon, Jan 17, 2011 at 3:54 PM, Dyer, James james.d...@ingrambook.comwrote: Camden, Have you seen SmileyPugh's Solr book? They describe something very similar to what you're trying to do on p180ff. The difference seems to be they use a field that only has a couple of terms so they don't bother with shingles. The book makes a big point about using spellcheck.q in this case in order to get the analysis right. I'm not sure if this is the solution but I thought I'd mention it. I never tried spell checking this way because it seemed very limited and possibly quite expensive. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Camden Daily [mailto:cam...@jaunter.com] Sent: Monday, January 17, 2011 1:41 PM To: solr-user@lucene.apache.org Subject: Re: Spell Checking a multi word phrase James, Thank you, but I'm not sure that will work for my needs. I'm very interested in contextual spell checking. Take for example the author stephenie meyer. stephenie is a far less popular spelling than stephanie, but in this context it's the correct option. I feel like shingles with an un tokenized query string would be able to catch this, but I can't find too many examples of people attempting this. On Mon, Jan 17, 2011 at 2:19 PM, Dyer, James james.d...@ingrambook.com wrote: Camden, You may also want to be aware that there is a new feature added to Spell Check's collate functionality that will guarantee the collations will return hits. It also is able to return more than one collation and tell you how many hits each one would result in if re-queried. This might do the same thing you're trying to do using shingles, but with more accuracy and less work. For info, look at spellcheck.collate, spellcheck.maxCollations, spellcheck.maxCollationTries spellcheck.collateExtendedResults on the component's wiki page: http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate This feature is committed to 3.x and 4.x and is available as a patch for 1.4.1 (here: https://issues.apache.org/jira/browse/SOLR-2010). James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Camden Daily [mailto:cam...@jaunter.com] Sent: Monday, January 17, 2011 1:01 PM To: solr-user@lucene.apache.org Subject: Spell Checking a multi word phrase Hello all, I'm pretty new to Solr, and trying to set up a spell checker that can handle entire phrases. My goal would be to have something that could offer a suggestion of united states for a query of untied stats. I have a very large index, and I've worked a bit with creating shingles for the spelling index. The problem I'm running into now is that the SpellCheckComponent is always tokenizing the query that I pass to it. For example, a query like this http://localhost:8080/solr/spell?q=untied\statsspellcheck=truedebugQuery=onhttp://localhost:8080/solr/spell?q=untied%5Cstatsspellcheck=truedebugQuery=on http://localhost:8080/solr/spell?q=untied%5Cstatsspellcheck=truedebugQuery=on The debug information shows me that the parsed query is: PhraseQuery(text:untied stats) But I receive the spelling suggestions for untied and stats separately. From what I understand, this is not a case where I would want to collate; I simply want the entire phrase treated as one token. I found the following post after much searching that suggests setting up a custom QueryConverter: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200810.mbox/%3c1224516331.3820.119.ca...@localhost.localdomain.tld%3E Does anyone know if that would be required? I had hoped to avoid Java code entirely with Solr (I haven't used Java in a very long time), but if I do need to set up the 'MultiWordSpellingQueryConvert' class, would anyone be able to give me some tips of exactly how I would add that functionality to Solr? Relevant configs below: solrconfig.xml: searchComponent name=spellcheck class=solr.SpellCheckComponent lst name=spellchecker str name=namedefault/str str name=fieldspellShingle/str str name=spellcheckIndexDir./spellShingle/str str name=queryAnalyzerFieldTypetextSpellShingle/str str name=buildOnOptimizetrue/str /lst /searchComponent schema.xml: fieldType name=textSpellShingle class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.ShingleFilterFactory maxShingleSize=2 outputUnigrams=true/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer
Re: Multi-word exact keyword case-insensitive search suggestions
No other way around to fit this requirement? On Sat, Jan 15, 2011 at 10:01 AM, Chamnap Chhorn chamnapchh...@gmail.comwrote: Ahh, thanks guys for helping me! For Adam solution, it doesn't work for me. Here is my Field, FieldType, and solr query: fieldType name=text_keyword class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.ShingleFilterFactory maxShingleSize=4 outputUnigrams=true outputUnigramIfNoNgram=false / /analyzer /fieldType field name=keyphrase type=text_keyword indexed=true stored=false multiValued=true/ http://localhost:8081/solr/select?q=printing%20houseqf=keyphrasedebugQuery=ondefType=dismax str name=parsedquery +((DisjunctionMaxQuery((keyphrase:smart)) DisjunctionMaxQuery((keyphrase:mobile)))~2) () /str str name=parsedquery_toString+(((keyphrase:smart) (keyphrase:mobile))~2) ()/str lst name=explain/ The result is not found. For erick solution, it works for me. However, I can't put filter query, since it's part of full text search. If I put fq, it would just return documents that match exactly as the query. I want to show those that match exactly on the top and the rest for documents that match partially. The problem is that when the user search a word (eg. printing of the keyword printing house), that document also include in the search results. The other problem is that if the user search the reverse order(eg. house printing), it's also found. Cheers On Sat, Jan 15, 2011 at 3:31 AM, Erick Erickson erickerick...@gmail.comwrote: This might work: Define your field to use WhitespaceTokenizer and LowerCaseFilterFactory Use a filter query referencing this field. If you wanted the words to appear in their exact order, you could just define the pf field in your dismax. Best Erick On Thu, Jan 13, 2011 at 8:01 PM, Estrada Groups estrada.adam.gro...@gmail.com wrote: Ahhh...the fun of open source software ;-). Requires a ton of trial and error! I found what worked for me and figured it was worth passing it along. If you don't mind...when you sort everything out on your end, please post results for the rest of us to take a gander at. Cheers, Adam On Jan 13, 2011, at 9:08 PM, Chamnap Chhorn chamnapchh...@gmail.com wrote: Thanks for your reply. However, it doesn't work for my case at all. I think it's the problem with query parser or something else. It forces me to put double quote to the search query in order to get the results found. str name=rawquerystringsim 010/str str name=querystringsim 010/str str name=parsedquery+DisjunctionMaxQuery((keyphrase:sim 010)) ()/str str name=parsedquery_toString+(keyphrase:sim 010) ()/str str name=rawquerystringsmart mobile/str str name=querystringsmart mobile/str str name=parsedquery +((DisjunctionMaxQuery((keyphrase:smart)) DisjunctionMaxQuery((keyphrase:mobile)))~2) () /str str name=parsedquery_toString+(((keyphrase:smart) (keyphrase:mobile))~2) ()/str The intent here is to do a full text search, part of that is to search keyword field, so I can't put quote to it. On Thu, Jan 13, 2011 at 10:30 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: Hi, the following seems to work pretty well. fieldType name=text_ws class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.ShingleFilterFactory maxShingleSize=4 outputUnigrams=true outputUnigramIfNoNgram=false / /analyzer /fieldType !-- A text field that uses WordDelimiterFilter to enable splitting and matching of words on case-change, alpha numeric boundaries, and non-alphanumeric chars, so that a query of wifi or wi fi could match a document containing Wi-Fi. Synonyms and stopwords are customized by external files, and stemming is enabled. The attribute autoGeneratePhraseQueries=true (the default) causes words that get split to form phrase queries. For example, WordDelimiterFilter splitting text:pdp-11 will cause the parser to generate text:pdp 11 rather than (text:PDP OR text:11). NOTE: autoGeneratePhraseQueries=true tends to not work well for non whitespace delimited languages. -- fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a
Re: NRT
How is NRT doing, being used in production? It works and there are not any lingering bugs as it's been available for quite a while. Which Solr is it in? Per-segment field cache is used transparently by Solr, IndexWriter.getReader is what's not used yet. I'm not sure where per-segment faceting is at. And is there built in Spatial in that version? Spatial is independent of NRT? On Mon, Jan 17, 2011 at 4:56 PM, Dennis Gearon gear...@sbcglobal.net wrote: How is NRT doing, being used in production? Which Solr is it in? And is there built in Spatial in that version? How is Solr 4.x doing? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: Solr: using to index large folders recursively containing lots of different documents, and querying over the web
Solr itself does all three things. There is no need for Nutch- that is needed for crawling web sites, not file systems (as the original question specifies). Solr operates as a web service, running in any Java servlet container. Detecting changes to files is more tricky: there is no implementation for the real-time update system available for Windows. You would have to implement that. Otherwise you can poll a file system and re-index altered files. On Fri, Jan 14, 2011 at 4:54 AM, Markus Jelsma markus.jel...@openindex.io wrote: Nutch can crawl the file system as well. Nutch 1.x can also provide search but this is delegated to Solr in Nutch 2.x. Solr can provide the search and Nutch can provide Solr with content from your intranet. On Friday 14 January 2011 13:17:52 Cathy Hemsley wrote: Hi, Thanks for suggesting this. However, I'm not sure a 'crawler' will work: as the various pages are not necessarily linked (it's complicated: basically our intranet is a dynamic and managed collection of independantly published web sites, and users found information using categorisation and/or text searching), so we need something that will index all the files in a given folder, rather than follow links like a crawler. Can Nutch do this? As well as the other requirements below? Regards Cathy On 14 January 2011 12:09, Markus Jelsma markus.jel...@openindex.io wrote: Please visit the Nutch project. It is a powerful crawler and can integrate with Solr. http://nutch.apache.org/ Hi Solr users, I hope you can help. We are migrating our intranet web site management system to Windows 2008 and need a replacement for Index Server to do the text searching. I am trying to establish if Lucene and Solr is a feasible replacement, but I cannot find the answers to these questions: 1. Can Solr be set up to recursively index a folder containing an indeterminate and variable large number of subfolders, containing files of all types: XML, HTML, PDF, DOC, spreadsheets, powerpoint presentations, text files etc. If so, how? 2. Can Solr be queried over the web and return a list of files that match a search query entered by a user, and also return the abstracts for these files, as well as 'hit highlighting'. If so, how? 3. Can Solr be run as a service (like Index Server) that automatically detects changes to the files within the indexed folder and updates the index? If so, how? Thanks for your help Cathy Hemsley -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350 -- Lance Norskog goks...@gmail.com
just got 'the book' already have a question
First of all, seems like a good book, Solr-14-Enterprise-Search-Server.pdf Question, is it possible to choose locale at search time? So if my customer is querying across cultural/national/linguistic boundaries and I have the data for him different languages in the same index, can I sort based on his language? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Carrot2 clustering Component
Hi, Please tell me how can I get the libraries and plugins for carrot2 clustering component in solr1.4.Tell me the site from where i can get them. Thanks! Isha
Carrot2 clustering component
Hi, I am not able to understand the caarot2 clustering component from http://wiki.apache.org/solr/ClusteringComponent please provide me more detailed information if someone had already worked on this. How to run this and use this during search query. Thanks! Isha
Re: Carrot2 clustering component
Isha, You'll get more and better help if you provide more details about what you have done, what you have tried, what isn't working, what errors or behaviour you are seeing, etc. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Isha Garg isha.g...@orkash.com To: solr-user@lucene.apache.org Sent: Tue, January 18, 2011 12:38:03 AM Subject: Carrot2 clustering component Hi, I am not able to understand the caarot2 clustering component from http://wiki.apache.org/solr/ClusteringComponent please provide me more detailed information if someone had already worked on this. How to run this and use this during search query. Thanks! Isha
explicit field type descriptions
Is there any tabular data anywhere on ALL field types and ALL options? For example, I've looked everywhere in the last hour, and I don't see anywhere on Solr site, google, or in the 1.4 manual where it says whether a copyField 'directive' can be made ' required=true '. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Getting started with writing parser
how to write a parser program that will convert log files into XML.. -- View this message in context: http://lucene.472066.n3.nabble.com/Getting-started-with-writing-parser-tp2278092p2278092.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Not storing, but highlighting from document sentences
On 01/12/2011 12:02 PM, Otis Gospodnetic wrote: Hello, I'm indexing some content (articles) whose text I cannot store in its original form for copyright reason. So I can index the content, but cannot store it. However, I need snippets and search term highlighting. Any way to accomplish this elegantly? Or even not so elegantly? Here is one idea: * Create 2 indices: main index for indexing (but not storing) the original content, the secondary index for storing individual sentences from the original article. How about storing the sentences in the same index in a separate field but with random ordering, would that be ok? Tarjei * That is, before indexing an article, split it into sentences. Then index the article in the main index, and index+store each sentence in the secondary index. So for each doc in the main index there will be multiple docs in the secondary index with individual sentences. Each sentence doc includes an ID of the parent document. * Then run queries against the main index, and pull individual sentences from the secondary index for snippet+highlight purposes. The problem I see with this approach (and there may be other ones that I am not seeing yet) is with queries like foo AND bar. In this case foo may be a match from sentence #1, and bar may be a match from sentence #7. Or maybe foo is a match in sentence #1, and bar is a match in multiple sentences: #7 and #10 and #23. Regardless, when a query is run against the main index, you don't know where the match was, so you don't know which sentences to go get from the secondary index. Does anyone have any suggestions for how to handle this? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -- Regards / Med vennlig hilsen Tarjei Huse Mobil: 920 63 413
Re: Carrot2 clustering component
On Tuesday 18 January 2011 11:12 AM, Otis Gospodnetic wrote: Isha, You'll get more and better help if you provide more details about what you have done, what you have tried, what isn't working, what errors or behaviour you are seeing, etc. Otis Sematext ::http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search ::http://search-lucene.com/ - Original Message From: Isha Gargisha.g...@orkash.com To:solr-user@lucene.apache.org Sent: Tue, January 18, 2011 12:38:03 AM Subject: Carrot2 clustering component Hi, I am not able to understand the caarot2 clustering component from http://wiki.apache.org/solr/ClusteringComponent please provide me more detailed information if someone had already worked on this. How to run this and use this during search query. Thanks! Isha I had downloaded some jar files compatible with solr1.4 including: carrot2-core-3.4.2.jar guava-r05.jar hppc-0.3.1.jar jackson-core-asl-1.5.2.jar mahout-collections-0.3.jar jackson-mapper-asl-1.5.2.jar log4j-1.2.14.jar mahout-math-0.3.jar simple-xml-2.3.5.jar And placed them at contrib/clustering/lib Then changed the solr.config as: requestHandler name=standard default=true !-- default values for query parameters -- lst name=defaults str name=echoParamsexplicit/str !-- int name=rows10/int str name=fl*/str str name=version2.1/str -- !--bool name=clusteringtrue/bool-- str name=clustering.enginedefault/str bool name=clustering.resultstrue/bool !-- The title field -- str name=carrot.titleheadin/str str name=carrot.urlid/str !-- The field to cluster on -- str name=carrot.snippettext/str !-- produce summaries -- bool name=carrot.produceSummarytrue/bool !-- the maximum number of labels per cluster -- !--int name=carrot.numDescriptions5/int-- !-- produce sub clusters -- bool name=carrot.outputSubClustersfalse/bool /lst arr name=last-components strclustering/str /arr /requestHandler searchComponent name=clustering !-- Declare an engine -- lst name=engine !-- The name, only one can be named default -- str name=namedefault/str str name=carrot.algorithmorg.carrot2.clustering.lingo.LingoClusteringAlgorithm/str str name=LingoClusteringAlgorithm.desiredClusterCountBase20/str /lst lst name=engine str name=namestc/str str name=carrot.algorithmorg.carrot2.clustering.stc.STCClusteringAlgorithm/str /lst /searchComponent And then run solr using command: java -Dsolr.clustering.enabled=true -jar start.jar Now can you tell me where i am wrong ?? what else should i do?
Re: Carrot2 clustering component
Isha, Next, you need to run the actual search so Carrot2 has some search results to cluster. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Isha Garg isha.g...@orkash.com To: solr-user@lucene.apache.org Sent: Tue, January 18, 2011 1:54:39 AM Subject: Re: Carrot2 clustering component On Tuesday 18 January 2011 11:12 AM, Otis Gospodnetic wrote: Isha, You'll get more and better help if you provide more details about what you have done, what you have tried, what isn't working, what errors or behaviour you are seeing, etc. Otis Sematext ::http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search ::http://search-lucene.com/ - Original Message From: Isha Gargisha.g...@orkash.com To:solr-user@lucene.apache.org Sent: Tue, January 18, 2011 12:38:03 AM Subject: Carrot2 clustering component Hi, I am not able to understand the caarot2 clustering component from http://wiki.apache.org/solr/ClusteringComponent please provide me more detailed information if someone had already worked on this. How to run this and use this during search query. Thanks! Isha I had downloaded some jar files compatible with solr1.4 including: carrot2-core-3.4.2.jar guava-r05.jar hppc-0.3.1.jar jackson-core-asl-1.5.2.jar mahout-collections-0.3.jar jackson-mapper-asl-1.5.2.jar log4j-1.2.14.jar mahout-math-0.3.jar simple-xml-2.3.5.jar And placed them at contrib/clustering/lib Then changed the solr.config as: requestHandler name=standard default=true !-- default values for query parameters -- lst name=defaults str name=echoParamsexplicit/str !-- int name=rows10/int str name=fl*/str str name=version2.1/str -- !--bool name=clusteringtrue/bool-- str name=clustering.enginedefault/str bool name=clustering.resultstrue/bool !-- The title field -- str name=carrot.titleheadin/str str name=carrot.urlid/str !-- The field to cluster on -- str name=carrot.snippettext/str !-- produce summaries -- bool name=carrot.produceSummarytrue/bool !-- the maximum number of labels per cluster -- !--int name=carrot.numDescriptions5/int-- !-- produce sub clusters -- bool name=carrot.outputSubClustersfalse/bool /lst arr name=last-components strclustering/str /arr /requestHandler searchComponent name=clustering !-- Declare an engine -- lst name=engine !-- The name, only one can be named default -- str name=namedefault/str str name=carrot.algorithmorg.carrot2.clustering.lingo.LingoClusteringAlgorithm/str str name=LingoClusteringAlgorithm.desiredClusterCountBase20/str /lst lst name=engine str name=namestc/str str name=carrot.algorithmorg.carrot2.clustering.stc.STCClusteringAlgorithm/str /lst /searchComponent And then run solr using command: java -Dsolr.clustering.enabled=true -jar start.jar Now can you tell me where i am wrong ?? what else should i do?
Re: NRT
Hi, How is NRT doing, being used in production? Which Solr is it in? Unless I missed it, I don't think there is true NRT in Solr just yet. And is there built in Spatial in that version? How is Solr 4.x doing? Well :) 3 ways to know this sort of stuff: * follow the dev list - high volume * subscribe to Sematext Blog - we publish monthly Solr Digests * check JIRA to see how many issues remain to be fixed Otis -- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: just got 'the book' already have a question
Hi, Don't think so. If you search across multiple languages and sort, I think the sort if based on UTF8 order. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Dennis Gearon gear...@sbcglobal.net To: solr-user@lucene.apache.org Sent: Mon, January 17, 2011 11:10:21 PM Subject: just got 'the book' already have a question First of all, seems like a good book, Solr-14-Enterprise-Search-Server.pdf Question, is it possible to choose locale at search time? So if my customer is querying across cultural/national/linguistic boundaries and I have the data for him different languages in the same index, can I sort based on his language? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: just got 'the book' already have a question
I could be wrong, have a look at http://search-lucene.com/?q=locale+sortfc_project=Solr plus: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.CollationKeyFilterFactory Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Otis Gospodnetic otis_gospodne...@yahoo.com To: solr-user@lucene.apache.org Sent: Tue, January 18, 2011 2:17:02 AM Subject: Re: just got 'the book' already have a question Hi, Don't think so. If you search across multiple languages and sort, I think the sort if based on UTF8 order. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Dennis Gearon gear...@sbcglobal.net To: solr-user@lucene.apache.org Sent: Mon, January 17, 2011 11:10:21 PM Subject: just got 'the book' already have a question First of all, seems like a good book, Solr-14-Enterprise-Search-Server.pdf Question, is it possible to choose locale at search time? So if my customer is querying across cultural/national/linguistic boundaries and I have the data for him different languages in the same index, can I sort based on his language? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: Not storing, but highlighting from document sentences
Hi Tarjei, :) Yeah, that is the solution we are going with, actually. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Tarjei Huse tar...@scanmine.com To: solr-user@lucene.apache.org Sent: Tue, January 18, 2011 1:33:44 AM Subject: Re: Not storing, but highlighting from document sentences On 01/12/2011 12:02 PM, Otis Gospodnetic wrote: Hello, I'm indexing some content (articles) whose text I cannot store in its original form for copyright reason. So I can index the content, but cannot store it. However, I need snippets and search term highlighting. Any way to accomplish this elegantly? Or even not so elegantly? Here is one idea: * Create 2 indices: main index for indexing (but not storing) the original content, the secondary index for storing individual sentences from the original article. How about storing the sentences in the same index in a separate field but with random ordering, would that be ok? Tarjei * That is, before indexing an article, split it into sentences. Then index the article in the main index, and index+store each sentence in the secondary index. So for each doc in the main index there will be multiple docs in the secondary index with individual sentences. Each sentence doc includes an ID of the parent document. * Then run queries against the main index, and pull individual sentences from the secondary index for snippet+highlight purposes. The problem I see with this approach (and there may be other ones that I am not seeing yet) is with queries like foo AND bar. In this case foo may be a match from sentence #1, and bar may be a match from sentence #7. Or maybe foo is a match in sentence #1, and bar is a match in multiple sentences: #7 and #10 and #23. Regardless, when a query is run against the main index, you don't know where the match was, so you don't know which sentences to go get from the secondary index. Does anyone have any suggestions for how to handle this? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -- Regards / Med vennlig hilsen Tarjei Huse Mobil: 920 63 413
Re: what is the diff between katta and solrcloud?
Sean, First 2 things that come to mind: * Katta keeps shards on HDFS and they then get deployed to regular servers/FS * SolrCloud doesn't involve HDFS at all. * Katta is a Lucene-level system * SolrCloud is a Solr-level system Both make heavy use of ZooKeeper. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Sean Bigdatafun sean.bigdata...@gmail.com To: solr-user@lucene.apache.org Sent: Mon, January 17, 2011 4:06:59 PM Subject: what is the diff between katta and solrcloud? Are their goal fudanmentally different at all or just different approaches to solve the same problem (sharding)? Can someone give a technical review? Thanks, --Sean
Re: explicit field type descriptions
On Tue, Jan 18, 2011 at 11:55 AM, Dennis Gearon gear...@sbcglobal.net wrote: Is there any tabular data anywhere on ALL field types and ALL options? There is this: http://search.lucidimagination.com/search/document/CDRG_ch04_4.4.2 Not sure if it meets your needs. For example, I've looked everywhere in the last hour, and I don't see anywhere on Solr site, google, or in the 1.4 manual where it says whether a copyField 'directive' can be made ' required=true '. [...] Sorry, I am having trouble understanding your goal here. Surely, it suffices to have required=True on the source field for the CopyField. Regards, Gora
Re: Getting started with writing parser
On Tue, Jan 18, 2011 at 11:59 AM, Dinesh mdineshkuma...@karunya.edu.in wrote: how to write a parser program that will convert log files into XML.. [...] There is no point to starting multiple threads on this issue, hoping that someone will somehow solve your problem. You have been given the following: * Links that should help you get started, including an example of someone indexing Solr's own logs. * Some ideas on how to proceed. * Requests to try the above suggestions out, and ask specific questions when you run into issues. * A suggestion to contact a local expert in Solr. * Multiple requests for a sample of your log files. Please show some signs that you have tried the above suggestions. Otherwise, I am afraid that it will be difficult, if not impossible. for people on this list to help you out. Regards, Gora
Re: what is the diff between katta and solrcloud?
Otis, Any pointer to an architecture view of either system? Thanks, Sean On Mon, Jan 17, 2011 at 11:27 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Sean, First 2 things that come to mind: * Katta keeps shards on HDFS and they then get deployed to regular servers/FS * SolrCloud doesn't involve HDFS at all. * Katta is a Lucene-level system * SolrCloud is a Solr-level system Both make heavy use of ZooKeeper. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Sean Bigdatafun sean.bigdata...@gmail.com To: solr-user@lucene.apache.org Sent: Mon, January 17, 2011 4:06:59 PM Subject: what is the diff between katta and solrcloud? Are their goal fudanmentally different at all or just different approaches to solve the same problem (sharding)? Can someone give a technical review? Thanks, --Sean -- --Sean