Re: Extending Solr Highlighter to pull information from external source

2011-06-20 Thread François Schiettecatte
Mike I would be very interested in the answer to that question too. My hunch is that the answer is no too. I have a few text databases that range from 200MB to about 60GB with which I could run some tests. I will have some downtime in early July and will post results. From what I can tell the

Re: Searching in Traditional / Simplified Chinese Record

2011-06-20 Thread François Schiettecatte
Wayne I am not sure what you mean by 'changing the record'. One option would be to implement something like the synonyms filter to generate the TC for SC when you index the document, which would index both the TC and the SC in the same location. That way your users would be able to search with

Re: Include synonys in solr

2011-06-28 Thread François Schiettecatte
Well you need to find word lists and/or a thesaurus. This is one place to start: http://wordlist.sourceforge.net/ I used the US/UK english word list for my synonyms for an index I have because it contains both US and UK english terms, the list lacks some medical terms though so we just

Re: Removing duplicate documents from search results

2011-06-28 Thread François Schiettecatte
Create a hash from the url and use that as the unique key, md5 or sha1 would probably be good enough. Cheers François On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote: > I also have the problem of duplicate docs. > I am indexing news articles, Every news article will have the source URL, > I

Re: Removing duplicate documents from search results

2011-06-28 Thread François Schiettecatte
gt; Since I am using SOLR as index engine Only and using Riak(key-value > storage) as storage engine, I dont want to do the overwrite on duplicate. > I just need to discard the duplicates. > > > > 2011/6/28 François Schiettecatte > >> Create a hash from the url an

Re: Removing duplicate documents from search results

2011-06-28 Thread François Schiettecatte
le <http://www.google.com/profiles/pranny> > > > 2011/6/28 François Schiettecatte > >> Maybe there is a way to get Solr to reject documents that already exist in >> the index but I doubt it, maybe someone else with can chime here here. You >> could do a search for

Re: Removing duplicate documents from search results

2011-06-28 Thread François Schiettecatte
work on it, as there are some other low hanging fruits I've to >>> capture. Will share my thoughts soon. >>> >>> >>> *Pranav Prakash* >>> >>> "temet nosce" >>> >>> Twitter <http://twitter.com/pranavprakash>

Re: Include synonys in solr

2011-06-28 Thread François Schiettecatte
wrote: > Thanks François Schiettecatte, information you provided is very helpful. > i need to know one more thing, i downloaded one of the given dictionary but > it contains many files, do i need to add all this files data in to > synonyms.text ?? > > - > Thanks & Regard

Re: filters effect on search results

2011-06-29 Thread François Schiettecatte
Indeed, I find the Porter stemmer to be too 'aggressive' for my taste, I prefer the EnglishMinimalStemFilterFactory, with the caveat that it depends on your data set. Cheers François On Jun 29, 2011, at 6:21 AM, Ahmet Arslan wrote: >> Hi, when i query for "elegant" in >> solr i get results fo

Re: Wildcard search not working if full word is queried

2011-06-30 Thread François Schiettecatte
I would run that word through the analyzer, I suspect that the word 'teste' is being stemmed to 'test' in the index, at least that is the first place I would check. François On Jun 30, 2011, at 2:21 PM, Celso Pinto wrote: > Hi everyone, > > I'm having some trouble figuring out why a query wit

Re: Wildcard search not working if full word is queried

2011-07-01 Thread François Schiettecatte
Celso Pinto wrote: >> Hi François, >> >> it is indeed being stemmed, thanks a lot for the heads up. It appears >> that stemming is also configured for the query so it should work just >> the same, no? >> >> Thanks again. >> >> Regards, &

Re: performance variation with respect to the index size

2011-07-08 Thread François Schiettecatte
Hi I don't think that anyone has run such benchmarks, in fact this topic came up two weeks ago and I volunteered some time to do that because I have some spare time this week, so I am going to run some benchmarks this weekend and report back. The machine I have to do this a core i7 960, 24GB,

Re: Result list order in case of ties

2011-07-12 Thread François Schiettecatte
You just need to provide a second sort field along the lines of: sort=score desc, author desc François On Jul 12, 2011, at 6:13 AM, Lox wrote: > Hi, > > In the case where two or more documents are returned with the same score, is > there a way to tell Solr to sort them alphabetically?

Re: Wildcard

2011-07-13 Thread François Schiettecatte
http://lucene.apache.org/java/2_9_1/queryparsersyntax.html http://wiki.apache.org/solr/SolrQuerySyntax François On Jul 13, 2011, at 1:29 PM, GAURAV PAREEK wrote: > Hello, > > What are wildcards we can use with the SOLR ? > > Regards, > Gaurav

Re: - character in search query

2011-07-14 Thread François Schiettecatte
Easy, the hyphen is out on its own (with spaces on either side) and is probably getting removed from the search by the tokenizer. Check your analysis. François On Jul 14, 2011, at 6:05 AM, roySolr wrote: > It looks like it's still not working. > > I send this to SOLR: q=arsenal \- london > >

Re: How to find whether solr server is running or not

2011-07-19 Thread François Schiettecatte
I think anything but a 200 OK mean it is dead like the proverbial parrot :) François On Jul 19, 2011, at 7:42 AM, Romi wrote: > But the problem is when solr server is not runing > *"http://host:port/solr/admin/ping"* > > will not give me any json response > then how will i get the status :( >

Re: POST VS GET and NON English Characters

2011-07-20 Thread François Schiettecatte
You need to do something like this in the ./conf/tomcat server.xml file: See 'URIEncoding' in http://tomcat.apache.org/tomcat-7.0-doc/config/http.html Note that this will assume that the encoding of the data is in utf-8 if (and ONLY if) the charset parameter is not set in the HTTP request

Re: problem searching on non standard characters

2011-07-22 Thread François Schiettecatte
Check your analyzers to make sure that these characters are not getting stripped out in the tokenization process, the url for 3.3 is somewhere along the lines of: http://localhost/solr/admin/analysis.jsp?highlight=on And you should be indeed be searching on "\#test". François On Jul 2

Re: problem searching on non standard characters

2011-07-22 Thread François Schiettecatte
Adding to my previous reply, I just did a quick check on the 'text_en' and 'text_en_splitting' field types and they both strip leading '#'. Cheers François On Jul 22, 2011, at 10:49 AM, Shawn Heisey wrote: > On 7/22/2011 8:34 AM, Jason Toy wrote: >> How does one search for words with character

Re: Spellcheck compounded words

2011-07-26 Thread François Schiettecatte
FWIW, here is the process I follow to create a log4j aware version of the apache solr war file and the corresponding lo4j.properties files. Have fun :) François ## # # Log4J configuration for SOLR # # http://wiki.apache.org/solr/Sol

Re: Spellcheck compounded words

2011-07-26 Thread François Schiettecatte
I get slf4j-log4j12-1.6.1.jar from http://www.slf4j.org/dist/slf4j-1.6.1.tar.gz, it is what interfaces slf4j to log4j, you will also need to add log4j-1.2.16.jar to WEB-INF/lib. François On Jul 26, 2011, at 3:40 PM, O. Klein wrote: > > François Schiettecatte wrote: >> >&

Re: performance variation with respect to the index size

2011-07-26 Thread François Schiettecatte
Note that the Qtime in the response packet is the search, exclusive of > assembling the response so that's probably a good number to measure. > > Best > Erick > > On Fri, Jul 8, 2011 at 8:01 AM, jame vaalet wrote: >> i would prefer every setting to be in its defa

Re: schema.xml changes, need re-indexing ?

2011-07-27 Thread François Schiettecatte
I have not seen this mentioned anywhere, but I found a useful 'trick' to restart solr without having to restart tomcat. All you need to do is 'touch' the solr.xml in the solr.home directory. It can take a few seconds but solr will restart and reload any config. Cheers François On Jul 27, 201

Re: Solr can not index "F**K"!

2011-07-31 Thread François Schiettecatte
That seems a little far fetched, have you checked your analysis? François On Jul 31, 2011, at 4:58 PM, randohi wrote: > One of our clients (a hot girl!) brought this to our attention: > In this document there are many f* words: > > http://sec.gov/Archives/edgar/data/1474227/00014742271032/

Re: Solr can not index "F**K"!

2011-07-31 Thread François Schiettecatte
Indeed, the analysis will show if the term is a stop word, the term gets removed by the stop filter, turning on verbose output shows that. François On Jul 31, 2011, at 6:27 PM, Shashi Kant wrote: > Check your Stop words list > On Jul 31, 2011 6:25 PM, "François Schiettecatte"

Re: Solr 3.3 crashes after ~18 hours?

2011-08-02 Thread François Schiettecatte
Assuming you are running on Linux, you might want to check /var/log/messages too (the location might vary), I think the kernel logs forced process termination there. I recall that the kernel will usually picks the process consuming the most memory, there may be other factors involved too. Franç

Re: SolrServer instances

2011-08-26 Thread François Schiettecatte
Sounds to me that you are looking for HTTP Persistent Connections (connection keep-alive as opposed to close), and a singleton object. This would be outside SOLR per se. A few caveats though, I am not sure if tomcat supports keep-alive, and I am not sure how SOLR deals with multiple requests co

Re: Error while decoding %DC (Ü) from URL - results in ?

2011-08-27 Thread François Schiettecatte
Merlin Ü encodes to two characters in utf-8 (C39C), and one in iso-8859-1 (%DC) so it looks like there is a charset mismatch somewhere. Cheers François On Aug 27, 2011, at 6:34 AM, Merlin Morgenstern wrote: > Hello, > > I am having problems with searches that are issued from spiders that

Re: difference between stored="false" and stored="true" ?

2012-06-30 Thread François Schiettecatte
Giovanni means the data is stored in the index and can be returned with the search results (see the 'fl' parameter). This is independent of Which means that you can store but not index a field: Best regards François On Jun 30, 2012, at 9:57 AM, Giovanni Gherdovich wrote:

Re: Can't find solr.xml

2012-07-11 Thread François Schiettecatte
On Jul 11, 2012, at 2:52 PM, Shawn Heisey wrote: > On 7/2/2012 2:33 AM, Nabeel Sulieman wrote: >> Argh! (and hooray!) >> >> I started from scratch again, following the wiki instructions. I did only >> one thing differently; put my data directory in /opt instead of /home/dev. >> And now it works!

Re: The way to customize ranking?

2012-08-23 Thread François Schiettecatte
I would create two indices, one with your content and one with your ads. This approach would allow you to precisely control how many ads you pull back and how you merge them into the results, and you would be able to control schemas, boosting, defaults fields, etc for each index independently.

Re: recommended SSD

2012-08-23 Thread François Schiettecatte
You should check this at pcper.com: http://pcper.com/ssd-decoder http://pcper.com/content/SSD-Decoder-popup Specs for a wide range of SSDs. Best regards François On Aug 23, 2012, at 5:35 PM, Peyman Faratin wrote: > Hi > > Is there a SSD brand and spec that the community re

Re: Solr and Tomcat - problem with unicode characters

2012-08-28 Thread François Schiettecatte
What is probably going on is that the response is not being interpreted as UTF-8 but as some other encoding. What are you using to display the response? François On Aug 28, 2012, at 8:08 AM, zehoss wrote: > Hi, > at the beginning I would like to sorry for my english. I hope my message > will

Re: MMapDirectory, demand paging, lazy evaluation, ramfs and the much maligned RAMDirectory (oh my!)

2012-10-24 Thread François Schiettecatte
Aaron The best way to make sure the index is cached by the OS is to just cat it on startup: cat `find /path/to/solr/index` > /dev/null Just make sure your index is smaller than RAM otherwise data will be rotated out. Memory mapping is built on the virtual memory system, and I suspect

Re: Is leading wildcard search turned on by default in Solr 3.6.1?

2012-11-12 Thread François Schiettecatte
John You can still use leading wildcards even if you dont have the ReversedWildcardFilterFactory in your analysis but it means you will be scanning the entire dictionary when the search is run which can be a performance issue. If you do use ReversedWildcardFilterFactory you wont have that perf

Re: Is leading wildcard search turned on by default in Solr 3.6.1?

2012-11-12 Thread François Schiettecatte
I suspect it is just part of the wildcard handling, maybe someone can chime in here, you may need to catch this before it gets to SOLR. François On Nov 12, 2012, at 5:44 PM, johnmu...@aol.com wrote: > Thanks for the quick response. > > > So, I do not want to use ReversedWildcardFilterFactory,

<    1   2