AW: Highlighting in large text fields
Hi Mike, Thanks for the quick help. I just added a call to Highlighter.setMaxDocBytesToAnalyze() to my local copy of the HighlightingUtil.java and it worked all right. It would be great to have the limit for the docBytesToAnalyze configurable in solrconfig.xml. (But it's out of scope for me to implement this right now). --Christian -Ursprüngliche Nachricht- Von: Mike Klaas [mailto:[EMAIL PROTECTED] Gesendet: Montag, 25. Juni 2007 19:34 An: solr-user@lucene.apache.org Betreff: Re: Highlighting in large text fields On 25-Jun-07, at 4:59 AM, Burkamp, Christian wrote: Hi list, Highlighting does not work for words that are not located near the beginning of a text field. In my index the whole text is stored in a text field for highlighting purpose. This field is just stored but not indexed. The maxFieldLength was set to 10. The document content can be retrieved from the index without any problem but for some terms highlighting does not return anything. This is the case for all words from position 9162 on. When I try to highlight the whole text (hl.fragsize=0) with some common word as query it returns the highlighted content but just the first 9162 words. The rest is omitted. Any idea what might be going wrong? 9162 seems not to be a standard limit for IT systems. The lucene highlighter by default only processes the first 50kB of text. This is probably something that should be made configurable. I'll add it to the future features. -Mike
Problem with surrogate characters in utf-8
Hi all, I have a problem after updating to solr 1.2. I'm using the bundled jetty that comes with the latest solr release. Some of the contents that are stored in my index contain characters from the unicode private section above 0x10. (They are used by some proprietary software and the text extraction does not throw them out). Contrasting to solr 1.1, the current release returns these characters coded as sequence of two surrogate characters. This could result from some utf-16 conversion that is taking place somewhere in the system? In fact a look into the index with luke suggests that lucene is storing it's data in utf-16 encoding. The code point 0x100058 is stored as the two surrogate characters 0xDBC0 and 0xDC58. This is the same behaviour in solr 1.1 and 1.2. But while in solr 1.1 the character is put together to form one 4-byte utf-8 character in the result, solr 1.2 returns the utf-8 codes for the two surrogate characters that I see using luke. Unfortunately this results in an invalid utf-8 encoded text that (for example) can not be displayed by Internet Explorer. A request like http://localhost:8983/solr/select?q=*:* results in an error message from the browser. This is easy to reproduce if someone would try to debug. I have attached a valid utf-8 encoded xml document that contains the 4-byte encoded codepoint 0x100058. It can be indexed with post.jar. Sending this request via Internet Explorer now results in an error: http://localhost:8983/solr/select?q=*:* utf.xml I tried the new solr 1.2 war file with the old example distribution (solr 1.1 and jetty 5.1). Suprisingly enough this does not reveal the problem. So the whole story might even be a jetty issue. Any ideas? -- Christian ?xml version=1.0 encoding=UTF-8? add doc field name=idUTF8TEST/field field name=nameabcdefgôhijklmnop/field /doc /add
AW: SOLR Indexing/Querying
Hi there, It looks alot like using Solr's standard WordDelimiterFilter (see the sample schema.xml) does what you need. It splits on alphabetical to numeric boundaries and on the various kinds of intra word delimiters like -, _ or .. You can decide whether the parts are put together again in addition to the split up tokens. Control this by the parameters catenateWords, catenateNumbers and catenateAll. Good documentation on this topic is found on the wiki http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089 -- Christian -Ursprüngliche Nachricht- Von: Frans Flippo [mailto:[EMAIL PROTECTED] Gesendet: Donnerstag, 31. Mai 2007 11:27 An: solr-user@lucene.apache.org Betreff: Re: SOLR Indexing/Querying I think if you add a field that has an analyzer that creates tokens on alpha/digit/punctuation boundaries, that should go a long way. Use that both at index and search time. For example: * 3555LHP becomes 3555 LHP Searching for D3555 becomes D OR 3555, so it matches on token 3555 from 3555LHP. * t14240 becomes t 14240 Searching for t14240-ss becomes t OR 14240 OR ss, matching 14240 from t14240. Similarly for your other examples. If this proves to be too broad, you may need to define some stricter rules, but you could use this for starters. I think you will have to write your own analyzer, as it doesn't look like any of the analyzers available in Solr/Lucene do exactly what you need. But that's relatively straightforward. Just start with the code from one of the existing Analyzers (e.g. KeywordAnalyzer). Good luck, Frans On 5/31/07, realw5 [EMAIL PROTECTED] wrote: Hey Guys, I need some guidance in regards to a problem we are having with our solr index. Below is a list of terms our customers search for, which are failing or not returning the complete set. The second side of the list is the product id/keyword we want it to match. Can you give me some direction on how this can (or let me know if i can't be done) with index/query analyzers. Any help is much appeciated! Dan --- Keyword Typed In / We want it to find D3555 / 3555LHP D460160-BN / D460160 D460160BN / D460160 Dd454557 / D454557 84200ORB / 84200 84200-ORB / 84200 T13420-SCH / T13420 t14240-ss / t14240 -- View this message in context: http://www.nabble.com/SOLR-Indexing-Querying-tf3843221.html#a10883456 Sent from the Solr - User mailing list archive at Nabble.com.
AW: UTF-8 2-byte vs 4-byte encodings
Gereon, The four bytes do not look like a valid utf-8 encoded character. 4-byte characters in utf-8 start with the binary sequence 0 (For reference see the excellent wikipedia article on utf-8 encoding). Your problem looks like someone interpreted your valid 2-byte utf-8 encoded character as two single byte characters in some fancy encoding. This happens if you send XML updates to solr via http without setting the encoding properly. It is not sufficient to set the encoding in the XML but you need an additional HTTP header to set the encoding (Content-type: text/xml; charset=UTF-8) --Christian -Ursprüngliche Nachricht- Von: Gereon Steffens [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 2. Mai 2007 09:59 An: solr-user@lucene.apache.org Betreff: UTF-8 2-byte vs 4-byte encodings Hi, I have a question regarding UTF-8 encodings, illustrated by the utf8-example.xml file. This file contains raw, unescaped UTF8 characters, for example the e acute character, represented as two bytes 0xC3 0xA9. When this file is added to Solar and retrieved later, the XML output contains a four-byte representation of that character, namely 0xC2 0x83 0xC2 0xA9. If, on the other hand, the input data contains this same character as an entity #A9; the output contains the two-byte encoded representation 0xC3 0xA9. Why is that so, and is there a way to always get characters like these out of Solr as their two-byte representations? The reason I'm asking is that I often have to deal with CDATA sections in my input files that contain raw (two-byte) UTF8 characters that can't be encoded as entities. Thanks, Gereon
AW: Help with Setup
Hi, You can use curl with a file if you put the @ char in front of it's name. (Otherwise curl expects the data on the commandline). curl http://localhost:8080/solr/update --data-binary @articles.xml -Ursprüngliche Nachricht- Von: Sean Bowman [mailto:[EMAIL PROTECTED] Gesendet: Donnerstag, 26. April 2007 23:32 An: solr-user@lucene.apache.org Betreff: Re: Help with Setup Try: curl http://localhost:8080/solr/update --data-binary 'adddocfield name=id2008/fieldfield name=storyTextThe Rain in Spain Falls Mainly In The Plain/ field/doc/add' And see if that works. I don't think curl lets you put a filename in for the --data-binary parameter. Has to be the actual data, though something like this might also work: curl http://localhost:8080/solr/update --data-binary `cat articles.xml` Those are open ticks, not apostrophes. On 4/26/07, Ryan McKinley [EMAIL PROTECTED] wrote: paladin:/data/solr mtorgler1$ curl http://localhost:8080/solr/update --data-binary articles.xml result status=1org.xmlpull.v1.XmlPullParserException: only whitespace content allowed before start tag and not a (position: START_DOCUMENT seen a... @1:1) at org.xmlpull.mxp1.MXParser.parseProlog(MXParser.java:1519) at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1395) My guess is you have some funny character at the start of the document. I have seen funny chars show show up when i edit a UTF-8 file and save it as ASCII. If you don't see it in your normal editor, try a different one. If that does not help, start with the working example and add modify a little bit at a time... ryan
Avoiding caching of special filter queries
Hi, I'm using filter queries to implement document level security with solr. The caching mechanism for filters separate from queries comes in handy and the system performs well once all the filters for the users of the system are stored in the cache. However, I'm storing full document content in the index for the purpose of highlighting. In addition to the standard snippet highlighting I would like to offer a feature that displays the highlighted full document content. I can add a filter query to select just the needed Document by ID but this filter would go into the filter cache as well, possibly throwing out some of the other usefull filters. Is there a way to get the single document with highlighting info but without polluting the filter cache? -- Christian
AW: Avoiding caching of special filter queries
Hi Erik, No, what I need to do is q=my funny queryfq=user:erikfq=id:doc Idhl=on ... This is because the StandardRequestHandler needs the original query to do proper highlighting. The user gets his paginated result page with his next 10 hits. He can then select one document for highlighting. Then I just repeat the last request with an additional filter query to select this one document and add the highlighting parameters. -- Christian -Ursprüngliche Nachricht- Von: Erik Hatcher [mailto:[EMAIL PROTECTED] Gesendet: Freitag, 20. April 2007 15:43 An: solr-user@lucene.apache.org Betreff: Re: Avoiding caching of special filter queries On Apr 20, 2007, at 7:11 AM, Burkamp, Christian wrote: I'm using filter queries to implement document level security with solr. The caching mechanism for filters separate from queries comes in handy and the system performs well once all the filters for the users of the system are stored in the cache. However, I'm storing full document content in the index for the purpose of highlighting. In addition to the standard snippet highlighting I would like to offer a feature that displays the highlighted full document content. I can add a filter query to select just the needed Document by ID but this filter would go into the filter cache as well, possibly throwing out some of the other usefull filters. Is there a way to get the single document with highlighting info but without polluting the filter cache? Correct me if I'm wrong, but here's my understanding... q=id:doc idfq=user:erik is what you'd want to do. q=id:doc won't go into the filter cache, but rather the query cache and the document itself into the document cache. So you won't risk bumping things out of the filter cache by using queries. Erik
AW: solr performance
I do agree. There's probably no need to go to the index directly. My current solr test server has more than 5M documents and a size of about 60GB. I still index at 13 docs per second and this still includes filtering of the documents. (If you have your content ready in XML format performance will be even better). It seems to me that indexing performance does not drop as the index increases. Optimizing the index although does take huge amounts of time for large indexes. --Christian -Ursprüngliche Nachricht- Von: Erik Hatcher [mailto:[EMAIL PROTECTED] Gesendet: Dienstag, 20. Februar 2007 11:43 An: solr-user@lucene.apache.org Betreff: Re: solr performance You could build your index using Lucene directly and then point a Solr instance at it once its built. My suspicion is that the overhead of forming a document as an XML string and posting to Solr via HTTP won't be that much different than indexing with Lucene directly. My largest Solr index is currently at 1.4M and it takes a max of 3ms to add a document (according to Solr's console), most of them 1ms. My single threaded indexer is indexing around 1000 documents per minute, but I think I can get this number even faster by parallelizing the indexer. I'm curious what rates others are indexing at ??? Erik On Feb 20, 2007, at 2:21 AM, Jack L wrote: Hello, I have a question about solr's performance of accepting inserts and indexing. If I have 10 million documents that I'd like to index, I suppose it will take some time to submit them to solr. Is there any faster way to do this than through the web interface? -- Best regards, Jack __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
AW: highlight search keywords on html page
I was thinking about the same thing. It shouldn't be too difficult to subclass SolrRequestHandler and build a special HighlightingRequestHandler that uses the builtin highlighting utils to do the job. I wonder if it's possible to get access to the http request body inside a SolrRequestHandler subclass. (The raw text to be highlighted would have to be passed to solr as body in an http request). Storing the raw text in the solr index is a reasonable solution for small indexes only. --Christian -Ursprüngliche Nachricht- Von: Chris Hostetter [mailto:[EMAIL PROTECTED] Gesendet: Montag, 19. Februar 2007 03:00 An: solr-user@lucene.apache.org Betreff: Re: highlight search keywords on html page : When a user performs a search, I will return a list of links containing : highlighted fragments : from pageContent. If a link is clicked, I want to return the associated : raw html back : to user AND have search keywords in it to be highlighted, just like google : cached page. i'm not really sure that Solr can help you in this case ... it only know about the data you give it -- if you want it to highlight the raw html of hte entire page, then you're going to need to store the raw html of hte entire page in the index. you can still highlight pageContent with heavy fragmentation on your main search page where you list multiple results, and then when a user picks one redo the search with an fq restricting to the doc they picked and hl.fl=rawHtml and hl.fragsize=0 so you'll get the whole highlighted without fragmentation. -Hoss
Re: performance testing practices
Hi there, I am working on some performance numbers too. This is part of my evaluation of solr. I'm planning to replace a legacy search engine and have to find out if this is possible with solr. I have loaded 1,1 million documents into solr by now. Indexing speed is not a big concern for me. I had about 17 documents per second while my indexing client is still only a python prototype with a very slow filtering engine based on windows Ifilter. I'm measuring the search performance by using a python client that is continually querying solr. It grabs a random word from the results and uses it for the next search. For every search request the time from sending the request till receiving the response is taken. Every query uses one word as search text and one word as filter query text. Highlighting is on. Some first results: Solr loaded with 112 Documents: Max queries per second: 14,5 Average request duration with only 1 client: 0,08 s My criteria of 90% requests completing in less than 1 second is met with a maximum of 10 parallel clients. I suspect to serve at least 300 users with one system like this. (Measured on a single CPU Pentium4 3GHz, 2GB RAM, internal standard ATA Drive) Next step will be to increase the number of documents till I reach the point where no request is completed in less than 1 second. (From this point on no amount of replication can bring me back to production performance). I have a few questions, too. - What size is the largest known solr server - What number of documents do you think can be handled by solr - Solr is using only one lucene index. There has been a thread about this before but it was more related to bringing together different lucene indexes under one solr server. I potentially need a solution for up to 500 millions of documents. I believe this will not work without splitting the index. What do you think? - Does anybody have own performance numbers they would share? - solr was running under jetty for my performance tests. What container is best suited for high performance? Thanks a lot for the inspiring talk going on on this mailing list. Christian -Ursprüngliche Nachricht- Von: Erik Hatcher [mailto:[EMAIL PROTECTED] Gesendet: Montag, 5. Februar 2007 11:23 An: solr-user@lucene.apache.org Betreff: performance testing practices This week I'm going to be incrementally loading up to 3.7M records into Solr, in 50k chunks. I'd like to capture some performance numbers after each chunk to see how she holds up. What numbers are folks capturing? What techniques are you using to capture numbers? I'm not looking for anything elaborate, as the goal is really to see how faceting fares as more data is loaded. We've got some ugly data in our initial experiment, so the faceting concerns me. Thanks, Erik