Re: Indexing very large files.
Yonik Seeley schrieb: On 9/5/07, Brian Carmalt [EMAIL PROTECTED] wrote: I've bin trying to index a 300MB file to solr 1.2. I keep getting out of memory heap errors. 300MB of what... a single 300MB document? Or is that file represent multiple documents in XML or CSV format? -Yonik Hello Yonik, Thank you for your fast reply. It is one large document. If it was made up of smaller docs, I would split it up and index them separately. Can Solr be made to handle such large docs? Thanks, Brian
Re: Indexing very large files.
Hello again, I run Solr on Tomcat under windows and use the tomcat monitor to start the service. I have set the minimum heap size to be 512MB and then maximum to be 1024mb. The system has 2 Gigs of ram. The error that I get after sending approximately 300 MB is: java.lang.OutOfMemoryError: Java heap space at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:2947) at org.xmlpull.mxp1.MXParser.more(MXParser.java:3026) at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1384) at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093) at org.xmlpull.mxp1.MXParser.nextText(MXParser.java:1058) at org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:332) at org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestHandler.java:162) at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:84) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:77) at org.apache.solr.core.SolrCore.execute(SolrCore.java:658) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:191) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:159) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:104) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:261) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:581) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619) After sleeping on the problem I see that it does not directly stem from Solr, but from the module org.xmlpull.mxp1.MXParser. Hmmm. I'm open to sugestions and ideas. First is this doable? If yes, will I have to modify the code to save the file to disk and then read it back in order to index it in chunks. Or can I get it it working on a stock Solr install. Thanks, Brian Norberto Meijome schrieb: On Wed, 05 Sep 2007 17:18:09 +0200 Brian Carmalt [EMAIL PROTECTED] wrote: I've bin trying to index a 300MB file to solr 1.2. I keep getting out of memory heap errors. Even on an empty index with one Gig of vm memory it sill won't work. Hi Brian, VM != heap memory. VM = OS memory heap memory = memory made available by the JavaVM to the Java process. Heap memory errors are hardly ever an issue of the app itself (other , of course, with bad programming... but it doesnt seem to be issue here so far) [EMAIL PROTECTED] [Thu Sep 6 14:59:21 2007] /usr/home/betom $ java -X [...] -Xmssizeset initial Java heap size -Xmxsizeset maximum Java heap size -Xsssizeset java thread stack size [...] For example, start solr as : java -Xms64m -Xmx512m -jar start.jar YMMV with respect to the actual values you use. Good luck, B _ {Beto|Norberto|Numard} Meijome Windows caters to everyone as though they are idiots. UNIX makes no such assumption. It assumes you know what you are doing, and presents the challenge of figuring it out for yourself if you don't. I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Indexing very large files.
On Thu, 2007-09-06 at 08:55 +0200, Brian Carmalt wrote: Hello again, I run Solr on Tomcat under windows and use the tomcat monitor to start the service. I have set the minimum heap size to be 512MB and then maximum to be 1024mb. The system has 2 Gigs of ram. The error that I get after sending approximately 300 MB is: java.lang.OutOfMemoryError: Java heap space at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:2947) at org.xmlpull.mxp1.MXParser.more(MXParser.java:3026) at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1384) at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093) at org.xmlpull.mxp1.MXParser.nextText(MXParser.java:1058) at org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:332) at org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestHandler.java:162) at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:84) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:77) at org.apache.solr.core.SolrCore.execute(SolrCore.java:658) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:191) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:159) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:104) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:261) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:581) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619) After sleeping on the problem I see that it does not directly stem from Solr, but from the module org.xmlpull.mxp1.MXParser. Hmmm. I'm open to sugestions and ideas. Which version do you use of solr? http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/handler/XmlUpdateRequestHandler.java?view=markup The trunk version of the XmlUpdateRequestHandler is now based on StAX. You may want to try whether that is working better. Please try and report back. salu2 -- Thorsten Scherler thorsten.at.apache.org Open Source Java consulting, training and solutions
Tagging using SOLR
Dear all, We are running an appalication built using SOLR, now we are trying to build a tagging system using the existing SOLR indexed field called tag_keywords, this field has different keywords seperated by comma, please give suggestions on how can we build tagging system using this field? Thanks, Mohandoss.
Re: Indexing very large files.
Moin Thorsten, I am using Solr 1.2.0. I'll try the svn version out and see of that helps. Thanks, Brian Which version do you use of solr? http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/handler/XmlUpdateRequestHandler.java?view=markup The trunk version of the XmlUpdateRequestHandler is now based on StAX. You may want to try whether that is working better. Please try and report back. salu2
solr.py problems with german Umlaute
Hi all, i try to add/update documents with the python solr.py api. Everything works fine so far but if i try to add a documents which contain German Umlaute (ö,ä,ü, ...) i got errors. Maybe someone has an idea how i could convert my data? Should i post this to JIRA? Thanks for help. Btw: I have no sitecustomize.py . This is my script: -- from solr import * title=Übersicht kw = {'id':'12','title':title,'system':'plone','url':'http://www.google.de'} c = SolrConnection('http://192.168.2.13:8080/solr') c.add_many([kw,]) c.commit() -- This is the error: File t.py, line 5, in ? c.add_many([kw,]) File /usr/local/lib/python2.4/site-packages/solr.py, line 596, in add_many self.__add(lst, doc) File /usr/local/lib/python2.4/site-packages/solr.py, line 710, in __add lst.append('field name=%s%s/field' % ( UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Re: Indexing very large files.
On Thu, 2007-09-06 at 11:26 +0200, Brian Carmalt wrote: Hallo again, I checked out the solr source and built the 1.3-dev version and then I tried to index the same file to the new server. I do get a different exception trace, but the result is the same. java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2882) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100) It seems that you are reaching the limits because of the StringBuilder. Did you try to raise the mem to the max like: java -Xms1536m -Xmx1788m -jar start.jar Anyway you will have to look into SolrInputDocument readDoc(XMLStreamReader parser) throws XMLStreamException { ... StringBuilder text = new StringBuilder(); ... case XMLStreamConstants.CHARACTERS: text.append( parser.getText() ); break; ... The problem is that the text object is bigger then heaps, maybe invoking garbage collection before will help. salu2 -- Thorsten Scherler thorsten.at.apache.org Open Source Java consulting, training and solutions
Re: Tagging using SOLR
On Sep 6, 2007, at 3:29 AM, Doss wrote: We are running an appalication built using SOLR, now we are trying to build a tagging system using the existing SOLR indexed field called tag_keywords, this field has different keywords seperated by comma, please give suggestions on how can we build tagging system using this field? There is also a wiki page on some brainstorming on how to implement tagging within Solr: http://wiki.apache.org/solr/UserTagDesign It's easy enough to have a tag_keywords field, but updating a single tag_keywords field is not so straightforward without sending the entire document to Solr every time it is tagged. See SOLR-139's extensive comments and patches to see what you're getting into. Erik
Re: Replication broken.. no helpful errors?
The snapinstaller script opens a new searcher by calling commit. From the attached debug output it looks like that actually worked: + /opt/solr/bin/commit + [[ 0 != 0 ]] + logExit ended 0 Try running the /opt/solr/bin/commit directly with the -V option. Bill On 9/5/07, Matthew Runo [EMAIL PROTECTED] wrote: If it helps anyone, this index is around a gig in size. ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Sep 5, 2007, at 3:14 PM, Matthew Runo wrote: It seems that the scripts cannot open new searchers at the end of the process, for some reason. Here's a message from cron, but I'm not sure what to make of it... It looks like the files properly copied over, but failed the install. I removed the temp* directory, but still SOLR could not launch a new searcher. I don't see any activity in catalina.out though... started by tomcat5 command: /opt/solr/bin/snappuller -M search1 -P 18080 -D /opt/solr/ data -S /opt/solr/logs -d /opt/solr/data -v pulling snapshot temp-snapshot.20070905150504 receiving file list ... done deleting segments_1ine deleting _164h_1.del deleting _164h.tis deleting _164h.tii deleting _164h.prx deleting _164h.nrm deleting _164h.frq deleting _164h.fnm deleting _164h.fdx deleting _164h.fdt deleting _164g_1.del deleting _164g.tis deleting _164g.tii deleting _164g.prx deleting _164g.nrm deleting _164g.frq deleting _164g.fnm deleting _164g.fdx deleting _164g.fdt deleting _164f_1.del deleting _164f.tis deleting _164f.tii deleting _164f.prx deleting _164f.nrm deleting _164f.frq deleting _164f.fnm deleting _164f.fdx deleting _164f.fdt deleting _164e_1.del deleting _164e.tis deleting _164e.tii deleting _164e.prx deleting _164e.nrm deleting _164e.frq deleting _164e.fnm deleting _164e.fdx deleting _164e.fdt deleting _164d_1.del deleting _164d.tis deleting _164d.tii deleting _164d.prx deleting _164d.nrm deleting _164d.frq deleting _164d.fnm deleting _164d.fdx deleting _164d.fdt deleting _164c_1.del deleting _164c.tis deleting _164c.tii deleting _164c.prx deleting _164c.nrm deleting _164c.frq deleting _164c.fnm deleting _164c.fdx deleting _164c.fdt deleting _164b_1.del deleting _164b.tis deleting _164b.tii deleting _164b.prx deleting _164b.nrm deleting _164b.frq deleting _164b.fnm deleting _164b.fdx deleting _164b.fdt deleting _164a_1.del deleting _164a.tis deleting _164a.tii deleting _164a.prx deleting _164a.nrm deleting _164a.frq deleting _164a.fnm deleting _164a.fdx deleting _164a.fdt deleting _163z_3.del deleting _163z.tis deleting _163z.tii deleting _163z.prx deleting _163z.nrm deleting _163z.frq deleting _163z.fnm deleting _163z.fdx deleting _163z.fdt deleting _163o_3.del deleting _163o.tis deleting _163o.tii deleting _163o.prx deleting _163o.nrm deleting _163o.frq deleting _163o.fnm deleting _163o.fdx deleting _163o.fdt deleting _163d_4.del deleting _163d.tis deleting _163d.tii deleting _163d.prx deleting _163d.nrm deleting _163d.frq deleting _163d.fnm deleting _163d.fdx deleting _163d.fdt deleting _1632_6.del deleting _1632.tis deleting _1632.tii deleting _1632.prx deleting _1632.nrm deleting _1632.frq deleting _1632.fnm deleting _1632.fdx deleting _1632.fdt deleting _162r_7.del deleting _162r.tis deleting _162r.tii deleting _162r.prx deleting _162r.nrm deleting _162r.frq deleting _162r.fnm deleting _162r.fdx deleting _162r.fdt deleting _162g_d.del deleting _162g.tis deleting _162g.tii deleting _162g.prx deleting _162g.nrm deleting _162g.frq deleting _162g.fnm deleting _162g.fdx deleting _162g.fdt deleting _1625_m.del deleting _1625.tis deleting _1625.tii deleting _1625.prx deleting _1625.nrm deleting _1625.frq deleting _1625.fnm deleting _1625.fdx deleting _1625.fdt deleting _161u_w.del deleting _161u.tis deleting _161u.tii deleting _161u.prx deleting _161u.nrm deleting _161u.frq deleting _161u.fnm deleting _161u.fdx deleting _161u.fdt deleting _161j_16.del ./ _161j_17.del _164m.fdt _164m.fdx _164m.fnm _164m.frq _164m.nrm _164m.prx _164m.tii _164m.tis _164m_1.del _164x.fdt _164x.fdx _164x.fnm _164x.frq _164x.nrm _164x.prx _164x.tii _164x.tis _164x_1.del segments.gen segments_1inv sent 516 bytes received 105864302 bytes 30247090.86 bytes/sec total size is 966107226 speedup is 9.13 + [[ -z search1 ]] + [[ -z /opt/solr/logs ]] + fixUser -M search1 -S /opt/solr/logs -d /opt/solr/data -V + [[ -z tomcat5 ]] ++ whoami + [[ tomcat5 != tomcat5 ]] ++ who -m ++ cut '-d ' -f1 ++ sed '-es/^.*!//' + oldwhoami= + [[ '' == '' ]] +++ pgrep -g0 snapinstaller
RSS syndication Plugin
Hi all, I am curious whether somebody has written a rss plugin for solr. The idea is to provide a rss syndication link for the current search. It should be really easy to implement since it would be just a transformation solrXml - RSS which easily can be done with a simple xsl. Has somebody already done this? salu2 -- Thorsten Scherler thorsten.at.apache.org Open Source Java consulting, training and solutions
Re: RSS syndication Plugin
perhaps: https://issues.apache.org/jira/browse/SOLR-208 in http://svn.apache.org/repos/asf/lucene/solr/trunk/example/solr/conf/xslt/ check: example_atom.xsl example_rss.xsl Thorsten Scherler wrote: Hi all, I am curious whether somebody has written a rss plugin for solr. The idea is to provide a rss syndication link for the current search. It should be really easy to implement since it would be just a transformation solrXml - RSS which easily can be done with a simple xsl. Has somebody already done this? salu2
Re: Distribution Information?
That is very strange. Even if there is something wrong with the config or code, the static HTML contained in distributiondump.jsp should show up. Are you using the latest version of the JSP? There has been a recent fix: http://issues.apache.org/jira/browse/SOLR-333 Bill On 9/5/07, Matthew Runo [EMAIL PROTECTED] wrote: When I load the distrobutiondump.jsp, there is no output in my catalina.out file. ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Sep 5, 2007, at 1:55 PM, Matthew Runo wrote: Not that I've noticed. I'll do a more careful grep soon here - I just got back from a long weekend. ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Aug 31, 2007, at 6:12 PM, Bill Au wrote: Are there any error message in your appserver log files? Bill On 8/31/07, Matthew Runo [EMAIL PROTECTED] wrote: Hello! /solr/admin/distributiondump.jsp This server is set up as a master server, and other servers use the replication scripts to pull updates from it every few minutes. My distribution information screen is blank.. and I couldn't find any information on fixing this in the wiki. Any chance someone would be able to explain how to get this page working, or what I'm doing wrong? ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++
update servlet not working
Hi, We have the example solr installed with jetty. We are able to navigate to the solr/admin page, but when we try to POST an xml document via the command line, there is a fatal error. It seems that the solr/update servlet isnt running, giving a http 400 error. does anyone have any clue what is going on? thats in advance! -- cheers, ben
Re: update servlet not working
: We are able to navigate to the solr/admin page, but when we try to : POST an xml document via the command line, there is a fatal error. It : seems that the solr/update servlet isnt running, giving a http 400 : error. a 400 could mean a lot of things ... what is the full HTTP response you get back from Solr? what kinds of Stack traces show up in the Jetty log output? -Hoss
Re: Distribution Information?
Well, I do get... Distribution Info Master Server No distribution info present ... But there appears to be no information filled in. ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Sep 6, 2007, at 6:09 AM, Bill Au wrote: That is very strange. Even if there is something wrong with the config or code, the static HTML contained in distributiondump.jsp should show up. Are you using the latest version of the JSP? There has been a recent fix: http://issues.apache.org/jira/browse/SOLR-333 Bill On 9/5/07, Matthew Runo [EMAIL PROTECTED] wrote: When I load the distrobutiondump.jsp, there is no output in my catalina.out file. ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Sep 5, 2007, at 1:55 PM, Matthew Runo wrote: Not that I've noticed. I'll do a more careful grep soon here - I just got back from a long weekend. ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Aug 31, 2007, at 6:12 PM, Bill Au wrote: Are there any error message in your appserver log files? Bill On 8/31/07, Matthew Runo [EMAIL PROTECTED] wrote: Hello! /solr/admin/distributiondump.jsp This server is set up as a master server, and other servers use the replication scripts to pull updates from it every few minutes. My distribution information screen is blank.. and I couldn't find any information on fixing this in the wiki. Any chance someone would be able to explain how to get this page working, or what I'm doing wrong? ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++
RE: Indexing very large files.
Now I'm curious: what is the use case for documents this large? Thanks, Lance Norskog
Re: Replication broken.. no helpful errors?
The thing is that a new searcher is not opened if I look in the stats.jsp page. The index version never changes. When I run.. sudo /opt/solr/bin/commit -V -u tomcat5 ..I get a new searcher opened, but even though it (in theory) installed the new index, I see no docs in there. During the snapinstaller... + echo 2007/09/06 11:43:49 command: /opt/solr/bin/snapinstaller -M search1 -S /opt/solr/logs -d /opt/solr/data -V -u tomcat5 + [[ -n '' ]] ++ ls /opt/solr/data ++ grep 'snapshot\.' ++ grep -v wip ++ sort -r ++ head -1 + name=temp-snapshot.20070905150504 + trap 'echo caught INT/TERM, exiting now but partial installation may have already occured;/bin/rm -rf ${data_dir/index.tmp$$;logExit aborted 13' INT TERM + [[ temp-snapshot.20070905150504 == '' ]] + name=/opt/solr/data/temp-snapshot.20070905150504 ++ cat /opt/solr/logs/snapshot.current ...it would seem that snappuller might not be properly setting the directory name - or should it be temp-*? I had replication working for a few weeks, and then it broke, and has been down since. We're going live with this project in about a week, and I really need to get this going before then =p ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Sep 6, 2007, at 6:01 AM, Bill Au wrote: The snapinstaller script opens a new searcher by calling commit. From the attached debug output it looks like that actually worked: + /opt/solr/bin/commit + [[ 0 != 0 ]] + logExit ended 0 Try running the /opt/solr/bin/commit directly with the -V option. Bill On 9/5/07, Matthew Runo [EMAIL PROTECTED] wrote: If it helps anyone, this index is around a gig in size. ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Sep 5, 2007, at 3:14 PM, Matthew Runo wrote: It seems that the scripts cannot open new searchers at the end of the process, for some reason. Here's a message from cron, but I'm not sure what to make of it... It looks like the files properly copied over, but failed the install. I removed the temp* directory, but still SOLR could not launch a new searcher. I don't see any activity in catalina.out though... started by tomcat5 command: /opt/solr/bin/snappuller -M search1 -P 18080 -D /opt/solr/ data -S /opt/solr/logs -d /opt/solr/data -v pulling snapshot temp-snapshot.20070905150504 receiving file list ... done deleting segments_1ine deleting _164h_1.del deleting _164h.tis deleting _164h.tii deleting _164h.prx deleting _164h.nrm deleting _164h.frq deleting _164h.fnm deleting _164h.fdx deleting _164h.fdt deleting _164g_1.del deleting _164g.tis deleting _164g.tii deleting _164g.prx deleting _164g.nrm deleting _164g.frq deleting _164g.fnm deleting _164g.fdx deleting _164g.fdt deleting _164f_1.del deleting _164f.tis deleting _164f.tii deleting _164f.prx deleting _164f.nrm deleting _164f.frq deleting _164f.fnm deleting _164f.fdx deleting _164f.fdt deleting _164e_1.del deleting _164e.tis deleting _164e.tii deleting _164e.prx deleting _164e.nrm deleting _164e.frq deleting _164e.fnm deleting _164e.fdx deleting _164e.fdt deleting _164d_1.del deleting _164d.tis deleting _164d.tii deleting _164d.prx deleting _164d.nrm deleting _164d.frq deleting _164d.fnm deleting _164d.fdx deleting _164d.fdt deleting _164c_1.del deleting _164c.tis deleting _164c.tii deleting _164c.prx deleting _164c.nrm deleting _164c.frq deleting _164c.fnm deleting _164c.fdx deleting _164c.fdt deleting _164b_1.del deleting _164b.tis deleting _164b.tii deleting _164b.prx deleting _164b.nrm deleting _164b.frq deleting _164b.fnm deleting _164b.fdx deleting _164b.fdt deleting _164a_1.del deleting _164a.tis deleting _164a.tii deleting _164a.prx deleting _164a.nrm deleting _164a.frq deleting _164a.fnm deleting _164a.fdx deleting _164a.fdt deleting _163z_3.del deleting _163z.tis deleting _163z.tii deleting _163z.prx deleting _163z.nrm deleting _163z.frq deleting _163z.fnm deleting _163z.fdx deleting _163z.fdt deleting _163o_3.del deleting _163o.tis deleting _163o.tii deleting _163o.prx deleting _163o.nrm deleting _163o.frq deleting _163o.fnm deleting _163o.fdx deleting _163o.fdt deleting _163d_4.del deleting _163d.tis deleting _163d.tii deleting _163d.prx deleting _163d.nrm deleting _163d.frq deleting _163d.fnm deleting _163d.fdx deleting _163d.fdt deleting _1632_6.del deleting _1632.tis deleting _1632.tii deleting _1632.prx deleting _1632.nrm deleting _1632.frq deleting _1632.fnm deleting _1632.fdx deleting _1632.fdt deleting _162r_7.del deleting _162r.tis deleting _162r.tii deleting _162r.prx deleting _162r.nrm deleting _162r.frq deleting _162r.fnm deleting _162r.fdx deleting _162r.fdt deleting _162g_d.del deleting _162g.tis deleting _162g.tii deleting _162g.prx deleting _162g.nrm deleting
Re: updates on the server
On a related note, it'd be great if we could set up a series of transformations to be done on data when it comes into the index, before being indexed. I guess a custom tokenizer might be the best way to do this though..? ie: -Post -Data is cleaned up, properly escaped, etc -Then data is passed to whatever tokenizer we want to use. ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Sep 3, 2007, at 7:10 AM, Erik Hatcher wrote: On Sep 3, 2007, at 12:22 AM, James O'Rourke wrote: Is there a way to pass the solr server a set of documents without all the fields present and only update the fields that are provided leaving the remaining document fields intact or do I need to pull those documents over the wire myself and do the update manual and then add them back to the index? With Solr currently you cannot update a specific field, you have to re-send the entire document to replace the existing one. However, preliminary support for such capability has been contributed here: http://issues.apache.org/jira/browse/SOLR-139 - this is not in its final form, so this is to use at your own risk given the caveats listed in that issue about concurrency. I'm currently using the patch I posted to that issue in a production environment and its working fine thus far, but it will change in at least core ways and likely request parameter and formatting ways before making its debut in Solr's trunk. Erik
RE: solr.py problems with german Umlaute
I researched this problem before. The problem I found is that Python strings are not Unicode by default. You have to do something to make them Unicode. Here are the links I found: http://www.reportlab.com/i18n/python_unicode_tutorial.html http://evanjones.ca/python-utf8.html http://jjinux.blogspot.com/2006/04/python-protecting-utf-8-strings-from.html We do the utf-8 encodesubmit and so our strings are badly encoded and stored. We are seeing the problem shown in Marc-Andre Lemburg in the reportlab.com link: an e-forward-accent becomes some Japanese character. -Original Message- From: news [mailto:[EMAIL PROTECTED] On Behalf Of Christian Klinger Sent: Thursday, September 06, 2007 2:55 AM To: solr-user@lucene.apache.org Subject: solr.py problems with german Umlaute Hi all, i try to add/update documents with the python solr.py api. Everything works fine so far but if i try to add a documents which contain German Umlaute (ö,ä,ü, ...) i got errors. Maybe someone has an idea how i could convert my data? Should i post this to JIRA? Thanks for help. Btw: I have no sitecustomize.py . This is my script: -- from solr import * title=Übersicht kw = {'id':'12','title':title,'system':'plone','url':'http://www.google.de'} c = SolrConnection('http://192.168.2.13:8080/solr') c.add_many([kw,]) c.commit() -- This is the error: File t.py, line 5, in ? c.add_many([kw,]) File /usr/local/lib/python2.4/site-packages/solr.py, line 596, in add_many self.__add(lst, doc) File /usr/local/lib/python2.4/site-packages/solr.py, line 710, in __add lst.append('field name=%s%s/field' % ( UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Re: solr.py problems with german Umlaute
On 9/6/07, Brian Carmalt [EMAIL PROTECTED] wrote: Try it with title.encode('utf-8'). As in: kw = {'id':'12','title':title.encode('utf-8'),'system':'plone','url':'http://www.google.de'} It seems like the client library should be responsible for encoding, not the user. So try changing title=Übersicht into a unicode string via title=uÜbersicht And that should hopefully get your test program working. If it doesn't it's probably a solr.py bug and should be fixed there. -Yonik
solr/home
Hi, I recently upgraded to Solr 1.2. I've set it up through Tomcat using context fragment files. I deploy using the tomcat web manager. In the context fragment I set the environment variable solr/home. This use to work as expected. The solr/home value pointed to the directory where data, conf etc. live. Now, this value doesn't get used and instead, tomcat creates a new directory called solr and solr/data in the same directory where the context fragment file is located. It's not really a problem in this particular instance. I like the idea of it defaulting to solr in the same location as the context fragment file, but as long as I can depend on it always working like that. It is a little puzzling as to why the value in my environment setting doesn't work though? Has anyone else experienced this behavior? Matt
Re: update servlet not working
I don't use the java client, but when I switched to 1.2, I'd get that message when I forget to add the content type header, as described in CHANGES.txt 9. The example solrconfig.xml maps /update to XmlUpdateRequestHandler using the new request dispatcher (SOLR-104). This requires posted content to have a valid contentType: curl -H 'Content-type:text/xml; charset=utf-8' The response format matches that of /select and returns standard error codes. To enable solr1.1 style /update, do not map /update to any handler in solrconfig.xml (ryan) But your request log shows a GET, should be a POST, I would think. I'd double check the parameters on post.jar On 9/6/07, Benjamin Li [EMAIL PROTECTED] wrote: oops, sorry, its says missing content stream as far as logs go: i have a request log, didn't find anything with stack traces though. where is it? we're using the example one packaged with solr. GET /solr/update HTTP/1.1 400 1401 just to make sure, i typed java -jar post.jar solrfile.xml thanks! On 9/6/07, Chris Hostetter [EMAIL PROTECTED] wrote: : We are able to navigate to the solr/admin page, but when we try to : POST an xml document via the command line, there is a fatal error. It : seems that the solr/update servlet isnt running, giving a http 400 : error. a 400 could mean a lot of things ... what is the full HTTP response you get back from Solr? what kinds of Stack traces show up in the Jetty log output? -Hoss -- cheers, ben
Re: Replication broken.. no helpful errors?
On 9/6/07, Matthew Runo [EMAIL PROTECTED] wrote: The thing is that a new searcher is not opened if I look in the stats.jsp page. The index version never changes. The index version is read from the index... hence if the lucene index doesn't change (even if a ew snapshot was taken), the version won't change even if a new searcher was opened. Is the problem on the master side now since it looks like the slave is pulling a temp-snapshot? -Yonik
Re: solr/home
Here you go: Context docBase=/usr/local/lib/solr.war debug=0 crossContext=true Environment name=solr/home type=java.lang.String value=/usr/ local/projects/my_app/current/solr-home / /Context This is the same file I'm putting into the Tomcat manager XML Configuration file URL form input. Matt On Sep 6, 2007, at 3:25 PM, Tom Hill wrote: It works for me. (fragments with solr 1.2 on tomcat 5.5.20) Could you post your fragment file? Tom On 9/6/07, Matt Mitchell [EMAIL PROTECTED] wrote: Hi, I recently upgraded to Solr 1.2. I've set it up through Tomcat using context fragment files. I deploy using the tomcat web manager. In the context fragment I set the environment variable solr/home. This use to work as expected. The solr/home value pointed to the directory where data, conf etc. live. Now, this value doesn't get used and instead, tomcat creates a new directory called solr and solr/data in the same directory where the context fragment file is located. It's not really a problem in this particular instance. I like the idea of it defaulting to solr in the same location as the context fragment file, but as long as I can depend on it always working like that. It is a little puzzling as to why the value in my environment setting doesn't work though? Has anyone else experienced this behavior? Matt
Re: solr.py problems with german Umlaute
On 6-Sep-07, at 12:13 PM, Yonik Seeley wrote: On 9/6/07, Brian Carmalt [EMAIL PROTECTED] wrote: Try it with title.encode('utf-8'). As in: kw = {'id':'12','title':title.encode ('utf-8'),'system':'plone','url':'http://www.google.de'} It seems like the client library should be responsible for encoding, not the user. So try changing title=Übersicht into a unicode string via title=uÜbersicht And that should hopefully get your test program working. If it doesn't it's probably a solr.py bug and should be fixed there. It may or may not, depending on the vagaries of the encoding in his text editor. What python gets when you enter u'é' is the byte sequence corresponding to the encoding of your editor. For instance, my terminal is set to utf-8 and when I type in é it is equivalent to entering the bytes C3 A9: In [5]: 'é' Out[5]: '\xc3\xa9' Prepending u does not work, because you are telling python that you want these two bytes as unicode characters. Note that this could be fixed by setting python's default encoding to match. In [1]: u'é' Out[1]: u'\xc3\xa9' In [11]: print u'é' é The proper thing to do is to interpret the byte sequence given the proper encoding: 'é'.decode('utf-8') Out[3]: u'\xe9' or enter the desired unicode character directly: u'\u00e9' u'\xe9' print u'\u00e9' é This is less complicated in the usual case of reading data from a file, because the encoding should be known (terminal encoding issues are much trickier). Use codecs.open() to get a unicode-output text stream. -Mike
searching where a value is not null?
Hi all. I'm trying to construct a query that in pseudo-code would read like this: field != '' I'm finding it difficult to write this as a solr query, though. Stuff like: NOT field:() doesn't seem to do the trick. any ideas? dw
Re: searching where a value is not null?
On 9/6/07, David Whalen [EMAIL PROTECTED] wrote: Hi all. I'm trying to construct a query that in pseudo-code would read like this: field != '' I'm finding it difficult to write this as a solr query, though. Stuff like: NOT field:() doesn't seem to do the trick. any ideas? perhaps field:[* TO *] -Yonik
Slow response
I am pretty new to Solr and this is my first post to this list so please forgive me if I make any glaring errors. Here's my problem. When I do a search using the Solr admin interface for a term that I know does not exist in my index the QTime is about 1ms. However, if I add facets to the search the response takes more than 20 seconds (and sometimes longer) to return. Here is the slow URL - /select?qf=AUTHOR_t+SUBJECT_t+TITLE_twt=xmlf.AUTHOR_facet.facet.sort=t ruef.FORMAT_t.facet.limit=25start=0facet=truefacet.mincount=1q=frak f.FORMAT_t.facet.mincount=1f.ITYPE_facet.facet.mincount=1f.SUBJECT_fa cet.facet.limit=25facet.field=AUTHOR_facetfacet.field=FORMAT_tfacet.f ield=LANGUAGE_tfacet.field=PUBDATE_tfacet.field=SUBJECT_facetfacet.fi eld=AGENCY_facetfacet.field=ITYPE_facetf.AGENCY_facet.facet.sort=true f.AGENCY_facet.facet.limit=-1rows=10f.ITYPE_facet.facet.limit=-1f.ITY PE_facet.facet.sort=truef.AUTHOR_facet.facet.limit=25f.LANGUAGE_t.face t.sort=truef.PUBDATE_t.facet.limit=-1f.AGENCY_facet.facet.mincount=1f .AUTHOR_facet.facet.mincount=1fl=*fl=scoreqt=dismaxversion=2.2f.SUB JECT_facet.facet.sort=truef.SUBJECT_facet.facet.mincount=1f.PUBDATE_t. facet.sort=falsef.FORMAT_t.facet.sort=truef.LANGUAGE_t.facet.limit=25 f.LANGUAGE_t.facet.mincount=1f.PUBDATE_t.facet.mincount=1 I am pretty sure I can't be the first to ask this question but I can't seem to find anything online with the answer. Thanks for your help. Aaron
Non-HTTP Indexing
Dear Solr Users: Is it possible to index documents directly without going through any XML/HTTP bridge? I have a large collection (10^7 documents, some very large) and indexing speed is a concern. Thanks! --Renaud
RE: Non-HTTP Indexing
There are couple choices, see: http://wiki.apache.org/solr/SolJava - Daniel -Original Message- From: Renaud Waldura [mailto:[EMAIL PROTECTED] Sent: Thursday, September 06, 2007 2:21 PM To: solr-user@lucene.apache.org Subject: Non-HTTP Indexing Dear Solr Users: Is it possible to index documents directly without going through any XML/HTTP bridge? I have a large collection (10^7 documents, some very large) and indexing speed is a concern. Thanks! --Renaud
RE: Slow response
Thank-you for your response, this does shed some light on the subject. Our basic question was why were we seeing slower responses the smaller our result set got. Currently we are searching about 1.2 million documents with the source document about 2KB, but we do duplicate some of the data. I bumped up my filterCache to 5 million and the 2nd search I did for an non-indexed term came back in 2.1 seconds so that is much improved. I am a little concerned about having this value so high but this is our problem and we will play with it. I do have a few follow-up questions. First, in regards to the filterCache once a single search has been done and facets requested, as long as new facets aren't requested and the size is large enough then the filters will remain in the cache, correct? Also, you mention that faceting is more a function of the number of the number of terms in the field. The 2 fields causing our problems are Authors and Subjects. If we divided up the data that made these facets into more specific fields (Primary author, secondary author, etc.) would this perform better? So the number of facet fields would increase but the unique terms for a given facet should be less. Thanks again for all your help. Aaron -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Thursday, September 06, 2007 4:17 PM To: solr-user@lucene.apache.org Subject: Re: Slow response On 9/6/07, Aaron Hammond [EMAIL PROTECTED] wrote: I am pretty new to Solr and this is my first post to this list so please forgive me if I make any glaring errors. Here's my problem. When I do a search using the Solr admin interface for a term that I know does not exist in my index the QTime is about 1ms. However, if I add facets to the search the response takes more than 20 seconds (and sometimes longer) to return. Here is the slow URL - Faceting on multi-value fields is more a function of the number of terms in the field (and their distribution) rather than the number of hits for a query. That said, perhaps faceting should be able to bail out if there are no hits. Is your question more about why faceting takes so long in general, or why it takes so long if there are no results? If you haven't, try optimizing your index for facet faceting in general. How many docs do you have in your index? As a side note, the way multi-valued faceting currently works, it's actually normally faster if the query returns a large number of hits. -Yonik
Re: Slow response
On 6-Sep-07, at 3:16 PM, Aaron Hammond wrote: Thank-you for your response, this does shed some light on the subject. Our basic question was why were we seeing slower responses the smaller our result set got. Currently we are searching about 1.2 million documents with the source document about 2KB, but we do duplicate some of the data. I bumped up my filterCache to 5 million and the 2nd search I did for an non-indexed term came back in 2.1 seconds so that is much improved. I am a little concerned about having this value so high but this is our problem and we will play with it. I do have a few follow-up questions. First, in regards to the filterCache once a single search has been done and facets requested, as long as new facets aren't requested and the size is large enough then the filters will remain in the cache, correct? Also, you mention that faceting is more a function of the number of the number of terms in the field. The 2 fields causing our problems are Authors and Subjects. If we divided up the data that made these facets into more specific fields (Primary author, secondary author, etc.) would this perform better? So the number of facet fields would increase but the unique terms for a given facet should be less. There are essentially two facet computation strategies: 1. cached bitsets: a bitset for each term is generated and intersected with the query restul bitset. This is more general and performs well up to a few thousand terms. 2. field enumeration: cache the field contents, and generate counts using this data. Relatively independent of #unique terms, but requires at most a single facet value per field per document. So, if you factor author into Primary author/Secondary author, where each is guaranteed to only have one value per doc, this could greatly accelerate your faceting. There are probably fewer unique subjects, so strategy 1 is likely fine. To use strategy 2, just make sure that multivalued=false is set for those fields in schema.xml -Mike
Re: Slow response
On 6-Sep-07, at 3:25 PM, Mike Klaas wrote: There are essentially two facet computation strategies: 1. cached bitsets: a bitset for each term is generated and intersected with the query restul bitset. This is more general and performs well up to a few thousand terms. 2. field enumeration: cache the field contents, and generate counts using this data. Relatively independent of #unique terms, but requires at most a single facet value per field per document. So, if you factor author into Primary author/Secondary author, where each is guaranteed to only have one value per doc, this could greatly accelerate your faceting. There are probably fewer unique subjects, so strategy 1 is likely fine. To use strategy 2, just make sure that multivalued=false is set for those fields in schema.xml I forgot to mention that strategy 2 also requires a single token for each doc (see http://wiki.apache.org/solr/ FAQ#head-14f9f2d84fb2cd1ff389f97f19acdb6ca55e4cd3) -Mike
caching query result
HI, I am wondering that is there any way for CACHING FACETS SEARCH Result? I have 13 millions and have facets by states (50). If there is a mechasim to chche, I may get faster result back. Thanks, Jae
removing a field from the relevance calculation
Hi, I'm having trouble getting a field of type SortableFloatField to not weigh into to the relevancy score returned for a document. fieldtype name=sfloat class=solr.SortableFloatField sortMissingLast=true omitNorms=true/ So far I've tried boosting the field to 0.0 at index time using this field type - and also implemented a custom Similarity implementation that overrode lengthNorm(String fieldname, int numTerms) after converting the field to a text field. Nothing I do seems to affect the behavior that when the value of the field in question changes, the score of the document changes along with it. The field does need to be both indexed and stored. There is a requirement to be able to sort by that field, and it must be returned in the document when searching. Am I going about this the wrong way? Regards, Bart Smyth IMPORTANT: This e-mail, including any attachments, may contain private or confidential information. If you think you may not be the intended recipient, or if you have received this e-mail in error, please contact the sender immediately and delete all copies of this e-mail. If you are not the intended recipient, you must not reproduce any part of this e-mail or disclose its contents to any other party. This email represents the views of the individual sender, which do not necessarily reflect those of education.au limited except where the sender expressly states otherwise. It is your responsibility to scan this email and any files transmitted with it for viruses or any other defects. education.au limited will not be liable for any loss, damage or consequence caused directly or indirectly by this email.
Re: updates on the server
On Sep 6, 2007, at 2:56 PM, Matthew Runo wrote: On a related note, it'd be great if we could set up a series of transformations to be done on data when it comes into the index, before being indexed. I guess a custom tokenizer might be the best way to do this though..? ie: -Post -Data is cleaned up, properly escaped, etc -Then data is passed to whatever tokenizer we want to use. Solr should do more work on the data indexing side, to allow clients to more easily hand documents to it and modify them. XML isn't necessarily the prettiest way, and we see other formats being supported with the CSV and rich document indexing. A custom tokenizer or token filter make great sense in the single field sense of data transformation, but parsing some request data into multiple fields must be done at a higher level. Erik
Re: caching query result
On 9/6/07, Jae Joo [EMAIL PROTECTED] wrote: I have 13 millions and have facets by states (50). If there is a mechasim to chche, I may get faster result back. How fast are you getting results back with standard field faceting (facet.field=state)?
Question on use of wildcard to field name at query
Hi all. Wildcard cannot be used for field name by specifying query though storage in index is possible according to the specification of wildcard by dynamic field. I want to use wildcard to specify field name at query. Please teach something a good idea. The following images. --document add doc field name=id0/feild field name=name00hoge hoge/field field name=name01hogesaru/field field name=name02saru/field field name=name03saru saru/field /doc doc field name=id1/feild field name=name04hage hage/field field name=name10hagesaru/field field name=name12hoge/field /doc /add --schema.xml dynamicField name=name* type=text_ws indexed=true stored=true/ --result of query /select/?q=name0?:hoge result:doc 0 /select/?q=name*:hoge result:doc 0 doc 1 /select/?q=name1?:hoge result:doc 1 Thanks, -- Toru Matsuzawa