Re: indexing txt file
but you need to index the text inside these files, right?. you need to read the text from file and include it into a field into the XML (of course this field must be defined in the schema). you can do it using a script and post then the XML to Solr. what amount/rate of generated text files are you thinking about? On Tue, Apr 14, 2009 at 7:07 PM, Alex Vu alex.v...@gmail.com wrote: I just want to be able to index my text file, and other files that carries the same format but with different IP address, ports, ect. I will have the traffic flow running in real-time. Do you think Solr will be able to index a bunch of my text files in real time? On Tue, Apr 14, 2009 at 9:35 AM, Alejandro Gonzalez alejandrogonzalezd...@gmail.com wrote: and i'm not sure of understanding what are u trying to do, but maybe you should define a text field and fill it with the text in each file for indexing the text in them, or maybe a path to that file if that's what u want. On Tue, Apr 14, 2009 at 6:28 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Tue, Apr 14, 2009 at 9:44 PM, Alex Vu alex.v...@gmail.com wrote: *schema file is * ?xml version=1.0 encoding=UTF-8? !--W3C Schema generated by XMLSpy v2009 sp1 (http://www.altova.com )-- xs:schema xmlns:xs=http://www.w3.org/2001/XMLSchema; xs:element name=networkTraffic xs:complexType xs:sequence xs:element name=packet maxOccurs=unbounded xs:complexType xs:attribute name=terminationTimestamp type=xs:string use=required/ xs:attribute name=sourcePort type=xs:string use=required/ xs:attribute name=sourceIp type=xs:string use=required/ xs:attribute name=protocolPortNumber type=xs:string use=required/ xs:attribute name=packets type=xs:string use=required/ xs:attribute name=ok type=xs:string use=required/ xs:attribute name=initialTimestamp type=xs:string use=required/ xs:attribute name=flows type=xs:string use=required/ xs:attribute name=destinatoinIp type=xs:string use=required/ xs:attribute name=destinationPort type=xs:string use=required/ xs:attribute name=bytes type=xs:string use=required/ /xs:complexType /xs:element /xs:sequence /xs:complexType /xs:element /xs:schema Can someone please show me where do I put these files? I'm aware that the schema.xsd file goes into the directory conf. What about my xml file, and txt file? Alex, the Solr schema is not the usual XML Schema (xsd). It is an xml file which describes the fields, their analyzers, tokenizers, copyFields, default search field etc. Look into the example schema supplied by Solr (inside example/solr/conf) directory and modify it according to your needs. -- Regards, Shalin Shekhar Mangar.
Re: indexing txt file
Hi all, I'm trying to use solr1.3 and trying to index a text file. I wrote a schema.xsd and a xml file. Just to make sure I understand things Do you just have one of these text files, containing many reports? Or Do you have many of these text files each containing one report? Also, is the report a single line, that has been wrapped for email? Fergus. *The content of my text file is * #src dstprotook sportdportpktsbytesflowsfirst atest 192.168.220.13526.147.238.1466 13283980 6 463 1 1237333861.4657640001237333861.664701000 *schema file is * ?xml version=1.0 encoding=UTF-8? !--W3C Schema generated by XMLSpy v2009 sp1 (http://www.altova.com)-- xs:schema xmlns:xs=http://www.w3.org/2001/XMLSchema; xs:element name=networkTraffic xs:complexType xs:sequence xs:element name=packet maxOccurs=unbounded xs:complexType xs:attribute name=terminationTimestamp type=xs:string use=required/ xs:attribute name=sourcePort type=xs:string use=required/ xs:attribute name=sourceIp type=xs:string use=required/ xs:attribute name=protocolPortNumber type=xs:string use=required/ xs:attribute name=packets type=xs:string use=required/ xs:attribute name=ok type=xs:string use=required/ xs:attribute name=initialTimestamp type=xs:string use=required/ xs:attribute name=flows type=xs:string use=required/ xs:attribute name=destinatoinIp type=xs:string use=required/ xs:attribute name=destinationPort type=xs:string use=required/ xs:attribute name=bytes type=xs:string use=required/ /xs:complexType /xs:element /xs:sequence /xs:complexType /xs:element /xs:schema *and my xml file is * ?xml version=1.0 encoding=UTF-8? networkTraffic xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xsi:noNamespaceSchemaLocation=C:\DOCUME~1\tpham\Desktop\networkTraffic.xsd packet sourceIp=192.168.54.23 destinatoinIp=192.168.0.1 protocolPortNumber=6 ok=1 sourcePort=32439 destinationPort=80 packets=6 bytes=463 flows=1 initialTimestamp=1237963861.465764000 terminationTimestamp=1237963861.664701000/ packet sourceIp=192.168.56.23 destinatoinIp=192.168.0.1 protocolPortNumber=17 ok=1 sourcePort=32439 destinationPort=80 packets=6 bytes=463 flows=1 initialTimestamp=1237963861.465764000 terminationTimestamp=1237963861.664701000/ packet sourceIp=192.168.74.23 destinatoinIp=192.168.0.1 protocolPortNumber=6 ok=1 sourcePort=32139 destinationPort=80 packets=6 bytes=463 flows=1 initialTimestamp=1237963861.465764000 terminationTimestamp=1237963861.664701000/ packet sourceIp=192.168.54.123 destinatoinIp=192.168.0.1 protocolPortNumber=6 ok=1 sourcePort=32839 destinationPort=80 packets=6 bytes=463 flows=1 initialTimestamp=1237963861.465764000 terminationTimestamp=1237963861.664701000/ packet sourceIp=192.168.14.23 destinatoinIp=192.168.0.1 protocolPortNumber=17 ok=1 sourcePort=32839 destinationPort=80 packets=6 bytes=463 flows=1 initialTimestamp=1237963861.465764000 terminationTimestamp=1237963861.664701000/ packet sourceIp=192.168.5.23 destinatoinIp=192.168.0.1 protocolPortNumber=17 ok=1 sourcePort=32439 destinationPort=80 packets=6 bytes=463 flows=1 initialTimestamp=1237963861.465764000 terminationTimestamp=1237963861.664701000/ packet sourceIp=192.168.15.23 destinatoinIp=192.168.0.1 protocolPortNumber=6 ok=1 sourcePort=36839 destinationPort=80 packets=6 bytes=463 flows=1 initialTimestamp=1237963861.465764000 terminationTimestamp=1237963861.664701000/ packet sourceIp=192.168.24.23 destinatoinIp=192.168.0.1 protocolPortNumber=6 ok=1 sourcePort=32839 destinationPort=80 packets=6 bytes=463 flows=1 initialTimestamp=1237963861.465764000 terminationTimestamp=1237963861.664701000/ /networkTraffic Can someone please show me where do I put these files? I'm aware that the schema.xsd file goes into the directory conf. What about my xml file, and txt file? Thank you, Alex On Tue, Apr 14, 2009 at 12:37 AM, Alejandro Gonzalez alejandrogonzalezd...@gmail.com wrote: you should construct the xml containing the fields defined in your schema.xml and give them the values from the text files. for example if you have an schema defining two fields title and text you should construct an xml with a field title and its value and another called text containing the body of your doc. then you can post it to Solr you have deployed and make a commit an it's done. it's possible to construct an xml defining more than jus t a doc add doc field name=titledoc1 title/field field name=textdoc1 text/field /doc . . . doc
Re: Disable logging in SOLR
Bill Au schrieb: Have you tried setting logging level to OFF from Solr's admin GUI: http://wiki.apache.org/solr/SolrAdminGUI thx 4 the hint ! But after I restart my tomcat its all reseted to default ? :-( Greets -Ralf-
Re: indexing txt file
On Tue, Apr 14, 2009 at 10:37 PM, Alex Vu alex.v...@gmail.com wrote: I just want to be able to index my text file, and other files that carries the same format but with different IP address, ports, ect. Alex, Solr consumes XML (in a specifc format) and CSV. It can consume plain text through ExtractIonHandler. It can index DBs, other XML formats. You can write a java program, parse your text file, and use Solrj client to send data to Solr. You could also write a program in any language you want and convert those text files to CSV or XML and post them to Solr. http://wiki.apache.org/solr/UpdateXmlMessages http://wiki.apache.org/solr/UpdateCSV http://wiki.apache.org/solr/Solrj I will have the traffic flow running in real-time. Do you think Solr will be able to index a bunch of my text files in real time? I don't think Solr is very suitable for this task. You can add the files to Solr at any time but you won't be able to search on them immediately. You should batch the commits (you can also use the maxDocs/maxTime properties in the autoCommit section in solrconfig.xml) -- Regards, Shalin Shekhar Mangar.
Maven repositories
Hi, does anyone know the location of the maven snapshot repositories for solr 1.4-SNAPSHOT? Thanks -- Gustavo Lopes smime.p7s Description: S/MIME Cryptographic Signature
Re: Maven repositories
On Wed, Apr 15, 2009 at 3:30 PM, Gustavo Lopes galo...@mediacapital.ptwrote: Hi, does anyone know the location of the maven snapshot repositories for solr 1.4-SNAPSHOT? http://people.apache.org/repo/m2-snapshot-repository/org/apache/solr/ Disclaimer - Un-released artifacts built from trunk. Use it at your own risk. -- Regards, Shalin Shekhar Mangar.
Re: Disable logging in SOLR
Yes, restarting Tomcat will reset things back to default. But you should be able to configure Tomcat to disable Solr logging since Solr uses JDK logging. Bill On Wed, Apr 15, 2009 at 4:51 AM, Kraus, Ralf | pixelhouse GmbH r...@pixelhouse.de wrote: Bill Au schrieb: Have you tried setting logging level to OFF from Solr's admin GUI: http://wiki.apache.org/solr/SolrAdminGUI thx 4 the hint ! But after I restart my tomcat its all reseted to default ? :-( Greets -Ralf-
Re: Disable logging in SOLR
Kraus, Ralf | pixelhouse GmbH wrote: Hi, is there a way to disable all logging output in SOLR ? I mean the output text like : INFO: [core_de] webapp=/solr path=/update params={wt=json} status=0 QTime=3736 greets -Ralf- You probably do not want to totally disable logging in Solr. More likely, your looking to make Solr less chatty by not logging the INFO level. Solr is a bit chatty by default, mostly I think, because that can be very useful and is often worth the likely very small performance hit of all the extra logging. At the least though, I think you want to leave Severe/Error logging on in most cases, and possibly WARN. Its easy enough to change the logging levels though. Solr 1.3 uses java.util.logging and Solr 1.4 uses SLF4J defaulting to java.util.logging. So you can either change the system level properties file in your JDK folder, or you can use a param at startup: |-Djava.util.logging.config.file=/path/to/my/logging.properties Then setup a props file. Here is an example from the wiki: | # Default global logging level: .level= INFO # Write to a file: handlers= java.util.logging.FileHandler # Write log messages in XML format: java.util.logging.FileHandler.formatter = java.util.logging.XMLFormatter # Log to the current working directory, with log files named solrxxx.log java.util.logging.FileHandler.pattern = ./solr%u.log -- - Mark http://www.lucidimagination.com
Re: Disable logging in SOLR
Mark Miller schrieb: Kraus, Ralf | pixelhouse GmbH wrote: Hi, is there a way to disable all logging output in SOLR ? I mean the output text like : INFO: [core_de] webapp=/solr path=/update params={wt=json} status=0 QTime=3736 greets -Ralf- You probably do not want to totally disable logging in Solr. More likely, your looking to make Solr less chatty by not logging the INFO level. Solr is a bit chatty by default, mostly I think, because that can be very useful and is often worth the likely very small performance hit of all the extra logging. At the least though, I think you want to leave Severe/Error logging on in most cases, and possibly WARN. Its easy enough to change the logging levels though. Solr 1.3 uses java.util.logging and Solr 1.4 uses SLF4J defaulting to java.util.logging. So you can either change the system level properties file in your JDK folder, or you can use a param at startup: |-Djava.util.logging.config.file=/path/to/my/logging.properties Thats exactly the way I choose yesterday ;-) Thx Greets -Ralf-
Re: [solr-user] Upgrade from 1.2 to 1.3 gives 3x slowdown
On Apr 2, 2009, at 9:23 AM, Fergus McMenemie wrote: Grant, I should note, however, that the speed difference you are seeing may not be as pronounced as it appears. If I recall during ApacheCon, I commented on how long it takes to shutdown your Solr instance when exiting it. That time it takes is in fact Solr doing the work that was put off by not committing earlier and having all those deletes pile up. I am confused about work that was put off vs committing. My script was doing a commit right after the CVS import, and you are right about the massive times required to shut tomcat down. But in my tests the time taken to do the commit was under a second, yet I had to allow 300secs for tomcat shutdown. Also I dont have any duplicates. So what sort of work was being done at shutdown that was not being done by a commit? Optimise! The work being done is addressing the deletes, AIUI, but of course there are other things happening during shutdown, too. There are no deletes to do. It was a clean index to begin with and there were no duplicates. How long is the shutdown if you do a commit first and then a shutdown? Still very long, sometimes 300sec. My script always did a commit! At any rate, I don't know that there is a satisfying answer to the larger issue due to the things like the fsync stuff, which is an overall win for Lucene/Solr despite it being more slower. Have you tried running the tests on other machines (non-Mac?) Nope. Although next week I will have real PC running vista, so I could try it there. I think we should knock this on the head and move on. I rarely need to index this content and I can take the performance hit, and of course your work around provides a good speed up. Regards Fergus. -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
looking at the results of a distributed search using shards.
Hi, Having all kinds of fun with distributed search using shards:-) I have 30K documents indexed using DIH into an index. Another index contain documents indexed using solr-cell. I am using shards to search across both indexes. I am trying to format the results returned from solr such the source document can be linked to, and to do so I think I need to know which shard a particular result came from. Is this a FAQ? Regards -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: solr 1.3 + tomcat 5.5
From the log it seems like there is a solr.xml inside var/lib/tomcat5/webapps/ which tomcat is trying deploy and failing. Very strange. You should remove that file and see if that fixes it. On Tue, Apr 14, 2009 at 11:35 PM, andrysha nihuhoid nihuh...@gmail.comwrote: Hi, got problem setting up solr + tomcat Tomcat5.5 + apache solr 1.3.0 + centos 5.3 I don't familiar with java at all, so sorry if it's dumb question. Here is what i did: placed solr.war in webapps folder changed solr home to /etc/solr copied contents of solr distribution example folder to /etc/solr tomcat starting successfully and i even can access admin interface but following error appears in catalina.out every 10 seconds: SEVERE: Error deploying configuration descriptor var#lib#tomcat5#webapps#solr.xml Apr 14, 2009 1:30:14 PM org.apache.catalina.startup.HostConfig deployDescriptor SEVERE: Error deploying configuration descriptor etc#solr#.xml Apr 14, 2009 1:30:24 PM org.apache.catalina.startup.HostConfig deployDescriptor SEVERE: Error deploying configuration descriptor var#lib#tomcat5#webapps#solr.xml Apr 14, 2009 1:30:24 PM org.apache.catalina.startup.HostConfig deployDescriptor SEVERE: Error deploying configuration descriptor etc#solr#.xml Apr 14, 2009 1:30:34 PM org.apache.catalina.startup.HostConfig deployDescriptor SEVERE: Error deploying configuration descriptor var#lib#tomcat5#webapps#solr.xml Apr 14, 2009 1:30:34 PM org.apache.catalina.startup.HostConfig deployDescriptor SEVERE: Error deploying configuration descriptor etc#solr#.xml Apr 14, 2009 1:30:44 PM org.apache.catalina.startup.HostConfig deployDescriptor SEVERE: Error deploying configuration descriptor var#lib#tomcat5#webapps#solr.xml Apr 14, 2009 1:30:44 PM org.apache.catalina.startup.HostConfig deployDescriptor SEVERE: Error deploying configuration descriptor etc#solr#.xml Apr 14, 2009 1:30:54 PM org.apache.catalina.startup.HostConfig deployDescriptor SEVERE: Error deploying configuration descriptor var#lib#tomcat5#webapps#solr.xml Apr 14, 2009 1:30:54 PM org.apache.catalina.startup.HostConfig deployDescriptor SEVERE: Error deploying configuration descriptor etc#solr#.xml Googled about 3 hours. tried to set allow write permissions for all to /etc, /etc/solr /var/ lib/tomcat5/webapps tried to create empty file named solr.xml in /etc, /etc/solr tried to copy solrconfig.xml to /etc/, /etc/solr -- Regards, Shalin Shekhar Mangar.
Re: Distinct terms in facet field
On Wed, Apr 15, 2009 at 1:13 AM, Harsch, Timothy J. (ARC-SC)[PEROT SYSTEMS] timothy.j.har...@nasa.gov wrote: How could I get a count of distinct terms for a given query? For example: The Wiki page http://wiki.apache.org/solr/SimpleFacetParameters has a section Facet Fields with No Zeros which shows the query: http://localhost:8983/solr/select?q=ipodrows=0facet=truefacet.limit=-1facet.field=catfacet.mincount=1facet.field=inStock and returns results where the inStock field has two facet counts (false is 3, and true is 1) But what I would want to know is how many distinct values were found ( in this case it would be 2 / true and false ). I realize I could count the number of terms returned, but if the set were large that would be non-performant. Is there a better way? To do this with facets, you'd need to return all of them. The other way of doing this is by making a request to /admin/luke?fl=inStock which will return the number of unique terms in that field. http://wiki.apache.org/solr/LukeRequestHandler You can also index the number of unique values in a field as a separate field. -- Regards, Shalin Shekhar Mangar.
Re: Index Replication or Distributed Search ?
On Wed, Apr 15, 2009 at 5:07 AM, ramanathan ramanat...@youinweb-inc.comwrote: Hi, Can someone provide a practical advice of how large a Solr search index can be? for a better performance for consumer facing media website?. The right answer is that it depends :) It depends on the number of documents, size of a document, number of unique terms, kind of queries, frequency of updates etc. Is it good or bad to think about Distributed Search and dividing index in earlier stage of development? If your index can fit into a single box with acceptable response times, then this is the simplest way to get started. If not, then you'll need to use Distributed Search. Note that many installations use distributed search *and * replication together to handle large traffic. These are some good resources on this topic: http://wiki.apache.org/solr/LargeIndexes http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr Ask away if you have specific questions. -- Regards, Shalin Shekhar Mangar.
Using CSV for indexing ... Remote Streaming disabled
Hi, I'm trying using CSV (Solr 1.4, 03/29) for indexing following wiki (http://wiki.apache.org/solr/UpdateCSV). I've updated the solrconfig.xml to have this lines, requestDispatcher handleSelect=true requestParsers enableRemoteStreaming=true multipartUploadLimitInKB=20480 / ... /requestDispatcher requestHandler name=/update/csv class=solr.CSVRequestHandler startup=lazy / When I try to upload the csv, curl 'http://localhost:8080/solr/20090414_1/update/csv?commit=trueseparator=%09escape=%5cstream.file=/Users/opal/temp/afterchat/data/csv/1239759267339.csv' I get following response, /headbodyh1HTTP Status 400 - Remote Streaming is disabled./h1HR size=1 noshade=noshadepbtype/b Status report/ppbmessage/b uRemote Streaming is disabled./u/ppbdescription/b uThe request sent by the client was syntactically incorrect (Remote Streaming is disabled.)./u/pHR size=1 noshade=noshadeh3Apache Tomcat/6.0.18/h3/body/html Why is it complaining about the remote streaming if it's already enabled? Is there anything I'm missing? Thanks, -vivek
Re: looking at the results of a distributed search using shards.
On Apr 15, 2009, at 11:18 AM, Fergus McMenemie wrote: Hi, Having all kinds of fun with distributed search using shards:-) I have 30K documents indexed using DIH into an index. Another index contain documents indexed using solr-cell. I am using shards to search across both indexes. I am trying to format the results returned from solr such the source document can be linked to, and to do so I think I need to know which shard a particular result came from. Is this a FAQ? +1, assuming you mean to add it as a FAQ and aren't asking if it already is one. Regards -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer === -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Commits taking too long
Hi, I've index where I commit every 50K records (using Solrj). Usually this commit takes 20sec to complete, but every now and then the commit takes way too long - from 10 min to 30 min. I see more delays as the index size continues to grow - once it gets over 5G I start seeing long commit cycles more frequently. See this for ex., Apr 15, 2009 12:04:13 AM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=false,waitFlush=false,waitSearcher=false) Apr 15, 2009 12:39:58 AM org.apache.solr.core.SolrDeletionPolicy onCommit INFO: SolrDeletionPolicy.onCommit: commits:num=2 commit{dir=/Users/vivek/demo/afterchat/solr/multicore/20090414_1/data/index,segFN=segments_fq,version=1239747075391,generation=566,filenames=[_19m.cfs, _jm.cfs, _1bk.cfs, _193.cfx, _19z.cfs, _1b8.cfs, _1bf.cfs, _10g.cfs, _ 2s.cfs, _1bf.cfx, _18x.cfx, _19c.cfx, _193.cfs, _18x.cfs, _1b7.cfs, _1aw.cfs, _1aq.cfs, _1bi.cfx, _1a6.cfs, _19l.cfs, _1ad.cfs, _1a6.cfx, _1as.cfs, _19l.cfx, _1aa.cfs, _1an.cfs, _19d.cfs, _1a3.cfx, _1a3.cfs, _19g.cfs, _b7.cfs, _19 e.cfs, _19b.cfs, _1ab.cfs, _1b3.cfx, _19j.cfs, _190.cfs, _uu.cfs, _1b3.cfs, _1ak.cfs, _19p.cfs, _195.cfs, _194.cfs, _19i.cfx, _199.cfs, _19i.cfs, _19o.cfx, _196.cfs, _199.cfx, _196.cfx, _19o.cfs, _190.cfx, _xn.cfs, _1b0.cfx, _1at. cfs, _1av.cfs, _1ao.cfs, _1a9.cfx, _1b0.cfs, _5l.cfs, _1ao.cfx, _1ap.cfs, _1b6.cfx, _19a.cfs, _139.cfs, _1a1.cfs, _s1.cfs, _1b6.cfs, _1a9.cfs, _197.cfs, _1bd.cfs, _19n.cfs, _1au.cfx, _1au.cfs, _1a5.cfs, _1be.cfs, segments_fq, _1b4.cfs, _gt.cfs, _1ag.cfs, _18z.cfs, _162.cfs, _1a4.cfs, _198.cfs, _19x.cfs, _1ah.cfs, _1ai.cfs, _19q.cfs, _1a7.cfs, _1ae.cfs, _19h.cfs, _19x.cfx, _1a2.cfs, _1bj.cfs, _1bb.cfs, _1b1.cfs, _1ai.cfx, _19r.cfs, _18y.cfs, _19u.cfx, _1a8. cfs, _19u.cfs, _1aj.cfs, _19r.cfx, _1ac.cfs, _1az.cfs, _1ac.cfx, _19y.cfs, _1bc.cfx, _19s.cfs, _1ar.cfs, _1al.cfx, _1bg.cfs, _18v.cfs, _1ar.cfx, _1bc.cfs, _1a0.cfx, _1b2.cfs, _1af.cfs, _1bi.cfs, _1af.cfx, _19f.cfs, _1a0.cfs, _1bh.cfs, _19f.cfx, _19c.cfs, _e0.cfs, _1ax.cfx, _1b5.cfs, _191.cfs, _18w.cfs, _19t.cfs, _8e.cfs, _19v.cfs, _192.cfs, _1b9.cfs, _1ay.cfs, _p8.cfs, _19k.cfs, _1b9.cfx, _1ax.cfs, _1am.cfs, _1ba.cfs, _mf.cfs, _1al.cfs, _19w.cfs] commit{dir=/Users/vivek/demo/afterchat/solr/multicore/20090414_1/data/index,segFN=segments_fr,version=1239747075392,generation=567,filenames=[_jm.cfs, _1bo.cfs, _xn.cfs, segments_fr, _8e.cfs, _gt.cfs, _18v.cfs, _uu.cfs, _1 0g.cfs, _2s.cfs, _5l.cfs, _162.cfs, _p8.cfs, _139.cfs, _s1.cfs, _mf.cfs, _b7.cfs, _e0.cfs] Apr 15, 2009 12:39:58 AM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: last commit = 1239747075392 Here is my default index settings, indexDefaults !-- Values here affect all index writers and act as a default unless overridden. -- useCompoundFiletrue/useCompoundFile mergeFactor100/mergeFactor !-- maxBufferedDocs1/maxBufferedDocs -- ramBufferSizeMB64/ramBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType /indexDefaults What am I doing wrong here? What's causing these delays? Thanks, -vivek
Re: indexing txt file
what amount/rate of generated text files are you thinking about? I have 1TB worth of text files coming in every couple of minutes in real-time. In about 10 minute I will have 4TB worth of text files. Do you just have one of these text files, containing many reports? Do you have many of these text files each containing one report? Also, is the report a single line, that has been wrapped for email? these files, rotate every hour. In each text files, it contains many reports, and it is not wrapped for email. Is there an effective way to use Solr to have it consistently index my text files. Please note: that these files all have the same formats. On Wed, Apr 15, 2009 at 1:58 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Tue, Apr 14, 2009 at 10:37 PM, Alex Vu alex.v...@gmail.com wrote: I just want to be able to index my text file, and other files that carries the same format but with different IP address, ports, ect. Alex, Solr consumes XML (in a specifc format) and CSV. It can consume plain text through ExtractIonHandler. It can index DBs, other XML formats. You can write a java program, parse your text file, and use Solrj client to send data to Solr. You could also write a program in any language you want and convert those text files to CSV or XML and post them to Solr. http://wiki.apache.org/solr/UpdateXmlMessages http://wiki.apache.org/solr/UpdateCSV http://wiki.apache.org/solr/Solrj I will have the traffic flow running in real-time. Do you think Solr will be able to index a bunch of my text files in real time? I don't think Solr is very suitable for this task. You can add the files to Solr at any time but you won't be able to search on them immediately. You should batch the commits (you can also use the maxDocs/maxTime properties in the autoCommit section in solrconfig.xml) -- Regards, Shalin Shekhar Mangar.
Re: Commits taking too long
vivek sar wrote: Hi, I've index where I commit every 50K records (using Solrj). Usually this commit takes 20sec to complete, but every now and then the commit takes way too long - from 10 min to 30 min. I see more delays as the index size continues to grow - once it gets over 5G I start seeing long commit cycles more frequently. See this for ex., Apr 15, 2009 12:04:13 AM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=false,waitFlush=false,waitSearcher=false) Apr 15, 2009 12:39:58 AM org.apache.solr.core.SolrDeletionPolicy onCommit INFO: SolrDeletionPolicy.onCommit: commits:num=2 commit{dir=/Users/vivek/demo/afterchat/solr/multicore/20090414_1/data/index,segFN=segments_fq,version=1239747075391,generation=566,filenames=[_19m.cfs, _jm.cfs, _1bk.cfs, _193.cfx, _19z.cfs, _1b8.cfs, _1bf.cfs, _10g.cfs, _ 2s.cfs, _1bf.cfx, _18x.cfx, _19c.cfx, _193.cfs, _18x.cfs, _1b7.cfs, _1aw.cfs, _1aq.cfs, _1bi.cfx, _1a6.cfs, _19l.cfs, _1ad.cfs, _1a6.cfx, _1as.cfs, _19l.cfx, _1aa.cfs, _1an.cfs, _19d.cfs, _1a3.cfx, _1a3.cfs, _19g.cfs, _b7.cfs, _19 e.cfs, _19b.cfs, _1ab.cfs, _1b3.cfx, _19j.cfs, _190.cfs, _uu.cfs, _1b3.cfs, _1ak.cfs, _19p.cfs, _195.cfs, _194.cfs, _19i.cfx, _199.cfs, _19i.cfs, _19o.cfx, _196.cfs, _199.cfx, _196.cfx, _19o.cfs, _190.cfx, _xn.cfs, _1b0.cfx, _1at. cfs, _1av.cfs, _1ao.cfs, _1a9.cfx, _1b0.cfs, _5l.cfs, _1ao.cfx, _1ap.cfs, _1b6.cfx, _19a.cfs, _139.cfs, _1a1.cfs, _s1.cfs, _1b6.cfs, _1a9.cfs, _197.cfs, _1bd.cfs, _19n.cfs, _1au.cfx, _1au.cfs, _1a5.cfs, _1be.cfs, segments_fq, _1b4.cfs, _gt.cfs, _1ag.cfs, _18z.cfs, _162.cfs, _1a4.cfs, _198.cfs, _19x.cfs, _1ah.cfs, _1ai.cfs, _19q.cfs, _1a7.cfs, _1ae.cfs, _19h.cfs, _19x.cfx, _1a2.cfs, _1bj.cfs, _1bb.cfs, _1b1.cfs, _1ai.cfx, _19r.cfs, _18y.cfs, _19u.cfx, _1a8. cfs, _19u.cfs, _1aj.cfs, _19r.cfx, _1ac.cfs, _1az.cfs, _1ac.cfx, _19y.cfs, _1bc.cfx, _19s.cfs, _1ar.cfs, _1al.cfx, _1bg.cfs, _18v.cfs, _1ar.cfx, _1bc.cfs, _1a0.cfx, _1b2.cfs, _1af.cfs, _1bi.cfs, _1af.cfx, _19f.cfs, _1a0.cfs, _1bh.cfs, _19f.cfx, _19c.cfs, _e0.cfs, _1ax.cfx, _1b5.cfs, _191.cfs, _18w.cfs, _19t.cfs, _8e.cfs, _19v.cfs, _192.cfs, _1b9.cfs, _1ay.cfs, _p8.cfs, _19k.cfs, _1b9.cfx, _1ax.cfs, _1am.cfs, _1ba.cfs, _mf.cfs, _1al.cfs, _19w.cfs] commit{dir=/Users/vivek/demo/afterchat/solr/multicore/20090414_1/data/index,segFN=segments_fr,version=1239747075392,generation=567,filenames=[_jm.cfs, _1bo.cfs, _xn.cfs, segments_fr, _8e.cfs, _gt.cfs, _18v.cfs, _uu.cfs, _1 0g.cfs, _2s.cfs, _5l.cfs, _162.cfs, _p8.cfs, _139.cfs, _s1.cfs, _mf.cfs, _b7.cfs, _e0.cfs] Apr 15, 2009 12:39:58 AM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: last commit = 1239747075392 Here is my default index settings, indexDefaults !-- Values here affect all index writers and act as a default unless overridden. -- useCompoundFiletrue/useCompoundFile mergeFactor100/mergeFactor !-- maxBufferedDocs1/maxBufferedDocs -- ramBufferSizeMB64/ramBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType /indexDefaults What am I doing wrong here? What's causing these delays? Thanks, -vivek Probably merging. With a mergefactor of 100, you will merge less often. But then you will hit points where you have to do a bunch more merging. Committing waits for these merges to finish. That would be my first guess. A mergefactor of say 10, would merge more often (only 10 segments per log level before they get merged up to the next level), but not run into points were it had as many segments to merge. -- - Mark http://www.lucidimagination.com
Re: looking at the results of a distributed search using shards.
On Apr 15, 2009, at 11:18 AM, Fergus McMenemie wrote: Hi, Having all kinds of fun with distributed search using shards:-) I have 30K documents indexed using DIH into an index. Another index contain documents indexed using solr-cell. I am using shards to search across both indexes. I am trying to format the results returned from solr such the source document can be linked to, and to do so I think I need to know which shard a particular result came from. Is this a FAQ? +1, assuming you mean to add it as a FAQ and aren't asking if it already is one. I was asking.. how do I find out which shard a result came from. But I felt it must be a FAQ! Again... I am wondering if there is established best practice covering this sort of thing, before I go and roll my own:-) Fergus. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: [solr-user] Upgrade from 1.2 to 1.3 gives 3x slowdown
The work being done is addressing the deletes, AIUI, but of course there are other things happening during shutdown, too. There are no deletes to do. It was a clean index to begin with and there were no duplicates. I have not followed this thread, so forgive me if this has already been suggested If you know that there are not any duplicates, have you tried indexing with allowDups=true? It will not change the fsync cost, but it may reduce some other checking times. ryan
Re: looking at the results of a distributed search using shards.
Ain't a FAQ, but could be. Look at JIRA and search for Brian, who made the same request a few months ago. I've often wondered if we could add info about the source shard, as well as whether a hit came from cache or not. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Fergus McMenemie fer...@twig.me.uk To: solr-user@lucene.apache.org Sent: Wednesday, April 15, 2009 11:18:21 AM Subject: looking at the results of a distributed search using shards. Hi, Having all kinds of fun with distributed search using shards:-) I have 30K documents indexed using DIH into an index. Another index contain documents indexed using solr-cell. I am using shards to search across both indexes. I am trying to format the results returned from solr such the source document can be linked to, and to do so I think I need to know which shard a particular result came from. Is this a FAQ? Regards -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: DataImporter : Java heap space
I think there is a bug in the 1.4 daily builds of data import handler which is causing the batchSize parameter to be ignored. This was probably introduced with more recent patches to resolve variables. The affected code is in JdbcDataSource.java String bsz = initProps.getProperty(batchSize); if (bsz != null) { bsz = (String) context.getVariableResolver().resolve(bsz); try { batchSize = Integer.parseInt(bsz); if (batchSize == -1) batchSize = Integer.MIN_VALUE; } catch (NumberFormatException e) { LOG.warn(Invalid batch size: + bsz); } } The call to context.getVariableResolver().resolve(bsz) is returning null, leading to a NumberFormatException and the batchSize never being set to Integer.MIN_VALUE. MySql won't use streaming result sets in this case which can lead to the OOM we're seeing. If your log file contains this entry like mine does, you're being affected by this bug too. Apr 15, 2009 1:21:58 PM org.apache.solr.handler.dataimport.JdbcDataSource init WARNING: Invalid batch size: null -Bryan On Apr 13, 2009, at Apr 13, 11:48 PM, Noble Paul നോബിള് नोब्ळ् wrote: DIH streams 1 row at a time. DIH is just a component in Solr. Solr indexing also takes a lot of memory On Tue, Apr 14, 2009 at 12:02 PM, Mani Kumar manikumarchau...@gmail.com wrote: Yes its throwing the same OOM error and from same place... yes i will try increasing the size ... just curious : how this dataimport works? Does it loads the whole table into memory? Is there any estimate about how much memory it needs to create index for 1GB of data. thx mani On Tue, Apr 14, 2009 at 11:48 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Tue, Apr 14, 2009 at 11:36 AM, Mani Kumar manikumarchau...@gmail.com wrote: Hi Shalin: yes i tried with batchSize=-1 parameter as well here the config i tried with dataConfig dataSource type=JdbcDataSource batchSize=-1 name=sp driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost/mydb_development user=root password=** / I hope i have used batchSize parameter @ right place. Yes that is correct. Did it still throw OOM from the same place? I'd suggest you increase the heap and see what works for you. Also try -server on the jvm. -- Regards, Shalin Shekhar Mangar. -- --Noble Paul
Re: Question on StreamingUpdateSolrServer
Quick comment - why so shy with number of open file descriptors? On some nothing-special machines from several years ago I had this limit set to 30K+ - here, for example: http://www.simpy.com/user/otis :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: vivek sar vivex...@gmail.com To: solr-user@lucene.apache.org Sent: Tuesday, April 14, 2009 3:12:41 AM Subject: Re: Question on StreamingUpdateSolrServer The machine's ulimit is set to 9000 and the OS has upper limit of 12000 on files. What would explain this? Has anyone tried Solr with 25 cores on the same Solr instance? Thanks, -vivek 2009/4/13 Noble Paul നോബിള് नोब्ळ् : On Tue, Apr 14, 2009 at 7:14 AM, vivek sar wrote: Some more update. As I mentioned earlier we are using multi-core Solr (up to 65 cores in one Solr instance with each core 10G). This was opening around 3000 file descriptors (lsof). I removed some cores and after some trial and error I found at 25 cores system seems to work fine (around 1400 file descriptors). Tomcat is responsive even when the indexing is happening at Solr (for 25 cores). But, as soon as it goes to 26 cores the Tomcat becomes unresponsive again. The puzzling thing is if I stop indexing I can search on even 65 cores, but while indexing is happening it seems to support only up to 25 cores. 1) Is there a limit on number of cores a Solr instance can handle? 2) Does Solr do anything to the existing cores while indexing? I'm writing to only one core at a time. There is no hard limit (it is Integer.MAX_VALUE) . But inreality your mileage depends on your hardware and no:of file handles the OS can open We are struggling to find why Tomcat stops responding on high number of cores while indexing is in-progress. Any help is very much appreciated. Thanks, -vivek On Mon, Apr 13, 2009 at 10:52 AM, vivek sar wrote: Here is some more information about my setup, Solr - v1.4 (nightly build 03/29/09) Servlet Container - Tomcat 6.0.18 JVM - 1.6.0 (64 bit) OS - Mac OS X Server 10.5.6 Hardware Overview: Processor Name: Quad-Core Intel Xeon Processor Speed: 3 GHz Number Of Processors: 2 Total Number Of Cores: 8 L2 Cache (per processor): 12 MB Memory: 20 GB Bus Speed: 1.6 GHz JVM Parameters (for Solr): export CATALINA_OPTS=-server -Xms6044m -Xmx6044m -DSOLR_APP -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:gc.log -Dsun.rmi.dgc.client.gcInterval=360 -Dsun.rmi.dgc.server.gcInterval=360 Other: lsof|grep solr|wc -l 2493 ulimit -an open files (-n) 9000 Tomcat connectionTimeout=2 maxThreads=100 / Total Solr cores on same instance - 65 useCompoundFile - true The tests I ran, While Indexer is running 1) Go to http://juum19.co.com:8080/solr;- returns blank page (no error in the catalina.out) 2) Try telnet juum19.co.com 8080 - returns with Connection closed by foreign host Stop the Indexer Program (Tomcat is still running with Solr) 3) Go to http://juum19.co.com:8080/solr; - works ok, shows the list of all the Solr cores 4) Try telnet - able to Telnet fine 5) Now comment out all the caches in solrconfig.xml. Try same tests, but the Tomcat still doesn't response. Is there a way to stop the auto-warmer. I commented out the caches in the solrconfig.xml but still see the following log, INFO: autowarming result for searc...@3aba3830 main fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} INFO: Closing searc...@175dc1e2 main fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} 6) Change the Indexer frequency so it runs every 2 min (instead of all the time). I noticed once the commit is done, I'm able to run my searches. During commit and auto-warming period I just get blank page. 7) Changed from Solrj to XML update - I still get the blank page whenever update/commit is happening. Apr 13, 2009 6:46:18
Re: Question on StreamingUpdateSolrServer
One more thing. I don't think this was mentioned, but you can: - optimize your indices - use compound index format That will lower the number of open file handles. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: vivek sar vivex...@gmail.com To: solr-user@lucene.apache.org Sent: Friday, April 10, 2009 5:59:37 PM Subject: Re: Question on StreamingUpdateSolrServer I also noticed that the Solr app has over 6000 file handles open - lsof | grep solr | wc -l - shows 6455 I've 10 cores (using multi-core) managed by the same Solr instance. As soon as start up the Tomcat the open file count goes up to 6400. Few questions, 1) Why is Solr holding on to all the segments from all the cores - is it because of auto-warmer? 2) How can I reduce the open file count? 3) Is there a way to stop the auto-warmer? 4) Could this be related to Tomcat returning blank page for every request? Any ideas? Thanks, -vivek On Fri, Apr 10, 2009 at 1:48 PM, vivek sar wrote: Hi, I was using CommonsHttpSolrServer for indexing, but having two threads writing (10K batches) at the same time was throwing, ProtocolException: Unbuffered entity enclosing request can not be repeated. I switched to StreamingUpdateSolrServer (using addBeans) and I don't see the problem anymore. The speed is very fast - getting around 25k/sec (single thread), but I'm facing another problem. When the indexer using StreamingUpdateSolrServer is running I'm not able to send any url request from browser to Solr web app. I just get blank page. I can't even get to the admin interface. I'm also not able to shutdown the Tomcat running the Solr webapp when the Indexer is running. I've to first stop the Indexer app and then stop the Tomcat. I don't have this problem when using CommonsHttpSolrServer. Here is how I'm creating it, server = new StreamingUpdateSolrServer(url, 1000,3); I simply call server.addBeans(...) on it. Is there anything else I need to do to make use of StreamingUpdateSolrServer? Why does Tomcat become unresponsive when Indexer using StreamingUpdateSolrServer is running (though, indexing happens fine)? Thanks, -vivek
StreamingUpdateSolrServer and DIH
Hey there, I have been reading about StreamingUpdateSolrServer but can't catch exactly how it works: More efficient index construction over http with solrj. If your doing it, this is a fantastic performance improvement. Adding a StreamingUpdateSolrServer that writes update commands to an open HTTP connection. If you are using solrj for bulk update requests you should consider switching to this implementation. However, note that the error handling is not immediate as it is with the standard SolrServer. Is there any way to use it in DataImportHandler? Thanks in advance -- View this message in context: http://www.nabble.com/StreamingUpdateSolrServer-and-DIH-tp23068057p23068057.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Question on StreamingUpdateSolrServer
Thanks Otis. I did increase the number of file descriptors to 22K, but I still get this problem. I've noticed following so far, 1) As soon as I get to around 1140 index segments (this is total over multiple cores) I start seeing this problem. 2) When the problem starts occassionally the index request (solrserver.commit) also fails with the following error, java.net.SocketException: Connection reset 3) Whenever the commit fails, I'm able to access Solr by the browser (http://ets11.co.com/solr). If the commit is succssfull and going on I get blank page on Firefox. Even the telnet to 8080 fails with Connection closed by foreign host. It does seem like there is some resource issue as it happens only once we reach a breaking point (too many index segment files) - lsof at this point usually shows at 1400, but my ulimit is much higher than that. I already use compound format for index files. I can also run optimize occassionally (though not preferred as it blocks the whole index cycle for a long time). I do want to find out what resource limitation is causing this and it has to do something with when Indexer is committing the records where there are large number of segment files. Any other ideas? Thanks, -vivek On Wed, Apr 15, 2009 at 3:10 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: One more thing. I don't think this was mentioned, but you can: - optimize your indices - use compound index format That will lower the number of open file handles. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: vivek sar vivex...@gmail.com To: solr-user@lucene.apache.org Sent: Friday, April 10, 2009 5:59:37 PM Subject: Re: Question on StreamingUpdateSolrServer I also noticed that the Solr app has over 6000 file handles open - lsof | grep solr | wc -l - shows 6455 I've 10 cores (using multi-core) managed by the same Solr instance. As soon as start up the Tomcat the open file count goes up to 6400. Few questions, 1) Why is Solr holding on to all the segments from all the cores - is it because of auto-warmer? 2) How can I reduce the open file count? 3) Is there a way to stop the auto-warmer? 4) Could this be related to Tomcat returning blank page for every request? Any ideas? Thanks, -vivek On Fri, Apr 10, 2009 at 1:48 PM, vivek sar wrote: Hi, I was using CommonsHttpSolrServer for indexing, but having two threads writing (10K batches) at the same time was throwing, ProtocolException: Unbuffered entity enclosing request can not be repeated. I switched to StreamingUpdateSolrServer (using addBeans) and I don't see the problem anymore. The speed is very fast - getting around 25k/sec (single thread), but I'm facing another problem. When the indexer using StreamingUpdateSolrServer is running I'm not able to send any url request from browser to Solr web app. I just get blank page. I can't even get to the admin interface. I'm also not able to shutdown the Tomcat running the Solr webapp when the Indexer is running. I've to first stop the Indexer app and then stop the Tomcat. I don't have this problem when using CommonsHttpSolrServer. Here is how I'm creating it, server = new StreamingUpdateSolrServer(url, 1000,3); I simply call server.addBeans(...) on it. Is there anything else I need to do to make use of StreamingUpdateSolrServer? Why does Tomcat become unresponsive when Indexer using StreamingUpdateSolrServer is running (though, indexing happens fine)? Thanks, -vivek
Re: StreamingUpdateSolrServer and DIH
On Thu, Apr 16, 2009 at 3:45 AM, Marc Sturlese marc.sturl...@gmail.com wrote: Hey there, I have been reading about StreamingUpdateSolrServer but can't catch exactly how it works: More efficient index construction over http with solrj. If your doing it, this is a fantastic performance improvement. StreamingUpdateSolrServer tries to use optimize use of http connection by posting multiple add commands in the same request. It also allows you to do the same task in multiple threads. Adding a StreamingUpdateSolrServer that writes update commands to an open HTTP connection. If you are using solrj for bulk update requests you should consider switching to this implementation. However, note that the error handling is not immediate as it is with the standard SolrServer. yeah true. CommonsHttpSolrServer has a add(IteratorSolrInptDocument) method which is efficient (but does the update in the calling thread) and you get to know about errors immedietly. Is there any way to use it in DataImportHandler? DIH and StreamingUpdateSolrServer? No . I cannot imagine a way Thanks in advance -- View this message in context: http://www.nabble.com/StreamingUpdateSolrServer-and-DIH-tp23068057p23068057.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul
Re: DataImporter : Java heap space
Hi Bryan, Thanks a lot. It is invoking the wrong method it should have been bsz = context.getVariableResolver().replaceTokens(bsz); it was a silly mistake --Noble On Thu, Apr 16, 2009 at 2:13 AM, Bryan Talbot btal...@aeriagames.com wrote: I think there is a bug in the 1.4 daily builds of data import handler which is causing the batchSize parameter to be ignored. This was probably introduced with more recent patches to resolve variables. The affected code is in JdbcDataSource.java String bsz = initProps.getProperty(batchSize); if (bsz != null) { bsz = (String) context.getVariableResolver().resolve(bsz); try { batchSize = Integer.parseInt(bsz); if (batchSize == -1) batchSize = Integer.MIN_VALUE; } catch (NumberFormatException e) { LOG.warn(Invalid batch size: + bsz); } } The call to context.getVariableResolver().resolve(bsz) is returning null, leading to a NumberFormatException and the batchSize never being set to Integer.MIN_VALUE. MySql won't use streaming result sets in this case which can lead to the OOM we're seeing. If your log file contains this entry like mine does, you're being affected by this bug too. Apr 15, 2009 1:21:58 PM org.apache.solr.handler.dataimport.JdbcDataSource init WARNING: Invalid batch size: null -Bryan On Apr 13, 2009, at Apr 13, 11:48 PM, Noble Paul നോബിള് नोब्ळ् wrote: DIH streams 1 row at a time. DIH is just a component in Solr. Solr indexing also takes a lot of memory On Tue, Apr 14, 2009 at 12:02 PM, Mani Kumar manikumarchau...@gmail.com wrote: Yes its throwing the same OOM error and from same place... yes i will try increasing the size ... just curious : how this dataimport works? Does it loads the whole table into memory? Is there any estimate about how much memory it needs to create index for 1GB of data. thx mani On Tue, Apr 14, 2009 at 11:48 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Tue, Apr 14, 2009 at 11:36 AM, Mani Kumar manikumarchau...@gmail.com wrote: Hi Shalin: yes i tried with batchSize=-1 parameter as well here the config i tried with dataConfig dataSource type=JdbcDataSource batchSize=-1 name=sp driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost/mydb_development user=root password=** / I hope i have used batchSize parameter @ right place. Yes that is correct. Did it still throw OOM from the same place? I'd suggest you increase the heap and see what works for you. Also try -server on the jvm. -- Regards, Shalin Shekhar Mangar. -- --Noble Paul -- --Noble Paul
want to Unsubscribe from Solr Mailing List
Hi, I wish to unsubscribe from list . My email address is neha_bhard...@peristent.co.in Thanks for all the help and support. Thanks and Regards, Neha Bhardwaj| Software Engineer| Persistent Systems Limited Neha mailto:neha%20bhard...@persistent.co.in%20 bhard...@persistent.co.in | Cell: +91 9272383082| Tel: +91 (20) 3023 5257 Innovation in software product design, development and delivery- www.persistentsys.com DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Re: DataImporter : Java heap space
Aah, Bryan you got it ... Thanks! Noble: so i can hope that it'll be fixed soon :) thank you for fixing it ... please lemme know when its done.. Thanks! Mani Kumar 2009/4/16 Noble Paul നോബിള് नोब्ळ् noble.p...@gmail.com Hi Bryan, Thanks a lot. It is invoking the wrong method it should have been bsz = context.getVariableResolver().replaceTokens(bsz); it was a silly mistake --Noble On Thu, Apr 16, 2009 at 2:13 AM, Bryan Talbot btal...@aeriagames.com wrote: I think there is a bug in the 1.4 daily builds of data import handler which is causing the batchSize parameter to be ignored. This was probably introduced with more recent patches to resolve variables. The affected code is in JdbcDataSource.java String bsz = initProps.getProperty(batchSize); if (bsz != null) { bsz = (String) context.getVariableResolver().resolve(bsz); try { batchSize = Integer.parseInt(bsz); if (batchSize == -1) batchSize = Integer.MIN_VALUE; } catch (NumberFormatException e) { LOG.warn(Invalid batch size: + bsz); } } The call to context.getVariableResolver().resolve(bsz) is returning null, leading to a NumberFormatException and the batchSize never being set to Integer.MIN_VALUE. MySql won't use streaming result sets in this case which can lead to the OOM we're seeing. If your log file contains this entry like mine does, you're being affected by this bug too. Apr 15, 2009 1:21:58 PM org.apache.solr.handler.dataimport.JdbcDataSource init WARNING: Invalid batch size: null -Bryan On Apr 13, 2009, at Apr 13, 11:48 PM, Noble Paul നോബിള് नोब्ळ् wrote: DIH streams 1 row at a time. DIH is just a component in Solr. Solr indexing also takes a lot of memory On Tue, Apr 14, 2009 at 12:02 PM, Mani Kumar manikumarchau...@gmail.com wrote: Yes its throwing the same OOM error and from same place... yes i will try increasing the size ... just curious : how this dataimport works? Does it loads the whole table into memory? Is there any estimate about how much memory it needs to create index for 1GB of data. thx mani On Tue, Apr 14, 2009 at 11:48 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Tue, Apr 14, 2009 at 11:36 AM, Mani Kumar manikumarchau...@gmail.com wrote: Hi Shalin: yes i tried with batchSize=-1 parameter as well here the config i tried with dataConfig dataSource type=JdbcDataSource batchSize=-1 name=sp driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost/mydb_development user=root password=** / I hope i have used batchSize parameter @ right place. Yes that is correct. Did it still throw OOM from the same place? I'd suggest you increase the heap and see what works for you. Also try -server on the jvm. -- Regards, Shalin Shekhar Mangar. -- --Noble Paul -- --Noble Paul
Re: want to Unsubscribe from Solr Mailing List
Dear Lady, this information available on http://lucene.apache.org/solr/mailing_lists.html page. Thank you for unsubscribing! -Mani On Thu, Apr 16, 2009 at 10:16 AM, Neha Bhardwaj neha_bhard...@persistent.co.in wrote: Hi, I wish to unsubscribe from list . My email address is neha_bhard...@peristent.co.in Thanks for all the help and support. Thanks and Regards, Neha Bhardwaj| Software Engineer| Persistent Systems Limited Neha mailto:neha%20bhard...@persistent.co.inneha%2520bhard...@persistent.co.in%20 bhard...@persistent.co.in | Cell: +91 9272383082| Tel: +91 (20) 3023 5257 Innovation in software product design, development and delivery- www.persistentsys.com DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.