Re: indexing txt file

2009-04-15 Thread Alejandro Gonzalez
but you need to index the text inside these files, right?. you need to read
the text from file and include it into a field into the XML (of course this
field must be defined in the schema). you can do it using a script and post
then the XML to Solr.

what amount/rate of generated text files are you thinking about?

On Tue, Apr 14, 2009 at 7:07 PM, Alex Vu alex.v...@gmail.com wrote:

 I just want to be able to index my text file, and other files that carries
 the same format but with different IP address, ports, ect.

  I will have the traffic flow running in real-time.  Do you think Solr will
 be able to index a bunch of my text files in real time?

 On Tue, Apr 14, 2009 at 9:35 AM, Alejandro Gonzalez 
 alejandrogonzalezd...@gmail.com wrote:

  and i'm not sure of understanding what are u trying to do, but maybe you
  should define a text field and fill it with the text in each file for
  indexing the text in them, or maybe a path to that file if that's what u
  want.
 
  On Tue, Apr 14, 2009 at 6:28 PM, Shalin Shekhar Mangar 
  shalinman...@gmail.com wrote:
 
   On Tue, Apr 14, 2009 at 9:44 PM, Alex Vu alex.v...@gmail.com wrote:
  
   
*schema file is *
?xml version=1.0 encoding=UTF-8?
!--W3C Schema generated by XMLSpy v2009 sp1 (http://www.altova.com
  )--
xs:schema xmlns:xs=http://www.w3.org/2001/XMLSchema;
   xs:element name=networkTraffic
   xs:complexType
   xs:sequence
   xs:element name=packet maxOccurs=unbounded
   xs:complexType
   xs:attribute name=terminationTimestamp
type=xs:string use=required/
   xs:attribute name=sourcePort
 type=xs:string
use=required/
   xs:attribute name=sourceIp type=xs:string
use=required/
   xs:attribute name=protocolPortNumber
type=xs:string use=required/
   xs:attribute name=packets type=xs:string
use=required/
   xs:attribute name=ok type=xs:string
use=required/
   xs:attribute name=initialTimestamp
type=xs:string use=required/
   xs:attribute name=flows type=xs:string
use=required/
   xs:attribute name=destinatoinIp
   type=xs:string
use=required/
   xs:attribute name=destinationPort
type=xs:string use=required/
   xs:attribute name=bytes type=xs:string
use=required/
   /xs:complexType
   /xs:element
   /xs:sequence
   /xs:complexType
   /xs:element
/xs:schema
   
   
Can someone please show me where do I put these files?  I'm aware
 that
   the
schema.xsd file goes into the directory conf. What about my xml file,
  and
txt file?
   
  
   Alex, the Solr schema is not the usual XML Schema (xsd). It is an xml
  file
   which describes the fields, their analyzers, tokenizers, copyFields,
   default
   search field etc.
  
   Look into the example schema supplied by Solr (inside
 example/solr/conf)
   directory and modify it according to your needs.
  
   --
   Regards,
   Shalin Shekhar Mangar.
  
 



Re: indexing txt file

2009-04-15 Thread Fergus McMenemie
Hi all,
I'm trying to use solr1.3 and trying to index a text file.  I wrote a
schema.xsd and a xml file.

Just to make sure I understand things

Do you just have one of these text files, containing many reports?
   Or
Do you have many of these text files each containing one report?

Also, is the report a single line, that has been wrapped for email?

Fergus.


*The content of my text file is *
#src   dstprotook
sportdportpktsbytesflowsfirst
atest
192.168.220.13526.147.238.1466  13283980
6  463  1  1237333861.4657640001237333861.664701000

*schema file is *
?xml version=1.0 encoding=UTF-8?
!--W3C Schema generated by XMLSpy v2009 sp1 (http://www.altova.com)--
xs:schema xmlns:xs=http://www.w3.org/2001/XMLSchema;
xs:element name=networkTraffic
xs:complexType
xs:sequence
xs:element name=packet maxOccurs=unbounded
xs:complexType
xs:attribute name=terminationTimestamp
type=xs:string use=required/
xs:attribute name=sourcePort type=xs:string
use=required/
xs:attribute name=sourceIp type=xs:string
use=required/
xs:attribute name=protocolPortNumber
type=xs:string use=required/
xs:attribute name=packets type=xs:string
use=required/
xs:attribute name=ok type=xs:string
use=required/
xs:attribute name=initialTimestamp
type=xs:string use=required/
xs:attribute name=flows type=xs:string
use=required/
xs:attribute name=destinatoinIp type=xs:string
use=required/
xs:attribute name=destinationPort
type=xs:string use=required/
xs:attribute name=bytes type=xs:string
use=required/
/xs:complexType
/xs:element
/xs:sequence
/xs:complexType
/xs:element
/xs:schema


*and my xml file is *

?xml version=1.0 encoding=UTF-8?
networkTraffic xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
xsi:noNamespaceSchemaLocation=C:\DOCUME~1\tpham\Desktop\networkTraffic.xsd
packet sourceIp=192.168.54.23 destinatoinIp=192.168.0.1
protocolPortNumber=6 ok=1 sourcePort=32439 destinationPort=80
packets=6 bytes=463 flows=1 initialTimestamp=1237963861.465764000
terminationTimestamp=1237963861.664701000/
packet sourceIp=192.168.56.23 destinatoinIp=192.168.0.1
protocolPortNumber=17 ok=1 sourcePort=32439 destinationPort=80
packets=6 bytes=463 flows=1 initialTimestamp=1237963861.465764000
terminationTimestamp=1237963861.664701000/
packet sourceIp=192.168.74.23 destinatoinIp=192.168.0.1
protocolPortNumber=6 ok=1 sourcePort=32139 destinationPort=80
packets=6 bytes=463 flows=1 initialTimestamp=1237963861.465764000
terminationTimestamp=1237963861.664701000/
packet sourceIp=192.168.54.123 destinatoinIp=192.168.0.1
protocolPortNumber=6 ok=1 sourcePort=32839 destinationPort=80
packets=6 bytes=463 flows=1 initialTimestamp=1237963861.465764000
terminationTimestamp=1237963861.664701000/
packet sourceIp=192.168.14.23 destinatoinIp=192.168.0.1
protocolPortNumber=17 ok=1 sourcePort=32839 destinationPort=80
packets=6 bytes=463 flows=1 initialTimestamp=1237963861.465764000
terminationTimestamp=1237963861.664701000/
packet sourceIp=192.168.5.23 destinatoinIp=192.168.0.1
protocolPortNumber=17 ok=1 sourcePort=32439 destinationPort=80
packets=6 bytes=463 flows=1 initialTimestamp=1237963861.465764000
terminationTimestamp=1237963861.664701000/
packet sourceIp=192.168.15.23 destinatoinIp=192.168.0.1
protocolPortNumber=6 ok=1 sourcePort=36839 destinationPort=80
packets=6 bytes=463 flows=1 initialTimestamp=1237963861.465764000
terminationTimestamp=1237963861.664701000/
packet sourceIp=192.168.24.23 destinatoinIp=192.168.0.1
protocolPortNumber=6 ok=1 sourcePort=32839 destinationPort=80
packets=6 bytes=463 flows=1 initialTimestamp=1237963861.465764000
terminationTimestamp=1237963861.664701000/
/networkTraffic



Can someone please show me where do I put these files?  I'm aware that the
schema.xsd file goes into the directory conf. What about my xml file, and
txt file?

Thank you,
Alex


On Tue, Apr 14, 2009 at 12:37 AM, Alejandro Gonzalez 
alejandrogonzalezd...@gmail.com wrote:

 you should construct the xml containing the fields defined in your
 schema.xml and give them the values from the text files. for example if you
 have an schema defining two fields title and text you should construct
 an xml with a field title and its value and another called text
 containing the body of your doc. then you can post it to Solr you have
 deployed and make a commit an it's done. it's possible to construct an xml
 defining more than jus t a doc


 add
 doc
 field name=titledoc1 title/field
 field name=textdoc1 text/field
 /doc
 .
 .
 .
 doc
 

Re: Disable logging in SOLR

2009-04-15 Thread Kraus, Ralf | pixelhouse GmbH

Bill Au schrieb:

Have you tried setting logging level to OFF from Solr's admin GUI:
http://wiki.apache.org/solr/SolrAdminGUI
  

thx 4 the hint !

But after I restart my tomcat its all reseted to default ? :-(

Greets -Ralf-


Re: indexing txt file

2009-04-15 Thread Shalin Shekhar Mangar
On Tue, Apr 14, 2009 at 10:37 PM, Alex Vu alex.v...@gmail.com wrote:

 I just want to be able to index my text file, and other files that carries
 the same format but with different IP address, ports, ect.


Alex, Solr consumes XML (in a specifc format) and CSV. It can consume plain
text through ExtractIonHandler. It can index DBs, other XML formats.

You can write a java program, parse your text file, and use Solrj client to
send data to Solr. You could also write a program in any language you want
and convert those text files to CSV or XML and post them to Solr.

http://wiki.apache.org/solr/UpdateXmlMessages
http://wiki.apache.org/solr/UpdateCSV
http://wiki.apache.org/solr/Solrj



  I will have the traffic flow running in real-time.  Do you think Solr will
 be able to index a bunch of my text files in real time?


I don't think Solr is very suitable for this task. You can add the files to
Solr at any time but you won't be able to search on them immediately. You
should batch the commits (you can also use the maxDocs/maxTime properties in
the autoCommit section in solrconfig.xml)

-- 
Regards,
Shalin Shekhar Mangar.


Maven repositories

2009-04-15 Thread Gustavo Lopes
Hi, does anyone know the location of the maven snapshot repositories for 
solr 1.4-SNAPSHOT?


Thanks

--
Gustavo Lopes


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Maven repositories

2009-04-15 Thread Shalin Shekhar Mangar
On Wed, Apr 15, 2009 at 3:30 PM, Gustavo Lopes galo...@mediacapital.ptwrote:

 Hi, does anyone know the location of the maven snapshot repositories for
 solr 1.4-SNAPSHOT?


http://people.apache.org/repo/m2-snapshot-repository/org/apache/solr/

Disclaimer - Un-released artifacts built from trunk. Use it at your own
risk.

-- 
Regards,
Shalin Shekhar Mangar.


Re: Disable logging in SOLR

2009-04-15 Thread Bill Au
Yes, restarting Tomcat will reset things back to default.  But you should be
able to configure Tomcat to disable Solr logging since Solr uses JDK
logging.

Bill

On Wed, Apr 15, 2009 at 4:51 AM, Kraus, Ralf | pixelhouse GmbH 
r...@pixelhouse.de wrote:

 Bill Au schrieb:

 Have you tried setting logging level to OFF from Solr's admin GUI:
 http://wiki.apache.org/solr/SolrAdminGUI


 thx 4 the hint !

 But after I restart my tomcat its all reseted to default ? :-(

 Greets -Ralf-



Re: Disable logging in SOLR

2009-04-15 Thread Mark Miller

Kraus, Ralf | pixelhouse GmbH wrote:

Hi,

is there a way to disable all logging output in SOLR ?
I mean the output text like :

INFO: [core_de] webapp=/solr path=/update params={wt=json} status=0 
QTime=3736


greets -Ralf-

You probably do not want to totally disable logging in Solr. More 
likely, your looking to make Solr less chatty by not logging the INFO 
level. Solr is a bit chatty by default, mostly I think, because that can 
be very useful and is often worth the likely very small performance hit 
of all the extra logging. At the least though, I think you want to leave 
Severe/Error logging on in most cases, and possibly WARN.


Its easy enough to change the logging levels though. Solr 1.3 uses 
java.util.logging and Solr 1.4 uses SLF4J defaulting to java.util.logging.


So you can either change the system level properties file in your JDK 
folder, or you can use a param at startup: 
|-Djava.util.logging.config.file=/path/to/my/logging.properties


Then setup a props file. Here is an example from the wiki:

|

# Default global logging level:
.level= INFO

# Write to a file:
handlers= java.util.logging.FileHandler

# Write log messages in XML format:
java.util.logging.FileHandler.formatter = java.util.logging.XMLFormatter 


# Log to the current working directory, with log files named solrxxx.log
java.util.logging.FileHandler.pattern = ./solr%u.log



--
- Mark

http://www.lucidimagination.com





Re: Disable logging in SOLR

2009-04-15 Thread Kraus, Ralf | pixelhouse GmbH

Mark Miller schrieb:

Kraus, Ralf | pixelhouse GmbH wrote:

Hi,

is there a way to disable all logging output in SOLR ?
I mean the output text like :

INFO: [core_de] webapp=/solr path=/update params={wt=json} status=0 
QTime=3736


greets -Ralf-

You probably do not want to totally disable logging in Solr. More 
likely, your looking to make Solr less chatty by not logging the INFO 
level. Solr is a bit chatty by default, mostly I think, because that 
can be very useful and is often worth the likely very small 
performance hit of all the extra logging. At the least though, I think 
you want to leave Severe/Error logging on in most cases, and possibly 
WARN.


Its easy enough to change the logging levels though. Solr 1.3 uses 
java.util.logging and Solr 1.4 uses SLF4J defaulting to 
java.util.logging.


So you can either change the system level properties file in your JDK 
folder, or you can use a param at startup: 
|-Djava.util.logging.config.file=/path/to/my/logging.properties

Thats exactly the way I choose yesterday ;-)

Thx

Greets -Ralf-


Re: [solr-user] Upgrade from 1.2 to 1.3 gives 3x slowdown

2009-04-15 Thread Fergus McMenemie
On Apr 2, 2009, at 9:23 AM, Fergus McMenemie wrote:

 Grant,



 I should note, however, that the speed difference you are seeing may
 not be as pronounced as it appears.  If I recall during ApacheCon, I
 commented on how long it takes to shutdown your Solr instance when
 exiting it.  That time it takes is in fact Solr doing the work that
 was put off by not committing earlier and having all those deletes
 pile up.

 I am confused about work that was put off vs committing. My script
 was doing a commit right after the CVS import, and you are right
 about the massive times required to shut tomcat down. But in my tests
 the time taken to do the commit was under a second, yet I had to allow
 300secs for tomcat shutdown. Also I dont have any duplicates. So
 what sort of work was being done at shutdown that was not being done
 by a commit? Optimise!


The work being done is addressing the deletes, AIUI, but of course  
there are other things happening during shutdown, too.
There are no deletes to do. It was a clean index to begin with
and there were no duplicates.

How long is the shutdown if you do a commit first and then a shutdown?
Still very long, sometimes 300sec. My script always did a commit!

At any rate, I don't know that there is a satisfying answer to the  
larger issue due to the things like the fsync stuff, which is an  
overall win for Lucene/Solr despite it being more slower.  Have you  
tried running the tests on other machines (non-Mac?)
Nope. Although next week I will have real PC running vista, so 
I could try it there.

I think we should knock this on the head and move on. I rarely
need to index this content and I can take the performance hit,
and of course your work around provides a good speed up. 

Regards Fergus.
-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===


looking at the results of a distributed search using shards.

2009-04-15 Thread Fergus McMenemie
Hi,

Having all kinds of fun with distributed search using shards:-)

I have 30K documents indexed using DIH into an index. Another
index contain documents indexed using solr-cell. I am using shards
to search across both indexes.
 
I am trying to format the results returned from solr such the
source document can be linked to, and to do so I think I need to
know which shard a particular result came from. Is this a FAQ?

Regards
-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===


Re: solr 1.3 + tomcat 5.5

2009-04-15 Thread Shalin Shekhar Mangar
From the log it seems like there is a solr.xml inside
var/lib/tomcat5/webapps/ which tomcat is trying deploy and failing. Very
strange. You should remove that file and see if that fixes it.


On Tue, Apr 14, 2009 at 11:35 PM, andrysha nihuhoid nihuh...@gmail.comwrote:

 Hi, got problem setting up solr + tomcat
 Tomcat5.5 + apache solr 1.3.0 + centos 5.3
 I don't familiar with java at all, so sorry if it's dumb question.
 Here is what i did:
 placed solr.war in webapps folder
 changed solr home to /etc/solr
 copied contents of solr distribution example folder to /etc/solr

 tomcat starting successfully and i even can access admin interface but
 following error appears in catalina.out every 10 seconds:
 SEVERE: Error deploying configuration descriptor
 var#lib#tomcat5#webapps#solr.xml
 Apr 14, 2009 1:30:14 PM org.apache.catalina.startup.HostConfig
 deployDescriptor
 SEVERE: Error deploying configuration descriptor etc#solr#.xml
 Apr 14, 2009 1:30:24 PM org.apache.catalina.startup.HostConfig
 deployDescriptor
 SEVERE: Error deploying configuration descriptor
 var#lib#tomcat5#webapps#solr.xml
 Apr 14, 2009 1:30:24 PM org.apache.catalina.startup.HostConfig
 deployDescriptor
 SEVERE: Error deploying configuration descriptor etc#solr#.xml
 Apr 14, 2009 1:30:34 PM org.apache.catalina.startup.HostConfig
 deployDescriptor
 SEVERE: Error deploying configuration descriptor
 var#lib#tomcat5#webapps#solr.xml
 Apr 14, 2009 1:30:34 PM org.apache.catalina.startup.HostConfig
 deployDescriptor
 SEVERE: Error deploying configuration descriptor etc#solr#.xml
 Apr 14, 2009 1:30:44 PM org.apache.catalina.startup.HostConfig
 deployDescriptor
 SEVERE: Error deploying configuration descriptor
 var#lib#tomcat5#webapps#solr.xml
 Apr 14, 2009 1:30:44 PM org.apache.catalina.startup.HostConfig
 deployDescriptor
 SEVERE: Error deploying configuration descriptor etc#solr#.xml
 Apr 14, 2009 1:30:54 PM org.apache.catalina.startup.HostConfig
 deployDescriptor
 SEVERE: Error deploying configuration descriptor
 var#lib#tomcat5#webapps#solr.xml
 Apr 14, 2009 1:30:54 PM org.apache.catalina.startup.HostConfig
 deployDescriptor
 SEVERE: Error deploying configuration descriptor etc#solr#.xml


 Googled about 3 hours.

 tried to set allow write permissions for all to /etc, /etc/solr /var/
 lib/tomcat5/webapps
 tried to create empty file named solr.xml in /etc, /etc/solr
 tried to copy solrconfig.xml to /etc/, /etc/solr




-- 
Regards,
Shalin Shekhar Mangar.


Re: Distinct terms in facet field

2009-04-15 Thread Shalin Shekhar Mangar
On Wed, Apr 15, 2009 at 1:13 AM, Harsch, Timothy J. (ARC-SC)[PEROT SYSTEMS]
timothy.j.har...@nasa.gov wrote:

 How could I get a count of distinct terms for a given query?  For example:
 The Wiki page
 http://wiki.apache.org/solr/SimpleFacetParameters
 has a section Facet Fields with No Zeros
 which shows the query:

 http://localhost:8983/solr/select?q=ipodrows=0facet=truefacet.limit=-1facet.field=catfacet.mincount=1facet.field=inStock
 and returns results where the inStock field has two facet counts (false is
 3, and true is 1)

 But what I would want to know is how many distinct values were found ( in
 this case it would be 2 / true and false ).  I realize I could count the
 number of terms returned, but if the set were large that would be
 non-performant.  Is there a better way?


To do this with facets, you'd need to return all of them. The other way of
doing this is by making a request to /admin/luke?fl=inStock which will
return the number of unique terms in that field.

http://wiki.apache.org/solr/LukeRequestHandler

You can also index the number of unique values in a field as a separate
field.

-- 
Regards,
Shalin Shekhar Mangar.


Re: Index Replication or Distributed Search ?

2009-04-15 Thread Shalin Shekhar Mangar
On Wed, Apr 15, 2009 at 5:07 AM, ramanathan ramanat...@youinweb-inc.comwrote:


 Hi,

 Can someone provide a practical advice of how large a Solr search index can
 be?  for a better performance for consumer facing media website?.


The right answer is that it depends :)

It depends on the number of documents, size of a document, number of unique
terms, kind of queries, frequency of updates etc.



 Is it good or bad to think about Distributed Search and dividing index in
 earlier stage of development?


If your index can fit into a single box with acceptable response times, then
this is the simplest way to get started. If not, then you'll need to use
Distributed Search. Note that many installations use distributed search *and
* replication together to handle large traffic.

These are some good resources on this topic:

http://wiki.apache.org/solr/LargeIndexes
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr

Ask away if you have specific questions.
-- 
Regards,
Shalin Shekhar Mangar.


Using CSV for indexing ... Remote Streaming disabled

2009-04-15 Thread vivek sar
Hi,

  I'm trying using CSV (Solr 1.4, 03/29) for indexing following wiki
(http://wiki.apache.org/solr/UpdateCSV). I've updated the
solrconfig.xml to have this lines,

requestDispatcher handleSelect=true 
requestParsers enableRemoteStreaming=true
multipartUploadLimitInKB=20480 /
...
/requestDispatcher

   requestHandler name=/update/csv class=solr.CSVRequestHandler
startup=lazy /

When I try to upload the csv,

  curl 
'http://localhost:8080/solr/20090414_1/update/csv?commit=trueseparator=%09escape=%5cstream.file=/Users/opal/temp/afterchat/data/csv/1239759267339.csv'

I get following response,

/headbodyh1HTTP Status 400 - Remote Streaming is
disabled./h1HR size=1 noshade=noshadepbtype/b Status
report/ppbmessage/b uRemote Streaming is
disabled./u/ppbdescription/b uThe request sent by the
client was syntactically incorrect (Remote Streaming is
disabled.)./u/pHR size=1 noshade=noshadeh3Apache
Tomcat/6.0.18/h3/body/html

Why is it complaining about the remote streaming if it's already
enabled? Is there anything I'm missing?

Thanks,
-vivek


Re: looking at the results of a distributed search using shards.

2009-04-15 Thread Grant Ingersoll


On Apr 15, 2009, at 11:18 AM, Fergus McMenemie wrote:


Hi,

Having all kinds of fun with distributed search using shards:-)

I have 30K documents indexed using DIH into an index. Another
index contain documents indexed using solr-cell. I am using shards
to search across both indexes.

I am trying to format the results returned from solr such the
source document can be linked to, and to do so I think I need to
know which shard a particular result came from. Is this a FAQ?


+1, assuming you mean to add it as a FAQ and aren't asking if it  
already is one.





Regards
--

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search



Commits taking too long

2009-04-15 Thread vivek sar
Hi,

  I've index where I commit every 50K records (using Solrj). Usually
this commit takes 20sec to complete, but every now and then the commit
takes way too long - from 10 min to 30 min. I see more delays as the
index size continues to grow - once it gets over 5G I start seeing
long commit cycles more frequently. See this for ex.,

Apr 15, 2009 12:04:13 AM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start commit(optimize=false,waitFlush=false,waitSearcher=false)
Apr 15, 2009 12:39:58 AM org.apache.solr.core.SolrDeletionPolicy onCommit
INFO: SolrDeletionPolicy.onCommit: commits:num=2

commit{dir=/Users/vivek/demo/afterchat/solr/multicore/20090414_1/data/index,segFN=segments_fq,version=1239747075391,generation=566,filenames=[_19m.cfs,
_jm.cfs, _1bk.cfs, _193.cfx, _19z.cfs, _1b8.cfs, _1bf.cfs, _10g.cfs, _
2s.cfs, _1bf.cfx, _18x.cfx, _19c.cfx, _193.cfs, _18x.cfs, _1b7.cfs,
_1aw.cfs, _1aq.cfs, _1bi.cfx, _1a6.cfs, _19l.cfs, _1ad.cfs, _1a6.cfx,
_1as.cfs, _19l.cfx, _1aa.cfs, _1an.cfs, _19d.cfs, _1a3.cfx, _1a3.cfs,
_19g.cfs, _b7.cfs, _19
e.cfs, _19b.cfs, _1ab.cfs, _1b3.cfx, _19j.cfs, _190.cfs, _uu.cfs,
_1b3.cfs, _1ak.cfs, _19p.cfs, _195.cfs, _194.cfs, _19i.cfx, _199.cfs,
_19i.cfs, _19o.cfx, _196.cfs, _199.cfx, _196.cfx, _19o.cfs, _190.cfx,
_xn.cfs, _1b0.cfx, _1at.
cfs, _1av.cfs, _1ao.cfs, _1a9.cfx, _1b0.cfs, _5l.cfs, _1ao.cfx,
_1ap.cfs, _1b6.cfx, _19a.cfs, _139.cfs, _1a1.cfs, _s1.cfs, _1b6.cfs,
_1a9.cfs, _197.cfs, _1bd.cfs, _19n.cfs, _1au.cfx, _1au.cfs, _1a5.cfs,
_1be.cfs, segments_fq, _1b4.cfs, _gt.cfs, _1ag.cfs, _18z.cfs,
_162.cfs, _1a4.cfs, _198.cfs, _19x.cfs, _1ah.cfs, _1ai.cfs, _19q.cfs,
_1a7.cfs, _1ae.cfs, _19h.cfs, _19x.cfx, _1a2.cfs, _1bj.cfs, _1bb.cfs,
_1b1.cfs, _1ai.cfx, _19r.cfs, _18y.cfs, _19u.cfx, _1a8.
cfs, _19u.cfs, _1aj.cfs, _19r.cfx, _1ac.cfs, _1az.cfs, _1ac.cfx,
_19y.cfs, _1bc.cfx, _19s.cfs, _1ar.cfs, _1al.cfx, _1bg.cfs, _18v.cfs,
_1ar.cfx, _1bc.cfs, _1a0.cfx, _1b2.cfs, _1af.cfs, _1bi.cfs, _1af.cfx,
_19f.cfs, _1a0.cfs, _1bh.cfs, _19f.cfx, _19c.cfs, _e0.cfs, _1ax.cfx,
_1b5.cfs, _191.cfs, _18w.cfs, _19t.cfs, _8e.cfs, _19v.cfs, _192.cfs,
_1b9.cfs, _1ay.cfs, _p8.cfs, _19k.cfs, _1b9.cfx, _1ax.cfs, _1am.cfs,
_1ba.cfs, _mf.cfs, _1al.cfs, _19w.cfs]

commit{dir=/Users/vivek/demo/afterchat/solr/multicore/20090414_1/data/index,segFN=segments_fr,version=1239747075392,generation=567,filenames=[_jm.cfs,
_1bo.cfs, _xn.cfs, segments_fr, _8e.cfs, _gt.cfs, _18v.cfs, _uu.cfs,
_1
0g.cfs, _2s.cfs, _5l.cfs, _162.cfs, _p8.cfs, _139.cfs, _s1.cfs,
_mf.cfs, _b7.cfs, _e0.cfs]
Apr 15, 2009 12:39:58 AM org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: last commit = 1239747075392

Here is my default index settings,

 indexDefaults
   !-- Values here affect all index writers and act as a default
unless overridden. --
useCompoundFiletrue/useCompoundFile
mergeFactor100/mergeFactor
!-- maxBufferedDocs1/maxBufferedDocs --
ramBufferSizeMB64/ramBufferSizeMB
maxMergeDocs2147483647/maxMergeDocs
maxFieldLength1/maxFieldLength
writeLockTimeout1000/writeLockTimeout
commitLockTimeout1/commitLockTimeout
lockTypesingle/lockType
  /indexDefaults

What am I doing wrong here? What's causing these delays?

Thanks,
-vivek


Re: indexing txt file

2009-04-15 Thread Alex Vu
  what amount/rate of generated text files are you thinking about?

I have 1TB worth of text files coming in every couple of minutes in
real-time.  In about 10 minute I will have 4TB worth of text files.

  Do you just have one of these text files, containing many reports?
  Do you have many of these text files each containing one report?
  Also, is the report a single line, that has been wrapped for email?

these files, rotate every hour.   In each text files, it contains many
reports, and it is not wrapped for email.

 Is there an effective way to use Solr to have it consistently index my text
files.

Please note: that these files all have the same formats.



On Wed, Apr 15, 2009 at 1:58 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 On Tue, Apr 14, 2009 at 10:37 PM, Alex Vu alex.v...@gmail.com wrote:

  I just want to be able to index my text file, and other files that
 carries
  the same format but with different IP address, ports, ect.
 

 Alex, Solr consumes XML (in a specifc format) and CSV. It can consume plain
 text through ExtractIonHandler. It can index DBs, other XML formats.

 You can write a java program, parse your text file, and use Solrj client to
 send data to Solr. You could also write a program in any language you want
 and convert those text files to CSV or XML and post them to Solr.

 http://wiki.apache.org/solr/UpdateXmlMessages
 http://wiki.apache.org/solr/UpdateCSV
 http://wiki.apache.org/solr/Solrj


 
   I will have the traffic flow running in real-time.  Do you think Solr
 will
  be able to index a bunch of my text files in real time?
 

 I don't think Solr is very suitable for this task. You can add the files to
 Solr at any time but you won't be able to search on them immediately. You
 should batch the commits (you can also use the maxDocs/maxTime properties
 in
 the autoCommit section in solrconfig.xml)

 --
 Regards,
 Shalin Shekhar Mangar.



Re: Commits taking too long

2009-04-15 Thread Mark Miller

vivek sar wrote:

Hi,

  I've index where I commit every 50K records (using Solrj). Usually
this commit takes 20sec to complete, but every now and then the commit
takes way too long - from 10 min to 30 min. I see more delays as the
index size continues to grow - once it gets over 5G I start seeing
long commit cycles more frequently. See this for ex.,

Apr 15, 2009 12:04:13 AM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start commit(optimize=false,waitFlush=false,waitSearcher=false)
Apr 15, 2009 12:39:58 AM org.apache.solr.core.SolrDeletionPolicy onCommit
INFO: SolrDeletionPolicy.onCommit: commits:num=2

commit{dir=/Users/vivek/demo/afterchat/solr/multicore/20090414_1/data/index,segFN=segments_fq,version=1239747075391,generation=566,filenames=[_19m.cfs,
_jm.cfs, _1bk.cfs, _193.cfx, _19z.cfs, _1b8.cfs, _1bf.cfs, _10g.cfs, _
2s.cfs, _1bf.cfx, _18x.cfx, _19c.cfx, _193.cfs, _18x.cfs, _1b7.cfs,
_1aw.cfs, _1aq.cfs, _1bi.cfx, _1a6.cfs, _19l.cfs, _1ad.cfs, _1a6.cfx,
_1as.cfs, _19l.cfx, _1aa.cfs, _1an.cfs, _19d.cfs, _1a3.cfx, _1a3.cfs,
_19g.cfs, _b7.cfs, _19
e.cfs, _19b.cfs, _1ab.cfs, _1b3.cfx, _19j.cfs, _190.cfs, _uu.cfs,
_1b3.cfs, _1ak.cfs, _19p.cfs, _195.cfs, _194.cfs, _19i.cfx, _199.cfs,
_19i.cfs, _19o.cfx, _196.cfs, _199.cfx, _196.cfx, _19o.cfs, _190.cfx,
_xn.cfs, _1b0.cfx, _1at.
cfs, _1av.cfs, _1ao.cfs, _1a9.cfx, _1b0.cfs, _5l.cfs, _1ao.cfx,
_1ap.cfs, _1b6.cfx, _19a.cfs, _139.cfs, _1a1.cfs, _s1.cfs, _1b6.cfs,
_1a9.cfs, _197.cfs, _1bd.cfs, _19n.cfs, _1au.cfx, _1au.cfs, _1a5.cfs,
_1be.cfs, segments_fq, _1b4.cfs, _gt.cfs, _1ag.cfs, _18z.cfs,
_162.cfs, _1a4.cfs, _198.cfs, _19x.cfs, _1ah.cfs, _1ai.cfs, _19q.cfs,
_1a7.cfs, _1ae.cfs, _19h.cfs, _19x.cfx, _1a2.cfs, _1bj.cfs, _1bb.cfs,
_1b1.cfs, _1ai.cfx, _19r.cfs, _18y.cfs, _19u.cfx, _1a8.
cfs, _19u.cfs, _1aj.cfs, _19r.cfx, _1ac.cfs, _1az.cfs, _1ac.cfx,
_19y.cfs, _1bc.cfx, _19s.cfs, _1ar.cfs, _1al.cfx, _1bg.cfs, _18v.cfs,
_1ar.cfx, _1bc.cfs, _1a0.cfx, _1b2.cfs, _1af.cfs, _1bi.cfs, _1af.cfx,
_19f.cfs, _1a0.cfs, _1bh.cfs, _19f.cfx, _19c.cfs, _e0.cfs, _1ax.cfx,
_1b5.cfs, _191.cfs, _18w.cfs, _19t.cfs, _8e.cfs, _19v.cfs, _192.cfs,
_1b9.cfs, _1ay.cfs, _p8.cfs, _19k.cfs, _1b9.cfx, _1ax.cfs, _1am.cfs,
_1ba.cfs, _mf.cfs, _1al.cfs, _19w.cfs]

commit{dir=/Users/vivek/demo/afterchat/solr/multicore/20090414_1/data/index,segFN=segments_fr,version=1239747075392,generation=567,filenames=[_jm.cfs,
_1bo.cfs, _xn.cfs, segments_fr, _8e.cfs, _gt.cfs, _18v.cfs, _uu.cfs,
_1
0g.cfs, _2s.cfs, _5l.cfs, _162.cfs, _p8.cfs, _139.cfs, _s1.cfs,
_mf.cfs, _b7.cfs, _e0.cfs]
Apr 15, 2009 12:39:58 AM org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: last commit = 1239747075392

Here is my default index settings,

 indexDefaults
   !-- Values here affect all index writers and act as a default
unless overridden. --
useCompoundFiletrue/useCompoundFile
mergeFactor100/mergeFactor
!-- maxBufferedDocs1/maxBufferedDocs --
ramBufferSizeMB64/ramBufferSizeMB
maxMergeDocs2147483647/maxMergeDocs
maxFieldLength1/maxFieldLength
writeLockTimeout1000/writeLockTimeout
commitLockTimeout1/commitLockTimeout
lockTypesingle/lockType
  /indexDefaults

What am I doing wrong here? What's causing these delays?

Thanks,
-vivek
  
Probably merging. With a mergefactor of 100, you will merge less often. 
But then you will hit points where you have to do a bunch more merging. 
Committing waits for these merges to finish. That would be my first 
guess. A mergefactor of say 10, would merge more often (only 10 segments 
per log level before they get merged up to the next level), but not run 
into points were it had as many segments to merge.


--
- Mark

http://www.lucidimagination.com





Re: looking at the results of a distributed search using shards.

2009-04-15 Thread Fergus McMenemie
On Apr 15, 2009, at 11:18 AM, Fergus McMenemie wrote:

 Hi,

 Having all kinds of fun with distributed search using shards:-)

 I have 30K documents indexed using DIH into an index. Another
 index contain documents indexed using solr-cell. I am using shards
 to search across both indexes.

 I am trying to format the results returned from solr such the
 source document can be linked to, and to do so I think I need to
 know which shard a particular result came from. Is this a FAQ?

+1, assuming you mean to add it as a FAQ and aren't asking if it  
already is one.

I was asking.. how do I find out which shard a result came from.
But I felt it must be a FAQ! Again... I am wondering if there is
established best practice covering this sort of thing, before I
go and roll my own:-)


Fergus.

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===


Re: [solr-user] Upgrade from 1.2 to 1.3 gives 3x slowdown

2009-04-15 Thread Ryan McKinley


The work being done is addressing the deletes, AIUI, but of course
there are other things happening during shutdown, too.

There are no deletes to do. It was a clean index to begin with
and there were no duplicates.



I have not followed this thread, so forgive me if this has already  
been suggested


If you know that there are not any duplicates, have you tried indexing  
with allowDups=true?


It will not change the fsync cost, but it may reduce some other  
checking times.


ryan


Re: looking at the results of a distributed search using shards.

2009-04-15 Thread Otis Gospodnetic

Ain't a FAQ, but could be.  Look at JIRA and search for Brian, who made the 
same request a few months ago.
I've often wondered if we could add info about the source shard, as well as 
whether a hit came from cache or not.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Fergus McMenemie fer...@twig.me.uk
 To: solr-user@lucene.apache.org
 Sent: Wednesday, April 15, 2009 11:18:21 AM
 Subject: looking at the results of a distributed search using shards.
 
 Hi,
 
 Having all kinds of fun with distributed search using shards:-)
 
 I have 30K documents indexed using DIH into an index. Another
 index contain documents indexed using solr-cell. I am using shards
 to search across both indexes.
 
 I am trying to format the results returned from solr such the
 source document can be linked to, and to do so I think I need to
 know which shard a particular result came from. Is this a FAQ?
 
 Regards
 -- 
 
 ===
 Fergus McMenemie   Email:fer...@twig.me.uk
 Techmore Ltd   Phone:(UK) 07721 376021
 
 Unix/Mac/Intranets Analyst Programmer
 ===



Re: DataImporter : Java heap space

2009-04-15 Thread Bryan Talbot
I think there is a bug in the 1.4 daily builds of data import handler  
which is causing the batchSize parameter to be ignored.  This was  
probably introduced with more recent patches to resolve variables.


The affected code is in JdbcDataSource.java

String bsz = initProps.getProperty(batchSize);
if (bsz != null) {
  bsz = (String) context.getVariableResolver().resolve(bsz);
  try {
batchSize = Integer.parseInt(bsz);
if (batchSize == -1)
  batchSize = Integer.MIN_VALUE;
  } catch (NumberFormatException e) {
LOG.warn(Invalid batch size:  + bsz);
  }
}


The call to context.getVariableResolver().resolve(bsz) is returning  
null, leading to a NumberFormatException and the batchSize never being  
set to Integer.MIN_VALUE.  MySql won't use streaming result sets in  
this case which can lead to the OOM we're seeing.



If your log file contains this entry like mine does, you're being  
affected by this bug too.


Apr 15, 2009 1:21:58 PM  
org.apache.solr.handler.dataimport.JdbcDataSource init

WARNING: Invalid batch size: null



-Bryan




On Apr 13, 2009, at Apr 13, 11:48 PM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:



DIH streams 1 row at a time.

DIH is just a component in Solr. Solr indexing also takes a lot of  
memory


On Tue, Apr 14, 2009 at 12:02 PM, Mani Kumar manikumarchau...@gmail.com 
 wrote:

Yes its throwing the same OOM error and from same place...
yes i will try increasing the size ... just curious : how this  
dataimport

works?

Does it loads the whole table into memory?

Is there any estimate about how much memory it needs to create  
index for 1GB

of data.

thx
mani

On Tue, Apr 14, 2009 at 11:48 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:


On Tue, Apr 14, 2009 at 11:36 AM, Mani Kumar manikumarchau...@gmail.com

wrote:



Hi Shalin:
yes i tried with batchSize=-1 parameter as well

here the config i tried with

dataConfig

   dataSource type=JdbcDataSource batchSize=-1 name=sp
driver=com.mysql.jdbc.Driver
url=jdbc:mysql://localhost/mydb_development
user=root password=** /


I hope i have used batchSize parameter @ right place.



Yes that is correct. Did it still throw OOM from the same place?

I'd suggest you increase the heap and see what works for you. Also  
try

-server on the jvm.

--
Regards,
Shalin Shekhar Mangar.







--
--Noble Paul




Re: Question on StreamingUpdateSolrServer

2009-04-15 Thread Otis Gospodnetic

Quick comment - why so shy with number of open file descriptors?  On some 
nothing-special machines from several years ago I had this limit set to 30K+ - 
here, for example: http://www.simpy.com/user/otis :)


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: vivek sar vivex...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Tuesday, April 14, 2009 3:12:41 AM
 Subject: Re: Question on StreamingUpdateSolrServer
 
 The machine's ulimit is set to 9000 and the OS has upper limit of
 12000 on files. What would explain this? Has anyone tried Solr with 25
 cores on the same Solr instance?
 
 Thanks,
 -vivek
 
 2009/4/13 Noble Paul നോബിള്‍  नोब्ळ् :
  On Tue, Apr 14, 2009 at 7:14 AM, vivek sar wrote:
  Some more update. As I mentioned earlier we are using multi-core Solr
  (up to 65 cores in one Solr instance with each core 10G). This was
  opening around 3000 file descriptors (lsof). I removed some cores and
  after some trial and error I found at 25 cores system seems to work
  fine (around 1400 file descriptors). Tomcat is responsive even when
  the indexing is happening at Solr (for 25 cores). But, as soon as it
  goes to 26 cores the Tomcat becomes unresponsive again. The puzzling
  thing is if I stop indexing I can search on even 65 cores, but while
  indexing is happening it seems to support only up to 25 cores.
 
  1) Is there a limit on number of cores a Solr instance can handle?
  2) Does Solr do anything to the existing cores while indexing? I'm
  writing to only one core at a time.
  There is no hard limit (it is Integer.MAX_VALUE) . But inreality your
  mileage depends on your hardware and no:of file handles the OS can
  open
 
  We are struggling to find why Tomcat stops responding on high number
  of cores while indexing is in-progress. Any help is very much
  appreciated.
 
  Thanks,
  -vivek
 
  On Mon, Apr 13, 2009 at 10:52 AM, vivek sar wrote:
  Here is some more information about my setup,
 
  Solr - v1.4 (nightly build 03/29/09)
  Servlet Container - Tomcat 6.0.18
  JVM - 1.6.0 (64 bit)
  OS -  Mac OS X Server 10.5.6
 
  Hardware Overview:
 
  Processor Name: Quad-Core Intel Xeon
  Processor Speed: 3 GHz
  Number Of Processors: 2
  Total Number Of Cores: 8
  L2 Cache (per processor): 12 MB
  Memory: 20 GB
  Bus Speed: 1.6 GHz
 
  JVM Parameters (for Solr):
 
  export CATALINA_OPTS=-server -Xms6044m -Xmx6044m -DSOLR_APP
  -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:gc.log
  -Dsun.rmi.dgc.client.gcInterval=360
  -Dsun.rmi.dgc.server.gcInterval=360
 
  Other:
 
  lsof|grep solr|wc -l
 2493
 
  ulimit -an
   open files  (-n) 9000
 
  Tomcat
 
connectionTimeout=2
maxThreads=100 /
 
  Total Solr cores on same instance - 65
 
  useCompoundFile - true
 
  The tests I ran,
 
  While Indexer is running
  1)  Go to http://juum19.co.com:8080/solr;- returns blank page (no
  error in the catalina.out)
 
  2) Try telnet juum19.co.com 8080  - returns with Connection closed
  by foreign host
 
  Stop the Indexer Program (Tomcat is still running with Solr)
 
  3)  Go to http://juum19.co.com:8080/solr;  - works ok, shows the list
  of all the Solr cores
 
  4) Try telnet - able to Telnet fine
 
  5)  Now comment out all the caches in solrconfig.xml. Try same tests,
  but the Tomcat still doesn't response.
 
  Is there a way to stop the auto-warmer. I commented out the caches in
  the solrconfig.xml but still see the following log,
 
  INFO: autowarming result for searc...@3aba3830 main
  
 fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
 
  INFO: Closing searc...@175dc1e2
  main   
  
 fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
  
 filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
  
 queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
  
 documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
 
 
  6) Change the Indexer frequency so it runs every 2 min (instead of all
  the time). I noticed once the commit is done, I'm able to run my
  searches. During commit and auto-warming period I just get blank page.
 
   7) Changed from Solrj to XML update -  I still get the blank page
  whenever update/commit is happening.
 
  Apr 13, 2009 6:46:18 

Re: Question on StreamingUpdateSolrServer

2009-04-15 Thread Otis Gospodnetic

One more thing.  I don't think this was mentioned, but you can:
- optimize your indices
- use compound index format

That will lower the number of open file handles.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: vivek sar vivex...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Friday, April 10, 2009 5:59:37 PM
 Subject: Re: Question on StreamingUpdateSolrServer
 
 I also noticed that the Solr app has over 6000 file handles open -
 
 lsof | grep solr | wc -l   - shows 6455
 
 I've 10 cores (using multi-core) managed by the same Solr instance. As
 soon as start up the Tomcat the open file count goes up to 6400.  Few
 questions,
 
 1) Why is Solr holding on to all the segments from all the cores - is
 it because of auto-warmer?
 2) How can I reduce the open file count?
 3) Is there a way to stop the auto-warmer?
 4) Could this be related to Tomcat returning blank page for every request?
 
 Any ideas?
 
 Thanks,
 -vivek
 
 On Fri, Apr 10, 2009 at 1:48 PM, vivek sar wrote:
  Hi,
 
   I was using CommonsHttpSolrServer for indexing, but having two
  threads writing (10K batches) at the same time was throwing,
 
   ProtocolException: Unbuffered entity enclosing request can not be 
  repeated. 
 
 
  I switched to StreamingUpdateSolrServer (using addBeans) and I don't
  see the problem anymore. The speed is very fast - getting around
  25k/sec (single thread), but I'm facing another problem. When the
  indexer using StreamingUpdateSolrServer is running I'm not able to
  send any url request from browser to Solr web app. I just get blank
  page. I can't even get to the admin interface. I'm also not able to
  shutdown the Tomcat running the Solr webapp when the Indexer is
  running. I've to first stop the Indexer app and then stop the Tomcat.
  I don't have this problem when using CommonsHttpSolrServer.
 
  Here is how I'm creating it,
 
  server = new StreamingUpdateSolrServer(url, 1000,3);
 
  I simply call server.addBeans(...) on it. Is there anything else I
  need to do to make use of StreamingUpdateSolrServer? Why does Tomcat
  become unresponsive  when Indexer using StreamingUpdateSolrServer is
  running (though, indexing happens fine)?
 
  Thanks,
  -vivek
 



StreamingUpdateSolrServer and DIH

2009-04-15 Thread Marc Sturlese

Hey there,
I have been reading about StreamingUpdateSolrServer but can't catch exactly
how it works:

More efficient index construction over http with solrj. If your doing it,
this is a fantastic performance improvement.

Adding a StreamingUpdateSolrServer that writes update commands to an open
HTTP connection. If you are using solrj for bulk update requests you should
consider switching to this implementation. However, note that the error
handling is not immediate as it is with the standard SolrServer.

Is there any way to use it in DataImportHandler?
Thanks in advance
-- 
View this message in context: 
http://www.nabble.com/StreamingUpdateSolrServer-and-DIH-tp23068057p23068057.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Question on StreamingUpdateSolrServer

2009-04-15 Thread vivek sar
Thanks Otis.

I did increase the number of file descriptors to 22K, but I still get
this problem. I've noticed following so far,

1) As soon as I get to around 1140 index segments (this is total over
multiple cores) I start seeing this problem.
2) When the problem starts occassionally the index request
(solrserver.commit) also fails with the following error,
  java.net.SocketException: Connection reset
3) Whenever the commit fails, I'm able to access Solr by the browser
(http://ets11.co.com/solr). If the commit is succssfull and going on I
get blank page on Firefox. Even the telnet to 8080 fails with
Connection closed by foreign host.

It does seem like there is some resource issue as it happens only once
we reach a breaking point (too many index segment files) - lsof at
this point usually shows at 1400, but my ulimit is much higher than
that.

I already use compound format for index files. I can also run optimize
occassionally (though not preferred as it blocks the whole index cycle
for a long time). I do want to find out what resource limitation is
causing this and it has to do something with when Indexer is
committing the records where there are large number of segment files.

Any other ideas?

Thanks,
-vivek

On Wed, Apr 15, 2009 at 3:10 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:

 One more thing.  I don't think this was mentioned, but you can:
 - optimize your indices
 - use compound index format

 That will lower the number of open file handles.

  Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: vivek sar vivex...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Friday, April 10, 2009 5:59:37 PM
 Subject: Re: Question on StreamingUpdateSolrServer

 I also noticed that the Solr app has over 6000 file handles open -

     lsof | grep solr | wc -l   - shows 6455

 I've 10 cores (using multi-core) managed by the same Solr instance. As
 soon as start up the Tomcat the open file count goes up to 6400.  Few
 questions,

 1) Why is Solr holding on to all the segments from all the cores - is
 it because of auto-warmer?
 2) How can I reduce the open file count?
 3) Is there a way to stop the auto-warmer?
 4) Could this be related to Tomcat returning blank page for every request?

 Any ideas?

 Thanks,
 -vivek

 On Fri, Apr 10, 2009 at 1:48 PM, vivek sar wrote:
  Hi,
 
   I was using CommonsHttpSolrServer for indexing, but having two
  threads writing (10K batches) at the same time was throwing,
 
   ProtocolException: Unbuffered entity enclosing request can not be 
  repeated.
 
 
  I switched to StreamingUpdateSolrServer (using addBeans) and I don't
  see the problem anymore. The speed is very fast - getting around
  25k/sec (single thread), but I'm facing another problem. When the
  indexer using StreamingUpdateSolrServer is running I'm not able to
  send any url request from browser to Solr web app. I just get blank
  page. I can't even get to the admin interface. I'm also not able to
  shutdown the Tomcat running the Solr webapp when the Indexer is
  running. I've to first stop the Indexer app and then stop the Tomcat.
  I don't have this problem when using CommonsHttpSolrServer.
 
  Here is how I'm creating it,
 
  server = new StreamingUpdateSolrServer(url, 1000,3);
 
  I simply call server.addBeans(...) on it. Is there anything else I
  need to do to make use of StreamingUpdateSolrServer? Why does Tomcat
  become unresponsive  when Indexer using StreamingUpdateSolrServer is
  running (though, indexing happens fine)?
 
  Thanks,
  -vivek
 




Re: StreamingUpdateSolrServer and DIH

2009-04-15 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Thu, Apr 16, 2009 at 3:45 AM, Marc Sturlese marc.sturl...@gmail.com wrote:

 Hey there,
 I have been reading about StreamingUpdateSolrServer but can't catch exactly
 how it works:

 More efficient index construction over http with solrj. If your doing it,
 this is a fantastic performance improvement.
StreamingUpdateSolrServer tries to use optimize use of http connection
by posting multiple add commands in the  same request. It also
allows you to do the same task in multiple threads.

 Adding a StreamingUpdateSolrServer that writes update commands to an open
 HTTP connection. If you are using solrj for bulk update requests you should
 consider switching to this implementation. However, note that the error
 handling is not immediate as it is with the standard SolrServer.
yeah true. CommonsHttpSolrServer has a add(IteratorSolrInptDocument)
method which is efficient (but does the update in the calling thread)
and you get to know about errors immedietly.

 Is there any way to use it in DataImportHandler?
DIH and StreamingUpdateSolrServer?
No . I cannot imagine a way
 Thanks in advance
 --
 View this message in context: 
 http://www.nabble.com/StreamingUpdateSolrServer-and-DIH-tp23068057p23068057.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-- 
--Noble Paul


Re: DataImporter : Java heap space

2009-04-15 Thread Noble Paul നോബിള്‍ नोब्ळ्
Hi Bryan,
Thanks a lot. It is invoking the wrong method

it should have been
bsz = context.getVariableResolver().replaceTokens(bsz);

it was a silly mistake

--Noble

On Thu, Apr 16, 2009 at 2:13 AM, Bryan Talbot btal...@aeriagames.com wrote:
 I think there is a bug in the 1.4 daily builds of data import handler which
 is causing the batchSize parameter to be ignored.  This was probably
 introduced with more recent patches to resolve variables.

 The affected code is in JdbcDataSource.java

    String bsz = initProps.getProperty(batchSize);
    if (bsz != null) {
      bsz = (String) context.getVariableResolver().resolve(bsz);
      try {
        batchSize = Integer.parseInt(bsz);
        if (batchSize == -1)
          batchSize = Integer.MIN_VALUE;
      } catch (NumberFormatException e) {
        LOG.warn(Invalid batch size:  + bsz);
      }
    }


 The call to context.getVariableResolver().resolve(bsz) is returning null,
 leading to a NumberFormatException and the batchSize never being set to
 Integer.MIN_VALUE.  MySql won't use streaming result sets in this case which
 can lead to the OOM we're seeing.


 If your log file contains this entry like mine does, you're being affected
 by this bug too.

 Apr 15, 2009 1:21:58 PM org.apache.solr.handler.dataimport.JdbcDataSource
 init
 WARNING: Invalid batch size: null



 -Bryan




 On Apr 13, 2009, at Apr 13, 11:48 PM, Noble Paul നോബിള്‍ नोब्ळ् wrote:

 DIH streams 1 row at a time.

 DIH is just a component in Solr. Solr indexing also takes a lot of memory

 On Tue, Apr 14, 2009 at 12:02 PM, Mani Kumar manikumarchau...@gmail.com
 wrote:

 Yes its throwing the same OOM error and from same place...
 yes i will try increasing the size ... just curious : how this dataimport
 works?

 Does it loads the whole table into memory?

 Is there any estimate about how much memory it needs to create index for
 1GB
 of data.

 thx
 mani

 On Tue, Apr 14, 2009 at 11:48 AM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:

 On Tue, Apr 14, 2009 at 11:36 AM, Mani Kumar manikumarchau...@gmail.com

 wrote:

 Hi Shalin:
 yes i tried with batchSize=-1 parameter as well

 here the config i tried with

 dataConfig

   dataSource type=JdbcDataSource batchSize=-1 name=sp
 driver=com.mysql.jdbc.Driver
 url=jdbc:mysql://localhost/mydb_development
 user=root password=** /


 I hope i have used batchSize parameter @ right place.


 Yes that is correct. Did it still throw OOM from the same place?

 I'd suggest you increase the heap and see what works for you. Also try
 -server on the jvm.

 --
 Regards,
 Shalin Shekhar Mangar.





 --
 --Noble Paul





-- 
--Noble Paul


want to Unsubscribe from Solr Mailing List

2009-04-15 Thread Neha Bhardwaj
Hi,

I wish to unsubscribe from list .

 

My email address is neha_bhard...@peristent.co.in

 

 

Thanks for all the help and support.

 

Thanks and Regards,

Neha Bhardwaj| Software Engineer| Persistent Systems Limited

Neha mailto:neha%20bhard...@persistent.co.in%20  bhard...@persistent.co.in
| Cell: +91 9272383082| Tel: +91 (20) 3023 5257

Innovation in software product design, development and delivery-
www.persistentsys.com

 

 


DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


Re: DataImporter : Java heap space

2009-04-15 Thread Mani Kumar
Aah, Bryan you got it ... Thanks!
Noble: so i can hope that it'll be fixed soon :) thank you for fixing it ...
please lemme know when its done..


Thanks!
Mani Kumar
2009/4/16 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@gmail.com

 Hi Bryan,
 Thanks a lot. It is invoking the wrong method

 it should have been
 bsz = context.getVariableResolver().replaceTokens(bsz);

 it was a silly mistake

 --Noble

 On Thu, Apr 16, 2009 at 2:13 AM, Bryan Talbot btal...@aeriagames.com
 wrote:
  I think there is a bug in the 1.4 daily builds of data import handler
 which
  is causing the batchSize parameter to be ignored.  This was probably
  introduced with more recent patches to resolve variables.
 
  The affected code is in JdbcDataSource.java
 
 String bsz = initProps.getProperty(batchSize);
 if (bsz != null) {
   bsz = (String) context.getVariableResolver().resolve(bsz);
   try {
 batchSize = Integer.parseInt(bsz);
 if (batchSize == -1)
   batchSize = Integer.MIN_VALUE;
   } catch (NumberFormatException e) {
 LOG.warn(Invalid batch size:  + bsz);
   }
 }
 
 
  The call to context.getVariableResolver().resolve(bsz) is returning null,
  leading to a NumberFormatException and the batchSize never being set to
  Integer.MIN_VALUE.  MySql won't use streaming result sets in this case
 which
  can lead to the OOM we're seeing.
 
 
  If your log file contains this entry like mine does, you're being
 affected
  by this bug too.
 
  Apr 15, 2009 1:21:58 PM org.apache.solr.handler.dataimport.JdbcDataSource
  init
  WARNING: Invalid batch size: null
 
 
 
  -Bryan
 
 
 
 
  On Apr 13, 2009, at Apr 13, 11:48 PM, Noble Paul നോബിള്‍ नोब्ळ् wrote:
 
  DIH streams 1 row at a time.
 
  DIH is just a component in Solr. Solr indexing also takes a lot of
 memory
 
  On Tue, Apr 14, 2009 at 12:02 PM, Mani Kumar 
 manikumarchau...@gmail.com
  wrote:
 
  Yes its throwing the same OOM error and from same place...
  yes i will try increasing the size ... just curious : how this
 dataimport
  works?
 
  Does it loads the whole table into memory?
 
  Is there any estimate about how much memory it needs to create index
 for
  1GB
  of data.
 
  thx
  mani
 
  On Tue, Apr 14, 2009 at 11:48 AM, Shalin Shekhar Mangar 
  shalinman...@gmail.com wrote:
 
  On Tue, Apr 14, 2009 at 11:36 AM, Mani Kumar 
 manikumarchau...@gmail.com
 
  wrote:
 
  Hi Shalin:
  yes i tried with batchSize=-1 parameter as well
 
  here the config i tried with
 
  dataConfig
 
dataSource type=JdbcDataSource batchSize=-1 name=sp
  driver=com.mysql.jdbc.Driver
  url=jdbc:mysql://localhost/mydb_development
  user=root password=** /
 
 
  I hope i have used batchSize parameter @ right place.
 
 
  Yes that is correct. Did it still throw OOM from the same place?
 
  I'd suggest you increase the heap and see what works for you. Also try
  -server on the jvm.
 
  --
  Regards,
  Shalin Shekhar Mangar.
 
 
 
 
 
  --
  --Noble Paul
 
 



 --
 --Noble Paul



Re: want to Unsubscribe from Solr Mailing List

2009-04-15 Thread Mani Kumar
Dear Lady,
this information available on
http://lucene.apache.org/solr/mailing_lists.html page.

Thank you for unsubscribing!

-Mani

On Thu, Apr 16, 2009 at 10:16 AM, Neha Bhardwaj 
neha_bhard...@persistent.co.in wrote:

 Hi,

 I wish to unsubscribe from list .



 My email address is neha_bhard...@peristent.co.in





 Thanks for all the help and support.



 Thanks and Regards,

 Neha Bhardwaj| Software Engineer| Persistent Systems Limited

 Neha 
 mailto:neha%20bhard...@persistent.co.inneha%2520bhard...@persistent.co.in%20
  bhard...@persistent.co.in
 | Cell: +91 9272383082| Tel: +91 (20) 3023 5257

 Innovation in software product design, development and delivery-
 www.persistentsys.com






 DISCLAIMER
 ==
 This e-mail may contain privileged and confidential information which is
 the property of Persistent Systems Ltd. It is intended only for the use of
 the individual or entity to which it is addressed. If you are not the
 intended recipient, you are not authorized to read, retain, copy, print,
 distribute or use this message. If you have received this communication in
 error, please notify the sender and delete all copies of this message.
 Persistent Systems Ltd. does not accept any liability for virus infected
 mails.