Re: How are multivalued fields used?

2008-10-13 Thread Brian Carmalt
Hello Gene, 

Am Montag, den 13.10.2008, 23:32 +1300 schrieb ristretto.rb:
 How does one use of this field type.
 Forums, wiki, Lucene in Action, all coming up empty.
 If there's a doc somewhere please point me there.
 
 I use pysolr to index. But, that's not a requirement.
 
 I'm not sure how one adds multivalues to a document. 

You add multiple fields with the same name to a document, say keywords.

doc
field name=keywordssolrfield/
field name=keywordslucenefield/
field name=keywordsseachfield/
/doc

If have configured the the field keyword in the index to be multivalued,
then solr will dump the values of each field in your document into the
field keywords. When you request that solr return the value in the field
keywords, then you get a coma seperated list of the keywords. 
solr,lucene,search

  And once added,
 if you want to remove one
  how do you specify?
You cannot remove a field from a document, documents are read only, you
have to reindex the document with the same unique id and the new
information.  

   Based on
 http://wiki.apache.org/solr/FieldOptionsByUseCase, it says to use
 it to add multiple values, maintaining order   Is the order for
 indexing/searching  or for storing/returning?
Not sure, but I would assume that the values are stored in the field in
the order that the fields are specified in the document. 

Brian



Re: Searching with Wildcards

2008-09-25 Thread Brian Carmalt
Hello all, 

Sorry I have taken so long to get back to Eriks reply, I used the
technique of inserting a ? before the * to get at prototype working. 
However, if 1.3 does not support this anymore, then I really need to
look into alternatives. 

What would be the scope of the work to implement Erik's suggestion, I
would have to ask my boss, but I think we would then contribute the code
back to Solr.

This should probably be continued on solr-dev, right? 

Brian

Am Mittwoch, den 17.09.2008, 17:19 -0400 schrieb Mark Miller:
 Alas no, the queryparser now uses an unhighlightable constantscore 
 query. I'd personally like to make it work at the Lucene level, but not 
 sure how thats going to proceed. The tradeoff is that you won't have max 
 boolean clause issues and wildcard searches should be faster. It is a 
 bummer though.
 
 - Mark
 
 dojolava wrote:
  Hi,
 
  I have another question on the wildcard problem:
 
  In the previous Solr releases there was a workaround to highlight wildcard
  queries using the StandardRequestHandler by adding a ? in between: e.g.
  hou?* would highlight house.
  But this is not working anymore. Is there maybe another workaround? ;-)
 
  Regards,
  Mathis
 
 
  On Tue, Sep 2, 2008 at 2:15 PM, Erik Hatcher [EMAIL PROTECTED]wrote:
 

  Probably your best bet is to create a new QParser(Plugin) that uses
  Lucene's QueryParser directly.  We probably should have that available
  anyway in the core, just so folks coming from Lucene Java have the same
  QueryParser.
 
 Erik
 
 
  On Sep 2, 2008, at 7:11 AM, Brian Carmalt wrote:
 
   Hello all,
  
  I need to get wildcard searches with highlighting up and running. I'd
  like to get it to work with a DismaxHandler, but I'll settle with
  starting with the StandardRequestHandler. I've been reading the some of
  the past mails on wildcard searches and Solr-195. It seems I need to
  change the default behavior for wildcards from PrefixFilter to a
  PrefixQuery.
  I know that I will have to deal with TooManyClauses Exceptions, but I
  want to paly around with it.
 
  I have read that this can only be done by modifying the code, but I
  cann't seem to find the correct section. Can someone point me in the
  right direction? Thanks.
 
  - Brian
 

  
 

 



Re: Not enough space

2008-09-25 Thread Brian Carmalt
Search with Google for swap file linux linux or distro name

There is tons of info out there.

 
Am Donnerstag, den 25.09.2008, 02:07 -0700 schrieb sunnyfr:
 Hi,
 I've obviously the same error, I just don't know how do you add swap space ? 
 Thanks a lot,
 
 
 Yonik Seeley wrote:
  
  On 7/5/07, Xuesong Luo [EMAIL PROTECTED] wrote:
  Thanks, Chris and Yonik. You are right. I remember the heap size was
  over 500m when I got the Not enough space error message.
  Is there a best practice to avoid this kind of problem?
  
  add more swap space.
  
  -Yonik
  
  
 



Re: How to copy a solr index to another index with a different schema collapsing stored data?

2008-09-17 Thread Brian Carmalt
It wouldn't be that bad to merge the index externally and the reindex
the results, if it is as simple as your example. Search for id:[1 TO *]
and a fq for the category, increment the slice of the results you need
to process until you have covered all of the docs in the category.
Request the content field and extract them from the xml responses and
save them somewhere. When you have all the info, reindex it. 

Am Mittwoch, den 17.09.2008, 10:00 -0400 schrieb Erick Erickson:
 You *might* be able to reconstruct enough of the original documents
 from your indexes to create another without recrawling. I know Luke
 can reconstruct documents form an index, but for unstored data it's
 slow and may be lossy.
 
 But it may suit your needs given how long it takes to make your index
 in the first place.
 
 Best
 Erick
 
 On Tue, Sep 16, 2008 at 9:14 PM, Gene Campbell [EMAIL PROTECTED] wrote:
 
  I was pretty sure you'd say that.  But, I means lots that you take the
  time to confirm it.  Thanks Otis.
 
  I don't want to give details, but we crawl for our data, and we don't
  save it in a DB or on disk.  It goes from download to index.  Was a
  good idea at the time; when we thought our designs were done evolving.
   :)
 
  cheers
  gene
 
 
  On Wed, Sep 17, 2008 at 12:51 PM, Otis Gospodnetic
  [EMAIL PROTECTED] wrote:
   You can't copy+merge+flatten indices like that.  Reindexing would be the
  easiest.  Indexing taking weeks sounds suspicious.  How much data are you
  reindexing and how big are your indices?
  
   Otis
   --
   Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
  
  
   - Original Message 
   From: ristretto.rb [EMAIL PROTECTED]
   To: solr-user@lucene.apache.org
   Sent: Tuesday, September 16, 2008 8:14:16 PM
   Subject: How to copy a solr index to another index with a different
  schema collapsing stored data?
  
   Is it possible to copy stored index data from index to another, but
   concatenating it as you go.
  
   Suppose 2 categories A and B both with 20 docs, for a total of 40 docs
   in the index.  The index has a stored field for the content from the
   docs.
  
   I want a new index with only two docs in it, one for A and one for B.
   And it would have a stored field that is the sum of all the stored
   data for the 20 docs of A and of B respectively.
  
   So, then a query on this index will tell me give me a relevant list of
   Categories?
  
   Perhaps there's a solr query to get that data out, and then I can
   handle concatenating it, and then indexing it in the new index.
  
   I'm hoping I don't have to reindex all this data from scratch?  It has
   taken weeks!
  
   thanks
   gene
  
  
 



Re: too many open files

2008-07-15 Thread Brian Carmalt
Am Montag, den 14.07.2008, 09:50 -0400 schrieb Yonik Seeley:
 Solr uses reference counting on IndexReaders to close them ASAP (since
 relying on gc can lead to running out of file descriptors).
 

How do you force them to close ASAP? I use File and FileOutputStream
objects, I close the output streams and then call delete on the files. I
sill have problems with to many open files. After a while I get
exceptions that I cannot open any new files. After this the threads stop
working and a day later, the files are still open and marked for
deletion. I have to kill the server to get it running again or call
System.gc() periodically.  

How do force the VM to realese the files?

This happens under RedHat with a 2.4er kernel and under Debian Etch with
2.6er kernel. 

Thanks,

Brian
 -Yonik
 
 On Mon, Jul 14, 2008 at 9:15 AM, Brian Carmalt [EMAIL PROTECTED] wrote:
  Hello,
 
  I have a similar problem, not with Solr, but in Java. From what I have
  found, it is a usage and os problem: comes from using to many files, and
  the time it takes the os to reclaim the fds. I found the recomendation
  that System.gc() should be called periodically. It works for me. May not
  be the most elegant, but it works.
 
  Brian.
 
  Am Montag, den 14.07.2008, 11:14 +0200 schrieb Alexey Shakov:
  now we have set the limt to ~1 files
  but this is not the solution - the amount of open files increases
  permanantly.
  Earlier or later, this limit will be exhausted.
 
 
  Fuad Efendi schrieb:
   Have you tried [ulimit -n 65536]? I don't think it relates to files
   marked for deletion...
   ==
   http://www.linkedin.com/in/liferay
  
  
   Earlier or later, the system crashes with message Too many open files
  
  
 
 
 
 
 



Re: too many open files

2008-07-14 Thread Brian Carmalt
Hello, 

I have a similar problem, not with Solr, but in Java. From what I have
found, it is a usage and os problem: comes from using to many files, and
the time it takes the os to reclaim the fds. I found the recomendation
that System.gc() should be called periodically. It works for me. May not
be the most elegant, but it works. 

Brian.  

Am Montag, den 14.07.2008, 11:14 +0200 schrieb Alexey Shakov:
 now we have set the limt to ~1 files
 but this is not the solution - the amount of open files increases 
 permanantly.
 Earlier or later, this limit will be exhausted.
 
 
 Fuad Efendi schrieb:
  Have you tried [ulimit -n 65536]? I don't think it relates to files 
  marked for deletion...
  ==
  http://www.linkedin.com/in/liferay
 
 
  Earlier or later, the system crashes with message Too many open files
 
 
 
 
 



Re: How to debug ?

2008-06-25 Thread Brian Carmalt
Hello Beto,

There is a plugin for jetty: http://webtide.com/eclipse. Insert this as
and update site and let eclipse install the plugin for you You can then
start the jetty server from eclipse and debug it. 

Brian. 

Am Mittwoch, den 25.06.2008, 12:48 +1000 schrieb Norberto Meijome:
 On Tue, 24 Jun 2008 19:17:58 -0700
 Ryan McKinley [EMAIL PROTECTED] wrote:
 
  also, check the LukeRequestHandler
  
  if there is a document you think *should* match, you can see what  
  tokens it has actually indexed...
 
 right, I will look into that a bit more. 
 
 I am actually using the lukeall.jar (0.8.1, linked against lucene 2.4) to look
 into what got indexed, but I am bit wary of how what I select in the the
 'analyzer' drop down option in Luke actually affects what I see.
 
 B
 
 _
 {Beto|Norberto|Numard} Meijome
 
 Web2.0 is outsourced RD from Web1.0 companies.
The Reverend
 
 I speak for myself, not my employer. Contents may be hot. Slippery when wet.
 Reading disclaimers makes you go blind. Writing them is worse. You have been
 Warned.



Problem with searching using the DisMaxHandler

2008-06-19 Thread Brian Carmalt
Hello all, 

I have defined a DisMax handler. It should search in the following
fields: content1, content2 and id(doc uid). I would like to beable to
specify a query like the following:
(search terms) AND ( id1 OR id2 .. idn)
My intent is to retrieve only the docs in which hits for the search
terms occur and that the docs have one of the specified ids.

Unfortunately, I get not document matches. 

Can any one shed some light on the What I am doing wrong?   

Thanks, 
Brian



Re: AW: My First Solr

2008-06-13 Thread Brian Carmalt
Do you see if the document update is sucessful? When you start solr with
java -jar start.jar for the example, Solr will list the the document id
of the docs that you are adding and tell you how long the update took. 

A simple  but brute force method to findout if a document has been
commited is to stop the server and then restart it.

You can also use the solr/admin/stats.jsp page to see if the docs are
there. 

After looking at your query in the results you posted, I would bet that
you are not specifying a search field. try searching for anwendung:KIS
or id:[1 TO *] to see all the docs in you index. 

Brian

Am Freitag, den 13.06.2008, 07:40 +0200 schrieb Thomas Lauer:
 i have tested:
 SimplePostTool: version 1.2
 SimplePostTool: WARNING: Make sure your XML documents are encoded in
 UTF-8, other encodings are not currently supported
 SimplePostTool: POSTing files to http://localhost:8983/solr/update..
 SimplePostTool: POSTing file import_sample.xml
 SimplePostTool: COMMITting Solr index changes.. 



Re: My First Solr

2008-06-13 Thread Brian Carmalt
http://wiki.apache.org/solr/DisMaxRequestHandler

In solrconfig.xml there are example configurations for the DisMax.

Sorry I told you the wrong name, not enough coffee this morning.  

Brian.


Am Freitag, den 13.06.2008, 09:40 +0200 schrieb Thomas Lauer:



Re: AW: My First Solr

2008-06-13 Thread Brian Carmalt
No, you do not have to reindex. You do have to restart the server.

The bf has fields listed that are not in your document: popularity,
price. delete the bf field, you do not need it unless you want to use
boost functions. 


Brian

Am Freitag, den 13.06.2008, 10:36 +0200 schrieb Thomas Lauer:
 ok,
 
 my dismax
 
   requestHandler name=dismax class=solr.DisMaxRequestHandler 
 lst name=defaults
  str name=echoParamsexplicit/str
  float name=tie0.01/float
  str name=qf
 beschreibung^0.5 ordner^1.0 register^1.2 Benutzer^1.5 guid^10.0 
 mandant^1.1
  /str
  str name=pf
 beschreibung^0.2 ordner^1.1 register^1.5 manu^1.4 manu_exact^1.9
  /str
  str name=bf
 ord(poplarity)^0.5 recip(rord(price),1,1000,1000)^0.3
  /str
  str name=fl
 guid,beschreibung,mandant,Benutzer
  /str
  str name=mm
 2lt;-1 5lt;-2 6lt;90%
  /str
  int name=ps100/int
  str name=q.alt*:*/str
 /lst
   /requestHandler
 
 must i make a reindex?
 
 I seek with this url
 http://localhost:8983/solr/select?indent=onversion=2.2q=bonowstart=0rows=10fl=*%2Cscoreqt=dismaxwt=standardexplainOther=hl.fl=
 
 The response is:
 HTTP Status 400 - undefined field text
 
 type Status report
 message undefined field text
 description The request sent by the client was syntactically incorrect 
 (undefined field text).
 
 
 Regards Thomas
 
 
 -Ursprüngliche Nachricht-
 Von: Brian Carmalt [mailto:[EMAIL PROTECTED]
 Gesendet: Freitag, 13. Juni 2008 09:50
 An: solr-user@lucene.apache.org
 Betreff: Re: My First Solr
 
 http://wiki.apache.org/solr/DisMaxRequestHandler
 
 In solrconfig.xml there are example configurations for the DisMax.
 
 Sorry I told you the wrong name, not enough coffee this morning.
 
 Brian.
 
 
 Am Freitag, den 13.06.2008, 09:40 +0200 schrieb Thomas Lauer:
 
 
 
 __ Hinweis von ESET NOD32 Antivirus, Signaturdatenbank-Version 3182 
 (20080612) __
 
 E-Mail wurde geprüft mit ESET NOD32 Antivirus.
 
 http://www.eset.com
 



Searching accross many fields

2008-06-05 Thread Brian Carmalt
Hello All, 

We are thinking about a totally dynamic indexing schema, where the only
fields that known to be in the index is the ID field. This means that in
order to search in the index, the field names of where we want to search
must be specified.

q=title:solr+content:solr+summary:solr and so on. 

This works well when the number of fields is small, but what are the
performance ramifications when the number of fields is more than 1000? 
Is this a serious performance killer? If yes, what would we need to
counter act it, more RAM or faster CPU's? Or both?

Is it better to copy all fields to a content field and then always
search there? This works, but then it is hard to boost specific field
values. and that is what we want to do. 

Any advice or experience in this area is appreciated.


Thanks, 

Brian




Re: exception while feeding converted text from pdf

2008-05-15 Thread Brian Carmalt
Hello Cam,

Are you writing your xml by hand, as in no xml writer? That can cause
problems. In your exception it says latitude 59, the  should have
converted to 'amp;'(I think). If you can use Java6, there is a
XMLStreamWriter in java.xml.stream that does automatic special character
escaping. This can simplify writing simple xml.

Unfortunatly the stream writer does not filter out invalid xml
characters. So I will point you to a helpful website: 
http://cse-mjmcl.cse.bris.ac.uk/blog/2007/02/14/1171465494443.html


Hope this helps.

Brian

Am Mittwoch, den 14.05.2008, 19:23 +0300 schrieb Cam Bazz:
 Hello,
 
 I made a simple java program to convert my pdfs to text, and then to xml
 file.
 I am getting a strange exception. I think the converted files have some
 errors. should I encode the txt string that I extract from the pdfs in a
 special way?
 
 Best,
 -C.B.
 
 EVERE: org.xmlpull.v1.XmlPullParserException: entity reference names can not
 start with character ' ' (position: START_TAG seen
 ...ay\n  latitude 59 ...
 @80:64)
 at org.xmlpull.mxp1.MXParser.parseEntityRef(MXParser.java:2212)
 at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1275)
 at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
 at org.xmlpull.mxp1.MXParser.nextText(MXParser.java:1058)
 at
 org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:332)
 at
 org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestHandler.java:162)
 at
 org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:84)
 at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:77)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
 at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:191)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:159)
 at
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
 at
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
 at
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
 at
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
 at
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
 at
 org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
 at
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
 at
 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
 at
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
 at org.mortbay.jetty.Server.handle(Server.java:285)
 at
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
 at
 org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
 at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
 at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
 at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
 at
 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
 at
 org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)



Re: How to effectively search inside fields that should be indexed with changing them.

2007-12-14 Thread Brian Carmalt

Hello Otis,

The example I provided was a simplified one. The real usecase is that 
will have to dynamically adapt to field values, from which
we have no idea what form they will have.. So unfortunately, a custom 
tokenizer will not work. I changed the n-gram values to min=max= 2
and I can match sub terms inside the fields that are analyzed with the 
NGramTokenizer. But I haven't had the time to test it completely.

Can you quickly outline why n-grams are not good solution for my problem?

Thanks, Brian

Otis Gospodnetic schrieb:

Brian,

This is not really a job for n-grams.  It sounds like you'll want to write a 
custom Tokenizer that has knowledge about this particular pattern, knows how to 
split input like the one in your example, and produce multiple tokens out of 
it.  For the natural language part you can probably get away with one of the 
existing tokenizers/analyzers/factories.  For the first part you'll likely want 
to extract (W+)0+ -- 1 or morel etters followed by 1 or more zeros as one 
token, and then 0+(D+) -- 1 or more zeros followed by 1 or more digits.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Brian Carmalt [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Tuesday, December 11, 2007 9:17:32 AM
Subject: How to effectively search inside fields that should be indexed with 
changing them.

Hello all,

 The titles of our docs have the form ABC0001231-This is an important 
doc.pdf. I would like to be able to
search for 'important', or '1231',  or 'ABC000*', or 'This is an 
important doc'  in the title field. I looked a the NGramTokenizer and 
tried to use it.
In the index it doesn't seem to work, I cannot get any hits. The 
analysis tool on the admin pages shows me that the
ngram tokenizing works by highlighting the matches between the indexed 
value and a query. I have set the

min and max ngram size to 2 and 6, with side equal to left.

Can anyone recommend a procedure that will allow me to search as stated
 
above?


I would also like to find out more about how to use the NgramTokenizer,
 
but have found little in the form of

documentation. Anyone know about any good sources?

Thanks,

Brian




  




How to effectively search inside fields that should be indexed with changing them.

2007-12-11 Thread Brian Carmalt

Hello all,

The titles of our docs have the form ABC0001231-This is an important 
doc.pdf. I would like to be able to
search for 'important', or '1231',  or 'ABC000*', or 'This is an 
important doc'  in the title field. I looked a the NGramTokenizer and 
tried to use it.
In the index it doesn't seem to work, I cannot get any hits. The 
analysis tool on the admin pages shows me that the
ngram tokenizing works by highlighting the matches between the indexed 
value and a query. I have set the

min and max ngram size to 2 and 6, with side equal to left.

Can anyone recommend a procedure that will allow me to search as stated 
above?


I would also like to find out more about how to use the NgramTokenizer, 
but have found little in the form of

documentation. Anyone know about any good sources?

Thanks,

Brian


Re: out of heap space, every day

2007-12-04 Thread Brian Carmalt

Hello,

I am also fighting with heap exhaustion, however during the indexing 
step. I was able to minimize, but not fix the problem
by setting the thread stack size to 64k with -Xss64k. The minimum size 
is os specific, but the VM will tell

you if you set the size too small. You can try it, it may help

Brian

Brian Whitman schrieb:
This maybe more of a general java q than a solr one, but I'm a bit 
confused.


We have a largish solr index, about 8M documents, the data dir is 
about 70G. We're getting about 500K new docs a week, as well as about 
1 query/second.


Recently (when we crossed about the 6M threshold) resin has been 
stopping with the following:


/usr/local/resin/log/stdout.log:[12:08:21.749] [28304] HTTP/1.1 500 
Java heap space
/usr/local/resin/log/stdout.log:[12:08:21.749] 
java.lang.OutOfMemoryError: Java heap space


Only a restart of resin will get it going again, and then it'll crash 
again within 24 hours.


It's a 4GB machine and we run it with args=-J-mx2500m -J-ms2000m We 
can't really raise this any higher on the machine.


Are there 'native' memory requirements for solr as a function of index 
size? Does a 70GB index require some minimum amount of wired RAM? Or 
is there some mis-configuration w/ resin or solr or my system? I don't 
really know Java well but it seems strange that the VM can't page RAM 
out to disk or really do something else beside stopping the server.




Re: Weird memory error.

2007-11-20 Thread Brian Carmalt

Can you recommend one? I am not familar with how to profile under Java.

Yonik Seeley schrieb:

Can you try a profiler to see where the memory is being used?
-Yonik

On Nov 20, 2007 11:16 AM, Brian Carmalt [EMAIL PROTECTED] wrote:
  

Hello all,

I started looking into the scalability of solr, and have started getting
weird  results.
I am getting the following error:

Exception in thread btpool0-3 java.lang.OutOfMemoryError: unable to
create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:574)
at
org.mortbay.thread.BoundedThreadPool.newThread(BoundedThreadPool.java:377)
at
org.mortbay.thread.BoundedThreadPool.dispatch(BoundedThreadPool.java:94)
at
org.mortbay.jetty.bio.SocketConnector$Connection.dispatch(SocketConnector.java:187)
at
org.mortbay.jetty.bio.SocketConnector.accept(SocketConnector.java:101)
at
org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:516)
at
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)

This only occurs when I send docs to the server in batches of around 10
as separate processes.
If I send the serially, the heap grows up to 1200M and with no errors.

When I observe the VM during it's operation, It doesn't seem to run out
of memory.  The VM starts
with 1024M and can allocate up to 1800M. I start getting the error
listed above when the memory
usage is right around 1 G. I have been using the Jconsole program on
windows to observe the
jetty server by using the com.sun.management.jmxremote* functions on the
server side. The number of threads
is always around 30, and jetty can create up 250, so I don't think
that's the problem. I can't really image that
the monitoring process is using the other 800M of the allowable heap
memory, but it could be.
But the problem occurs without monitoring, even when the VM heap is set
to 1500M.

Does anyone have an idea as to why this error is occurring?

Thanks,
Brian




  




Weird memory error.

2007-11-20 Thread Brian Carmalt

Hello all,

I started looking into the scalability of solr, and have started getting 
weird  results.

I am getting the following error:

Exception in thread btpool0-3 java.lang.OutOfMemoryError: unable to 
create new native thread

   at java.lang.Thread.start0(Native Method)
   at java.lang.Thread.start(Thread.java:574)
   at 
org.mortbay.thread.BoundedThreadPool.newThread(BoundedThreadPool.java:377)
   at 
org.mortbay.thread.BoundedThreadPool.dispatch(BoundedThreadPool.java:94)
   at 
org.mortbay.jetty.bio.SocketConnector$Connection.dispatch(SocketConnector.java:187)
   at 
org.mortbay.jetty.bio.SocketConnector.accept(SocketConnector.java:101)
   at 
org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:516)
   at 
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)


This only occurs when I send docs to the server in batches of around 10 
as separate processes.

If I send the serially, the heap grows up to 1200M and with no errors.

When I observe the VM during it's operation, It doesn't seem to run out 
of memory.  The VM starts
with 1024M and can allocate up to 1800M. I start getting the error 
listed above when the memory
usage is right around 1 G. I have been using the Jconsole program on 
windows to observe the
jetty server by using the com.sun.management.jmxremote* functions on the 
server side. The number of threads
is always around 30, and jetty can create up 250, so I don't think 
that's the problem. I can't really image that
the monitoring process is using the other 800M of the allowable heap 
memory, but it could be.
But the problem occurs without monitoring, even when the VM heap is set 
to 1500M.


Does anyone have an idea as to why this error is occurring?

Thanks,
Brian


Re: [jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a structured document.

2007-10-31 Thread Brian Carmalt


There is more to consider here.  Lucene now supports payloads, 
additional metadata on terms that can be leveraged with custom 
queries.  I've not yet tinkered with them myself, but my understanding 
is that they would be useful (and in fact designed in part) for 
representing structured documents.  It would behoove us to investigate 
how payloads might be leveraged for your needs here, such that a 
single field could represent an entire document, with payloads 
representing the hierarchical structure.  This will require 
specialized Analyzer and Query subclasses be created to take advantage 
of payloads.  The Lucene community itself is just now starting to 
exploit this new feature, so there isn't a lot out there on it yet, 
but I think it holds great promise for these purposes.


Erik



Hello Erik,

Could you elaborate on how payloads could be used to represent a 
structured doc?


Thanks, Brian


Searching dynamic fields

2007-10-15 Thread Brian Carmalt

Hello all,

Is there a way to search dynamicFields, without having to specify the 
name of the

filed in a Query.
Example: I have index a doc with the field name myDoc_text_en. and I 
have a dynamic field
*_text_en which maps to a type of text_en. How can I search this field 
without knowing its

specific name?  Can I search according to field type? I have looked at the
DisMaxRequestHandler, which might work, but it doesn't accept wildcard 
field names or field types.


I'm using 1.3.
Thanks in advance.

Brian



Re: Querying for an id with a colon in it

2007-10-15 Thread Brian Carmalt

Robert Young schrieb:

Hi,

If my unique identifier is called guid and one of the ids in it is,
for example, article:123. How can I query for that article id? I
have tried a number of ways but I always either get no results or an
error. It seems to be to do with having the colon in the id value.

eg.
?q=guid:article:123 - error
?q=guid:article:123 - error
?q=guid:article%3A123 - error

Any ideas?
Cheers
Rob

  

Try it with a \: That's what the Lucene Query Parser Syntax page says.
It doesn't cause an error, but I don't know if it will provide the 
results you want.


Brian


Re: Searching dynamic fields

2007-10-15 Thread Brian Carmalt

Hello Erik,

A field copy implies a doubling of the data in the Index, right? OR 
should I not store or index the
dynamic field and instead copy it to another field, and then let it be 
indexed and stored?


Another possibility would be to search all fields, but that doesn't seem 
to be possible. Or am I Missing something?


Thanks, Brian.

Erik Hatcher schrieb:
Brian - you can copyField all *_en fields to a common contents_en 
field, for example, and then search contents_en:(for whatever).


You cannot currently search by field type, though that is an 
interesting possible feature.


I would like to see Solr support wildcarded field names in request 
parameters, but we're not there yet.


Erik


On Oct 15, 2007, at 9:32 AM, Brian Carmalt wrote:


Hello all,

Is there a way to search dynamicFields, without having to specify the 
name of the

filed in a Query.
Example: I have index a doc with the field name myDoc_text_en. and I 
have a dynamic field
*_text_en which maps to a type of text_en. How can I search this 
field without knowing its
specific name?  Can I search according to field type? I have looked 
at the
DisMaxRequestHandler, which might work, but it doesn't accept 
wildcard field names or field types.


I'm using 1.3.
Thanks in advance.

Brian







Re: Indexing very large files.

2007-09-07 Thread Brian Carmalt

Lance Norskog schrieb:

Now I'm curious: what is the use case for documents this large?

Thanks,

Lance Norskog


  
It is a rand use case, but could become relevant for us. I was told to 
explore the possibilities, and that's what I'm doing. :)


Since I haven't heard any suggestions as to how to do this with a stock 
Solr install, other than increase vm memory, I'll assume it will have to 
be done

with a custom solution.

Thanks for the answers and the interest.

Brian


Re: Indexing very large files.

2007-09-06 Thread Brian Carmalt

Yonik Seeley schrieb:

On 9/5/07, Brian Carmalt [EMAIL PROTECTED] wrote:
  

I've bin trying to index a 300MB file to solr 1.2. I keep getting out of
memory heap errors.



300MB of what... a single 300MB document?  Or is that file represent
multiple documents in XML or CSV format?

-Yonik
  

Hello Yonik,

Thank you for your fast reply.  It is one large document. If it was made up
of smaller docs, I would split it up and index them separately.

Can Solr be made to handle such large docs?

Thanks, Brian


Re: Indexing very large files.

2007-09-06 Thread Brian Carmalt

Hello again,

I run Solr on Tomcat under windows and use the tomcat monitor to start 
the service. I have set the minimum heap
size to be 512MB and then maximum to be 1024mb. The system has 2 Gigs of 
ram. The error that I get after sending

approximately 300 MB is:

java.lang.OutOfMemoryError: Java heap space
   at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:2947)
   at org.xmlpull.mxp1.MXParser.more(MXParser.java:3026)
   at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1384)
   at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
   at org.xmlpull.mxp1.MXParser.nextText(MXParser.java:1058)
   at 
org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:332)
   at 
org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestHandler.java:162)
   at 
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:84)
   at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:77)

   at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
   at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:191)
   at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:159)
   at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
   at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
   at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
   at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
   at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
   at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:104)
   at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
   at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:261)
   at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
   at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:581)
   at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)

   at java.lang.Thread.run(Thread.java:619)

After sleeping on the problem I see that it does not directly stem from 
Solr, but from the

module  org.xmlpull.mxp1.MXParser. Hmmm. I'm open to sugestions and ideas.

First is this doable?
If yes, will I have to modify the code to save the file to disk and then 
read it back

in order to index it in chunks.
Or can I get it it working on a stock Solr install.

Thanks,

Brian

Norberto Meijome schrieb:

On Wed, 05 Sep 2007 17:18:09 +0200
Brian Carmalt [EMAIL PROTECTED] wrote:

  
I've bin trying to index a 300MB file to solr 1.2. I keep getting out of 
memory heap errors.

Even on an empty index with one Gig of vm memory it sill won't work.



Hi Brian,

VM != heap memory.

VM = OS memory
heap memory = memory made available by the JavaVM to the Java process. Heap 
memory errors are hardly ever an issue of the app itself (other , of course, 
with bad programming... but it doesnt seem to be issue here so far)


[EMAIL PROTECTED] [Thu Sep  6 14:59:21 2007]
/usr/home/betom
$ java -X
[...]
-Xmssizeset initial Java heap size
-Xmxsizeset maximum Java heap size
-Xsssizeset java thread stack size
[...]

For example, start solr as :
java  -Xms64m -Xmx512m   -jar start.jar

YMMV with respect to the actual values you use.

Good luck,
B
_
{Beto|Norberto|Numard} Meijome

Windows caters to everyone as though they are idiots. UNIX makes no such assumption. 
It assumes you know what you are doing, and presents the challenge of figuring it out for yourself if you don't.


I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.

  




Re: Indexing very large files.

2007-09-06 Thread Brian Carmalt

Moin Thorsten,
I am using Solr 1.2.0. I'll try the svn version out and see of that helps.

Thanks,
Brian


Which version do you use of solr?

http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/handler/XmlUpdateRequestHandler.java?view=markup

The trunk version of the XmlUpdateRequestHandler is now based on StAX.
You may want to try whether that is working better.

Please try and report back.

salu2