Re: lucene nicking my memory ?

2008-12-04 Thread Khawaja Shams
Magnus, If you get a chance, can you try setting a different xms and xmx value. For instance, try xms384M and xmx1024M. The "forced" GC [request] will almost always reduce the memory footprint simply because of the weak references that lucene leverages, but I bet subsequent queries are not as

Re: lucene nicking my memory ?

2008-12-04 Thread Khawaja Shams
Magnus, Please feel free to ignore my last email; I see that you had this setup earlier. As far as using up all the memory it can get its hands on, this is actually a good thing. This allows Lucene and other java applications to keep more things in cache when more memory is available. Also, if

Re: lucene nicking my memory ?

2008-12-04 Thread Magnus Rundberget
hmmm, Well in production (1024M heap), it seems that after a while (some hundred user queries) the memory starts reaching the max threshold and when it does at some point it becomes unresponsive. Id rather it was slightly less performant (cleaning up memory more frequently) than freezing u

Re: Pdf in Lucene?

2008-12-04 Thread Kalani Ruwanpathirana
Hi, In my case I used PDFBox, just to extract the text from PDF document and then I created the Lucene document giving the extracted text. (I didn't use the PDFBox built in Lucene search engine). So I didn't get any incompatibility problems. This blog post shows the way. http://kalanir.blogspot.c

Re: lucene nicking my memory ?

2008-12-04 Thread Eric Bowman
I'm not sure I really understand what the problem is here. First of all, the VM will appear to consume most or all of the memory you give it. You really shouldn't worry about this, and it is misleading to look at what happens when you force a gc. I think there are really only 2 things that matter

Re: lucene nicking my memory ?

2008-12-04 Thread Michael McCandless
It's important to understand that the JRE using up all memory to the max you specified, and then doing GC, is entirely "normal" (if not desirable). This is just how Java works: when code runs it generates garbage, sometimes quite a bit (eg if you make a new IndexSearcher per query), and

Re: NPE inside org.apache.lucene.index.SegmentReader.getNorms

2008-12-04 Thread Michael McCandless
Mark Miller wrote: Sounds familiar. This may actually be in JIRA already. Maybe this is: https://issues.apache.org/jira/browse/LUCENE-689 ? I just marked it as fix version 2.9. Mike - To unsubscribe, e-mail: [EMAIL P

Re: Indexing Names in Lucene -- Thomas = Tom, etc

2008-12-04 Thread Grant Ingersoll
I believe these lists exists out on the Internet, just google for something like "most common first names" or "common nicknames" (yields: http://www.cc.kyoto-su.ac.jp/~trobb/nicklist.html for instance) If you want to dig deeper, you might look into named entity recognition research, and a

RE: Pdf in Lucene?

2008-12-04 Thread tiziano bernardi
Thanks very kind ... But I've tried that code but I do not work ... You could send me a simple working class that uses it please? Thanks> Date: Thu, 4 Dec 2008 15:19:26 +0530> From: [EMAIL PROTECTED]> To: java-user@lucene.apache.org> Subject: Re: Pdf in Lucene?> > Hi,> > In my case I used PDFBox

Design guidance - search strategy

2008-12-04 Thread Ian Vink
I have documents with this simple schema in Lucene which I can not change. docid: (int) contents: (text) The user is given a list of 10,000 documents in a tree which they select to search, usually they select 5000 or so. I only want to search those 5000 documents. I have the 'id' fields. That is

Re: Design guidance - search strategy

2008-12-04 Thread Erick Erickson
It's generally a bad idea to iterate a Hits object. In fact, Hits is deprecated in recent versions of Lucene. The underlying problem is that the query is re-executed every 100 responses or so. First suggestion, create a Filter by iterating over your docid field and use that in your searches see se

Re: Pdf in Lucene?

2008-12-04 Thread Kalani Ruwanpathirana
Hi Tiziano, What is the error you got? I think you can get the text easily using the code shown below. FileInputStream fi = new FileInputStream(new File("sample.pdf")); PDFParser parser = new PDFParser(fi); parser.parse(); COSDocument cd = parser.getDocument(); PDFTextStripper stripper = new PD

Re: Pdf in Lucene?

2008-12-04 Thread Kalani Ruwanpathirana
Hi Tiziano, What is the error you got? I think you can get the text easily using the code shown below. FileInputStream fi = new FileInputStream(new File("sample.pdf")); PDFParser parser = new PDFParser(fi); parser.parse(); COSDocument cd = parser.getDocument(); PDFTextStripper stripper = new PD

RE: Pdf in Lucene?

2008-12-04 Thread tiziano bernardi
I entered your code inside a main. I have imported libraries required by mistake but me. First error: parser.parse(); Syntax error on token "parse", Identifier expected after this token Second error: cd.close(); Syntax error on token "close", Identifier expected after this token Third error: doc

Re: I would want to know more about the lucene implementation in C++

2008-12-04 Thread Otis Gospodnetic
There is CLucene. It's not a part of Apache, but lives on SourceForge, I think. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Ariel <[EMAIL PROTECTED]> > To: lucene user > Sent: Tuesday, December 2, 2008 2:13:08 PM > Subject: I wou

RE: NPE inside org.apache.lucene.index.SegmentReader.getNorms

2008-12-04 Thread Teruhiko Kurosaka
> Mark Miller wrote: > > > Sounds familiar. This may actually be in JIRA already. > > Maybe this is: > > https://issues.apache.org/jira/browse/LUCENE-689 > > ? > > I just marked it as fix version 2.9. > > Mike Not exactly. NPE was from SegmentReader in my case while NPE is from SpanOrQu

Slow queries with lots of hits

2008-12-04 Thread Tim Sturge
Hi all, I have an interesting problem with my query traffic. Most of the queries run in a fairly short amount of time (< 100ms) but a few take over 1000ms. These queries are predominantly those with a huge number of hits (>1 million hits in a >100 million document index). The time taken (as far as

Re: Slow queries with lots of hits

2008-12-04 Thread Erick Erickson
The problem here is how *could* a system return even the top 10,000 results without scoring them all? What if the millionth hit resulted in the very best match in the entire corpus? That said, sorting may well be the issue here rather than scoring. You can use a TopDocCollector to get the top N ma

Re: Slow queries with lots of hits

2008-12-04 Thread Tim Sturge
That makes sense. I should be more precise in that all I need is 100 of the 1 "reasonable" results. The concern I would have with a TopDocCollector is that this is biased towards the top of the index which translates for me into a bias for older documents. I'd prefer no age bias or a newer doc

Re: Slow queries with lots of hits

2008-12-04 Thread Erick Erickson
Huh? TopDocCollector isn't biased unless you suppose that you'll have many documents scoring *exactly* the same. You collect the top N scoring documents. Actually, I think this is all pretty much done for you with the Searcher.search(Query query, Filter filter, int n) method. You can pass null for

NIOFSDirectory

2008-12-04 Thread John Wang
Hi guys: We did some profiling and benchmarking: The thread contention on FSDIrectory is gone, and for the set of queries we are running, performance improved by a factor of 5 (to be conservative). Great job, this is awesome, a simple change and made a huge difference. To get NIO

Re: Design guidance - search strategy

2008-12-04 Thread Ian Vink
So, let me get this straight. :) A Query tells Lucene what to search for. Then a Filter tells lucene what? I think I'm missing understanding about what a Filter is for. Ian On Thu, Dec 4, 2008 at 9:36 AM, Erick Erickson <[EMAIL PROTECTED]>wrote: > It's generally a bad idea to iterate a Hits

Re: NIOFSDirectory

2008-12-04 Thread Yonik Seeley
On Thu, Dec 4, 2008 at 4:11 PM, John Wang <[EMAIL PROTECTED]> wrote: > Hi guys: >We did some profiling and benchmarking: > >The thread contention on FSDIrectory is gone, and for the set of queries > we are running, performance improved by a factor of 5 (to be conservative). > >Great job

Re: NIOFSDirectory

2008-12-04 Thread Glen Newton
Sorrywhat version are we talking about? :-) thanks, Glen 2008/12/4 Yonik Seeley <[EMAIL PROTECTED]>: > On Thu, Dec 4, 2008 at 4:11 PM, John Wang <[EMAIL PROTECTED]> wrote: >> Hi guys: >>We did some profiling and benchmarking: >> >>The thread contention on FSDIrectory is gone, and fo

Re: NIOFSDirectory

2008-12-04 Thread John Wang
version 2.4, sorry for not clarifying. Yonik, pardon my ignorance. I still don't get it. When instantiating NIOFSDIrectory, how would I specify the path? I see only the empty constructor. With FSDirectory, you use the factory: getDirectory(File) -John On Thu, Dec 4, 2008 at 1:26 PM, Yonik Seeley

Re: NIOFSDirectory

2008-12-04 Thread Yonik Seeley
On Thu, Dec 4, 2008 at 4:32 PM, Glen Newton <[EMAIL PROTECTED]> wrote: > Sorrywhat version are we talking about? :-) The current development version of Lucene allows you to directly instantiate FSDirectory subclasses. -Yonik > thanks, > > Glen > > 2008/12/4 Yonik Seeley <[EMAIL PROTECTED]>

Re: NIOFSDirectory

2008-12-04 Thread John Wang
That does not help. The File/path is not stored with the instance. It is in a map FSDirectory keeps statically. Should subclasses of FSDirectory be modifying the map? This is not a question about how to subclass or customize FSDirectory. This is more on how to use NIOFSDirectory class. I am hoping

Re: Design guidance - search strategy

2008-12-04 Thread Erick Erickson
See the class in the docs or Lucene In Action for more detail, but here's the short form. A Filter is a bitset where each bit's ordinal position stands for a document. I.e. bit 1 means doc id 1, bit 519 represents document 519 etc. When you pass a filter to one of the search routines that acc

Re: Slow queries with lots of hits

2008-12-04 Thread John Wang
Tim: How about implementing your own HitCollector and stop when you have collected 100 docs with score above certain threshold? BTW, are there lotsa concurrent searches? -John On Thu, Dec 4, 2008 at 12:52 PM, Tim Sturge <[EMAIL PROTECTED]> wrote: > That makes sense. I should be more p

Re: NIOFSDirectory

2008-12-04 Thread Yonik Seeley
Details in the bug: https://issues.apache.org/jira/browse/LUCENE-1451 Use this constructor to create an instance of NIODirectory: /** Create a new NIOFSDirectory for the named location. * * @param path the path of the directory * @param lockFactory the lock factory to use, or null for

Re: NIOFSDirectory

2008-12-04 Thread Wouter Heijke
I had the same problem, only got it to work when I set the system property the way you do... UGLY! So if there is a solution like you ask for that use 2.4 I would be interested to know as well. Wouter > That does not help. The File/path is not stored with the instance. It is > in > a map FSDirect

Re: NIOFSDirectory

2008-12-04 Thread John Wang
Thanks! -John On Thu, Dec 4, 2008 at 2:16 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: > Details in the bug: > https://issues.apache.org/jira/browse/LUCENE-1451 > > Use this constructor to create an instance of NIODirectory: > > /** Create a new NIOFSDirectory for the named location. > * > *

Re: NIOFSDirectory

2008-12-04 Thread Glen Newton
Am I missing something here? Why not use: IndexWriter writer = new IndexWriter(NIOFSDirectory.getDirectory(new File(filename), analyzer, true); Another question: is NIOFSDirectory to be used with IndexWriter? If no, could someone explain? thanks, -glen 2008/12

Re: NIOFSDirectory

2008-12-04 Thread John Wang
NIOFSDirectory.getDirectory simple calls the static method on the parent class: FSDirectory.getDirectory. Which returns an instance of FSDirectory. IMO: NIOFSDirectory solves concurrent read problems, generally you don't want concurrent writes. -John On Thu, Dec 4, 2008 at 2:44 PM, Glen Newton <

Suggestions for drill downs

2008-12-04 Thread Muralidharan V
We are evaluating lucene for a product search engine. One requirement is that we be able to suggest the top n brands(the ones with most products in the result set) for a given search term to further refine the search query. The brand is stored in a separate field and searches are performed against

Re: Suggestions for drill downs

2008-12-04 Thread John Wang
Easiest way to do this is using the FieldCache. It constructs a StringIndex object which gives you very fast lookup to the field value (index) given a docid. Create a parallel count array to the lookup array for the StringIndex. Run your HitCollector thru should be fast. Loading FieldCache maybe ex

SnowballAnalyzer and AlphaNumeric

2008-12-04 Thread samd
Where can I get the Lucene source for the Snowball implementation. I need to be able to search for words that are alphanumeric and this does not work with the current snowballanalyzer. If there is an alternative to this then that would be greatly appreciated. Thanks. -- View this message in con

Re: Suggestions for drill downs

2008-12-04 Thread Muralidharan V
John, Using the FieldCache worked well. Thanks! -Murali On Thu, Dec 4, 2008 at 3:10 PM, John Wang <[EMAIL PROTECTED]> wrote: > Easiest way to do this is using the FieldCache. It constructs a StringIndex > object which gives you very fast lookup to the field value (index) given a > docid. C

Re: Design guidance - search strategy

2008-12-04 Thread Ian Vink
I bought your book :) Thanks, I will look into it. On Thu, Dec 4, 2008 at 6:12 PM, Erick Erickson <[EMAIL PROTECTED]>wrote: > See the class in the docs or Lucene In Action for more > detail, but here's the short form. > > A Filter is a bitset where each bit's ordinal position stands > for a d

Re: Suggestions for drill downs

2008-12-04 Thread John Wang
Glad to be of help. Understand that FieldCache lives in a map in the static memory and is keyed by an IndexReader. So if your reader updates often there might be an issue of cleaning the map. This is a question for the Luceners, when you call IndexReader.reopen, how is FieldCache updated? -John

Re: Design guidance - search strategy

2008-12-04 Thread Ian Vink
It works. For those using Lucene.NET here is an example of a Filter that takes a list of IDs for books: public class BookFilter: Filter { private readonly List bookIDs; public BookFilter(List bookIDsToSearch) { bookIDs = bookIDsToSearch; }

Re: Suggestions for drill downs

2008-12-04 Thread Jason Rutherglen
The field cache is completely reloaded. LUCENE-831 solves this by merging the field caches of the segments. For realtime search systems, merging the field caches is not desirable though. On Thu, Dec 4, 2008 at 6:45 PM, John Wang <[EMAIL PROTECTED]> wrote: > Glad to be of help. > Understand that

TopDocs

2008-12-04 Thread Ian Vink
I have this search which returns TopDocs TopDocs topDocs = searcher.Search(query, bookFilter, maxDocsToFind); How do I get the document object for the ScoreDoc? foreach (ScoreDoc scoreDoc in topDocs.scoreDocs) { ??Document myDoc = GetTheDocument(scoreDoc.doc); ?? }

Re: TopDocs

2008-12-04 Thread John Wang
searcher.doc(scoreDoc.doc); On Thu, Dec 4, 2008 at 6:59 PM, Ian Vink <[EMAIL PROTECTED]> wrote: > I have this search which returns TopDocs > TopDocs topDocs = searcher.Search(query, bookFilter, maxDocsToFind); > > > How do I get the document object for the ScoreDoc? > > foreach (ScoreDoc scoreDo

Re: Slow queries with lots of hits

2008-12-04 Thread Otis Gospodnetic
Tim (and we should move this to java-dev if it gains traction), Perhaps you can come up with a mechanism to perform scoring in two passes instead of one: - first pass is cheap and fast - second pass is more expensive and slower Currently, there is no choice - Lucene does 2). But perhaps you can

Cannot find gdata-server

2008-12-04 Thread Anees Haider
I have setup lucene, test run it and go through samples. Now I have been working on setting up GData server by consulting Getting Started guide "http://wiki.apache.org/lucene-java/GdataServer/GettingStarted";. I have setup JDK, Ant and Tomcat as required. I have checked out the working copy of GD

Re: Slow queries with lots of hits

2008-12-04 Thread Karl Wettin
Hi Tim, is it possible that the slow queries contains terms that are very common in your index? If so you could replace those clauses with a filter. This would impact the score as filters does nothing with that, but if your query contains enough other clauses that should not be a problem.

Re: Cannot find gdata-server

2008-12-04 Thread Karl Wettin
Hello Anees, the Gdata server was phased out by 2.3. You can still get if from the 2.2 tag in the SVN: http://svn.apache.org/repos/asf/lucene/java/tags/lucene_2_2_0/ karl 5 dec 2008 kl. 07.13 skrev Anees Haider: I have setup lucene, test run it and go through samples. Now I have been w

Sorting documents without a query

2008-12-04 Thread Shivaraj Tenginakai
I have a usecase in which I have no search query, but still need to sort documents. For example, items need to be sorted by price, though the user has not yet selected any search criteria. What would be the best way to achieve this? Thanks and Regards, Shivaraj

Sorting documents without a query

2008-12-04 Thread Shivaraj Tenginakai
I have a usecase in which I have no search query, but still need to sort documents. For example, items need to be sorted by price, though the user has not yet selected any search criteria. What would be the best way to achieve this? Thanks and Regards, Shivaraj