Magnus, If you get a chance, can you try setting a different xms and xmx
value. For instance, try xms384M and xmx1024M.
The "forced" GC [request] will almost always reduce the memory footprint
simply because of the weak references that lucene leverages, but I bet
subsequent queries are not as
Magnus, Please feel free to ignore my last email; I see that you had
this setup earlier. As far as using up all the memory it can get its hands
on, this is actually a good thing. This allows Lucene and other java
applications to keep more things in cache when more memory is available.
Also, if
hmmm,
Well in production (1024M heap), it seems that after a while (some
hundred user queries) the memory starts reaching the max threshold and
when it does at some point it becomes unresponsive.
Id rather it was slightly less performant (cleaning up memory more
frequently) than freezing u
Hi,
In my case I used PDFBox, just to extract the text from PDF document and
then I created the Lucene document giving the extracted text. (I didn't use
the PDFBox built in Lucene search engine). So I didn't get any
incompatibility problems.
This blog post shows the way.
http://kalanir.blogspot.c
I'm not sure I really understand what the problem is here.
First of all, the VM will appear to consume most or all of the memory
you give it. You really shouldn't worry about this, and it is misleading
to look at what happens when you force a gc.
I think there are really only 2 things that matter
It's important to understand that the JRE using up all memory to the
max you specified, and then doing GC, is entirely "normal" (if not
desirable).
This is just how Java works: when code runs it generates garbage,
sometimes quite a bit (eg if you make a new IndexSearcher per query),
and
Mark Miller wrote:
Sounds familiar. This may actually be in JIRA already.
Maybe this is:
https://issues.apache.org/jira/browse/LUCENE-689
?
I just marked it as fix version 2.9.
Mike
-
To unsubscribe, e-mail: [EMAIL P
I believe these lists exists out on the Internet, just google for
something like "most common first names" or "common
nicknames" (yields: http://www.cc.kyoto-su.ac.jp/~trobb/nicklist.html
for instance)
If you want to dig deeper, you might look into named entity
recognition research, and a
Thanks very kind ...
But I've tried that code but I do not work ...
You could send me a simple working class that uses it please?
Thanks> Date: Thu, 4 Dec 2008 15:19:26 +0530> From: [EMAIL PROTECTED]> To:
java-user@lucene.apache.org> Subject: Re: Pdf in Lucene?> > Hi,> > In my case I
used PDFBox
I have documents with this simple schema in Lucene which I can not change.
docid: (int)
contents: (text)
The user is given a list of 10,000 documents in a tree which they select to
search, usually they select 5000 or so.
I only want to search those 5000 documents. I have the 'id' fields. That is
It's generally a bad idea to iterate a Hits object. In fact, Hits
is deprecated in recent versions of Lucene. The underlying
problem is that the query is re-executed every 100 responses
or so.
First suggestion, create a Filter by iterating over your
docid field and use that in your searches see
se
Hi Tiziano,
What is the error you got? I think you can get the text easily using the
code shown below.
FileInputStream fi = new FileInputStream(new File("sample.pdf"));
PDFParser parser = new PDFParser(fi);
parser.parse();
COSDocument cd = parser.getDocument();
PDFTextStripper stripper = new PD
Hi Tiziano,
What is the error you got? I think you can get the text easily using the
code shown below.
FileInputStream fi = new FileInputStream(new File("sample.pdf"));
PDFParser parser = new PDFParser(fi);
parser.parse();
COSDocument cd = parser.getDocument();
PDFTextStripper stripper = new PD
I entered your code inside a main.
I have imported libraries required by mistake but me.
First error:
parser.parse();
Syntax error on token "parse", Identifier expected after this token
Second error:
cd.close();
Syntax error on token "close", Identifier expected after this token
Third error:
doc
There is CLucene. It's not a part of Apache, but lives on SourceForge, I
think.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Ariel <[EMAIL PROTECTED]>
> To: lucene user
> Sent: Tuesday, December 2, 2008 2:13:08 PM
> Subject: I wou
> Mark Miller wrote:
>
> > Sounds familiar. This may actually be in JIRA already.
>
> Maybe this is:
>
> https://issues.apache.org/jira/browse/LUCENE-689
>
> ?
>
> I just marked it as fix version 2.9.
>
> Mike
Not exactly.
NPE was from SegmentReader in my case while
NPE is from SpanOrQu
Hi all,
I have an interesting problem with my query traffic. Most of the queries run
in a fairly short amount of time (< 100ms) but a few take over 1000ms. These
queries are predominantly those with a huge number of hits (>1 million hits
in a >100 million document index). The time taken (as far as
The problem here is how *could* a system return even the top
10,000 results without scoring them all? What if the millionth
hit resulted in the very best match in the entire corpus?
That said, sorting may well be the issue here rather than scoring.
You can use a TopDocCollector to get the top N ma
That makes sense. I should be more precise in that all I need is 100 of the
1 "reasonable" results.
The concern I would have with a TopDocCollector is that this is biased
towards the top of the index which translates for me into a bias for older
documents. I'd prefer no age bias or a newer doc
Huh? TopDocCollector isn't biased unless you suppose that you'll
have many documents scoring *exactly* the same. You collect the
top N scoring documents.
Actually, I think this is all pretty much done for you with the
Searcher.search(Query query, Filter filter, int n) method. You
can pass null for
Hi guys:
We did some profiling and benchmarking:
The thread contention on FSDIrectory is gone, and for the set of queries
we are running, performance improved by a factor of 5 (to be conservative).
Great job, this is awesome, a simple change and made a huge difference.
To get NIO
So, let me get this straight. :)
A Query tells Lucene what to search for. Then a Filter tells lucene what?
I think I'm missing understanding about what a Filter is for.
Ian
On Thu, Dec 4, 2008 at 9:36 AM, Erick Erickson <[EMAIL PROTECTED]>wrote:
> It's generally a bad idea to iterate a Hits
On Thu, Dec 4, 2008 at 4:11 PM, John Wang <[EMAIL PROTECTED]> wrote:
> Hi guys:
>We did some profiling and benchmarking:
>
>The thread contention on FSDIrectory is gone, and for the set of queries
> we are running, performance improved by a factor of 5 (to be conservative).
>
>Great job
Sorrywhat version are we talking about? :-)
thanks,
Glen
2008/12/4 Yonik Seeley <[EMAIL PROTECTED]>:
> On Thu, Dec 4, 2008 at 4:11 PM, John Wang <[EMAIL PROTECTED]> wrote:
>> Hi guys:
>>We did some profiling and benchmarking:
>>
>>The thread contention on FSDIrectory is gone, and fo
version 2.4, sorry for not clarifying.
Yonik, pardon my ignorance. I still don't get it. When instantiating
NIOFSDIrectory, how would I specify the path? I see only the empty
constructor.
With FSDirectory, you use the factory: getDirectory(File)
-John
On Thu, Dec 4, 2008 at 1:26 PM, Yonik Seeley
On Thu, Dec 4, 2008 at 4:32 PM, Glen Newton <[EMAIL PROTECTED]> wrote:
> Sorrywhat version are we talking about? :-)
The current development version of Lucene allows you to directly
instantiate FSDirectory subclasses.
-Yonik
> thanks,
>
> Glen
>
> 2008/12/4 Yonik Seeley <[EMAIL PROTECTED]>
That does not help. The File/path is not stored with the instance. It is in
a map FSDirectory keeps statically. Should subclasses of FSDirectory be
modifying the map?
This is not a question about how to subclass or customize FSDirectory. This
is more on how to use NIOFSDirectory class. I am hoping
See the class in the docs or Lucene In Action for more
detail, but here's the short form.
A Filter is a bitset where each bit's ordinal position stands
for a document. I.e. bit 1 means doc id 1, bit 519
represents document 519 etc.
When you pass a filter to one of the search routines that acc
Tim:
How about implementing your own HitCollector and stop when you have
collected 100 docs with score above certain threshold?
BTW, are there lotsa concurrent searches?
-John
On Thu, Dec 4, 2008 at 12:52 PM, Tim Sturge <[EMAIL PROTECTED]> wrote:
> That makes sense. I should be more p
Details in the bug:
https://issues.apache.org/jira/browse/LUCENE-1451
Use this constructor to create an instance of NIODirectory:
/** Create a new NIOFSDirectory for the named location.
*
* @param path the path of the directory
* @param lockFactory the lock factory to use, or null for
I had the same problem, only got it to work when I set the system property
the way you do... UGLY!
So if there is a solution like you ask for that use 2.4 I would be
interested to know as well.
Wouter
> That does not help. The File/path is not stored with the instance. It is
> in
> a map FSDirect
Thanks!
-John
On Thu, Dec 4, 2008 at 2:16 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> Details in the bug:
> https://issues.apache.org/jira/browse/LUCENE-1451
>
> Use this constructor to create an instance of NIODirectory:
>
> /** Create a new NIOFSDirectory for the named location.
> *
> *
Am I missing something here?
Why not use:
IndexWriter writer = new IndexWriter(NIOFSDirectory.getDirectory(new
File(filename), analyzer, true);
Another question: is NIOFSDirectory to be used with IndexWriter? If
no, could someone explain?
thanks,
-glen
2008/12
NIOFSDirectory.getDirectory simple calls the static method on the parent
class: FSDirectory.getDirectory.
Which returns an instance of FSDirectory.
IMO: NIOFSDirectory solves concurrent read problems, generally you don't
want concurrent writes.
-John
On Thu, Dec 4, 2008 at 2:44 PM, Glen Newton <
We are evaluating lucene for a product search engine. One requirement is
that we be able to suggest the top n brands(the ones with most products in
the result set) for a given search term to further refine the search query.
The brand is stored in a separate field and searches are performed against
Easiest way to do this is using the FieldCache. It constructs a StringIndex
object which gives you very fast lookup to the field value (index) given a
docid. Create a parallel count array to the lookup array for the
StringIndex. Run your HitCollector thru should be fast.
Loading FieldCache maybe ex
Where can I get the Lucene source for the Snowball implementation.
I need to be able to search for words that are alphanumeric and this does
not work with the current snowballanalyzer.
If there is an alternative to this then that would be greatly appreciated.
Thanks.
--
View this message in con
John,
Using the FieldCache worked well. Thanks!
-Murali
On Thu, Dec 4, 2008 at 3:10 PM, John Wang <[EMAIL PROTECTED]> wrote:
> Easiest way to do this is using the FieldCache. It constructs a StringIndex
> object which gives you very fast lookup to the field value (index) given a
> docid. C
I bought your book :)
Thanks, I will look into it.
On Thu, Dec 4, 2008 at 6:12 PM, Erick Erickson <[EMAIL PROTECTED]>wrote:
> See the class in the docs or Lucene In Action for more
> detail, but here's the short form.
>
> A Filter is a bitset where each bit's ordinal position stands
> for a d
Glad to be of help.
Understand that FieldCache lives in a map in the static memory and is keyed
by an IndexReader. So if your reader updates often there might be an issue
of cleaning the map.
This is a question for the Luceners, when you call IndexReader.reopen, how
is FieldCache updated?
-John
It works.
For those using Lucene.NET here is an example of a Filter that takes a list
of IDs for books:
public class BookFilter: Filter
{
private readonly List bookIDs;
public BookFilter(List bookIDsToSearch)
{
bookIDs = bookIDsToSearch;
}
The field cache is completely reloaded. LUCENE-831 solves this by merging
the field caches of the segments. For realtime search systems, merging the
field caches is not desirable though.
On Thu, Dec 4, 2008 at 6:45 PM, John Wang <[EMAIL PROTECTED]> wrote:
> Glad to be of help.
> Understand that
I have this search which returns TopDocs
TopDocs topDocs = searcher.Search(query, bookFilter, maxDocsToFind);
How do I get the document object for the ScoreDoc?
foreach (ScoreDoc scoreDoc in topDocs.scoreDocs)
{
??Document myDoc = GetTheDocument(scoreDoc.doc); ??
}
searcher.doc(scoreDoc.doc);
On Thu, Dec 4, 2008 at 6:59 PM, Ian Vink <[EMAIL PROTECTED]> wrote:
> I have this search which returns TopDocs
> TopDocs topDocs = searcher.Search(query, bookFilter, maxDocsToFind);
>
>
> How do I get the document object for the ScoreDoc?
>
> foreach (ScoreDoc scoreDo
Tim (and we should move this to java-dev if it gains traction),
Perhaps you can come up with a mechanism to perform scoring in two passes
instead of one:
- first pass is cheap and fast
- second pass is more expensive and slower
Currently, there is no choice - Lucene does 2). But perhaps you can
I have setup lucene, test run it and go through samples.
Now I have been working on setting up GData server by consulting Getting
Started guide
"http://wiki.apache.org/lucene-java/GdataServer/GettingStarted";.
I have setup JDK, Ant and Tomcat as required. I have checked out the
working copy of GD
Hi Tim,
is it possible that the slow queries contains terms that are very
common in your index? If so you could replace those clauses with a
filter. This would impact the score as filters does nothing with that,
but if your query contains enough other clauses that should not be a
problem.
Hello Anees,
the Gdata server was phased out by 2.3. You can still get if from the
2.2 tag in the SVN:
http://svn.apache.org/repos/asf/lucene/java/tags/lucene_2_2_0/
karl
5 dec 2008 kl. 07.13 skrev Anees Haider:
I have setup lucene, test run it and go through samples.
Now I have been w
I have a usecase in which I have no search query, but still need to sort
documents. For example, items need to be sorted by price, though the user
has not yet selected any search criteria.
What would be the best way to achieve this?
Thanks and Regards,
Shivaraj
I have a usecase in which I have no search query, but still need to sort
documents. For example, items need to be sorted by price, though the user
has not yet selected any search criteria.
What would be the best way to achieve this?
Thanks and Regards,
Shivaraj
50 matches
Mail list logo