Re: Cache full text into memory

2010-07-14 Thread findbestopensource
You have two options
1. Store the compressed text as part of stored field in Solr.
2. Using external caching.
http://www.findbestopensource.com/tagged/distributed-caching
You could use ehcache / Memcache / Membase.

The problem with external caching is you need to synchronize the deletions
and modification. Fetching the stored field from Solr is also faster.

Regards
Aditya
www.findbestopensource.com


On Wed, Jul 14, 2010 at 12:08 PM, Li Li  wrote:

> I want to cache full text into memory to improve performance.
> Full text is only used to highlight in my application(But it's very
> time consuming, My avg query time is about 250ms, I guess it will cost
> about 50ms if I just get top 10 full text. Things get worse when get
> more full text because in disk, it scatters erverywhere for a query.).
> My full text per machine is about 200GB. The memory available for
> store full text is about 10GB. So I want to compress it in memory.
> Suppose compression ratio is 1:5, then I can load 1/4 full text in
> memory. I need a Cache component for it. Has anyone faced the problem
> before? I need some advice. Is it possbile using external tools such
> as MemCached? Thank you.
>


Re: ShingleFilter failing with more terms than index phrase

2010-07-14 Thread Ethan Collins
Hi Steve,

Thanks for your kind response. I checked PositionfilterFactory
(re-index as well) but that also didn't solve the problem. Interesting
the problem is not reproduceable from Solr's Field Analysis page, it
manifests only when it's in a query.

I guess the subject for this post is not very correct, it's not that
ShingleFilter is failing but -- using ShingleFilter, there is no score
provided by the shingle field when I pass more terms than the indexed
terms. I observe this using debugQuery.

I had actually posted to solr-user but received no response yet.
Probably because the problem is not clear at first glance. However,
there's an example I have put in the mail for someone interested to
try out and check if there's a problem. Let's see if I receive any
response.

-Ethan

On Tue, Jul 13, 2010 at 9:15 PM, Steven A Rowe  wrote:
> Hi Ethan,
>
> You'll probably get better answers about Solr specific stuff on the 
> solr-u...@a.l.o list.
>
> Check out PositionFilterFactory - it may address your issue:
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory
>
> Steve
>
>> -Original Message-
>> From: Ethan Collins [mailto:collins.eth...@gmail.com]
>> Sent: Tuesday, July 13, 2010 3:42 AM
>> To: java-user@lucene.apache.org
>> Subject: ShingleFilter failing with more terms than index phrase
>>
>> I am using lucene 2.9.3 (via Solr 1.4.1) on windows and am trying to
>> understand ShingleFilter. I wrote the following code and find that if I
>> provide more words than the actual phrase indexed in the field, then the
>> search on that field fails (no score found with debugQuery=true).
>>
>> Here is an example to reproduce, with field names:
>> Id: 1
>> title_1: Nina Simone
>> title_2: I put a spell on you
>>
>> Query (dismax) with:
>> - “Nina Simone I put”  <- Fails i.e. no score shown from title_1 search
>> (using debugQuery)
>> - “Nina Simone” <- SUCCESS
>>
>> But, when I used Solr’s Field Analysis with the ‘shingle’ field (given
>> below) and tried “Nina Simone I put”, it succeeds. It’s only during the
>> query that no score is provided. I also checked ‘parsedquery’ and it shows
>> disjunctionMaxQuery issuing the string “Nina_Simone Simone_I I_put” to the
>> title_1 field.
>>
>> title_1 and title_2 fields are of type ‘shingle’, defined as:
>>
>>    > positionIncrementGap="100" indexed="true" stored="true">
>>        
>>            
>>            
>>            > maxShingleSize="2" outputUnigrams="false"/>
>>        
>>        
>>            
>>            
>>            > maxShingleSize="2" outputUnigrams="false"/>
>>        
>>    
>>
>> Note that I also have a catchall field which is text. I have qf set
>> to: 'id^2 catchall' and pf set to: 'title_1^1.5 title_2^1.2'
>>
>> If I am missing something or doing something wrong please let me know.
>>
>> -Ethan
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Best practices for searcher memory usage?

2010-07-14 Thread Toke Eskildsen
On Tue, 2010-07-13 at 23:49 +0200, Christopher Condit wrote:
> * 20 million documents [...]
> * 140GB total index size
> * Optimized into a single segment

I take it that you do not have frequent updates? Have you tried to see
if you can get by with more segments without significant slowdown?

> The application will run with 10G of -Xmx but any less and it bails out. 
> It seems happier if we feed it 12GB. The searches are starting to bog 
> down a bit (5-10 seconds for some queries)...

10G sounds like a lot for that index. Two common memory-eaters are
sorting by field value and faceting. Could you describe what you're
doing in that regard?

Similarly, the 5-10 seconds for some queries seems very slow. Could you
give some examples on the queries that causes problems together with
some examples of fast queries and how long they take to execute?


The standard silver bullet for easy performance boost is to buy a couple
of consumer grade SSDs and put them on the local machine. If you're
gearing up to use more machines you might want to try this first.

Regards,
Toke


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ShingleFilter failing with more terms than index phrase

2010-07-14 Thread Ethan Collins
Hi Steve,

Thanks, wrapping with PositionFilter actually worked the search and
score -- I made a mistake while re-indexing last time.

Trying to analyze PositionFilter: didn't understand why earlier the
search of 'Nina Simone I Put' failed since atleast the phrase 'Nina
Simone' should have matched against title_0 field. Any clue?

I am also trying to understand the impact of PositionFilter on phrase
search quality and score. Unfortunately there are not enough
literature/help put up by google.

-Ethan

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Best practices for searcher memory usage?

2010-07-14 Thread Michael McCandless
You can also set the termsIndexDivisor when opening the IndexReader.
The terms index is an in-memory data structure and it an consume ALOT
of RAM when your index has many unique terms.

Flex (only on Lucene's trunk / next major release (4.0)) has reduced
this RAM usage (as well as the RAM required when sorting by string
field with mostly ascii content) substantially -- see
http://chbits.blogspot.com/2010/07/lucenes-ram-usage-for-searching.html

Mike

On Tue, Jul 13, 2010 at 6:09 PM, Paul Libbrecht  wrote:
>
>
> Le 13-juil.-10 à 23:49, Christopher Condit a écrit :
>
>> * are there performance optimizations that I haven't thought of?
>
> The first and most important one I'd think of is get rid of NFS.
> You can happily do a local copy which might, even for 10 Gb take less than
> 30 seconds at server start.
>
> paul
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ShingleFilter failing with more terms than index phrase

2010-07-14 Thread Ethan Collins
> Trying to analyze PositionFilter: didn't understand why earlier the
> search of 'Nina Simone I Put' failed since atleast the phrase 'Nina
> Simone' should have matched against title_0 field. Any clue?

Please note that I have configure the ShingleFilter as bigrams without unigrams.

[Honestly, I am still struggling to understand how this worked and the
earlier one didn't]

-Ethan

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to create a fuzzy suggest

2010-07-14 Thread Alexander Rothenberg
Hi, 
i had a similar need to create somethink that acts not like a "filter" 
or "tokenizer" but only inserts self-generated tokens into the token-stream. 
(my purpose was to generate all kinds of word-forms for german umlauts...)

the following code-base helped me a lot to create it: 
http://207.44.206.178/message.jspa?messageID=91989#91991

the synonym-filter also adds tokens into the tokenstream

regards, Alex



On Wednesday 14 July 2010 01:11:02 Kai Weingärtner wrote:
> Hello,
>
>
> I am trying to create a suggest search (search results are displayed while
> the user is entering the query) for names, but the search should also give
> results if the given name just sounds like an indexed name. However a
> perfect match should be ranked higher than a similar sounding match.
>
>
> I looked at the SpellChecker contrib, but this AFAIK cannot handle
> incomplete names (edge n-grams).
>
>
> So I came up with this idea and it would be great if anyone could tell me
> if that is sensible or if there is a better way:
>
>
> I create an analyzer to be run on the full names, which does the following
> - lowercase
> - build edge n-grams
> put these terms in the field (this would handle correctly spelled input)
>
>
> - run soundex on the n-grams
> put there soundexed n-grams in the field as well
>
>
> The incoming query will then also run through this analyzer with an
> or-default. So a correct spelling will match the normal n-grams plus the
> soundexed n-grams leading to a good score. A missspelled name would still
> match the soundexed n-grams, leading to a somewhat lower score.
>
>
> My current problem is that I don't know how to duplicate the tokens in the
> analyzer so I can add them as normal n-grams and soundexed n-grams. I
> suppose the TeeSinkTokenFilter will get me there, but I could not figure
> out how to add all tokens back in one stream.
>
>
> To recap, my questions are: Could this approach work to create a "fuzzy
> suggest"? How do I use the TeeSinkTokenFilter to separate and recombine the
> tokenstream.
>
>
> I hope that was clear, thanks for your help!
>
>
>
> Kai
>
>
>
>
> Regelung im Bezug auf Paragraph 37a Absatz 4 HGB: WidasConcepts GmbH,
> Geschaeftsfuehrer: Thomas Widmann und Christian Kappert,
> Gerichtsstand Pforzheim, Registernummer: HRB 511442,
> Umsatzsteueridentifikationsnummer: DE205851091
>
> Diese E-Mail enthaelt vertrauliche und/oder rechtlich geschuetzte
> Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
> irrtuemlich erhalten haben, informieren Sie bitte sofort den Absender und
> vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
> Weitergabe dieser Mail sind nicht gestattet.
>
> This e-mail may contain confidential and/or privileged information.
> If you are not the intended recipient (or have received this e-mail in
> error) please notify the sender immediately and destroy this e-mail.
> Any unauthorized copying, disclosure or distribution of the material in
> this e-mail is strictly forbidden.



-- 
Alexander Rothenberg
Fotofinder GmbH USt-IdNr. DE812854514
Software EntwicklungWeb: http://www.fotofinder.net/
Potsdamer Str. 96   Tel: +49 30 25792890
10785 BerlinFax: +49 30 257928999

Geschäftsführer:Ali Paczensky
Amtsgericht:Berlin Charlottenburg (HRB 73099)
Sitz:   Berlin

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Best open source

2010-07-14 Thread findbestopensource
Hello all,

We have launched a new site, which provides the best open source products
and libraries across all categories. This site is powered by Solr search.
There are many open source products available in all categories and it is
sometimes difficult to identify which is the best. The main problem in open
source is, there are lot more redudant products in a category and it is
impossible for a User to try all. We identify the best.

As a open source users, you might be using many opensource products and
libraries , It would be great, if you help us by adding the information
about the opensource products you use.
http://www.findbestopensource.com/addnew

http://www.findbestopensource.com/

Regards
Aditya


Out of memory problem in search

2010-07-14 Thread ilkay polat
Hello Friends;

Recently, I have problem with lucene search - memory problem on the basis that 
indexed file is so big. (I have indexed some kinds of information and this 
indexed file's size is nearly more than 40 gigabyte. )  

I search the lucene indexed file with 
org.apache.lucene.search.Searcher.search(query, null, offset + limit, new 
Sort(new SortField("time", SortField.LONG, true)));
(This provides to find (offset + limit) records to back.) 

I use searching by range. For example, in web page I firstly search records 
which are in [0, 100] range then second page [100, 200]  
I have nearly 200,000 records at all. When I go to last page which means 
records between 200,000 -100, 200,0, there is a memory problem(I have 4gb ram 
on running machine) in jvm( out of memory error).

Is there a way to overcome this memory problem? 

Thanks

--
ilkay POLAT   Software Engineer
TURKEY
 
  Gsm : (+90) 532 542 36 71
  E-mail : ilkay_po...@yahoo.com


  

Re: Out of memory problem in search

2010-07-14 Thread findbestopensource
Certainly it will. Either you need to increase your memory OR refine your
query. Eventhough you display paginated result. The first couple of pages
will display fine and going towards last may face problem. This is because,
200,000 objects is created and iterated, 190,900 objects are skipped and
last100 objects are returned. The memory is consumed in creating these
objects.

Regards
Aditya
www.findbestopensource.com



On Wed, Jul 14, 2010 at 4:14 PM, ilkay polat  wrote:

> Hello Friends;
>
> Recently, I have problem with lucene search - memory problem on the basis
> that indexed file is so big. (I have indexed some kinds of information and
> this indexed file's size is nearly more than 40 gigabyte. )
>
> I search the lucene indexed file with
> org.apache.lucene.search.Searcher.search(query, null, offset + limit, new
> Sort(new SortField("time", SortField.LONG, true)));
> (This provides to find (offset + limit) records to back.)
>
> I use searching by range. For example, in web page I firstly search records
> which are in [0, 100] range then second page [100, 200]
> I have nearly 200,000 records at all. When I go to last page which means
> records between 200,000 -100, 200,0, there is a memory problem(I have 4gb
> ram on running machine) in jvm( out of memory error).
>
> Is there a way to overcome this memory problem?
>
> Thanks
>
> --
> ilkay POLAT   Software Engineer
> TURKEY
>
>  Gsm : (+90) 532 542 36 71
>  E-mail : ilkay_po...@yahoo.com
>
>
>


RE: Out of memory problem in search

2010-07-14 Thread Uwe Schindler
Reverse the query sorting to display the last page.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: ilkay polat [mailto:ilkay_po...@yahoo.com]
> Sent: Wednesday, July 14, 2010 12:44 PM
> To: java-user@lucene.apache.org
> Subject: Out of memory problem in search
> 
> Hello Friends;
> 
> Recently, I have problem with lucene search - memory problem on the basis
> that indexed file is so big. (I have indexed some kinds of information and
this
> indexed file's size is nearly more than 40 gigabyte. )
> 
> I search the lucene indexed file with
> org.apache.lucene.search.Searcher.search(query, null, offset + limit, new
> Sort(new SortField("time", SortField.LONG, true))); (This provides to find
> (offset + limit) records to back.)
> 
> I use searching by range. For example, in web page I firstly search
records
> which are in [0, 100] range then second page [100, 200] I have nearly
200,000
> records at all. When I go to last page which means records between 200,000
-
> 100, 200,0, there is a memory problem(I have 4gb ram on running machine)
in
> jvm( out of memory error).
> 
> Is there a way to overcome this memory problem?
> 
> Thanks
> 
> --
> ilkay POLAT   Software Engineer
> TURKEY
> 
>   Gsm : (+90) 532 542 36 71
>   E-mail : ilkay_po...@yahoo.com
> 
> 
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Out of memory problem in search

2010-07-14 Thread ilkay polat
Indeed, this is  good solution to that kind of problems. But same problem can 
be  occured in future when logs are added to index file. 
For example, here 200,000 records have problem(These logs are collected in 13 
days). 
With that reverse way, there will be maximum search range is 100,000. 
But if there is 400,000 records same problem will be occured(Max search space 
is 200,000 again). 
Is there another way which do not consume so much memory  or consume restrict 
memory and consume time instead of memory. This restriction come from our 
project hardware restrictions(Hardware memory is 8GB in maximum situation)?

--- On Wed, 7/14/10, Uwe Schindler  wrote:

From: Uwe Schindler 
Subject: RE: Out of memory problem in search
To: java-user@lucene.apache.org
Date: Wednesday, July 14, 2010, 3:25 PM

Reverse the query sorting to display the last page.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: ilkay polat [mailto:ilkay_po...@yahoo.com]
> Sent: Wednesday, July 14, 2010 12:44 PM
> To: java-user@lucene.apache.org
> Subject: Out of memory problem in search
> 
> Hello Friends;
> 
> Recently, I have problem with lucene search - memory problem on the basis
> that indexed file is so big. (I have indexed some kinds of information and
this
> indexed file's size is nearly more than 40 gigabyte. )
> 
> I search the lucene indexed file with
> org.apache.lucene.search.Searcher.search(query, null, offset + limit, new
> Sort(new SortField("time", SortField.LONG, true))); (This provides to find
> (offset + limit) records to back.)
> 
> I use searching by range. For example, in web page I firstly search
records
> which are in [0, 100] range then second page [100, 200] I have nearly
200,000
> records at all. When I go to last page which means records between 200,000
-
> 100, 200,0, there is a memory problem(I have 4gb ram on running machine)
in
> jvm( out of memory error).
> 
> Is there a way to overcome this memory problem?
> 
> Thanks
> 
> --
> ilkay POLAT   Software Engineer
> TURKEY
> 
>   Gsm : (+90) 532 542 36 71
>   E-mail : ilkay_po...@yahoo.com
> 
> 
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




  

Re: Out of memory problem in search

2010-07-14 Thread ilkay polat
Hi,
We have hardware restrictions(Max RAM can be  8GB). So, unfortunately,  
increasing memory can not be option for us for today's situation. 

Yes, as you said that problem is faced when goes to last pages of search screen 
because of using search method which is find top n records. In other way, this 
is meaning "searching all the thinngs returns all". 

I am now researching whether there is a way which consumes time instead of 
memory in this search mechanism in lucene? Any other ideas? 

Thanks

--- On Wed, 7/14/10, findbestopensource  wrote:

From: findbestopensource 
Subject: Re: Out of memory problem in search
To: java-user@lucene.apache.org
Date: Wednesday, July 14, 2010, 2:59 PM

Certainly it will. Either you need to increase your memory OR refine your
query. Eventhough you display paginated result. The first couple of pages
will display fine and going towards last may face problem. This is because,
200,000 objects is created and iterated, 190,900 objects are skipped and
last100 objects are returned. The memory is consumed in creating these
objects.

Regards
Aditya
www.findbestopensource.com



On Wed, Jul 14, 2010 at 4:14 PM, ilkay polat  wrote:

> Hello Friends;
>
> Recently, I have problem with lucene search - memory problem on the basis
> that indexed file is so big. (I have indexed some kinds of information and
> this indexed file's size is nearly more than 40 gigabyte. )
>
> I search the lucene indexed file with
> org.apache.lucene.search.Searcher.search(query, null, offset + limit, new
> Sort(new SortField("time", SortField.LONG, true)));
> (This provides to find (offset + limit) records to back.)
>
> I use searching by range. For example, in web page I firstly search records
> which are in [0, 100] range then second page [100, 200]
> I have nearly 200,000 records at all. When I go to last page which means
> records between 200,000 -100, 200,0, there is a memory problem(I have 4gb
> ram on running machine) in jvm( out of memory error).
>
> Is there a way to overcome this memory problem?
>
> Thanks
>
> --
> ilkay POLAT   Software Engineer
> TURKEY
>
>  Gsm : (+90) 532 542 36 71
>  E-mail : ilkay_po...@yahoo.com
>
>
>



  

subset query :query filter or boolean query

2010-07-14 Thread suman.holani


Hi , 

I have 4 query search fields. 

case 1 : if i use one search
field to make a query filter and then use the query filter to search on
other 3 fields so as to reduce the searching docs subset. 

case 2: i use
all query parameters using boolean query , whole of index will be searched.


Which of the two approach will give better performance.Or is there ne
other approach to do this . 

Also 

Can we use subset of documents , for
searching . 

Lets say I have hash map of 

P1 -1,2,3,4 

P2 - 3,4,5


P3-7,5,3 

Now I have an documents in lucene index stored as  

1-P1


2-P1 

3-P1,P2,P3 

4-P1,P2 

5-P2,P3 

7-P3 

.. 

.. 

when i search
docs with P2 I get 3,4,5 

Now I want my search to b restricted to just
3,4,5 doc only. where by I can search only these docs for further
parameters. 

1. How to go abt it. 

2. Is there any other seraching
mechanism I should use, or Lucene is better fit? 

3. should i keep my hash
map also in lucene indexes and is then thr a method to link it to another
lucene indexes. 

regards, 

Suman

Re: Out of memory problem in search

2010-07-14 Thread ilkay polat
I have also  confused about the memory management of lucene. 

Where is this out of memory problem is mainly arised from Reason-1 or Reason-2 
reason?
 
Reason-1 : Problem is sourced from searching is done in big indexed file 
(nearly 40 GB) If there is 100(small number of records) records returned from 
search in 60 GB indexed file, problem will again arised.
OR
Reason-2 : Problem is sourced from finding so many records(nearly 200,000 
records), so in memory 200, 000 java object in heap? If file's sizeis 10 
GB(small file size ) but returned records are so many, problem will again 
arised.

Is there any document which tells the general memory management issues in 
searching in lucene? 

Thanks

 
ilkay POLAT     Software Engineer   Gsm : (+90) 532 542 36 71
  E-mail : ilkay_po...@yahoo.com

--- On Wed, 7/14/10, ilkay polat  wrote:

From: ilkay polat 
Subject: Re: Out of memory problem in search
To: java-user@lucene.apache.org
Date: Wednesday, July 14, 2010, 3:54 PM

Hi,
We have hardware restrictions(Max RAM can be  8GB). So, unfortunately,  
increasing memory can not be option for us for today's situation. 

Yes, as you said that problem is faced when goes to last pages of search screen 
because of using search method which is find top n records. In other way, this 
is meaning "searching all the thinngs returns all". 

I am now researching whether there is a way which consumes time instead of 
memory in this search mechanism in lucene? Any other ideas? 

Thanks

--- On Wed, 7/14/10, findbestopensource  wrote:

From: findbestopensource 
Subject: Re: Out of memory problem in search
To: java-user@lucene.apache.org
Date: Wednesday, July 14, 2010, 2:59 PM

Certainly it will. Either you need to increase your memory OR refine your
query. Eventhough you display paginated result. The first couple of pages
will display fine and going towards last may face problem. This is because,
200,000 objects is created and iterated, 190,900 objects are skipped and
last100 objects are returned. The memory is consumed in creating these
objects.

Regards
Aditya
www.findbestopensource.com



On Wed, Jul 14, 2010 at 4:14 PM, ilkay polat  wrote:

> Hello Friends;
>
> Recently, I have problem with lucene search - memory problem on the basis
> that indexed file is so big. (I have indexed some kinds of information and
> this indexed file's size is nearly more than 40 gigabyte. )
>
> I search the lucene indexed file with
> org.apache.lucene.search.Searcher.search(query, null, offset + limit, new
> Sort(new SortField("time", SortField.LONG, true)));
> (This provides to find (offset + limit) records to back.)
>
> I use searching by range. For example, in web page I firstly search records
> which are in [0, 100] range then second page [100, 200]
> I have nearly 200,000 records at all. When I go to last page which means
> records between 200,000 -100, 200,0, there is a memory problem(I have 4gb
> ram on running machine) in jvm( out of memory error).
>
> Is there a way to overcome this memory problem?
>
> Thanks
>
> --
> ilkay POLAT   Software Engineer
> TURKEY
>
>  Gsm : (+90) 532 542 36 71
>  E-mail : ilkay_po...@yahoo.com
>
>
>



      


  

Re: Continuously iterate over documents in index

2010-07-14 Thread Max Lynch
You could have a field within each doc say "Processed" and store a

> value Yes/No, next run a searcher query which should give you the
> collection of unprocessed ones.
>

That sounds like a reasonable idea, and I just realized that I could have
done that in a way specific to my application.  However, I already tried
doing something with a MatchAllDocsQuery with a custom collector and sort by
date.  I store the last date and time of a doc I processed and process only
newer ones.


Re: Continuously iterate over documents in index

2010-07-14 Thread Kiran Kumar
All,

Issue: Unable to get the proper results after searching. I added sample code
which I used in the application.

If I used *numHitPerPage* value as 1000 its giving expected results.
ex: The expected results is 32 docs but showing 32 docs
Instead If I use *numHitPerPage* as 2^32-1 its not giving expected results.
ex: The expected results is 32 docs but showing only 29 docs.

Sample code below:


StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
 QueryParser qp = new QueryParser(Version.LUCENE_CURRENT, defField,
analyzer);
Query q = qp.parse(queryString);
TopDocsCollector tdc = TopScoreDocCollector.create(*numHitPerPage*, true);
IndexSearcher(is).search(q,tdc);

ScoreDocs[]  noDocs  = tdc.topDocs().scoreDocs;

Please let me know if any other way to search?

Thanks.
Kiran. M


RE: Best practices for searcher memory usage?

2010-07-14 Thread Christopher Condit
Hi Toke-
> > * 20 million documents [...]
> > * 140GB total index size
> > * Optimized into a single segment
> 
> I take it that you do not have frequent updates? Have you tried to see if you
> can get by with more segments without significant slowdown?

Correct - in fact there are no updates and no deletions. We index everything 
offline when necessary and just swap the new index in...
By more segments do you mean not call optimize() at index time?

> > The application will run with 10G of -Xmx but any less and it bails out.
> > It seems happier if we feed it 12GB. The searches are starting to bog
> > down a bit (5-10 seconds for some queries)...
> 
> 10G sounds like a lot for that index. Two common memory-eaters are sorting
> by field value and faceting. Could you describe what you're doing in that
> regard?

No faceting and no sorting (other than score) for this index...

> Similarly, the 5-10 seconds for some queries seems very slow. Could you give
> some examples on the queries that causes problems together with some
> examples of fast queries and how long they take to execute?

Typically just TermQueries or BooleanQueries: (Chip OR Nacho OR Foo) AND (Salsa 
OR Sauce) AND (This OR That)
The latter is most typical.

With a single keyword it will execute in < 1 second. In a case where there are 
10 clauses it becomes much slower (which I understand, just looking for ways to 
speed it up)...

Thanks,
-Chris


Re: Out of memory problem in search

2010-07-14 Thread Erick Erickson
This doesn't make sense to me. Are you saying that you only have 200,000
documents in your index? Because keeping a score for 200K documents should
consume a relatively trivial amount of memory. The fact that you're sorting
by time is a red flag, but it's only a long, so 200K documents shouldn't
strain memory due to sorting either. The critical thing here isn't
necessarily the size of your index, but the number of documents in that
index and the number of unique values you're sorting by. By the way, what
happens if you don't sort?

Since it doesn't make sense to me, that must mean I don't understand the
problem very thoroughly. Could you provide some index characteristics?
Saying it's 40G leaves a lot open to speculation. That could be 39G of
stored text which is mostly irrelevant for searching. Or it could be
entirely indexed, tokenized data which would be a different thing. How many
documents do you have in your index? What does your query look like?

You can get an idea of the amount of your index holding indexed tokens by
NOT storing any of the fields, just indexing them (Field.Store.NO)

What version of Lucene are you using? How do you start your process? If you
start the application with java's default memory, that's not very much (64M
if memory serves). You may be using nowhere near your hardware limits. Try
specifying -Xmx512M and/or the -server option.

Best
Erick

On Wed, Jul 14, 2010 at 9:27 AM, ilkay polat  wrote:

> I have also  confused about the memory management of lucene.
>
> Where is this out of memory problem is mainly arised from Reason-1 or
> Reason-2 reason?
>
> Reason-1 : Problem is sourced from searching is done in big indexed file
> (nearly 40 GB) If there is 100(small number of records) records returned
> from search in 60 GB indexed file, problem will again arised.
> OR
> Reason-2 : Problem is sourced from finding so many records(nearly 200,000
> records), so in memory 200, 000 java object in heap? If file's sizeis 10
> GB(small file size ) but returned records are so many, problem will again
> arised.
>
> Is there any document which tells the general memory management issues in
> searching in lucene?
>
> Thanks
>
>
> ilkay POLAT Software Engineer   Gsm : (+90) 532 542 36 71
>   E-mail : ilkay_po...@yahoo.com
>
> --- On Wed, 7/14/10, ilkay polat  wrote:
>
> From: ilkay polat 
> Subject: Re: Out of memory problem in search
> To: java-user@lucene.apache.org
> Date: Wednesday, July 14, 2010, 3:54 PM
>
> Hi,
> We have hardware restrictions(Max RAM can be  8GB). So, unfortunately,
> increasing memory can not be option for us for today's situation.
>
> Yes, as you said that problem is faced when goes to last pages of search
> screen because of using search method which is find top n records. In other
> way, this is meaning "searching all the thinngs returns all".
>
> I am now researching whether there is a way which consumes time instead of
> memory in this search mechanism in lucene? Any other ideas?
>
> Thanks
>
> --- On Wed, 7/14/10, findbestopensource 
> wrote:
>
> From: findbestopensource 
> Subject: Re: Out of memory problem in search
> To: java-user@lucene.apache.org
> Date: Wednesday, July 14, 2010, 2:59 PM
>
> Certainly it will. Either you need to increase your memory OR refine your
> query. Eventhough you display paginated result. The first couple of pages
> will display fine and going towards last may face problem. This is because,
> 200,000 objects is created and iterated, 190,900 objects are skipped and
> last100 objects are returned. The memory is consumed in creating these
> objects.
>
> Regards
> Aditya
> www.findbestopensource.com
>
>
>
> On Wed, Jul 14, 2010 at 4:14 PM, ilkay polat 
> wrote:
>
> > Hello Friends;
> >
> > Recently, I have problem with lucene search - memory problem on the basis
> > that indexed file is so big. (I have indexed some kinds of information
> and
> > this indexed file's size is nearly more than 40 gigabyte. )
> >
> > I search the lucene indexed file with
> > org.apache.lucene.search.Searcher.search(query, null, offset + limit, new
> > Sort(new SortField("time", SortField.LONG, true)));
> > (This provides to find (offset + limit) records to back.)
> >
> > I use searching by range. For example, in web page I firstly search
> records
> > which are in [0, 100] range then second page [100, 200]
> > I have nearly 200,000 records at all. When I go to last page which means
> > records between 200,000 -100, 200,0, there is a memory problem(I have 4gb
> > ram on running machine) in jvm( out of memory error).
> >
> > Is there a way to overcome this memory problem?
> >
> > Thanks
> >
> > --
> > ilkay POLAT   Software Engineer
> > TURKEY
> >
> >  Gsm : (+90) 532 542 36 71
> >  E-mail : ilkay_po...@yahoo.com
> >
> >
> >
>
>
>
>
>
>
>
>


Re: Continuously iterate over documents in index

2010-07-14 Thread Erick Erickson
Kiran:
Please start a new thread when asking a new question. From Hossman's apache
page:

When starting a new discussion on a mailing list, please do not reply to
an existing message, instead start a fresh email.  Even if you change the
subject line of your email, other mail headers still track which thread
you replied to and your question is "hidden" in that thread and gets less
attention.   It makes following discussions in the mailing list archives
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking


On Wed, Jul 14, 2010 at 10:56 AM, Kiran Kumar  wrote:

> All,
>
> Issue: Unable to get the proper results after searching. I added sample
> code
> which I used in the application.
>
> If I used *numHitPerPage* value as 1000 its giving expected results.
> ex: The expected results is 32 docs but showing 32 docs
> Instead If I use *numHitPerPage* as 2^32-1 its not giving expected results.
> ex: The expected results is 32 docs but showing only 29 docs.
>
> Sample code below:
>
>
> StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
>  QueryParser qp = new QueryParser(Version.LUCENE_CURRENT, defField,
> analyzer);
> Query q = qp.parse(queryString);
> TopDocsCollector tdc = TopScoreDocCollector.create(*numHitPerPage*, true);
> IndexSearcher(is).search(q,tdc);
>
> ScoreDocs[]  noDocs  = tdc.topDocs().scoreDocs;
>
> Please let me know if any other way to search?
>
> Thanks.
> Kiran. M
>


Re: Continuously iterate over documents in index

2010-07-14 Thread Erick Erickson
H, if you somehow know the last date you processed, why wouldn't using a
range query work for you? I.e.
date:[ TO ]?

Best
Erick

On Wed, Jul 14, 2010 at 10:37 AM, Max Lynch  wrote:

> You could have a field within each doc say "Processed" and store a
>
> > value Yes/No, next run a searcher query which should give you the
> > collection of unprocessed ones.
> >
>
> That sounds like a reasonable idea, and I just realized that I could have
> done that in a way specific to my application.  However, I already tried
> doing something with a MatchAllDocsQuery with a custom collector and sort
> by
> date.  I store the last date and time of a doc I processed and process only
> newer ones.
>


Re: Best practices for searcher memory usage?

2010-07-14 Thread Glen Newton
There are a number of strategies, on the Java or OS side of things:
- Use huge pages[1]. Esp on 64 bit and lots of ram. For long running,
large memory (and GC busy) applications, this has achieved significant
improvements. Like 300% on EJBs. See [2],[3],[4]. For a great article
introducing and benchmarking huge tables, both in C and Java, see [5]
 To see if huge pages might help you, do
  > cat /proc/meminfo
 And check on the "PageTables:26480 kB"
 If the PageTables is, say, more than 1-2GBs, you should consider
using huge pages.
- assuming multicore: there are times (very application dependent)
when having your application running on all cores turns out not to
produce the best performance. Take one core out making it available to
look after system things (I/O, etc) sometimes will improve
performance. Use numactl[6] to bind your application to n-1 cores,
leaving one out.
- - numactl also allows you to restrict memory allocation to 1-n
cores, which also may be useful depending on your application
- The Java vm from Sun-Oracle has a number of options[7]
  - -XX:+AggressiveOpts [You should have this one on always...]
  - -XX:+StringCache
  - -XX:+UseFastAccessorMethods

  - -XX:+UseBiasedLocking  [My experience has this helping some
applications, hindering others...]
  - -XX:ParallelGCThreads= [Usually this is #cores; try reducing this to n/2]
  - -Xss128k
  - -Xmn [Make this large, like of your 40% of heap -Xmx If you do
this use -XX:+UseParallelGC See [8]
You can also play with the many GC parameters. This is pretty arcane,
but can give you good returns.

And of course, I/O is important: data on multiple disks with multiple
controllers; RAID; filesystem tuning ; turn off atime; readahead
buffer (change from 128k to 8MB on Linux: see [9]) OS tuning. See [9]
for a useful filesystem comparison (for Postgres).

-glen
http://zzzoot.blogspot.com/

[1]http://zzzoot.blogspot.com/2009/02/java-mysql-increased-performance-with.html
[2]http://andrigoss.blogspot.com/2008/02/jvm-performance-tuning.html
[3]http://kirkwylie.blogspot.com/2008/11/linux-fork-performance-redux-large.html
[4]http://orainternals.files.wordpress.com/2008/10/high_cpu_usage_hugepages.pdf
[5]http://lwn.net/Articles/374424/
[6]http://www.phpman.info/index.php/man/numactl/8
[7]http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp#PerformanceTuning
[8]http://java.sun.com/performance/reference/whitepapers/tuning.html#section4.2.5
[9]http://assets.en.oreilly.com/1/event/27/Linux%20Filesystem%20Performance%20for%20Databases%20Presentation.pdf

On 15 July 2010 04:28, Christopher Condit  wrote:
> Hi Toke-
>> > * 20 million documents [...]
>> > * 140GB total index size
>> > * Optimized into a single segment
>>
>> I take it that you do not have frequent updates? Have you tried to see if you
>> can get by with more segments without significant slowdown?
>
> Correct - in fact there are no updates and no deletions. We index everything 
> offline when necessary and just swap the new index in...
> By more segments do you mean not call optimize() at index time?
>
>> > The application will run with 10G of -Xmx but any less and it bails out.
>> > It seems happier if we feed it 12GB. The searches are starting to bog
>> > down a bit (5-10 seconds for some queries)...
>>
>> 10G sounds like a lot for that index. Two common memory-eaters are sorting
>> by field value and faceting. Could you describe what you're doing in that
>> regard?
>
> No faceting and no sorting (other than score) for this index...
>
>> Similarly, the 5-10 seconds for some queries seems very slow. Could you give
>> some examples on the queries that causes problems together with some
>> examples of fast queries and how long they take to execute?
>
> Typically just TermQueries or BooleanQueries: (Chip OR Nacho OR Foo) AND 
> (Salsa OR Sauce) AND (This OR That)
> The latter is most typical.
>
> With a single keyword it will execute in < 1 second. In a case where there 
> are 10 clauses it becomes much slower (which I understand, just looking for 
> ways to speed it up)...
>
> Thanks,
> -Chris
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Best practices for searcher memory usage?

2010-07-14 Thread Lance Norskog
Glen, thank you for this very thorough and informative post.

Lance Norskog

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org