Re: Lucene 4.0 Index Format Finalization Timetable

2011-12-08 Thread Mark Miller
While we are in constant sync due to the merge, lucene would still be
updated multiple times before a solr 4 release, and it would be subject to
happen at any time - so its really not any different.

On Wednesday, December 7, 2011, Jamie Johnson  wrote:
> Yeah, biggest issue for us is we're using the SolrCloud features.
> While I see some good things related to the Lucene and Solr code bases
> being merged, this is certainly a frustrating aspect of it as I don't
> require some of the changes that are in Lucene 4.0  (withstanding
> anything that SolrCloud requires that is).
>
> I think the best solution (assuming it works) is to try to lock a
> version of Lucene 4.0 while upgrading Solr.  I'll have to test to see
> if this works or not, but at least it's something.
>
> On Wed, Dec 7, 2011 at 9:02 AM, Mike Sokolov  wrote:
>> My personal view, as a bystander with no more information than you, is
that
>> one has to assume there will be further index format changes before a 4.0
>> release.  This is based on the number of changes in the last 9 months,
and
>> the amount of activity on the dev list.
>>
>> For us the implication is we need to stick w/3.x for now.  You might be
in a
>> different situation if you really need the 4.0 changes.  Maybe you can
just
>> stick w/the current trunk and take responsibility for patching critical
>> bugfixes, hoping you won't have to recreate your index too many times...
>>
>> -Mike
>>
>>
>> On 12/06/2011 09:48 PM, Jamie Johnson wrote:
>>>
>>> I suppose that's fair enough.  Some quick googling seems that this has
>>> been asked many times with pretty much the same response.  Sorry to
>>> add to the noise.
>>>
>>> On Tue, Dec 6, 2011 at 9:34 PM, Darren Govoni
 wrote:
>>>

 I asked here[1] and it said "Ask again later."

 [1] http://8ball.tridelphia.net/


 On 12/06/2011 08:46 PM, Jamie Johnson wrote:

>
> Thanks Robert.  Is there a timetable for that?  I'm trying to gauge
> whether it is appropriate to push for my organization to move to the
> current lucene 4.0 implementation (we're using solr cloud which is
> built against trunk) or if it's expected there will be changes to what
> is currently on trunk.  I'm not looking for anything hard, just trying
> to plan as much as possible understanding that this is one of the
> implications of using trunk.
>
> On Tue, Dec 6, 2011 at 6:48 PM, Robert Muir
 wrote:
>
>>
>> On Tue, Dec 6, 2011 at 6:41 PM, Jamie Johnson
>>  wrote:
>>
>>>
>>> Is there a timetable for when it is expected to be finalized?
>>>
>>
>> it will be finalized when Lucene 4.0 is released.
>>
>> --
>> lucidimagination.com
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


>>>
>>> -
>>> To unsubscribe, e-mail:

-- 
- Mark

http://www.lucidimagination.com


Re: "read past EOF" when merge

2012-11-03 Thread Mark Miller
Can you file a JIRA Markus? This is probably related to the new code that uses 
Directory for replication.

- Mark

On Nov 2, 2012, at 6:53 AM, Markus Jelsma  wrote:

> Hi,
> 
> For what it's worth, we have seen similar issues with Lucene/Solr from this 
> week's trunk. The issue manifests itself when it want to replicate. The 
> servers have not been taken offline and did not crash when this happenend. 
> 
> 2012-10-30 16:12:51,061 WARN [solr.handler.ReplicationHandler] - 
> [http-8080-exec
> -3] - : Exception while writing response for params: 
> file=_p_Lucene41_0.doc&comm
> and=filecontent&checksum=true&generation=6&qt=/replication&wt=filestream
> java.io.EOFException: read past EOF: 
> MMapIndexInput(path="/opt/solr/cores/openindex_h/data/index.20121030152234973/_p_Lucene41_0.doc")
>at 
> org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:100)
>at 
> org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1065)
>at 
> org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932)
> 
> 
> Markus
> 
> -Original message-
>> From:Michael McCandless 
>> Sent: Fri 02-Nov-2012 11:46
>> To: java-user@lucene.apache.org
>> Subject: Re: "read past EOF" when merge
>> 
>> Are you able to reproduce the corruption?
>> 
>> If at any time you accidentally had two writers open on the same
>> index, it could have created this corruption.
>> 
>> Writing to an index over NFS ought to be OK, however, it's not well
>> tested.  You should use SimpleFSLockFactory (not the default
>> NativeFSLockFactory).
>> 
>> The more "typical" way people use NFS is to write to an index on a
>> local disk, and then other machines read from that index using NFS.
>> 
>> In any event performance is usually much worse than using local disks ...
>> 
>> Mike McCandless
>> 
>> http://blog.mikemccandless.com
>> 
>> On Thu, Nov 1, 2012 at 10:32 PM, superruiye  wrote:
>>> oh ,thx,I don't know CheckIndex before...and I use to fix my error index,it
>>> is OK...
>>> I use NFS to share my index,and no change to the LogFactory.
>>> How could I avoid this problem,and not only fix after it was broken
>>> suddenly?
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: 
>>> http://lucene.472066.n3.nabble.com/read-past-EOF-when-merge-tp4017179p4017734.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>> 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 4.1 tentative release

2012-12-12 Thread Mark Miller
We are hoping for 4.1 very soon! With the holidays it will be difficult to say 
- but 4.1 talk has been going on for some time now. Its really a matter of 
wrapping up some short term work and getting some guys to do the release work.

I dont think anyone can give you a date, but it's certainly in the works!

- Mark

On Dec 12, 2012, at 6:50 AM, Ramprakash Ramamoorthy 
 wrote:

> Hello,
> 
> Any 'tentative' release date for 4.1 would help. I know it is
> difficult pointing a date, but still couldn't resist asking, for we could
> plan accordingly. Thanks in advance.
> 
> -- 
> With Thanks and Regards,
> Ramprakash Ramamoorthy,
> India.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Luke?

2013-03-15 Thread Mark Miller
If anyone is able to donate some effort, a nice future scenario could be that 
Luke comes fully up to date with every Lucene release: 
https://issues.apache.org/jira/browse/LUCENE-2562

- Mark

On Mar 15, 2013, at 5:58 AM, Eric Charles  wrote:

> For the record, I happily use Luke (with Lucene 4.1) compiled from 
> https://github.com/sonarme/luke. It is also mavenized (shipped with a 
> pom.xml).
> 
> Thx, Eric
> 
> 
> On 14/03/2013 09:10, dizh wrote:
>> OK , tomorrow I will put it on spmewhere such as GitHub or googlecode.
>> 
>> But, I really don't look into details, when I compile Luke src , I found 
>> about ten's errors.
>> 
>> Most are TermEnums API , so I fixed them.
>> ---
>> Confidentiality Notice: The information contained in this e-mail and any 
>> accompanying attachment(s)
>> is intended only for the use of the intended recipient and may be 
>> confidential and/or privileged of
>> Neusoft Corporation, its subsidiaries and/or its affiliates. If any reader 
>> of this communication is
>> not the intended recipient, unauthorized use, forwarding, printing,  
>> storing, disclosure or copying
>> is strictly prohibited, and may be unlawful.If you have received this 
>> communication in error,please
>> immediately notify the sender by return e-mail, and delete the original 
>> message and all copies from
>> your system. Thank you.
>> ---
>> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



[ANNOUNCE] Apache Lucene 4.2.1 released

2013-04-03 Thread Mark Miller
April 2013, Apache Lucene™ 4.2.1 available

The Lucene PMC is pleased to announce the release of Apache Lucene 4.2.1.

Apache Lucene is a high-performance, full-featured text search engine
library written entirely in Java. It is a technology suitable for nearly
any application that requires full-text search, especially cross-platform.

This release contains numerous bug fixes, optimizations, and
improvements, some of which are highlighted below. The release
is available for immediate download at:
   http://lucene.apache.org/core/mirrors-core-latest-redir.html

See the CHANGES.txt file included with the release for a full list of
details.

Lucene 4.2.1 Release Highlights:

* Lucene 4.2.1 includes 9 bug fixes and 3 optimizations, including a
  fix for a serious bug that could result in the loss of an index.

Please read CHANGES.txt for a full list of changes.

Please report any feedback to the mailing lists
(http://lucene.apache.org/core/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases.  It is possible that the mirror you are using
may not have replicated the release yet.  If that is the case, please
try another mirror.  This also goes for Maven access.

Happy searching,
Lucene/Solr developers


[ANN] Lucene/Solr Meetup in NYC on May 11th

2010-05-08 Thread Mark Miller
If you haven't heard, there is a Lucene/Solr meetup in New York next 
week: http://www.meetup.com/NYC-Apache-Lucene-Solr-Meetup/calendar/13325754/


The scheduled talks are (in addition to lightening talks):

Solr 1.5 and Beyond:

Yonik Seeley, author of Solr, co-founder, Lucid Imagination Topics will 
include new faceting functionality, new function queries, increased 
scalability, field collapsing, and spatial search. There will also be a 
discussion about the recently announced Lucene/Solr merge, the 
rationale, its implications, and plans for its completion. The talk will 
span features already included in trunk, features slated for the next 
release, as well as incomplete features under consideration for future 
releases.


Cool Linguistic Tricks to Apply to Search Results:

A source code level demonstration using LingPipe Breck Baldwin, Founder, 
LingPipe: The talk will cover post-processing options for Twitter 
searches. I will cover clustering and classification with some potential 
for conceptual indexing at the level of persons/orgs/locations. There 
will be a demo available for download that runs the examples and 
contains the source.


--
- Mark

http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NumericField API

2010-06-01 Thread Mark Miller

On 6/1/10 9:34 AM, Mindaugas Žakšauskas wrote:

It's just an early
observation as historically Lucene has been doing an amazing job in
terms of API stability.


Yes it has :)

Get ready for even more change in that area though :)

--
- Mark

http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



[ANN] Free technical webinar: Mastering the Lucene Index: Wednesday, August 11, 2010 11:00 AM PST / 2:00 PM EST / 20:00 CET

2010-08-09 Thread Mark Miller
Hey all - apologize for the quick cross post - just to let you know,
Andrzej is giving a free webinar this wed. His presentations are always
fantastic, so check it out:

Lucid Imagination Presents a free technical webinar:  Mastering the
Lucene Index
Wednesday, August 11, 2010 11:00 AM PST / 2:00 PM EST / 20:00 CET

Sign up here:
http://www.eventsvc.com/lucidimagination/081110?trk-AP

Lucene/Solr index implementation is critical to the performance of your
search application and the quality of your results -- and not just at
indexing time. If you're developing applications in Lucene/Solr, your
index will reward care and attention -- adding power to your running
search application -- all the more so as you inevitably increase the
scope of your query traffic and the dimensions of your data.

Join Andrzej Bialecki, Lucene Committer and inventor of the Luke index
utility, for an advanced workshop on cutting edge techniques for keeping
your Lucene/Solr index at its peak potential. Andrzej will discuss and
present essential strategies for index post-processing, including:
* Single-pass index splitting -- reshaping indexes for flexible deployment
* Index pruning, filtering and multi-tiered search, or how to serve
indexes (mostly) from RAM
* Bit-wise search -- or how to find the best bit-wise matches - and
applications in text fingerprinting

About the presenter: Andrzej Bialecki is a committer of the Apache
Lucene project, a Lucene PMC member, and chairman of the Apache Nutch
project. He is also the author of Luke, the Lucene Index Toolbox.
Andrzej participates in many commercial projects that use Lucene, Solr,
Nutch and Hadoop to implement enterprise and vertical search.

Sign up here:
http://www.eventsvc.com/lucidimagination/081110?trk-AP

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Difference between regular Highlighter and Fast Vector Highlighter ?

2011-04-11 Thread Mark Miller
The general and short answer is:

Highlighter: highlights more query types, has a fairly rich API, doesn't scale 
well to very large documents (though 
https://issues.apache.org/jira/browse/LUCENE-2939 is going to help a lot here) 
- does not require that you store term vectors, but is faster if you do.

FVH: works with fewer query types and requires that you store term vectors - 
but scales better than the std Highlighter to very large documents

- Mark Miller
lucidimagination.com

Lucene/Solr User Conference
May 25-26, San Francisco
www.lucenerevolution.org

On Apr 1, 2011, at 8:32 AM, shrinath.m wrote:

> I was wondering whats the difference between the Lucene's 2 implementation of
> highlighters... 
> I saw the javadoc of FVH, but it only says "another implementation of Lucene
> Highlighter" ...
> 
> Can someone throw some more light on this ? 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Difference-between-regular-Highlighter-and-Fast-Vector-Highlighter-tp2763162p2763162.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 








-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NRT consistency

2011-04-11 Thread Mark Miller

On Apr 10, 2011, at 4:34 AM, Em wrote:

> Hello list,
> 
> I am currently trying to understand Lucene's Near-Real-Time-Feature which
> was covered in "Lucene in Action, Second Edition".
> 
> Let's say I got a distributed system with a master and a slave.
> 
> In Solr replication is solved by checking for any differences in the
> index-directory and to consume those differences to keep indices consistent.
> 
> How is this possible within a NRT-System? Is there any possibility to
> consume snapshots of the internal buffer of the index writer to send them to
> the slave?

I think for near real time, Solr index replication may not be appropriate. 
Though I think it would be cool to use Andrzej's mythical single pass index 
splitter to create a single+ doc segment that could be shipped around.

Most likely, a system that just sends each doc to each replica is probably 
going to work a lot better. Introduces other issues of course - some of which 
we hope to alleviate with further SolrCloud work.

> 
> Regards,
> Em
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/NRT-consistency-tp2801878p2801878.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 

- Mark Miller
lucidimagination.com

Lucene/Solr User Conference
May 25-26, San Francisco
www.lucenerevolution.org






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NRT consistency

2011-04-11 Thread Mark Miller

On Apr 11, 2011, at 1:05 PM, Em wrote:

> Thank you both!
> 
> Mark, could you explain what you mean? I never heard from such an
> index-splitter. BTW: The idea of having a segment per document sounds a lot
> like an exception for too many FileDescriptors :)

This is just an idea for rebalancing I suppose - an index splitter lets you 
split up an index - there is a multi pass splitter in contrib. So if you wanted 
to move a few documents around (to rebalance after a couple servers go down 
perhaps), you might split out another index (just the docs you want to move), 
and then ship off that already analyzed and indexed bunch of documents to other 
servers.

> 
> Mike, as you said, the segments are flushed like normal.
> Let's say my server dies for whatever reason, when restarting it and
> reopening the index-writer: Does the IW deletes the flushed file, because it
> is not mentioned in the segmentInfo - file or how does Lucene handle this
> internally?
> 
> Regards,
> Em
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/NRT-consistency-tp2801878p2807475.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 

- Mark Miller
lucidimagination.com

Lucene/Solr User Conference
May 25-26, San Francisco
www.lucenerevolution.org






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NRT consistency

2011-04-11 Thread Mark Miller

On Apr 11, 2011, at 2:41 PM, Otis Gospodnetic wrote:

> I think what's being described here is a lot like what I *think* 
> ElasticSearch 
> does, where there is no single master and index changed made to any node get 
> propagated to N-1 other nodes (N=number of index replicas).  I'm not sure how 
> it 
> deals with situations where "incompatible" index changes are made to the same 
> index via 2 different nodes at the same time.  Is that what vector clocks are 
> about?

Right - you have to have some sort of conflict detection/resolution - Amazon 
Dynamo uses vector clocks for this.

> 
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
> 
> 
> 
> - Original Message 
>> From: Mark Miller 
>> To: java-user@lucene.apache.org
>> Sent: Mon, April 11, 2011 11:52:05 AM
>> Subject: Re: NRT consistency
>> 
>> 
>> On Apr 10, 2011, at 4:34 AM, Em wrote:
>> 
>>> Hello list,
>>> 
>>> I am currently trying to understand Lucene's Near-Real-Time-Feature  which
>>> was covered in "Lucene in Action, Second Edition".
>>> 
>>> Let's say I got a distributed system with a master and a slave.
>>> 
>>> In Solr replication is solved by checking for any differences in  the
>>> index-directory and to consume those differences to keep indices  
> consistent.
>>> 
>>> How is this possible within a NRT-System? Is there  any possibility to
>>> consume snapshots of the internal buffer of the index  writer to send them 
> to
>>> the slave?
>> 
>> I think for near real time,  Solr index replication may not be appropriate. 
>> Though I think it would be cool  to use Andrzej's mythical single pass index 
>> splitter to create a single+ doc  segment that could be shipped around.
>> 
>> Most likely, a system that just  sends each doc to each replica is probably 
>> going to work a lot better.  Introduces other issues of course - some of 
>> which 
>> we hope to alleviate with  further SolrCloud work.
>> 
>>> 
>>> Regards,
>>> Em
>>> 
>>> --
>>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/NRT-consistency-tp2801878p2801878.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>> 
>>> -
>>> To  unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> 
>> 
>> - Mark Miller
>> lucidimagination.com
>> 
>> Lucene/Solr User  Conference
>> May 25-26, San  Francisco
>> www.lucenerevolution.org
>> 
>> 
>> 
>> 
>> 
>> 
>> -
>> To  unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For  additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 

- Mark Miller
lucidimagination.com

Lucene/Solr User Conference
May 25-26, San Francisco
www.lucenerevolution.org






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Extracting span terms using WeightedSpanTermExtractor

2011-07-06 Thread Mark Miller
Sorry - kind of my fault. When I fixed this to use maxDocCharsToAnalyze, I 
didn't set a default other than 0 because I didn't really count on this being 
used beyond how it is in the Highlighter - which always sets 
maxDocCharsToAnalyze with it's default.
 
You've got to explicitly set it higher than 0 for now.

Feel free to create a JIRA issue and we can give it's own default greater than 
0.

- Mark Miller
lucidimagination.com


On Jul 6, 2011, at 5:34 PM, Jahangir Anwari wrote:

> I have a CustomHighlighter that extends the SolrHighlighter and overrides
> the doHighlighting() method. Then for each document I am trying to extract
> the span terms so that later I can use it to get the span Positions. I tried
> to get the weightedSpanTerms using WeightedSpanTermExtractor but was
> unsuccessful. Below is the code that I am have. Is there something missing
> that needs to be added to get the span terms?
> 
> // in CustomHighlighter.java
> @Override
> public NamedList doHighlighting(DocList docs, Query query, SolrQueryRequest
> req, String[] defaultFields) throws IOException {
> 
>  NamedList highlightedSnippets = super.doHighlighting(docs, query, req,
> defaultFields);
> 
>  IndexReader reader = req.getSearcher().getIndexReader();
> 
>  String[] fieldNames = getHighlightFields(query, req, defaultFields);
>  for (String fieldName : fieldNames)
>  {
>  QueryScorer scorer = new QueryScorer(query, null);
>  scorer.setExpandMultiTermQuery(true);
>  scorer.setMaxDocCharsToAnalyze(51200);
> 
>  DocIterator iterator = docs.iterator();
>  for (int i = 0; i < docs.size(); i++)
>  {
> int docId = iterator.nextDoc();
> System.out.println("DocId: " + docId);
> TokenStream tokenStream = TokenSources.getTokenStream(reader, docId,
> fieldName);
> WeightedSpanTermExtractor wste = new WeightedSpanTermExtractor(fieldName);
> wste.setExpandMultiTermQuery(true);
> wste.setWrapIfNotCachingTokenFilter(true);
> 
> Map weightedSpanTerms  =
> wste.getWeightedSpanTerms(query, tokenStream, fieldName); // this is always
> empty
> System.out.println("weightedSpanTerms: " + weightedSpanTerms.values());
> 
>  }
>  }
> return highlightedSnippets;
> 
> }
> 
> Thanks,
> Jahangir











-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Extracting span terms using WeightedSpanTermExtractor

2011-07-07 Thread Mark Miller

On Jul 7, 2011, at 5:14 PM, Jahangir Anwari wrote:

> I did noticed a strange issue though. When the query is just a
> PhraseQuery(e.g. "everlasting glory"), getWeightedSpanTerms() returns all
> the span terms along with their span positions. But when the query is a
> BooleanQuery containing phrase and non-phrase terms(e.g. "everlasting
> glory"+unity), getWeightedSpanTerms() returns all the span terms but the
> span positions are returned only for the phrase terms(i.e. "everlasting" and
> "glory"). Span positions for the non-phrase term(i.e. "unity") is empty. Any
> ideas why this could be happening?


Positions are only collected for "position sensitive" queries. The Highlighter 
framework that I plugged this into already runs through the TokenStream one 
token at a time - to highlight a TermQuery, there is no need to consult 
positions - just highlight every occurrence seen while marching through the 
TokenStream. Which means there is no need to find those positions either.

If you are looking for those positions, here is a patch to calculate them for 
TermQuerys as well. If you open a JIRA issue, seems like a reasonable option to 
add to the class.

Index: 
lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTermExtractor.java
===
--- 
lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTermExtractor.java
   (revision 1143407)
+++ 
lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTermExtractor.java
   (working copy)
@@ -133,7 +133,7 @@
   sp.setBoost(query.getBoost());
   extractWeightedSpanTerms(terms, sp);
 } else if (query instanceof TermQuery) {
-  extractWeightedTerms(terms, query);
+  extractWeightedSpanTerms(terms, new 
SpanTermQuery(((TermQuery)query).getTerm()));
 } else if (query instanceof SpanQuery) {
   extractWeightedSpanTerms(terms, (SpanQuery) query);
 } else if (query instanceof FilteredQuery) {


- Mark Miller
lucidimagination.com









-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Extracting span terms using WeightedSpanTermExtractor

2011-07-08 Thread Mark Miller

On Jul 8, 2011, at 5:43 AM, Jahangir Anwari wrote:

> I don't think this is the best
> solution, am open to other alternatives.


Could also make it static public where it is? Either way.


- Mark Miller
lucidimagination.com









-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



[Announce] Lucene-Eurocon Call for Participation Closes Friday, JULY 15

2011-07-12 Thread Mark Miller
Hey all - just a friendly FYI reminder:

CALL FOR PARTICIPATION CLOSES FRIDAY, JULY 15!
TO SUBMIT A TOPIC, GO TO: http://2011.lucene-eurocon.org/pages/cfp

Now in its second year, Apache Lucene Eurocon 2011 comes to Barcelona, Spain, 
providing an unparalleled opportunity for  European search application 
developers and technologists to connect and  network. The conference takes 
place October 19 - 20, preceded by two days of optional training workshops 
October 17 - 18.

Get Involved Today! The Call for Participation Closes This Week!   

Consider presenting at Apache Lucene EuroCon 2011. Submit your ideas by July 
15. If  you have a great Solr or Lucene story to tell, the community wants to 
hear about it. Share your expertise and innovations! To submit a topic, go to:
http://2011.lucene-eurocon.org/pages/cfp

Sample topics of interest include:

* Lucene and Solr in the Enterprise (case studies, implementation, return on 
investment, etc.)
* “How We Did It”  Development Case Studies
* Relevance in Practice
* Spatial/Geo search
* Lucene and Solr in the Cloud
* Scalability and Performance Tuning
* Large Scale Search
* Real Time Search
* Data Integration/Data Management
* Tika, Nutch and Mahout
* Faceting and Categorization
* Lucene & Solr for Mobile Applications
* Multi-language Support
* Indexing and Analysis Techniques
* Advanced Topics in Lucene & Solr Development

Want to be added to the conference mailing list? Is your organization 
interested in sponsorship opportunities? Please send an email to  
i...@lucene-eurocon.org 

Best Regards,

Suzanne Kushner
Lucid Imagination Corporate Marketing
www.lucidimagination.com

DATE: OCTOBER 17 - 20 2011

LOCATION:
Hotel Meliá Barcelona
C/ Avenida Sarriá,
50 Barcelona - SPAIN 08029
Tel: (0034) 93 4106060

Apache Lucene EuroCon 2011 is presented by Lucid Imagination, the commercial 
entity for Apache Solr/Lucene Open Source Search; proceeds of the conference 
benefit The Apache Software Foundation.

"Lucene" and "Apache Solr" are trademarks of the Apache Software Foundation.





- Mark Miller
lucidimagination.com









-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Questions on index Writer

2011-07-16 Thread Mark Miller
My advice: Don't close the IndexWriter - just call commit. Don't worry about 
forcing merges - let them happen as they do when you call commit.

If you are going to use the IndexWriter again, you generally do not want to 
close it. Calling commit is the preferred option.

- Mark Miller
lucidimagination.com

On Jul 15, 2011, at 3:03 PM, Saurabh Gokhale wrote:

> Hi All,
> 
> I have following questions about lucene indexWriter. I am using version
> 3.1.0.
> 
> While indexing documents,
> 1. When is the good time to commit changes? (indexWriter.commit) or just
> close the writer after the indexing is done so that commit automatically
> happens.
> 2. When is the good time to merge indexes (indexWriter.maybeMerge()).  Is it
> just before committing the changes or after indexing say X number of
> documents. (I recently upgraded from 2.9.4 to 3.1 and I see 3.1 lucene
> generates lot of small index files while indexing document)
> 
> Also I have a problem where my lucene index files sometimes gets deleted
> from the index folder. I am not sure what code snippet is causing the
> existing index files to accidently get removed.
> 
> My indexer runs in a thread loop where it indexes file whenever they are
> available. When no more files are available, indexer thread closes the
> writer and goes to sleep, after specific time, it again creates a new index
> on the same folder and starts indexing new files if any available.
> 
> A. Is it a wrong way to index files?
> B. Because I close the index and open it again later, am I seeing my lucene
> index files getting deleted?
> 
> Thanks
> 
> Saurabh











-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Search within a sentence (revisited)

2011-07-20 Thread Mark Miller

On Jul 20, 2011, at 11:27 AM, Peter Keegan wrote:

> Mark Miller's 'SpanWithinQuery' patch
> seems to have the same issue.

If I remember right (It's been more the a couple years), I did index the 
sentence markers at the same position as the last word in the sentence. And I 
think the limitation that I ate was that the word could belong to both it's 
true sentence, and the one after it.

- Mark Miller
lucidimagination.com









-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Search within a sentence (revisited)

2011-07-20 Thread Mark Miller

On Jul 20, 2011, at 7:44 PM, Mark Miller wrote:

> 
> On Jul 20, 2011, at 11:27 AM, Peter Keegan wrote:
> 
>> Mark Miller's 'SpanWithinQuery' patch
>> seems to have the same issue.
> 
> If I remember right (It's been more the a couple years), I did index the 
> sentence markers at the same position as the last word in the sentence. And I 
> think the limitation that I ate was that the word could belong to both it's 
> true sentence, and the one after it.
> 
> - Mark Miller
> lucidimagination.com

Perhaps you could index the sentence marker at both the last word of the 
sentence as well as the first word of the next sentence if there is one. This 
would seem to solve the above limitation as well?

- Mark Miller
lucidimagination.com









-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Search within a sentence (revisited)

2011-07-21 Thread Mark Miller
Hey Peter,

Getting sucked back into Spans...

That test should pass now - I uploaded a new patch to 
https://issues.apache.org/jira/browse/LUCENE-777

Further tests may be needed though. 

- Mark


On Jul 21, 2011, at 9:28 AM, Peter Keegan wrote:

> Hi Mark,
> 
> Here is a unit test using a version of 'SpanWithinQuery' modified for 3.2
> ('getTerms' removed) . The last test fails (search for "1" and "3").
> 
> package org.apache.lucene.search.spans;
> 
> import java.io.Reader;
> 
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
> import
> org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.RandomIndexWriter;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.PhraseQuery;
> import org.apache.lucene.search.ScoreDoc;
> import org.apache.lucene.search.TermQuery;
> import org.apache.lucene.search.spans.SpanNearQuery;
> import org.apache.lucene.search.spans.SpanQuery;
> import org.apache.lucene.search.spans.SpanTermQuery;
> import org.apache.lucene.util.LuceneTestCase;
> 
> public class TestSentence extends LuceneTestCase {
> public static final String field = "field";
> public static final String START = "^";
> public static final String END = "$";
> public void testSetPosition() throws Exception {
> Analyzer analyzer = new Analyzer() {
> @Override
> public TokenStream tokenStream(String fieldName, Reader reader) {
> return new TokenStream() {
> private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", END,
> "9"};
> private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1};
> private int i = 0;
> 
> PositionIncrementAttribute posIncrAtt =
> addAttribute(PositionIncrementAttribute.class);
> CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
> OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
> 
> @Override
> public boolean incrementToken() {
> assertEquals(TOKENS.length, INCREMENTS.length);
> if (i == TOKENS.length)
> return false;
> clearAttributes();
> termAtt.append(TOKENS[i]);
> offsetAtt.setOffset(i,i);
> posIncrAtt.setPositionIncrement(INCREMENTS[i]);
> i++;
> return true;
> }
> };
> }
> };
> Directory store = newDirectory();
> RandomIndexWriter writer = new RandomIndexWriter(random, store, analyzer);
> Document d = new Document();
> d.add(newField("field", "bogus", Field.Store.YES, Field.Index.ANALYZED));
> writer.addDocument(d);
> IndexReader reader = writer.getReader();
> writer.close();
> IndexSearcher searcher = newSearcher(reader);
> 
> SpanTermQuery startSentence = makeSpanTermQuery(START);
> SpanTermQuery endSentence = makeSpanTermQuery(END);
> SpanQuery[] clauses = new SpanQuery[2];
> clauses[0] = makeSpanTermQuery("1");
> clauses[1] = makeSpanTermQuery("2");
> SpanNearQuery allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE,
> false); // SpanAndQuery equivalent
> SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence, 0);
> System.out.println("query: "+query);
> ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
> assertEquals(hits.length, 1);
> 
> clauses[1] = makeSpanTermQuery("4");
> allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
> SpanAndQuery equivalent
> query = new SpanWithinQuery(allKeywords, endSentence, 0);
> System.out.println("query: "+query);
> hits = searcher.search(query, null, 1000).scoreDocs;
> assertEquals(hits.length, 0);
> 
> PhraseQuery pq = new PhraseQuery();
> pq.add(new Term(field, "3"));
> pq.add(new Term(field, "4"));
> hits = searcher.search(pq, null, 1000).scoreDocs;
> assertEquals(hits.length, 1);
> 
> clauses[1] = makeSpanTermQuery("3");
> allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
> SpanAndQuery equivalent
> query = new SpanWithinQuery(allKeywords, endSentence, 0);
> System.out.println("query: "+query);
> hits = searcher.search(query, null, 1000).scoreDocs;
> assertEquals(hits.length, 1);
> 
> 
> }
> 
> public SpanTermQuery makeSpanTermQuery(String text) {
> return new SpanTermQuery(new 

Re: Search within a sentence (revisited)

2011-07-21 Thread Mark Miller
Yeah, it's off trunk - I'll submit a 3X patch in a bit - just have to change 
that to an IndexReader I believe.

- Mark

On Jul 21, 2011, at 4:01 PM, Peter Keegan wrote:

> Does this patch require the trunk version? I'm using 3.2 and
> 'AtomicReaderContext' isn't there.
> 
> Peter
> 
> On Thu, Jul 21, 2011 at 3:07 PM, Mark Miller  wrote:
> 
>> Hey Peter,
>> 
>> Getting sucked back into Spans...
>> 
>> That test should pass now - I uploaded a new patch to
>> https://issues.apache.org/jira/browse/LUCENE-777
>> 
>> Further tests may be needed though.
>> 
>> - Mark
>> 
>> 
>> On Jul 21, 2011, at 9:28 AM, Peter Keegan wrote:
>> 
>>> Hi Mark,
>>> 
>>> Here is a unit test using a version of 'SpanWithinQuery' modified for 3.2
>>> ('getTerms' removed) . The last test fails (search for "1" and "3").
>>> 
>>> package org.apache.lucene.search.spans;
>>> 
>>> import java.io.Reader;
>>> 
>>> import org.apache.lucene.analysis.Analyzer;
>>> import org.apache.lucene.analysis.TokenStream;
>>> import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
>>> import
>>> org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
>>> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
>>> import org.apache.lucene.document.Document;
>>> import org.apache.lucene.document.Field;
>>> import org.apache.lucene.index.IndexReader;
>>> import org.apache.lucene.index.RandomIndexWriter;
>>> import org.apache.lucene.index.Term;
>>> import org.apache.lucene.store.Directory;
>>> import org.apache.lucene.search.IndexSearcher;
>>> import org.apache.lucene.search.PhraseQuery;
>>> import org.apache.lucene.search.ScoreDoc;
>>> import org.apache.lucene.search.TermQuery;
>>> import org.apache.lucene.search.spans.SpanNearQuery;
>>> import org.apache.lucene.search.spans.SpanQuery;
>>> import org.apache.lucene.search.spans.SpanTermQuery;
>>> import org.apache.lucene.util.LuceneTestCase;
>>> 
>>> public class TestSentence extends LuceneTestCase {
>>> public static final String field = "field";
>>> public static final String START = "^";
>>> public static final String END = "$";
>>> public void testSetPosition() throws Exception {
>>> Analyzer analyzer = new Analyzer() {
>>> @Override
>>> public TokenStream tokenStream(String fieldName, Reader reader) {
>>> return new TokenStream() {
>>> private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", END,
>>> "9"};
>>> private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1};
>>> private int i = 0;
>>> 
>>> PositionIncrementAttribute posIncrAtt =
>>> addAttribute(PositionIncrementAttribute.class);
>>> CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
>>> OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
>>> 
>>> @Override
>>> public boolean incrementToken() {
>>> assertEquals(TOKENS.length, INCREMENTS.length);
>>> if (i == TOKENS.length)
>>> return false;
>>> clearAttributes();
>>> termAtt.append(TOKENS[i]);
>>> offsetAtt.setOffset(i,i);
>>> posIncrAtt.setPositionIncrement(INCREMENTS[i]);
>>> i++;
>>> return true;
>>> }
>>> };
>>> }
>>> };
>>> Directory store = newDirectory();
>>> RandomIndexWriter writer = new RandomIndexWriter(random, store,
>> analyzer);
>>> Document d = new Document();
>>> d.add(newField("field", "bogus", Field.Store.YES, Field.Index.ANALYZED));
>>> writer.addDocument(d);
>>> IndexReader reader = writer.getReader();
>>> writer.close();
>>> IndexSearcher searcher = newSearcher(reader);
>>> 
>>> SpanTermQuery startSentence = makeSpanTermQuery(START);
>>> SpanTermQuery endSentence = makeSpanTermQuery(END);
>>> SpanQuery[] clauses = new SpanQuery[2];
>>> clauses[0] = makeSpanTermQuery("1");
>>> clauses[1] = makeSpanTermQuery("2");
>>> SpanNearQuery allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE,
>>> false); // SpanAndQuery equivalent
>>> SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence, 0);
>>> System.out.println("query: "+query);
>>&

Re: Search within a sentence (revisited)

2011-07-21 Thread Mark Miller

I just uploaded a patch for 3X that will work for 3.2.

On Jul 21, 2011, at 4:25 PM, Mark Miller wrote:

> Yeah, it's off trunk - I'll submit a 3X patch in a bit - just have to change 
> that to an IndexReader I believe.
> 
> - Mark
> 
> On Jul 21, 2011, at 4:01 PM, Peter Keegan wrote:
> 
>> Does this patch require the trunk version? I'm using 3.2 and
>> 'AtomicReaderContext' isn't there.
>> 
>> Peter
>> 
>> On Thu, Jul 21, 2011 at 3:07 PM, Mark Miller  wrote:
>> 
>>> Hey Peter,
>>> 
>>> Getting sucked back into Spans...
>>> 
>>> That test should pass now - I uploaded a new patch to
>>> https://issues.apache.org/jira/browse/LUCENE-777
>>> 
>>> Further tests may be needed though.
>>> 
>>> - Mark
>>> 
>>> 
>>> On Jul 21, 2011, at 9:28 AM, Peter Keegan wrote:
>>> 
>>>> Hi Mark,
>>>> 
>>>> Here is a unit test using a version of 'SpanWithinQuery' modified for 3.2
>>>> ('getTerms' removed) . The last test fails (search for "1" and "3").
>>>> 
>>>> package org.apache.lucene.search.spans;
>>>> 
>>>> import java.io.Reader;
>>>> 
>>>> import org.apache.lucene.analysis.Analyzer;
>>>> import org.apache.lucene.analysis.TokenStream;
>>>> import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
>>>> import
>>>> org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
>>>> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
>>>> import org.apache.lucene.document.Document;
>>>> import org.apache.lucene.document.Field;
>>>> import org.apache.lucene.index.IndexReader;
>>>> import org.apache.lucene.index.RandomIndexWriter;
>>>> import org.apache.lucene.index.Term;
>>>> import org.apache.lucene.store.Directory;
>>>> import org.apache.lucene.search.IndexSearcher;
>>>> import org.apache.lucene.search.PhraseQuery;
>>>> import org.apache.lucene.search.ScoreDoc;
>>>> import org.apache.lucene.search.TermQuery;
>>>> import org.apache.lucene.search.spans.SpanNearQuery;
>>>> import org.apache.lucene.search.spans.SpanQuery;
>>>> import org.apache.lucene.search.spans.SpanTermQuery;
>>>> import org.apache.lucene.util.LuceneTestCase;
>>>> 
>>>> public class TestSentence extends LuceneTestCase {
>>>> public static final String field = "field";
>>>> public static final String START = "^";
>>>> public static final String END = "$";
>>>> public void testSetPosition() throws Exception {
>>>> Analyzer analyzer = new Analyzer() {
>>>> @Override
>>>> public TokenStream tokenStream(String fieldName, Reader reader) {
>>>> return new TokenStream() {
>>>> private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", END,
>>>> "9"};
>>>> private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1};
>>>> private int i = 0;
>>>> 
>>>> PositionIncrementAttribute posIncrAtt =
>>>> addAttribute(PositionIncrementAttribute.class);
>>>> CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
>>>> OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
>>>> 
>>>> @Override
>>>> public boolean incrementToken() {
>>>> assertEquals(TOKENS.length, INCREMENTS.length);
>>>> if (i == TOKENS.length)
>>>> return false;
>>>> clearAttributes();
>>>> termAtt.append(TOKENS[i]);
>>>> offsetAtt.setOffset(i,i);
>>>> posIncrAtt.setPositionIncrement(INCREMENTS[i]);
>>>> i++;
>>>> return true;
>>>> }
>>>> };
>>>> }
>>>> };
>>>> Directory store = newDirectory();
>>>> RandomIndexWriter writer = new RandomIndexWriter(random, store,
>>> analyzer);
>>>> Document d = new Document();
>>>> d.add(newField("field", "bogus", Field.Store.YES, Field.Index.ANALYZED));
>>>> writer.addDocument(d);
>>>> IndexReader reader = writer.getReader();
>>>> writer.close();
>>>> IndexSearcher searcher = newSearcher(reader);
>>>> 
>>>> SpanTermQuery startSentence = makeSpanTermQuery(START

Re: Search within a sentence (revisited)

2011-07-25 Thread Mark Miller
Thanks Peter - if you supply the unit tests, I'm happy to work on the fixes.

I can likely look at this later today.

- Mark Miller
lucidimagination.com

On Jul 25, 2011, at 10:14 AM, Peter Keegan wrote:

> Hi Mark,
> 
> Sorry to bug you again, but there's another case that fails the unit test
> (search within the second sentence), as shown here in the last test:
> 
> package org.apache.lucene.search.spans;
> 
> import java.io.Reader;
> 
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
> import
> org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.RandomIndexWriter;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.PhraseQuery;
> import org.apache.lucene.search.ScoreDoc;
> import org.apache.lucene.search.TermQuery;
> import org.apache.lucene.search.spans.SpanNearQuery;
> import org.apache.lucene.search.spans.SpanQuery;
> import org.apache.lucene.search.spans.SpanTermQuery;
> import org.apache.lucene.util.LuceneTestCase;
> 
> public class TestSentence extends LuceneTestCase {
> public static final String field = "field";
> public static final String START = "^";
> public static final String END = "$";
> public void testSetPosition() throws Exception {
> Analyzer analyzer = new Analyzer() {
> @Override
> public TokenStream tokenStream(String fieldName, Reader reader) {
> return new TokenStream() {
> private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", END,
> "9"};
> private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1};
> private int i = 0;
> PositionIncrementAttribute posIncrAtt =
> addAttribute(PositionIncrementAttribute.class);
> CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
> OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
> @Override
> public boolean incrementToken() {
> assertEquals(TOKENS.length, INCREMENTS.length);
> if (i == TOKENS.length)
> return false;
> clearAttributes();
> termAtt.append(TOKENS[i]);
> offsetAtt.setOffset(i,i);
> posIncrAtt.setPositionIncrement(INCREMENTS[i]);
> i++;
> return true;
> }
> };
> }
> };
> Directory store = newDirectory();
> RandomIndexWriter writer = new RandomIndexWriter(random, store, analyzer);
> Document d = new Document();
> d.add(newField("field", "bogus", Field.Store.YES, Field.Index.ANALYZED));
> writer.addDocument(d);
> IndexReader reader = writer.getReader();
> writer.close();
> IndexSearcher searcher = newSearcher(reader);
> SpanTermQuery startSentence = makeSpanTermQuery(START);
> SpanTermQuery endSentence = makeSpanTermQuery(END);
> SpanQuery[] clauses = new SpanQuery[2];
> clauses[0] = makeSpanTermQuery("1");
> clauses[1] = makeSpanTermQuery("2");
> SpanNearQuery allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE,
> false); // SpanAndQuery equivalent
> SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence, 0);
> System.out.println("query: "+query);
> ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
> assertEquals(1, hits.length);
> clauses[1] = makeSpanTermQuery("4");
> allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
> SpanAndQuery equivalent
> query = new SpanWithinQuery(allKeywords, endSentence, 0);
> System.out.println("query: "+query);
> hits = searcher.search(query, null, 1000).scoreDocs;
> assertEquals(0, hits.length);
> PhraseQuery pq = new PhraseQuery();
> pq.add(new Term(field, "3"));
> pq.add(new Term(field, "4"));
> System.out.println("query: "+pq);
> hits = searcher.search(pq, null, 1000).scoreDocs;
> assertEquals(1, hits.length);
> clauses[0] = makeSpanTermQuery("4");
> clauses[1] = makeSpanTermQuery("6");
> allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
> SpanAndQuery equivalent
> query = new SpanWithinQuery(allKeywords, endSentence, 0);
> System.out.println("query: "+query);
> hits = searcher.search(query, null, 1000).scoreDocs;
> assertEquals(1, hits.length);
> }
> 
> public SpanTermQuery makeSpanTermQuery(String text) {
> return new SpanTermQuery(new Term(field, text));

Re: Search within a sentence (revisited)

2011-07-25 Thread Mark Miller
Sorry Peter - I introduced this problem with some kind of typo type issue - I 
somehow changed an includeSpans variable to excludeSpans - but I certainly 
didn't mean too - it makes no sense. So not sure how it happened, and surprised 
the tests that passed still passed!

We could probably use even more tests before feeling too confident here…

I've attached a patch for 3X with the new test and fix (changed that include 
back to exclude).

- Mark Miller
lucidimagination.com

On Jul 25, 2011, at 10:29 AM, Mark Miller wrote:

> Thanks Peter - if you supply the unit tests, I'm happy to work on the fixes.
> 
> I can likely look at this later today.
> 
> - Mark Miller
> lucidimagination.com
> 
> On Jul 25, 2011, at 10:14 AM, Peter Keegan wrote:
> 
>> Hi Mark,
>> 
>> Sorry to bug you again, but there's another case that fails the unit test
>> (search within the second sentence), as shown here in the last test:
>> 
>> package org.apache.lucene.search.spans;
>> 
>> import java.io.Reader;
>> 
>> import org.apache.lucene.analysis.Analyzer;
>> import org.apache.lucene.analysis.TokenStream;
>> import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
>> import
>> org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
>> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
>> import org.apache.lucene.document.Document;
>> import org.apache.lucene.document.Field;
>> import org.apache.lucene.index.IndexReader;
>> import org.apache.lucene.index.RandomIndexWriter;
>> import org.apache.lucene.index.Term;
>> import org.apache.lucene.store.Directory;
>> import org.apache.lucene.search.IndexSearcher;
>> import org.apache.lucene.search.PhraseQuery;
>> import org.apache.lucene.search.ScoreDoc;
>> import org.apache.lucene.search.TermQuery;
>> import org.apache.lucene.search.spans.SpanNearQuery;
>> import org.apache.lucene.search.spans.SpanQuery;
>> import org.apache.lucene.search.spans.SpanTermQuery;
>> import org.apache.lucene.util.LuceneTestCase;
>> 
>> public class TestSentence extends LuceneTestCase {
>> public static final String field = "field";
>> public static final String START = "^";
>> public static final String END = "$";
>> public void testSetPosition() throws Exception {
>> Analyzer analyzer = new Analyzer() {
>> @Override
>> public TokenStream tokenStream(String fieldName, Reader reader) {
>> return new TokenStream() {
>> private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", END,
>> "9"};
>> private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1};
>> private int i = 0;
>> PositionIncrementAttribute posIncrAtt =
>> addAttribute(PositionIncrementAttribute.class);
>> CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
>> OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
>> @Override
>> public boolean incrementToken() {
>> assertEquals(TOKENS.length, INCREMENTS.length);
>> if (i == TOKENS.length)
>> return false;
>> clearAttributes();
>> termAtt.append(TOKENS[i]);
>> offsetAtt.setOffset(i,i);
>> posIncrAtt.setPositionIncrement(INCREMENTS[i]);
>> i++;
>> return true;
>> }
>> };
>> }
>> };
>> Directory store = newDirectory();
>> RandomIndexWriter writer = new RandomIndexWriter(random, store, analyzer);
>> Document d = new Document();
>> d.add(newField("field", "bogus", Field.Store.YES, Field.Index.ANALYZED));
>> writer.addDocument(d);
>> IndexReader reader = writer.getReader();
>> writer.close();
>> IndexSearcher searcher = newSearcher(reader);
>> SpanTermQuery startSentence = makeSpanTermQuery(START);
>> SpanTermQuery endSentence = makeSpanTermQuery(END);
>> SpanQuery[] clauses = new SpanQuery[2];
>> clauses[0] = makeSpanTermQuery("1");
>> clauses[1] = makeSpanTermQuery("2");
>> SpanNearQuery allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE,
>> false); // SpanAndQuery equivalent
>> SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence, 0);
>> System.out.println("query: "+query);
>> ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
>> assertEquals(1, hits.length);
>> clauses[1] = makeSpanTermQuery("4");
>> allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
>> SpanAndQuery equivalent
>> query = new SpanWithinQuery(allKeywords, endSentence, 0);
>> System.out.println(&

Re: Search within a sentence (revisited)

2011-07-26 Thread Mark Miller
As long as you are happy with the results, I'm good. Always nice to have an 
excuse to dip back into Lucene. Just don't want you to feel over confident with 
the code without proper testing of it - I coded to fix the broken tests rather 
than taking the time to write a bunch more corner case tests like I likely 
should try if I was going to commit this thing.

- Mark Miller
lucidimagination.com

On Jul 26, 2011, at 8:56 AM, Peter Keegan wrote:

> Thanks Mark! The new patch is working fine with the tests and a few more. If
> you have particular test cases in mind, I'd be happy to add them.
> 
> Thanks,
> Peter
> 
> On Mon, Jul 25, 2011 at 5:56 PM, Mark Miller  wrote:
> 
>> Sorry Peter - I introduced this problem with some kind of typo type issue -
>> I somehow changed an includeSpans variable to excludeSpans - but I certainly
>> didn't mean too - it makes no sense. So not sure how it happened, and
>> surprised the tests that passed still passed!
>> 
>> We could probably use even more tests before feeling too confident here…
>> 
>> I've attached a patch for 3X with the new test and fix (changed that
>> include back to exclude).
>> 
>> - Mark Miller
>> lucidimagination.com
>> 
>> On Jul 25, 2011, at 10:29 AM, Mark Miller wrote:
>> 
>>> Thanks Peter - if you supply the unit tests, I'm happy to work on the
>> fixes.
>>> 
>>> I can likely look at this later today.
>>> 
>>> - Mark Miller
>>> lucidimagination.com
>>> 
>>> On Jul 25, 2011, at 10:14 AM, Peter Keegan wrote:
>>> 
>>>> Hi Mark,
>>>> 
>>>> Sorry to bug you again, but there's another case that fails the unit
>> test
>>>> (search within the second sentence), as shown here in the last test:
>>>> 
>>>> package org.apache.lucene.search.spans;
>>>> 
>>>> import java.io.Reader;
>>>> 
>>>> import org.apache.lucene.analysis.Analyzer;
>>>> import org.apache.lucene.analysis.TokenStream;
>>>> import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
>>>> import
>>>> org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
>>>> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
>>>> import org.apache.lucene.document.Document;
>>>> import org.apache.lucene.document.Field;
>>>> import org.apache.lucene.index.IndexReader;
>>>> import org.apache.lucene.index.RandomIndexWriter;
>>>> import org.apache.lucene.index.Term;
>>>> import org.apache.lucene.store.Directory;
>>>> import org.apache.lucene.search.IndexSearcher;
>>>> import org.apache.lucene.search.PhraseQuery;
>>>> import org.apache.lucene.search.ScoreDoc;
>>>> import org.apache.lucene.search.TermQuery;
>>>> import org.apache.lucene.search.spans.SpanNearQuery;
>>>> import org.apache.lucene.search.spans.SpanQuery;
>>>> import org.apache.lucene.search.spans.SpanTermQuery;
>>>> import org.apache.lucene.util.LuceneTestCase;
>>>> 
>>>> public class TestSentence extends LuceneTestCase {
>>>> public static final String field = "field";
>>>> public static final String START = "^";
>>>> public static final String END = "$";
>>>> public void testSetPosition() throws Exception {
>>>> Analyzer analyzer = new Analyzer() {
>>>> @Override
>>>> public TokenStream tokenStream(String fieldName, Reader reader) {
>>>> return new TokenStream() {
>>>> private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", END,
>>>> "9"};
>>>> private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1};
>>>> private int i = 0;
>>>> PositionIncrementAttribute posIncrAtt =
>>>> addAttribute(PositionIncrementAttribute.class);
>>>> CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
>>>> OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
>>>> @Override
>>>> public boolean incrementToken() {
>>>> assertEquals(TOKENS.length, INCREMENTS.length);
>>>> if (i == TOKENS.length)
>>>> return false;
>>>> clearAttributes();
>>>> termAtt.append(TOKENS[i]);
>>>> offsetAtt.setOffset(i,i);
>>>> posIncrAtt.setPositionIncrement(INCREMENTS[i]);
>>>> i++;
>>>> return true;
>>>

Re: implicit closing of an IndexWriter

2011-07-26 Thread Mark Miller

On Jul 26, 2011, at 9:52 AM, Clemens Wyss wrote:

> Side note: I am using threads when writing and theses threads are (by design) 
> interrupted (from time to time)

Perhaps you are seeing this: https://issues.apache.org/jira/browse/LUCENE-2239

- Mark Miller
lucidimagination.com









-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: optimize with num segments > 1 index keeps growing

2011-09-12 Thread Mark Miller

On Sep 9, 2011, at 3:35 PM, Robert Muir wrote:

> On Fri, Sep 9, 2011 at 3:07 PM, Uwe Schindler  wrote:
>> Hi,
>> 
>> This is still some kind of bug, because expungeDeletes is documented to 
>> remove all deletes. Maybe we need to modify MergePolicy?
>> 
> 
> we should correct the javadocs for expungeDeletes here I think: so
> that its more consistent with the javadocs for optimize?
> 
> "Requests an expunge operation..." ?
> 

+1 - it's a documentation bug now.

- Mark Miller
lucidimagination.com
2011.lucene-eurocon.org | Oct 17-20 | Barcelona











-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ElasticSearch

2011-11-17 Thread Mark Miller
The XML query parser can map to Lucene one to one as well - hasn't seemed
to pick up enough steam to be included with Solr yet, but there has been
some commotion so it's likely to go in at some point. Not enough demand yet
I guess. https://issues.apache.org/jira/browse/SOLR-839 XML Query Parser
Support

-- 
- Mark

http://www.lucidimagination.com

On Thu, Nov 17, 2011 at 6:11 PM, Peter Karich  wrote:

>
>
> > I don't think it's possible.
>
> Eh, of course its possible (if I would understand it I would do it. no,
> no, just joking ;))
>
> and yes, Solr its a shorter for some common use cases. I don't think
> that there is a 'best', but JSON can map 1:1 to lucene.
>
> The biggest problem with ES's syntax is that you can have super big
> queries where you miss the big picture or some closing bracket (probably
>  would be better ;))
> => so this makes it sometimes harder to 'parse' for humans (for bigger
> queries) and more chatty
>
> The biggest problem with Solr's syntax is that you need to escape here
> and there and you have all the different brackets and dots (e.g. for
> ranges, local params, term filter, ...),
> which makes it hard to parse for *non*-humans and sub-intelligent people
> IMO. An advantage is that you can put the URL into the browser with
> Solr, which is only possible via additional software for ES (called
> Elasticsearch-head). although some parameters are available as URL
> parameters as well in ES
>
> Regards,
> Peter.
>
>
> > On Thu, Nov 17, 2011 at 3:44 PM, Michael McCandless
> >  wrote:
> >> Maybe someone can post the equivalent query in ElasticSearch?
> > I don't think it's possible.  Hoss threw in the kitchen sink into his
> > "contrived' example.
> > Here's a super simple example:
> >
> > JSON:
> >
> > {
> > "sort" : [
> > { "age" : {"order" : "asc"} }
> > ],
> > "query" : {
> > "term" : { "user" : "jack" }
> > }
> > }
> >
> > Solr's HTTP:
> >
> > q=user:jack&sort=age asc
> >
> > -Yonik
> > http://www.lucidimagination.com
> >
>
>
>
> --
> http://jetsli.de news reader for geeks
>
>


Re: Regarding Compression Tool

2013-09-16 Thread Mark Miller
Have you considered storing your indexes server-side? I haven't used
compression but usually the trade-off of compression is CPU usage which
will also be a drain on battery life. Or maybe consider how important the
highlighter is to your users - is it worth the trade-off of either disk
space or battery life? If it's more of a nice-to-have then maybe hold off
on the feature for a later release until you've had some feedback and some
more time to figure out the best solution. Of course I don't know much
about your application, so take my advice with a grain of salt.


On Mon, Sep 16, 2013 at 2:22 AM, Jebarlin Robertson wrote:

> I am using Apache Lucene in Android. I have around 1 GB of Text documents
> (Logs). When I Index these text documents using this
> *new Field(ContentIndex.KEY_TEXTCONTENT, contents, Field.Store.YES,
> Field.Index.ANALYZED,TermVector.WITH_POSITIONS_OFFSETS)*, the index
> directory is consuming 1.59GB memory size.
> But without Field Store it will be adound 0.59 GB indexed size. If the
> Lucene indexing is taking this much space to create index and to store the
> original text just to use hightlight feature, it will be big problem for
> mobile devices. So I just want some help that, is there any alternative
> ways to do this without occupying more space to use highligh feature in
> Android powered devices.
>
>
> On Sun, Sep 15, 2013 at 3:26 AM, Erick Erickson  >wrote:
>
> > bq: I thought that I can use the CompressionTool to minimize the memory
> > size.
> >
> > This doesn't make a lot of sense. Highlighting needs the raw data to
> > figure out what to highlight, so I don't see how the CompressionTool
> > will help you there.
> >
> > And unless you have a huge document and only a very few of them, then
> > the memory occupied by the uncompressed data should be trivial
> > compared to the various low-level caches. This really is seeming like
> > an XY problem. Perhaps if you backed up and explained _why_ this
> > seems important to do people could be more helpful.
> >
> >
> > Best,
> > Erick
> >
> >
> > On Sat, Sep 14, 2013 at 12:21 PM, Jebarlin Robertson  > >wrote:
> >
> > > Thank you very much Erick. Actually I was using Highlighter tool, that
> > > needs the entire data to be stored to get the relevant searched
> sentence.
> > > But when I use that, It was consuming more memory (Indexed data size +
> > >  Store.YES - the entire content) than the actual documents size.
> > > I thought that I can use the CompressionTool to minimize the memory
> size.
> > > You can help, if there is any possiblities or way to store the entire
> > > content and to use the highlighter feature.
> > >
> > > Thankyou
> > >
> > >
> > > On Fri, Sep 13, 2013 at 6:54 PM, Erick Erickson <
> erickerick...@gmail.com
> > > >wrote:
> > >
> > > > Compression is for the _stored_ data, which is not searched. Ignore
> > > > the compression and insure that you index the data.
> > > >
> > > > The compressing/decompressing for looking at stored
> > > > values is, I believe, done at a very low level that you don't
> > > > need to care about at all.
> > > >
> > > > If you index the data in the field, you shouldn't have to do
> > > > anything special to search it.
> > > >
> > > > Best,
> > > > Erick
> > > >
> > > >
> > > > On Fri, Sep 13, 2013 at 1:19 AM, Jebarlin Robertson <
> > jebar...@gmail.com
> > > > >wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am trying to store all the Field values using CompressionTool,
> But
> > > > When I
> > > > > search for any content, it is not finding any results.
> > > > >
> > > > > Can you help me, how to create the Field with CompressionTool to
> add
> > to
> > > > the
> > > > > Document and how to decompress it when searching for any content in
> > it.
> > > > >
> > > > > --
> > > > > Thanks & Regards,
> > > > > Jebarlin Robertson.R
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks & Regards,
> > > Jebarlin Robertson.R
> > > GSM: 91-9538106181.
> > >
> >
>
>
>
> --
> Thanks & Regards,
> Jebarlin Robertson.R
> GSM: 91-9538106181.
>



-- 
Mark J. Miller
Blog: http://www.developmentalmadness.com
LinkedIn: http://www.linkedin.com/in/developmentalmadness


[ANNOUNCE] Apache Lucene 4.5.1 released.

2013-10-24 Thread Mark Miller
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

October 2013, Apache Lucene™ 4.5.1 available

The Lucene PMC is pleased to announce the release of Apache Lucene 4.5.1

Apache Lucene is a high-performance, full-featured text search engine
library written entirely in Java. It is a technology suitable for
nearly any application that requires full-text search, especially
cross-platform.

Lucene 4.5.1 includes 8 bug fixes. The release is available for
immediate download at:

http://lucene.apache.org/core/mirrors-core-latest-redir.html


See the CHANGES.txt file included with the release for a full list of
changes and further details.

Please report any feedback to the mailing lists
(http://lucene.apache.org/core/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also goes for Maven access.

Happy searching,

Lucene/Solr developers
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJSaUdcAAoJED+/0YJ4eWrI+WMP/2SJySsdpGdO2QRT3cj+5y5f
b62LlhTMpMG3vVkETphWyVaRFrDyDBmG7co1ZAQ86YEesJ5VeumJqIrLN6gheT30
DJM/j70BKvPDhESCSSJocJ59peHkfbz5DI4UOdDjHqyNgM6sPHBfMuxLTQkg4NkY
CStKXo/X5GWu5sscwUSuUazI59Qm8gAMj1fLnqhRJPpjgNvYLs8+XG12jA0phL6y
pDClThi0eYekf2x6t3Rlzm4GaF0wFBBLJhaJZr+YhbJRApNXsYwJNUbtYPyspHWE
Xye8HRep0Q26FHmUPas3sLew92MhE/xqUPjeHooDbVlfGFeJIUkKcT482V2+MHXW
ubOno1MA6LVGr1LGu56rx+VHUz7BiNFP9vi2tvfNoifTPWsQ0+38ptk5HrEchgB4
sayhEJepyrGRVKu7i8AvwGb/CLXAjE7SmHAbftOFTUFkNMchqN9Evb2mS+F4aZYA
bbpoz5hX92C5UTmRnKk/Lm+I6p1Vu5OlErCpxVqFMAI+NAdMMyakLZ1OS6itAESa
V9uIVOfN89jBO8h4xjxYROQpCeoBWx1B6kCnEozKpr07B7jV9VOrzzlhqb/VSGor
0tZ12Gq1BLUdVE8Hl5ZA86JDeV/l6Y+Aoo2ibxKDDCnvdQOugyVQXtOYCElJ1rCu
7TQSntmfiwN5BB37jJWQ
=uO0Z
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



[ANNOUNCE] Apache Lucene 4.10.3 released

2014-12-29 Thread Mark Miller
December 2014, Apache Lucene™ 4.10.3 available

The Lucene PMC is pleased to announce the release of Apache Lucene 4.10.3

Apache Lucene is a high-performance, full-featured text search engine
library written entirely in Java. It is a technology suitable for
nearly any application that requires full-text search, especially
cross-platform.

The release is available for immediate download at:

http://lucene.apache.org/core/mirrors-core-latest-redir.html

Lucene 4.10.3 includes 12 bug fixes.

See the CHANGES.txt file included with the release for a full list of
changes and further details.

Please report any feedback to the mailing lists
(http://lucene.apache.org/core/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also goes for Maven access.

Happy Holidays,

Mark Miller

http://www.about.me/markrmiller

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene in action

2023-06-10 Thread Mark Miller
Nature abhors being anything but an author by name on a second tech book.
The ruse is up after one when you have the inputs crystalized and the
hourly wage in hand. Hard to find anything but executive producers after
that. I’d shoot for a persuasive crowdfunding attempt.


Re: Analyzer at Query time

2008-08-28 Thread Mark Miller

Dino Korah wrote:

Hi All,
 
If I am to completely avoid the query parser and use the BooleanQuery along

with TermQuery, RangeQuery, PrefixQuery, PhraseQuery, etc, does the search
words still get to the Analyzer, before actually doing the real search?
 
Many thanks,

Dino
 

  

Answer: no

The QueryParser applies the analyzer and builds a Query object tree 
based on the results. You will have to apply the analyzer yourself if 
your going to forgo QueryParser.


- Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: phrases and slop

2008-08-28 Thread Mark Miller

Andy Goodell wrote:

I thought I understood phrases and slop until one of my coworkers
brought by the following example

For a document that contains
"quick brown fox"

"quick brown fox"~0
"quick fox brown"~2
"fox quick brown"~3

all match.

I would have expected "fox quick brown" to require a 4 instead of a 3,
two to transpose brown and fox, two to transpose quick and fox.  Why
is this only 3?

- andy g

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  
I think its this: push fox on quick for move 1, then fox on brown for 
move 2, then fox into last spot for move 3, quick brown fox.


- Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Performance, yet again

2008-09-02 Thread Mark Miller

Andre Rubin wrote:

Hi all,

Most of our queries are very simple, of the type:

Query query = new PrefixQuery(new Term(LABEL_FIELD, prefix));
Hits hits = searcher.search(query, new Sort(new SortField(LABEL_FIELD)))
  
You might want to check out solrs ConstantScorePrefixQuery and compare 
performance.

Which sometimes result in 10, 20, sometimes 40 thousand hits.

I get good performance if hits.length is 20.000 or less (less than 0.5
seconds). I However, if it is 40.000 or more, querying takes over a second,
up to 2.5 seconds. Point in check here is that this solution is not scaling.
Any ideas I can try?

I already exhausted the ideas from http://wiki.apache.org/lucene
-java/ImproveSearchingSpeed

I was reading about TopDocs and TopFieldDocs. Is this search method (using
TopDocs) preferred over Hits? Also, there's no constructor for them without
a Filter, can I just pass null?
  
It is preferred over Hits. Hits has been deprecated and you should 
really migrate away from it.

Is it possible to pre-sort the index, so I don't have to every time I
perform a query?

Any other ideas?
  
I think in general, sorting and prefix query can be slower operations in 
Lucene (though sorting is generally pretty fast after the field caches 
are loaded). You might try the first couple suggestions there though, 
and others may fill on other steps you can take as well.


- Mark


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Performance, yet again

2008-09-02 Thread Mark Miller

Andre Rubin wrote:

On Tue, Sep 2, 2008 at 10:16 AM, Mark Miller <[EMAIL PROTECTED]> wrote:

  

Andre Rubin wrote:



Hi all,

Most of our queries are very simple, of the type:

Query query = new PrefixQuery(new Term(LABEL_FIELD, prefix));
Hits hits = searcher.search(query, new Sort(new SortField(LABEL_FIELD)))


  

You might want to check out solrs ConstantScorePrefixQuery and compare
performance.




I'm not familiar with Solrs. It is not standard Lucene, is it?
  
Sorry about that. Solr is a search server that is a sub project of the 
Lucene Apache project. You can just copy the Query from solrs source 
code and use it with Lucene.  ConstantScorePrefixQuery may be faster for 
you than PrefixQuery and it doesn't have MaxClause exceptions issues 
when your prefix matches too many terms in the index. Please report back 
the speed difference if you can.


http://lucene.apache.org/solr/


  

 Which sometimes result in 10, 20, sometimes 40 thousand hits.


I get good performance if hits.length is 20.000 or less (less than 0.5
seconds). I However, if it is 40.000 or more, querying takes over a
second,
up to 2.5 seconds. Point in check here is that this solution is not
scaling.
Any ideas I can try?

I already exhausted the ideas from http://wiki.apache.org/lucene
-java/ImproveSearchingSpeed

I was reading about TopDocs and TopFieldDocs. Is this search method (using
TopDocs) preferred over Hits? Also, there's no constructor for them
without
a Filter, can I just pass null?


  

It is preferred over Hits. Hits has been deprecated and you should really
migrate away from it.




I was trying, before, to use it, but it doesn't seem as straightfoward as
Hits. Is there an example code, somewhere?
  

I think work was done on this when Hits was deprecated. Anyone know?


  

 Is it possible to pre-sort the index, so I don't have to every time I


perform a query?

Any other ideas?


  

I think in general, sorting and prefix query can be slower operations in
Lucene (though sorting is generally pretty fast after the field caches are
loaded). You might try the first couple suggestions there though, and others
may fill on other steps you can take as well.

- Mark





Thanks, Mark.


Andre

  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Memory Leak

2008-09-02 Thread Mark Miller

You should really close the IndexSearcher rather than the directory.

Andy33 wrote:

I have a memory leak in my lucene search code. I am able to run a few queries
fine, but I eventually run out of memory. Please note that I do close and
set to null the ivIndexSearcher object elsewhere. Here is the code I am
using... 



private synchronized Hits doQuery(String field, String queryStr, Sort
sortOrder, String indexDirectory) throws Exception
{
Directory directory = null;

try
{
Analyzer analyzer = new StandardAnalyzer();
	
	directory = FSDirectory.getDirectory(indexDirectory);
	
	//search the index

ivIndexSearcher = new IndexSearcher(directory);
	   
	QueryParser parser = new QueryParser(field, analyzer);

Query query = parser.parse(queryStr);
Hits results = ivIndexSearcher.search(query, sortOrder);
	
	return results;

}   
finally
	{	   
	if(null != directory)

{
directory.close();
}
directory = null;
}
}
  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PhraseQuery issues - differences with SpanNearQuery

2008-09-04 Thread Mark Miller
Sounds like its more in line with what you are looking for. If I 
remember correctly, the phrase query factors in the edit distance in 
scoring, but the NearSpanQuery will just use the combined idf for each 
of the terms in it, so distance shouldnt matter with spans (I'm sure 
Paul will correct me if I am wrong).


- Mark

Yannis Pavlidis wrote:

Hi,

I am having an issue when using the PhraseQuery which is best illustrated with 
this example:

I have created 2 documents to emulate URLs. One with a URL of: 
"http://www.airballoon.com"; and title "air balloon" and the second one with URL
"http://www.balloonair.com"; and title: "balloon air".

Test1 (PhraseQuery)
==
Now when I use the phrase query with - title: "air balloon" ~2
I get back:

url: "http://www.airballoon.com"; - score: 1.0
url: "http://www.balloonair.com"; - score: 0.57

Test2 (PhraseQuery)
==
Now when I use the phrase query with - title: "balloon air" ~2
I get back:
url: "http://www.balloonair.com"; - score: 1.0
url: "http://www.airballoon.com"; - score: 0.57

Test3 (PhraseQuery)
==
Now when I use the phrase query with - title: "air balloon" ~2 title: "balloon 
air" ~2
I get back:
url: "http://www.airballoon.com"; - score: 1.0
url: "http://www.balloonair.com"; - score: 1.0

Test4 (SpanNearQuery)
===
spanNear([title:air, title:balloon], 2, false)
I get back:
url: "http://www.airballoon.com"; - score: 1.0
url: "http://www.balloonair.com"; - score: 1.0

I would have expected that Test1, Test2 would actually return both URLs with 
score of 1.0 since I am setting the slop to 2. It seems though that lucene 
really favors and absolute exact match.

Is it safe to assume that for what I am looking for (basically score the docs the same regardless 
on when someone is searching for "air balloon" or "balloon air") it would be 
better to use the SpanNearQuery rather than the PhraseQuery?

Any input would be appreciated. 


Thanks in advance,

Yannis.

  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PhraseQuery issues - differences with SpanNearQuery

2008-09-05 Thread Mark Miller

Paul Elschot wrote:

Op Thursday 04 September 2008 20:39:13 schreef Mark Miller:
  

Sounds like its more in line with what you are looking for. If I
remember correctly, the phrase query factors in the edit distance in
scoring, but the NearSpanQuery will just use the combined idf for
each of the terms in it, so distance shouldnt matter with spans (I'm
sure Paul will correct me if I am wrong).



SpanScorer will use the similarity slop factor for each matching
span size to adjust the effective frequency.
The span size is the difference in position between the first
and last matching term, and idf is not used for scoring Spans.
The reason why idf is not used could be that there is no basic
score value associated with inner spans; only top level spans
are scored by SpanScorer.
For more details, please consult the SpanScorer code.

Regards,
Paul Elschot
  
Right, my fault, its the query normalization in the weight which uses 
idf (by pulling from each clause in the span). So its kind of factored 
into the score, but not in the way I implied. Sorry, my bad on the info.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PhraseQuery issues - differences with SpanNearQuery

2008-09-05 Thread Mark Miller



SpanScorer will use the similarity slop factor for each matching
span size to adjust the effective frequency.


Regards,
Paul Elschot
  
You have pointed this out to me before. One day I will remember  
Every time I look things over again I miss it, and I couldn't find that 
email in the archives. Its done here if original questioner is intersted:


SpanScorer

 protected boolean setFreqCurrentDoc() throws IOException {
   if (! more) {
 return false;
   }
   doc = spans.doc();
   freq = 0.0f;
   while (more && doc == spans.doc()) {
 int matchLength = spans.end() - spans.start();
 freq += getSimilarity().sloppyFreq(matchLength);
 more = spans.next();
   }
   return more || (freq != 0);
 }

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Frequently updated fields

2008-09-12 Thread Mark Miller
You might check out the tagindex issue in jira as well. Havn't looked at 
it myself, but I believe its supposed to be an option for this.


Gerardo Segura wrote:
I think the important question is: in general how to cope with 
frequently changing fields.



Karl Wettin wrote:

Hi Wojciech,

can you please give us a bit more specific information about the meta 
data fields that will change? I would recommend you looking at 
creating filters from your primary persistency for query clauses such 
as unread/read, mailbox folders, et c.


  karl

12 sep 2008 kl. 13.57 skrev Wojciech Strza?ka:


Hi.

  I'm new to Lucene and I would like to get a few answers (they can
  be lame)

  I want to index large amount of emails using Lucene (maybe SOLR), 
not only

  the contents but also some metadata like state or flags. The
  problem is that the metadata will change during mail lifecycle,
  although much smaller updating this information will require
  reindex the whole mail content which I see performance bottleneck.

  I have the data in DB also so my first question is:

  - are there any best practices to implement my needs (querying both
  lucene & DB and then merging in memory?, close one eye and re-index
  the whole content on every metadata change? others?)

  - is at all Lucene good solution for my problem?

  - are there any plans to implement field updates in more efficient 
way then

  delete/insert the whole document? if yes what's the time horizon?


   Best regards
  Wojtek


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StandardAnalyzer exclude numbers

2008-09-22 Thread Mark Miller

[EMAIL PROTECTED] wrote:

Hello

Is it possible to exclude numbers using StandardAnalyzer just like 
SimpleAnalyzer?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Its possible but its tricky. You would want to copy the StandardAnalyzer 
into your own Analyzer and then modify the grammar. 
StandardTokenizerImpl.jflex is where to look, but you will have to learn 
how to use/compile jflex (look at the build file) to build the parser 
classes. What you would do though, is start by trying to remove the 
digit from the Alphanum regex in StandardTokenizerImpl.jflex. You might 
want to rename alphanum after such a move. That may be as far as you 
need to go.



- Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StandardAnalyzer exclude numbers

2008-09-22 Thread Mark Miller
Agreed. I am always diving into that analyzer too fast  Possibly
premature optimization thoughts as well. But scanning the token after in
a filter and breaking/skipping if you find a number will be much easier
and possibly not too much slower. Depends on how involved you are/want
to get I suppose. Personally I would prefer to start a new analyzer for
such a significant change, but for the average Lucene user, pre/post
processing is always going to make more sense. Plus there is enough
overlap in the code that I can see plenty of people preferring not to
split off.

黄成 wrote:
> why not use a token filter?
>
> On Mon, Sep 22, 2008 at 8:36 PM, Mark Miller <[EMAIL PROTECTED]> wrote:
>
>   
>> [EMAIL PROTECTED] wrote:
>>
>> 
>>> Hello
>>>
>>> Is it possible to exclude numbers using StandardAnalyzer just like
>>> SimpleAnalyzer?
>>>
>>> -
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>  Its possible but its tricky. You would want to copy the StandardAnalyzer
>>>   
>> into your own Analyzer and then modify the grammar.
>> StandardTokenizerImpl.jflex is where to look, but you will have to learn how
>> to use/compile jflex (look at the build file) to build the parser classes.
>> What you would do though, is start by trying to remove the digit from the
>> Alphanum regex in StandardTokenizerImpl.jflex. You might want to rename
>> alphanum after such a move. That may be as far as you need to go.
>>
>>
>> - Mark
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>> 
>
>
>   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: sharing SearchIndexer

2008-09-25 Thread Mark Miller

simon litwan wrote:

hi all

i tried to reuse the IndexSearcher among all of the threads that are 
doing searches as described in 
(http://wiki.apache.org/lucene-java/LuceneFAQ#head-48921635adf2c968f7936dc07d51dfb40d638b82) 



this works fine. but our application does continuous indexing. so the 
index is changing and the at startup initialized IndexSearcher seems 
not to be notified to reload the index.


is there a way to force the IndexSearcher to reload the index if the 
index has changed?


thanks in advance

simon

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

You want to reopen the Reader under the IndexSearcher, or open a new 
IndexSearcher.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: QueryParser

2008-10-18 Thread Mark Miller

Right, just don't share the same instance across threads.

- Mark


On Oct 18, 2008, at 3:11 PM, "Rafael Almeida" <[EMAIL PROTECTED]>  
wrote:



On queryparser's documentation says:
"Note that QueryParser is not thread-safe."
it only means that the same instance of QueryParser can't be used by
multiple threads, right? But if each thread has its own QueryParser
instance, then it's OK, right?

BTW, the link http://lucene.apache.org/java/docs/ 
queryparsersyntax.html

on 
http://lucene.apache.org/java/2_4_0/api/core/org/apache/lucene/queryParser/QueryParser.html
seems to be broken.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Hiring etiquette

2008-10-19 Thread Mark Miller

Richard Marr wrote:

Hi all,

Is there a mailing-list-appropriate way to hire coders with Lucene
experience? I don't want to just spam the list because I don't want to
crap where I live. I'm a programmer not a recruiter if that makes any
difference.

Cheers,

Rich

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  
Generally, people just throw out the request to the list and no one 
really complains, but I do think its frowned upon. I'm sure someone else 
can give the 'official' stance (since we are not a job board/list I 
assume its against).


You might instead limit your email to those that have agreed to be 
contacted at http://wiki.apache.org/lucene-java/Support


- Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi -threaded indexing of large number of PDF documents

2008-10-23 Thread Mark Miller
It sounds like you might have some thread synchronization issues outside 
of Lucene. To simplify things a bit, you might try just using one 
IndexWriter. If I remember right, the IndexWriter is now pretty 
efficient, and there isn't much need to index to smaller indexes and 
then merge. There is a lot of juggling to get wrong with that approach.


- Mark

Sudarsan, Sithu D. wrote:

Hi,

We are trying to index large collection of PDF documents, sizes varying
from few KB to few GB.  Lucene 2.3.2 with jdk 1.6.0_01 (with PDFBox for
text extraction) and on Windows as well as CentOS Linux. Used java -Xms
and -Xmx options, both at 1080m, even though we have 4GB on Windows and
32 GB on Linux with sufficient swap space.

With just one thread, though it takes time, the indexing happens. To
speed up, we tried multi-threaded approach with one Indexwriter for each
thread. After all the threads finish their indexing, they are merged.
With about 100 sample files and 10 threads, the program works pretty
well and it does speed up. But, when we run on document collection of
about 25GB, couple of threads just hang, while the rest have completed
their indexing. The program never gracefully exits, and the threads that
seem to have died ensure that the final index merging does not take
place. The program needs to be manually terminated. 


Tried both with simple analyzer as well as standard analyzer, with
similar results.

Any useful tips / solutions welcome.

Thanks in advance,
Sithu Sudarsan
Graduate Research Assistant, UALR
& Visiting Researcher, CDRH/OSEL

[EMAIL PROTECTED]
[EMAIL PROTECTED]


  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi -threaded indexing of large number of PDF documents

2008-10-23 Thread Mark Miller

Glen Newton wrote:

2008/10/23 Mark Miller <[EMAIL PROTECTED]>:
  

It sounds like you might have some thread synchronization issues outside of
Lucene. To simplify things a bit, you might try just using one IndexWriter.
If I remember right, the IndexWriter is now pretty efficient, and there
isn't much need to index to smaller indexes and then merge. There is a lot
of juggling to get wrong with that approach.



While I agree it is easier to have a single IndexWriter, if you have
multiple cores you will get significant speed-ups with multiple
IndexWriters, even with the impact of merging at the end.
#IndexWriters = # physical cores is an reasonable rule of thumb.

General speed-up estimate: # cores * 0.6 - 0.8  over single IndexWriter
YMMV

When I get around to it, I'll re-run my tests varying the # of
IndexWriters & post.

-Glen
  
Hey Mr McCandless, whats up with that? Can IndexWriter be made to be as 
efficient as using Multiple Writers? Where do you suppose the hold up 
is? Number of threads doing merges? Sync contention? I hate the idea of 
multiple IndexWriter/Readers being more efficient than a single 
instance. In an ideal Lucene world, a single instance would hide the 
complexity and use the number of threads needed to match multiple 
instance performance.
  

- Mark

Sudarsan, Sithu D. wrote:


Hi,

We are trying to index large collection of PDF documents, sizes varying
from few KB to few GB.  Lucene 2.3.2 with jdk 1.6.0_01 (with PDFBox for
text extraction) and on Windows as well as CentOS Linux. Used java -Xms
and -Xmx options, both at 1080m, even though we have 4GB on Windows and
32 GB on Linux with sufficient swap space.

With just one thread, though it takes time, the indexing happens. To
speed up, we tried multi-threaded approach with one Indexwriter for each
thread. After all the threads finish their indexing, they are merged.
With about 100 sample files and 10 threads, the program works pretty
well and it does speed up. But, when we run on document collection of
about 25GB, couple of threads just hang, while the rest have completed
their indexing. The program never gracefully exits, and the threads that
seem to have died ensure that the final index merging does not take
place. The program needs to be manually terminated.
Tried both with simple analyzer as well as standard analyzer, with
similar results.

Any useful tips / solutions welcome.

Thanks in advance,
Sithu Sudarsan
Graduate Research Assistant, UALR
& Visiting Researcher, CDRH/OSEL

[EMAIL PROTECTED]
[EMAIL PROTECTED]



  

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Change the merge factor for an existing index?

2008-10-28 Thread Mark Miller
Just change it. Merges will start obeying the new merge factor  
seamlessly.


- Mark


On Oct 27, 2008, at 1:07 PM, Tom Saulpaugh <[EMAIL PROTECTED]>  
wrote:



Hello,

We are currently using lucene v2.1 and we are planning to upgrade to  
lucene v2.4.
Can we change the merge factor for an existing index and then add  
more documents to that index?  Is there some kind of upgrade path  
like using optimize to move an existing index to a different merge  
factor?


Thanks,

Tom



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: OutOfMemory Problems Lucene 2.4 / Tomcat

2008-10-29 Thread Mark Miller
How many fields are you sorting on? Lots of unuiqe terms in those  
fields?


- Mark


On Oct 29, 2008, at 6:03 PM, "Todd Benge" <[EMAIL PROTECTED]> wrote:


Hi,

I'm the lead engineer for search on a large website using lucene for  
search.


We're indexing about 300M documents in ~ 100 indices.  The indices add
up to ~ 60G.

The indices are sorted into 4 different Multisearcher with the largest
handling ~50G.

The code is basically like the following:

private static MultiSearcher searcher;

public void init(File files) {

IndexSearcer [] searchers = new IndexSearcher[files.length] ();
int i = 0;
for ( File file: files ) {
 searchers[i++] = new  
IndexSearcher(FSDirectory.getDirectory(file);

}

searcher = new MultiSearcher(searchers);
}

public Searcher getSearcher() {
  return searcher;
}

We're seeing a high cache rate with Term & TermInfo in Lucene 2.4.
Performance is good but servers are consistently hanging with
OutOfMemory errors.

We're allocating 4G in the heap to each server.

Is there any way to control the amount of memory Lucene consume for
caching?  Any other suggestions on fixing the memory errors?

Thanks,

Todd

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: OutOfMemory Problems Lucene 2.4 / Tomcat

2008-10-29 Thread Mark Miller
The term, terminfo, indexreader internals stuff is prob on the low end 
compared to the size of your field caches (needed for sorting). If you 
are sorting by String I think the space needed is 32 bits x number of 
docs + an array to hold all of the unique terms. So checking 300 million 
docs (I know you are actually breaking it up smaller than that, but for 
example) and ignoring things like String chars being variable byte 
lengths and storing the length, etc and randomly picking 5 unique 
terms at 6 chars per:


32 bits x 3 + 5 x 6 x 16 bits to MB = 1 144.98138 megabytes

Thats per field your sorting on. If you are sorting on an int field it 
should be closer to 32 bits x num docs - shorts, 32 bits x num docs, etc.


So you have those field caches, plus the IndexReader terminfo, term 
stuff, plus whatever RAM your app needs beyond Lucene. 4 gig might just 
not *quite* cut it is my guess.


Todd Benge wrote:

There's usually only a couple sort fields and a bunch of terms in the
various indices.  The terms are user entered on various media so the
number of terms is very large.

Thanks for the help.

Todd



On 10/29/08, Todd Benge <[EMAIL PROTECTED]> wrote:
  

Hi,

I'm the lead engineer for search on a large website using lucene for search.

We're indexing about 300M documents in ~ 100 indices.  The indices add
 up to ~ 60G.

The indices are sorted into 4 different Multisearcher with the largest
handling ~50G.

The code is basically like the following:

private static MultiSearcher searcher;

public void init(File files) {

 IndexSearcer [] searchers = new IndexSearcher[files.length] ();
 int i = 0;
 for ( File file: files ) {
  searchers[i++] = new IndexSearcher(FSDirectory.getDirectory(file);
 }

searcher = new MultiSearcher(searchers);
}

public Searcher getSearcher() {
   return searcher;
}

We're seeing a high cache rate with Term & TermInfo in Lucene 2.4.
Performance is good but servers are consistently hanging with
OutOfMemory errors.

We're allocating 4G in the heap to each server.

Is there any way to control the amount of memory Lucene consume for
caching?  Any other suggestions on fixing the memory errors?

Thanks,

Todd




  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: OutOfMemory Problems Lucene 2.4 / Tomcat

2008-10-30 Thread Mark Miller
Michaels got some great points (he the lucene master), especially  
possibly turning off norms if you can, but for an index like that i'd  
reccomwnd solr. Solr sharding can be scaled to billions (min a billion  
or two anyway) with few limitations (of course there are a few). Plus  
it has further caching options, indexreader refresh managment, etc etc  
etc



- Mark


On Oct 29, 2008, at 10:30 PM, "Todd Benge" <[EMAIL PROTECTED]> wrote:


Thanks Mark.  I appreciate the help.

I thought our memory may be low but wanted to verify there if there is
any way to control memory usage.  I think we'll likely upgrade the
memory on the machines but that may just delay the inevitable.

Wondering if anyone else has encountered similar issues with indices
if a similar size.  I've been thinking we will need to move to a
clustered solution and have been reading on hadoop, nutch, solr &
terracotta for possibilities such as index sharding.

Has anyone implemented a solution using hadoop or terracotta for a
large scale system?  Just wondering the pro's / con's of the various
approaches.

Thanks,

Todd

On Wed, Oct 29, 2008 at 6:07 PM, Mark Miller <[EMAIL PROTECTED]>  
wrote:
The term, terminfo, indexreader internals stuff is prob on the low  
end
compared to the size of your field caches (needed for sorting). If  
you are
sorting by String I think the space needed is 32 bits x number of  
docs + an
array to hold all of the unique terms. So checking 300 million docs  
(I know
you are actually breaking it up smaller than that, but for example)  
and
ignoring things like String chars being variable byte lengths and  
storing
the length, etc and randomly picking 5 unique terms at 6 chars  
per:


32 bits x 3 + 5 x 6 x 16 bits to MB = 1 144.98138  
megabytes


Thats per field your sorting on. If you are sorting on an int field  
it
should be closer to 32 bits x num docs - shorts, 32 bits x num  
docs, etc.


So you have those field caches, plus the IndexReader terminfo, term  
stuff,
plus whatever RAM your app needs beyond Lucene. 4 gig might just  
not *quite*

cut it is my guess.

Todd Benge wrote:


There's usually only a couple sort fields and a bunch of terms in  
the

various indices.  The terms are user entered on various media so the
number of terms is very large.

Thanks for the help.

Todd



On 10/29/08, Todd Benge <[EMAIL PROTECTED]> wrote:



Hi,

I'm the lead engineer for search on a large website using lucene  
for

search.

We're indexing about 300M documents in ~ 100 indices.  The  
indices add

up to ~ 60G.

The indices are sorted into 4 different Multisearcher with the  
largest

handling ~50G.

The code is basically like the following:

private static MultiSearcher searcher;

public void init(File files) {

   IndexSearcer [] searchers = new IndexSearcher[files.length] ();
   int i = 0;
   for ( File file: files ) {
searchers[i++] = new
IndexSearcher(FSDirectory.getDirectory(file);
   }

searcher = new MultiSearcher(searchers);
}

public Searcher getSearcher() {
 return searcher;
}

We're seeing a high cache rate with Term & TermInfo in Lucene 2.4.
Performance is good but servers are consistently hanging with
OutOfMemory errors.

We're allocating 4G in the heap to each server.

Is there any way to control the amount of memory Lucene consume for
caching?  Any other suggestions on fixing the memory errors?

Thanks,

Todd








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document marked as deleted

2008-10-30 Thread Mark Miller

John G wrote:

I have an index with a particular document marked as deleted. If I use the
search method that returns TopDocs and that deleted document satisfies the
search criteria, will it be included in the returned TopDocs object even
though it has been marked as deleted?

Thanks in advance.

John G.
  
Nope. It will still be loaded in the field cache and used for corpus 
statistics I believe, but it won't be returned in search results, no 
matter which search method on searcher you are using.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: OutOfMemory Problems Lucene 2.4 / Tomcat

2008-10-31 Thread Mark Miller
20 fields on a huge index? Wow - not sure there is a ton you can do with 
that...anyone have any suggestions for that one? Distributed should help 
I suppose, but thats a lot of sort fields for a large index.


If LUCENE-831 ever gets off the ground you will be able to change the 
cache used, and possibly use something that spills over to disk.


PabloS wrote:

Hi,

I'm having a similar problem with my application, although we are using
lucene 2.3.2. The problem we have is that we are required to sort on most of
the fields (20 at least). Is there any way of changing the cache being used?
I can't seem to find a way, since the cache is being accessed using the
FieldCache.DEFAULT static field..

Any tip would be appreciated, otherwise I'll have to start looking for a
clustered solution like Todd.

Thanks in advance.
Pablo




markrmiller wrote:
  
The term, terminfo, indexreader internals stuff is prob on the low end 
compared to the size of your field caches (needed for sorting). If you 
are sorting by String I think the space needed is 32 bits x number of 
docs + an array to hold all of the unique terms. So checking 300 million 
docs (I know you are actually breaking it up smaller than that, but for 
example) and ignoring things like String chars being variable byte 
lengths and storing the length, etc and randomly picking 5 unique 
terms at 6 chars per:


32 bits x 3 + 5 x 6 x 16 bits to MB = 1 144.98138 megabytes

Thats per field your sorting on. If you are sorting on an int field it 
should be closer to 32 bits x num docs - shorts, 32 bits x num docs, etc.


So you have those field caches, plus the IndexReader terminfo, term 
stuff, plus whatever RAM your app needs beyond Lucene. 4 gig might just 
not *quite* cut it is my guess.


Todd Benge wrote:


There's usually only a couple sort fields and a bunch of terms in the
various indices.  The terms are user entered on various media so the
number of terms is very large.

Thanks for the help.

Todd



On 10/29/08, Todd Benge <[EMAIL PROTECTED]> wrote:
  
  

Hi,

I'm the lead engineer for search on a large website using lucene for
search.

We're indexing about 300M documents in ~ 100 indices.  The indices add
 up to ~ 60G.

The indices are sorted into 4 different Multisearcher with the largest
handling ~50G.

The code is basically like the following:

private static MultiSearcher searcher;

public void init(File files) {

 IndexSearcer [] searchers = new IndexSearcher[files.length] ();
 int i = 0;
 for ( File file: files ) {
  searchers[i++] = new
IndexSearcher(FSDirectory.getDirectory(file);
 }

searcher = new MultiSearcher(searchers);
}

public Searcher getSearcher() {
   return searcher;
}

We're seeing a high cache rate with Term & TermInfo in Lucene 2.4.
Performance is good but servers are consistently hanging with
OutOfMemory errors.

We're allocating 4G in the heap to each server.

Is there any way to control the amount of memory Lucene consume for
caching?  Any other suggestions on fixing the memory errors?

Thanks,

Todd



  
  

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Performance of never optimizing

2008-11-03 Thread Mark Miller
Am I missing your benchmark algorithm somewhere? We need it. Something 
doesn't make sense.


- Mark


Justus Pendleton wrote:

Howdy,

I have a couple of questions regarding some Lucene benchmarking and 
what the results mean[3]. (Skip to the numbered list at the end if you 
don't want to read the lengthy exegesis :)


I'm a developer for JIRA[1]. We are currently trying to get a better 
understanding of Lucene, and our use of it, to cope with the needs of 
our larger customers. These "large" indexes are only a couple hundred 
thousand documents but our problem is compounded by the fact that they 
have a relatively high rate of modification (=delete+insert of new 
document) and our users expect these modification to show up in query 
results pretty much instantly.


Our current default behaviour is a merge factor of 4. We perform an 
optimization on the index every 4000 additions. We also perform an 
optimize at midnight. Our fundamental problem is that these 
optimizations are locking the index for unacceptably long periods of 
time, something that we want to resolve for our next major release, 
hopefully without undermining search performance too badly.


In the Lucene javadoc there is a comment, and a link to a mailing list 
discussion[2], that suggests applications such as JIRA should never 
perform optimize but should instead set their merge factor very low.


In an attempt to understand the impact of a) lowering the merge factor 
from 4 to 2 and b) never, ever optimizing on an index (over the course 
of years and millions of additions/updates) I wanted to try to 
benchmark Lucene.


I used the contrib/benchmark framework and wrote a small algorithm 
that adds documents to an index (using the Reuters doc generator), 
does a search, does an optimize, then does another search. All the 
pretty pictures can be seen at:


  http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs

I have several questions, hopefully they aren't overwhelming in their 
quantity :-/


1. Why does the merge factor of 4 appear to be faster than the merge 
factor of 2?


2. Why does non-optimized searching appear to be faster than optimized 
searching once the index hits ~500,000 documents?


3. There appears to be a fairly sizable performance drop across the 
board around 450,000 documents. Why is that?


4. Searching performance appears to decrease towards a fairly 
pessimistic 20 searches per second (for a relatively simple search). 
Is this really what we should expect long-term from Lucene?


5. Does my benchmark even make sense? I am far from an expert on 
benchmarking so it is possible I'm not measuring what I think I am 
measuring.


Thanks in advance for any insight you can provide. This is an area 
that we very much want to understand better as Lucene is a key part of 
JIRA's success,


Cheers,
Justus
JIRA Developer

[1]: http://www.atlassian.com
[2]: http://www.gossamer-threads.com/lists/lucene/java-dev/47895
[3]: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Performance of never optimizing

2008-11-03 Thread Mark Miller
Been a while since I've been in the benchmark stuff, so I am going to 
take some time to look at this when I get a chance, but off the cuff I 
think you are open and closing the reader for each search. Try using the 
openreader task before the 100 searches and then the closereader task. 
That will ensure you are reusing the same reader for each search. Hope 
to analyze further soon.


- Mark

Justus Pendleton wrote:

On 03/11/2008, at 11:07 PM, Mark Miller wrote:

Am I missing your benchmark algorithm somewhere? We need it. 
Something doesn't make sense.


I thought I had included in at[1] before but apparently not, my 
apologies for that. I have updated that wiki page. I'll also reproduce 
it here:


{ "Rounds"

ResetSystemErase
{ CreateIndex >
{ AddDoc > : NUM_DOCS
{ CloseIndex >

[ "UnoptSearch" Search > : 100
{ "Optimize" OpenIndex Optimize CloseIndex }
[ "OptSearch" Search > : 100

NewRound

} : 6

NUM_DOCS increases by 5,000 for each iteration.

What constitutes a "proper warm up before measuring"?


[1]: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs



Cheers,
Justus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: searchable archives

2008-11-07 Thread Mark Miller

Or nabble or markmail


- Mark


On Nov 7, 2008, at 3:33 PM, Dragon Fly <[EMAIL PROTECTED]>  
wrote:




http://www.gossamer-threads.com/lists/lucene/java-user/


Date: Fri, 7 Nov 2008 14:27:38 -0700
From: [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Subject: searchable archives

Hey,

Is this list available somewhere that you can search the entire  
archives at

one time?

Thanks,
Chad


_
Stay up to date on your PC, the Web, and your mobile phone with  
Windows Live

http://clk.atdmt.com/MRT/go/119462413/direct/01/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multisearcher

2008-11-08 Thread Mark Miller
Not out of the box, but it's fairly trivial to copy multisesscher and  
modify it so that a different query goes to each suvsearcher.


- Mark


On Nov 8, 2008, at 5:45 AM, "Shishir Jain" <[EMAIL PROTECTED]>  
wrote:



Hi,

Doc1: Field1, Field2
Doc2: Field1, Field2

If I create Index such that Field1 is stored in index1 and Field2 is  
stored

in index2.

Can I use Multisearcher to search for Field1 in index1 and Field2  
index2 and

get the merged results?

Thanks & Regards,
Shishir Jain


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ScoreDoc

2008-11-09 Thread Mark Miller
Excuse me. Some unchecked logic there concerning HitCollector. A 
HitCollector hits all matching documents, not all documents. Sometimes 
that can be a lot. With TopDocs, you only ask for the Top scoring 
documents, which is usually a lesser number than all matching docs, and 
generally what people are interested in rather than all matching docs. 
Sorry for the confusion there - need to double check what I write...


Mark Miller wrote:
Their is definitely some stale javadoc in Lucene here and there. All 
of what your talking about has been shaken up recently with the 
deprecation of Hits. Hits used to pretty much be considered the 
non-expert API, but its been tossed in favor of the TopDoc API's.


The HitCollector stuff has been marked expert because a lot of people 
get into trouble using something that hits every doc in the index on a 
search, not just the matching docs from the search. If you don't 
understand whats going on, you can, and many have, make some pretty 
slow code. The expert stuff just means, understand whats going on 
before you start to play here ;) I don't necessarily think it doesn't 
belong in a tutorial - assuming the guy who wrote the tutorial 
understood what he was doing.


As for the stale java-doc though, I'm sure patches would be welcome ;) 
Its a group of volunteers all scratching their own itches here, so its 
likely you will find things like that. Best bet is to pitch in when 
you see it, and I'm sure one of the commiters will apply your patch if 
its appropriate.


- Mark

ChadDavis wrote:

In fact, the search method used to populate the collector used in that
sample code also claims to be low level.  It suggests using the
IndexSearcher.search( query ) method instead, but that method is 
deprecated.


Lower-level search API.
 

HitCollector.collect(int,float) is called for every matching document.

Applications should only use this if they need *all* of the matching
documents. The high-level search API (Searcher.search(Query)) is 
usually

more efficient, as it skips non-high-scoring hits.

Note: The score passed to this method is a raw score. In other 
words, the

score will not necessarily be a float whose value is between 0 and 1.



Is this just stale documentation ?

On Sun, Nov 9, 2008 at 3:28 PM, ChadDavis 
<[EMAIL PROTECTED]>wrote:


 

The sample code uses a ScoreDoc array to hold the hits.

ScoreDoc[] hits = collector.topDocs().scoreDocs;

But the JavaDoc says "Expert: Returned by low-level search
implementations."  Why would the tutorial sample code use an 
"expert" api?







  





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Highlighter and Phrase Queries

2008-11-10 Thread Mark Miller

Check out the SpanScorer.

- Mark


On Nov 10, 2008, at 8:25 AM, "Sertic Mirko, Bedag" <[EMAIL PROTECTED] 
> wrote:



[EMAIL PROTECTED]



I am searching for a solution to make the Highlighter run property in
combination with phrase queries.



I want to highlight text with a phrase query like "windows  
printserver",

the following highlighted:



"windows printservers" are good blah blah "windows" manages
"printserver" blah blah, so the phrases

and the single terms are highlighted, but I just want to highlight the
phrases. How could this be done?



Thanks in advance



Mirko





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Boosting results

2008-11-10 Thread Mark Miller

Michael McCandless wrote:


But: it's slow to load a field for the first time.  LUCENE-1231 
(column-stride fields) aims to greatly speed up the load time.
Test it out though. In some recent testing I was doing it was *way* 
faster than I thought it would be based on what I had been reading. Of 
course if every term is unique, its going to be worse, but even with 
like 10 mil docs and a few hundred thousand uniques, either I was doing 
something wrong, or even on my 4200rpm laptop hd, it loaded like nothing 
(of course even a second load and then a search is much slower than just 
a warmed search though). Was hoping to see some advantage with a payload 
implementation with LUCENE-831, but really didn't seem to...


It's also memory-consuming.

Finally, you might want to instead look at Solr, which provides facet 
counting out of the box, rather than roll your own...


Mike

Stefan Trcek wrote:


On Friday 07 November 2008 18:46:17 Michael McCandless wrote:


Sorting populates the field cache (internal to Lucene) for that
field,   meaning it loads all values for all docs and holds them in
memory. This makes the first query slow, and, consumes RAM, in
proportion to how large your index is.


Can you direct me to the API how to access these cached values?
I'd like to have a function like: "List all unique values of the
categories (A, B, C...) for documents that match this query".

i.e. for a query "text:john" show up categories=(A,B)

Doc 1: category=A text=john
Doc 2: category=B text=mary
Doc 3: category=B text=john
Doc 4: category=C text=mary

This is intended for search refinement (I use about 200 categories).
Sorry for hijacking this thread.

Stefan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: AW: Highlighter and Phrase Queries

2008-11-10 Thread Mark Miller
Check out the unit tests for the highlighter and there are a bunch of 
examples.


Its pretty much the same as using the standard scorer, except that it 
requires a cached token filter so that the tokenstream can be read more 
than once.


Once you pass in the SpanScorer to the Highlighter though, it works just 
like the non phrase/span aware Highlighter.



- Mark


Sertic Mirko, Bedag wrote:

Hi

Thank you for your response.
Are there examples available?

Regards
Mirko

-Ursprüngliche Nachricht-
Von: Mark Miller [mailto:[EMAIL PROTECTED] 
Gesendet: Montag, 10. November 2008 14:45

An: java-user@lucene.apache.org
Betreff: Re: Highlighter and Phrase Queries

Check out the SpanScorer.

- Mark


On Nov 10, 2008, at 8:25 AM, "Sertic Mirko, Bedag" <[EMAIL PROTECTED] 
 > wrote:


  

[EMAIL PROTECTED]



I am searching for a solution to make the Highlighter run property in
combination with phrase queries.



I want to highlight text with a phrase query like "windows  
printserver",

the following highlighted:



"windows printservers" are good blah blah "windows" manages
"printserver" blah blah, so the phrases

and the single terms are highlighted, but I just want to highlight the
phrases. How could this be done?



Thanks in advance



Mirko






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: AW: AW: Highlighter and Phrase Queries

2008-11-10 Thread Mark Miller
Right, it will work the same as the standard Highlighter except that it 
highlights spans and phrase queries based on position.


Sertic Mirko, Bedag wrote:

Ok, i will do.

I guess it will also work with BooleanQueries and combined Term/Wildcard/Phrase 
Queries?

-Ursprüngliche Nachricht-
Von: Mark Miller [mailto:[EMAIL PROTECTED] 
Gesendet: Montag, 10. November 2008 15:38

An: java-user@lucene.apache.org
Betreff: Re: AW: Highlighter and Phrase Queries

Check out the unit tests for the highlighter and there are a bunch of 
examples.


Its pretty much the same as using the standard scorer, except that it 
requires a cached token filter so that the tokenstream can be read more 
than once.


Once you pass in the SpanScorer to the Highlighter though, it works just 
like the non phrase/span aware Highlighter.



- Mark


Sertic Mirko, Bedag wrote:
  

Hi

Thank you for your response.
Are there examples available?

Regards
Mirko

-Ursprüngliche Nachricht-
Von: Mark Miller [mailto:[EMAIL PROTECTED] 
Gesendet: Montag, 10. November 2008 14:45

An: java-user@lucene.apache.org
Betreff: Re: Highlighter and Phrase Queries

Check out the SpanScorer.

- Mark


On Nov 10, 2008, at 8:25 AM, "Sertic Mirko, Bedag" <[EMAIL PROTECTED] 
 > wrote:


  


[EMAIL PROTECTED]



I am searching for a solution to make the Highlighter run property in
combination with phrase queries.



I want to highlight text with a phrase query like "windows  
printserver",

the following highlighted:



"windows printservers" are good blah blah "windows" manages
"printserver" blah blah, so the phrases

and the single terms are highlighted, but I just want to highlight the
phrases. How could this be done?



Thanks in advance



Mirko




  

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ScoreDoc

2008-11-09 Thread Mark Miller
Their is definitely some stale javadoc in Lucene here and there. All of 
what your talking about has been shaken up recently with the deprecation 
of Hits. Hits used to pretty much be considered the non-expert API, but 
its been tossed in favor of the TopDoc API's.


The HitCollector stuff has been marked expert because a lot of people 
get into trouble using something that hits every doc in the index on a 
search, not just the matching docs from the search. If you don't 
understand whats going on, you can, and many have, make some pretty slow 
code. The expert stuff just means, understand whats going on before you 
start to play here ;) I don't necessarily think it doesn't belong in a 
tutorial - assuming the guy who wrote the tutorial understood what he 
was doing.


As for the stale java-doc though, I'm sure patches would be welcome ;) 
Its a group of volunteers all scratching their own itches here, so its 
likely you will find things like that. Best bet is to pitch in when you 
see it, and I'm sure one of the commiters will apply your patch if its 
appropriate.


- Mark

ChadDavis wrote:

In fact, the search method used to populate the collector used in that
sample code also claims to be low level.  It suggests using the
IndexSearcher.search( query ) method instead, but that method is deprecated.

Lower-level search API.
  

HitCollector.collect(int,float) is called for every matching document.

Applications should only use this if they need *all* of the matching
documents. The high-level search API (Searcher.search(Query)) is usually
more efficient, as it skips non-high-scoring hits.

Note: The score passed to this method is a raw score. In other words, the
score will not necessarily be a float whose value is between 0 and 1.



Is this just stale documentation ?

On Sun, Nov 9, 2008 at 3:28 PM, ChadDavis <[EMAIL PROTECTED]>wrote:

  

The sample code uses a ScoreDoc array to hold the hits.

ScoreDoc[] hits = collector.topDocs().scoreDocs;

But the JavaDoc says "Expert: Returned by low-level search
implementations."  Why would the tutorial sample code use an "expert" api?






  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexSearcher and multi-threaded performance

2008-11-11 Thread Mark Miller

Nice! An 8 core machine with a test ready to go!

How about trying the read only mode that was added to 2.4 on your 
IndexReader?


And if you you are on unix and could try trunk and use the new 
NIOFSDirectory implementation...that would be awesome.


Those two additions are our current hope for what your seeing...would be 
nice to know if we need to try for more (or if we need to petition the 
smart people that work on that stuff to try for more ;) ).


- Mark

Dmitri Bichko wrote:

Hi,

I'm pretty new to Lucene, so please bear with me if this has been
covered before.

The wiki suggests sharing a single IndexSearcher between threads for
best performance
(http://wiki.apache.org/lucene-java/ImproveSearchingSpeed).  I've
tested running the same set of queries with: multiple threads sharing
the same searcher, with a separate searcher for each thread, both
shared/private with a RAMDirectory in-memory index, and (just for fun)
in multiple JVMs running concurrently (the results are in milliseconds
to complete the whole job):

threads  multi-jvm  shared  per-thread  ram-shared  ram-thread
  1  72997   70883   72573   60308   60012
  2  33147   48762   35973   25498   25734
  4  16229   46828   21267   13127   27164
  6  13088   47240   140289858   29917
  8   9775   47020   109838948   10440
 10   8721   50132   113349587   11355
 12   7290   49002   117989832
 16   9365   47099   12338   11296

The shared searcher indeed behaves better with a ram-based index, but
what's going on with the disk-based one?  It's basically not scaling
beyond two threads. Am I just doing something completely wrong here?

The test consists of about 1,500 Boolean OR queries with 1-10
PhraseQueries each, with 1-20 Terms per PhraseQuery.  I'm using a
HitCollector to count the hits, so I'm not retrieving any results.
The index is about 5GB and 20 million documents.

This is running on a 8 x quad-core Opteron machine with plenty of RAM to spare.

Any idea why I would see this behaviour?

Thanks,
Dmitri

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexSearcher and multi-threaded performance

2008-11-11 Thread Mark Miller




And if you you are on unix and could try trunk and use the new 
NIOFSDirectory implementation...that would be awesome.
Woah...that made 2.4 too. A 2.4 release will allow both optimizations. 
Many thanks!


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexSearcher and multi-threaded performance

2008-11-11 Thread Mark Miller

Dmitri Bichko wrote:

32 cores, actually :)
  

Glossed over that - even better! Killer machine to be able to test this on.

I reran the test with readonly turned on (I changed how the time is
measured a little, it should be more consistent):

fs-thread   ram-thread  fs-shared   ram-shared
1   71877   54739   73986   61595
2   34949   26735   43719   28935
3   25581   26885   38412   19624
4   20511   31742   38712   15059
5   19235   24345   39685   12509
6   16775   26896   39592   10841
7   17147   18296   46678   10183
8   18327   19043   39886   10048
9   16885   18721   40342   9483
10  17832   30757   44706   10975
11  17251   21199   39947   9704
12  17267   36284   40208   10996

I can't seem to get NIOFSDirectory working, though.  Calling
NIOFSDirectory.getDirectory("foo") just returns an FSDirectory.
  
Thats a good point, and points out a bug in solr trunk for me. Frankly I 
don't see how its done. There is no code I can see/find to use it rather 
than FSDirectory. Still assuming there must be a way, but I don't see it...


- Mark

Any ideas?

Cheers,
Dmitri

On Tue, Nov 11, 2008 at 5:09 PM, Mark Miller <[EMAIL PROTECTED]> wrote:
  

Nice! An 8 core machine with a test ready to go!

How about trying the read only mode that was added to 2.4 on your
IndexReader?

And if you you are on unix and could try trunk and use the new
NIOFSDirectory implementation...that would be awesome.

Those two additions are our current hope for what your seeing...would be
nice to know if we need to try for more (or if we need to petition the smart
people that work on that stuff to try for more ;) ).

- Mark

Dmitri Bichko wrote:


Hi,

I'm pretty new to Lucene, so please bear with me if this has been
covered before.

The wiki suggests sharing a single IndexSearcher between threads for
best performance
(http://wiki.apache.org/lucene-java/ImproveSearchingSpeed).  I've
tested running the same set of queries with: multiple threads sharing
the same searcher, with a separate searcher for each thread, both
shared/private with a RAMDirectory in-memory index, and (just for fun)
in multiple JVMs running concurrently (the results are in milliseconds
to complete the whole job):

threads  multi-jvm  shared  per-thread  ram-shared  ram-thread
 1  72997   70883   72573   60308   60012
 2  33147   48762   35973   25498   25734
 4  16229   46828   21267   13127   27164
 6  13088   47240   140289858   29917
 8   9775   47020   109838948   10440
10   8721   50132   113349587   11355
12   7290   49002   117989832
16   9365   47099   12338   11296

The shared searcher indeed behaves better with a ram-based index, but
what's going on with the disk-based one?  It's basically not scaling
beyond two threads. Am I just doing something completely wrong here?

The test consists of about 1,500 Boolean OR queries with 1-10
PhraseQueries each, with 1-20 Terms per PhraseQuery.  I'm using a
HitCollector to count the hits, so I'm not retrieving any results.
The index is about 5GB and 20 million documents.

This is running on a 8 x quad-core Opteron machine with plenty of RAM to
spare.

Any idea why I would see this behaviour?

Thanks,
Dmitri

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexSearcher and multi-threaded performance

2008-11-11 Thread Mark Miller

Mark Miller wrote:
Thats a good point, and points out a bug in solr trunk for me. Frankly 
I don't see how its done. There is no code I can see/find to use it 
rather than FSDirectory. Still assuming there must be a way, but I 
don't see it...


Ah - brain freeze. What else is new :) You have to set the system 
property to change implementations: org.apache.lucene.FSDirectory.class 
is the property, set it to the class. Been a long time...


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexSearcher and multi-threaded performance

2008-11-12 Thread Mark Miller

+1

- Mark


On Nov 12, 2008, at 4:50 AM, Michael McCandless <[EMAIL PROTECTED] 
> wrote:




I think we really should open up a non-static way to choose a  
different FSDirectory impl?  EG maybe add optional Class to  
FSDirectory.getDirectory?  Or maybe give NIOFSDirectory a public  
ctor?  Or something?


Mike

Mark Miller wrote:


Mark Miller wrote:
Thats a good point, and points out a bug in solr trunk for me.  
Frankly I don't see how its done. There is no code I can see/find  
to use it rather than FSDirectory. Still assuming there must be a  
way, but I don't see it...


Ah - brain freeze. What else is new :) You have to set the system  
property to change implementations:  
org.apache.lucene.FSDirectory.class is the property, set it to the  
class. Been a long time...


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexSearcher and multi-threaded performance

2008-11-12 Thread Mark Miller
I'm thinking about it, so if someone else doesn't get something together 
before I have some free time...


Its just not clear to me at the moment how best to do it.

Michael McCandless wrote:


Any takers for pulling a patch together...?

Mike

Mark Miller wrote:


+1

- Mark


On Nov 12, 2008, at 4:50 AM, Michael McCandless 
<[EMAIL PROTECTED]> wrote:




I think we really should open up a non-static way to choose a 
different FSDirectory impl?  EG maybe add optional Class to 
FSDirectory.getDirectory?  Or maybe give NIOFSDirectory a public 
ctor?  Or something?


Mike

Mark Miller wrote:


Mark Miller wrote:
Thats a good point, and points out a bug in solr trunk for me. 
Frankly I don't see how its done. There is no code I can see/find 
to use it rather than FSDirectory. Still assuming there must be a 
way, but I don't see it...


Ah - brain freeze. What else is new :) You have to set the system 
property to change implementations: 
org.apache.lucene.FSDirectory.class is the property, set it to the 
class. Been a long time...


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene implementation/performance question

2008-11-12 Thread Mark Miller
If your new to Lucene, this might be a little much (and maybe I am not 
fully understand the problem), but you might try:


Add the attributes to the words in a payload with a PayloadAnalyzer. Do 
searching as normal. Use the new PayloadSpanUtil class to get the 
payloads for the matching words. (Think of the PayloadSpanUtil as a 
highlighter - you give it a query, it gives you the payloads to the 
terms that match). The PayloadSpanUtil class is a bit experimental, but 
I'll fix anything you run into with it.


- Mark

Greg Shackles wrote:

Hi Erick,

Thanks for the response, sorry that I was somewhat vague in the reasoning
for my implementation in the first post.  I should have mentioned that the
word details are not details of the Lucene document, but are attributes
about the word that I am storing.  Some examples are position on the actual
page, color, size, bold/italic/underlined, and most importantly, the text as
it appeared on the page.  The reason the last one matters is that things
like punctuation, spacing and capitalization can vary between the result and
the search term, and can affect how I need to process the results
afterwords.  I am certainly open to the idea of a new approach if it would
improve on things, I admit I am new to Lucene so if there are options I'm
unaware of I'd love to learn about them.

Just to sum it up with an example, let's say we have a page of text that
stores "This is a page of text."  We want to search for the text "of text",
which would span multiple words in the word index.  The final result would
need to contain "of" and "text", along with the details about each as
described before.  I hope this is more helpful!

- Greg

On Wed, Nov 12, 2008 at 11:17 AM, Erick Erickson <[EMAIL PROTECTED]>wrote:

  

If I may suggest, could you expand upon what you're trying to
accomplish? Why do you care about the detailed information
about each word? The reason I'm suggesting this is "the XY
problem". That is, people often ask for details about a specific
approach when what they really need is a different approach

There are TermFrequencies, TermPositions,
TermVectorOffsetInfo and a bunch of other stuff that I don't
know the details of that may work for you if we had
a better idea of what it is you're trying to accomplish...

Best
Erick

On Wed, Nov 12, 2008 at 10:47 AM, Greg Shackles <[EMAIL PROTECTED]>
wrote:



I hope this isn't a dumb question or anything, I'm fairly new to Lucene
  

so


I've been picking it up as I go pretty much.  Without going into too much
detail, I need to store pages of text, and for each word on each page,
store
detailed information about it.  To do this, I have 2 indexes:

1) pages: this stores the full text of the page, and identifying
information
about it
2) words: this stores a single word, along with the page it was on and is
stored in the order they appear on the page

When doing a search, not only do I need to return the page it was found
  

on,


but also the details of the matching words.  Since I couldn't think of a
better way to do it, I first search the pages index and find any matching
pages.  Then I iterate the words on those pages to find where the match
occurred.  Obviously this is costly as far as execution time goes, but at
least it only has to get done for matching pages rather than every page.
Searches still take way longer than I'd like though, and the bottleneck
  

is


almost entirely in the code to find the matches on the page.

One simple optimization I can think of is store the pages in smaller
  

blocks


so that the scope of the iteration is made smaller.  This is not really
ideal, since I also need the ability to narrow down results based on
  

other


words that can/can't appear on the same page which would mean storing 3
full
copies of every word on every page (one in each of the 3 resulting
indexes).

I know this isn't a Java performance forum so I'll try to keep this
  

Lucene


related, but has anyone done anything similar to this, or have any
comments/ideas on how to improve it?  I'm in the process of trying to
  

speed


things up since I need to perform many searches often over very large
  

sets


of pages.  Thanks!

- Greg

  


  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene implementation/performance question

2008-11-12 Thread Mark Miller
Here is a great power point on payloads from Michael Busch: 
www.us.apachecon.com/us2007/downloads/AdvancedIndexing*Lucene*.ppt. 
Essentially, you can store metadata at each term position, so its an 
excellent place to store attributes of the term - they are very fast to 
load, efficient, etc.


You can check out the spans test classes for a small example using the 
PayloadSpanUtil...its actually fairly simple and short, and the main 
reason I consider it experimental is that it hasn't really been used too 
much to my knowledge (who knows though). If you have a problem, you'll 
know quickly and I'll fix quickly. It should work fine though. Overall, 
the approach wouldn't take that much code, so I don't think youd be out 
a lot of time.


The PayloadSpanUtil takes an IndexReader and a query and returns the 
payloads for the terms in the IndexReader that match the query. If you 
end up with multiple docs in the IndexReader, be sure to isolate the 
query down to the exact doc you want the payloads from (the Span scoring 
mode of the highlighter actually puts the doc in a fast MemoryIndex 
which only holds one doc, and uses an IndexReader from the MemoryIndex).


Greg Shackles wrote:

Hey Mark,

This sounds very interesting.  Is there any documentation or examples I
could see?  I did a quick search but didn't really find much.  It might just
be that I don't know how payloads work in Lucene, but I'm not sure how I
would see this actually doing what I need.  My reasoning is this...you'd
have an index that stores all the text for a particular page.  Would you be
able to attach payload information to individual words on that page?  In my
head it seems like that would be the job of a second index, which is exactly
why I added the word index.

Any details you can give would be great as I need to keep moving on this
project quickly.  I will also say that I'm somewhat wary of using an
experimental class since this is a really important project that really
won't be able to wait on a lot of development cycles to get the class fully
working.  That said, if it can give me serious speed improvements it's
definitely worth considering.

- Greg


On Wed, Nov 12, 2008 at 12:01 PM, Mark Miller <[EMAIL PROTECTED]> wrote:

  

If your new to Lucene, this might be a little much (and maybe I am not
fully understand the problem), but you might try:

Add the attributes to the words in a payload with a PayloadAnalyzer. Do
searching as normal. Use the new PayloadSpanUtil class to get the payloads
for the matching words. (Think of the PayloadSpanUtil as a highlighter - you
give it a query, it gives you the payloads to the terms that match). The
PayloadSpanUtil class is a bit experimental, but I'll fix anything you run
into with it.

- Mark


Greg Shackles wrote:



Hi Erick,

Thanks for the response, sorry that I was somewhat vague in the reasoning
for my implementation in the first post.  I should have mentioned that the
word details are not details of the Lucene document, but are attributes
about the word that I am storing.  Some examples are position on the
actual
page, color, size, bold/italic/underlined, and most importantly, the text
as
it appeared on the page.  The reason the last one matters is that things
like punctuation, spacing and capitalization can vary between the result
and
the search term, and can affect how I need to process the results
afterwords.  I am certainly open to the idea of a new approach if it would
improve on things, I admit I am new to Lucene so if there are options I'm
unaware of I'd love to learn about them.

Just to sum it up with an example, let's say we have a page of text that
stores "This is a page of text."  We want to search for the text "of
text",
which would span multiple words in the word index.  The final result would
need to contain "of" and "text", along with the details about each as
described before.  I hope this is more helpful!

- Greg

On Wed, Nov 12, 2008 at 11:17 AM, Erick Erickson <[EMAIL PROTECTED]
  

wrote:



  

If I may suggest, could you expand upon what you're trying to
accomplish? Why do you care about the detailed information
about each word? The reason I'm suggesting this is "the XY
problem". That is, people often ask for details about a specific
approach when what they really need is a different approach

There are TermFrequencies, TermPositions,
TermVectorOffsetInfo and a bunch of other stuff that I don't
know the details of that may work for you if we had
a better idea of what it is you're trying to accomplish...

Best
Erick

On Wed, Nov 12, 2008 at 10:47 AM, Greg Shackles <[EMAIL PROTECTED]>
wrote:





I hope this isn't a dumb question or anything, I'm fairly new to Lucene


  

so




I've been picking it up as I go pretty much.  Without going

Re: Lucene implementation/performance question

2008-11-12 Thread Mark Miller

Greg Shackles wrote:

Thanks!  This all actually sounds promising, I just want to make sure I'm
thinking about this correctly.  Does this make sense?

Indexing process:

1) Get list of all words for a page and their attributes, stored in some
sort of data structure
2) Concatenate the text from those words (space separated) into a string
that represents the entire page
3) When adding the page document to the index, run it through a custom
analyzer that attaches the payloads to the tokens
  * this would have to follow along in the word list from #1 to get the
payload information for each token
  * would also have to tokenize the word we are storing to see how many
Lucene tokens it would translate to (to make sure the right payloads go with
the right tokens)
  
Right, sounds like you have it spot on. That second * from 3 looks like 
a possible tricky part.

I haven't totally analyzed the searching process yet since I want to get my
head around the storage part first, but I imagine that would be the easier
part anyway.  Does this approach sound reasonable?
  

Sounds good.

My other concern is your comment about isolating results.  If I'm reading it
correctly, it means that I'd have to do the search in multiple passes, one
to get the individual docs containing the matches, and then one query for
each of those to get the payloads within them?
  
Right...you'd do it essentially how Highlighting works...you do the 
search to get the docs of interest, and then redo the search somewhat to 
get the highlights/payloads for an individual doc at a time. You are 
redoing some work, but if you think about, getting that info for every 
match (there could be tons) doesn't make much since when someone might 
just look at the top couple results, or say 10 at a time. Depends on 
your usecase if its feasible or not though. Most find it efficient 
enough to do highlighting with, so I'm assuming it should be good enough 
here.

Thanks again for your help on this one.

- Greg


On Wed, Nov 12, 2008 at 12:52 PM, Mark Miller <[EMAIL PROTECTED]> wrote:

  

Here is a great power point on payloads from Michael Busch:
www.us.apachecon.com/us2007/downloads/AdvancedIndexing*Lucene*.ppt.
Essentially, you can store metadata at each term position, so its an
excellent place to store attributes of the term - they are very fast to
load, efficient, etc.

You can check out the spans test classes for a small example using the
PayloadSpanUtil...its actually fairly simple and short, and the main reason
I consider it experimental is that it hasn't really been used too much to my
knowledge (who knows though). If you have a problem, you'll know quickly and
I'll fix quickly. It should work fine though. Overall, the approach wouldn't
take that much code, so I don't think youd be out a lot of time.

The PayloadSpanUtil takes an IndexReader and a query and returns the
payloads for the terms in the IndexReader that match the query. If you end
up with multiple docs in the IndexReader, be sure to isolate the query down
to the exact doc you want the payloads from (the Span scoring mode of the
highlighter actually puts the doc in a fast MemoryIndex which only holds one
doc, and uses an IndexReader from the MemoryIndex).


Greg Shackles wrote:



Hey Mark,

This sounds very interesting.  Is there any documentation or examples I
could see?  I did a quick search but didn't really find much.  It might
just
be that I don't know how payloads work in Lucene, but I'm not sure how I
would see this actually doing what I need.  My reasoning is this...you'd
have an index that stores all the text for a particular page.  Would you
be
able to attach payload information to individual words on that page?  In
my
head it seems like that would be the job of a second index, which is
exactly
why I added the word index.

Any details you can give would be great as I need to keep moving on this
project quickly.  I will also say that I'm somewhat wary of using an
experimental class since this is a really important project that really
won't be able to wait on a lot of development cycles to get the class
fully
working.  That said, if it can give me serious speed improvements it's
definitely worth considering.

- Greg


On Wed, Nov 12, 2008 at 12:01 PM, Mark Miller <[EMAIL PROTECTED]>
wrote:



  

If your new to Lucene, this might be a little much (and maybe I am not
fully understand the problem), but you might try:

Add the attributes to the words in a payload with a PayloadAnalyzer. Do
searching as normal. Use the new PayloadSpanUtil class to get the
payloads
for the matching words. (Think of the PayloadSpanUtil as a highlighter -
you
give it a query, it gives you the payloads to the terms that match). The
PayloadSpanUtil class is a bit experimental, but I'll fix anything you
run
into with it.

- Mark


Greg Shackles wrote:





Hi Erick,

Thanks for the re

Re: LUCENE-831 (complete cache overhaul) -> mem use

2008-11-14 Thread Mark Miller
Its hard to predict the future of LUCENE-831. I would bet that it will 
end up in Lucene at some point in one form or another, but its hard to 
say if that form will be whats in the available patches (I'm a contrib 
committer so I won't have any real say in that, so take that prediction 
with a grain of salt). It has strong ties to other issues and a 
committer hasn't really had their whack at it yet.


Having said that though, LUCENE-831 allows for two types for dealing 
with field values: either the old style int/string/long/etc arrays, or 
for a small speed hit and faster reopens, an ArrayObject type that is 
basically an Object that can provide access to one or two real or 
virtual arrays. So technically you could use an ArrayObject that had a 
sparse implementation behind it. Unfortunately, you would have to 
implement new CachKeys to do this. Trivial to do, but reveals our 
LUCENE-831 problem of exponential cachkey increases with every new 
little option/idea and the juggling of which to use. I havn't thought 
about it, but I'm hoping an API tweak can alleviate some of this.


- Mark

Britske wrote:
Hi, 


I recently saw activity on LUCENE-831 (Complete overhaul of FieldCache
API/Implementation) which I have interest in. 
I posted previously on this with my concern that given the current default

cache I sometimes get OOM-errors because I have a lot of fields which are
sorted on, which ultimately causes the fieldcache to grow greater then
available RAM. 


ultimately I want to subclass the new pluggable Fieldcache of lucene-831 to
offload to disk (using ehcache or memcachedB or something) but havn't found
the time yet. 


What I would like to know for now is if perhaps the newly implemented
standard cache in LUCENE-831 uses another strategy of caching than the
standard Fieldcache in Lucene. 


i.e: The normal cache consumes memory while generating a fieldcache for
every document in lucene even though the document hasn't got that field set. 


Since my documents are very sparse in these fields I want to sort on it
would differ a_lot when documents that don't have the field in question set
don't add up in the used memory. 

So am I lucky? Or would I indeed have to cook up something myself? 
Thanks and best regards,


Geert-Jan


  

I'm

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: LUCENE-831 (complete cache overhaul) -> mem use

2008-11-15 Thread Mark Miller
Like I said, its pretty easy to add this, but its also going to suck. 
Kind of exposes the fact that its missing the right extensibility at the 
moment. Things are still a bit ugly overall.



Your going to need new CacheKeys for the data types you want to support. 
A CacheKey builds and provides access to the field data and is simply:



*public* *abstract* *class* CacheKey {

*public* *abstract* CacheData buildData(IndexReader r);

*public* *abstract* *boolean* equals(Object o);

*public* *abstract* *int* hashCode();

*public* *boolean* isMergable();

*public* CacheData mergeData(*int*[] starts, CacheData[] data) ;

*public* *boolean* usesObjectArray();


For a sparse storage implementation you would use an object array, so 
have usesObjectArray return true and isMergable can then be false and 
you dont have to support the mergeData method.



In buildData you will load your object array and return it. Here is an 
array backed IntObjectArrayCacheKey build method:


*public* CacheData buildData(IndexReader reader) *throws* IOException {

  *final* *int*[] retArray = getIntArray(reader);

  ObjectArray fieldValues = *new* ObjectArray() {

*public* Object get(*int* index) {

  *return* *new* Integer(retArray[index]);

}

  };

  *return* *new* CacheData(fieldValues);

}


*protected* *int*[] getIntArray(IndexReader reader) *throws* IOException {

  *final* *int*[] retArray = *new* *int*[reader.maxDoc()];

  TermDocs termDocs = reader.termDocs();

  TermEnum termEnum = reader.terms(*new* Term(field, ""));

  *try* {

*do* {

  Term term = termEnum.term();

  *if* (term == *null* || term.field() != field)
*break*;
* 
 int* termval = parser.parseInt(term.text());


  termDocs.seek(termEnum);

  *while* (termDocs.next()) {
   retArray[termDocs.doc()] = termval;
 }

} *while* (termEnum.next());

  } *finally* {

termDocs.close();

termEnum.close();

  }

  *return* retArray;

}


So it should be fairly straightforward to return a sparse implementation 
backed object array from your new CacheKey (SparseIntObjectArrayCacheKey 
or something).


Now some more ugliness: You can turn on the ObjectArray cachekeys by 
setting the system property 'use.object.array.sort' to true. This will 
cause FieldSortedHitQueue to return ScoreDocComparators that use the 
standard ObjectArray CacheKeys, IntObjectArrayCacheKey, 
FloatObjectArrayCacheKey, etc.The method that builds each comparator 
type knows what type to build for and whether to use primitive arrays or 
ObjectArrays ie (from FieldSortedHitQueue):



*static* ScoreDocComparator comparatorDoubleOA(*final* IndexReader 
reader, *final* String fieldname)



does this (it has to provide the CacheKey and know the return type):


*final* ObjectArray fieldOrder = (ObjectArray) 
reader.getCachedData(*new* 
DoubleObjectArrayCacheKey(field)).getCachePayload();



So you have to either change all of the ObjectArray comparator builders 
to use your CacheKeys:



*final* ObjectArray fieldOrder = (ObjectArray) 
reader.getCachedData(*new* 
SparseIntObjectArrayCacheKey(field)).getCachePayload();



Or you have to add more options in 
FieldSortedHitQueue.CacheEntry.buildData(IndexReader reader) and more 
static comparator builders in FieldSortedHitQueue that use the right 
CacheKeys. Obviously not very extensibility friendly at the moment. I'm 
sure with some thought, things could be much better. If you decided to 
jump into any of this, let me know if you have any suggestions, feedback.



- Mark



Britske wrote:

That ArrayObject suggestion makes sense to me. It amost seemed to be as if
you were referring as this option (or at least the interfaces needed to
implement this) were already available as 1 out of 2 options available in
831? 


Could you give me a hint at were I have to be looking to extend what you're
suggesting? 
a new Cache, CacheFactory and Cachekey implementaiton for all types of

cachekeys? This may sound a bit ignorant, but it would be my first time to
get my head around the internals of an api instead of merely using it to
imbed in a client application so any help is highly appreciated.  


Thanks for your help,

Geert-Jan



markrmiller wrote:
  
Its hard to predict the future of LUCENE-831. I would bet that it will 
end up in Lucene at some point in one form or another, but its hard to 
say if that form will be whats in the available patches (I'm a contrib 
committer so I won't have any real say in that, so take that prediction 
with a grain of salt). It has strong ties to other issues and a 
committer hasn't really had their whack at it yet.


Having said that though, LUCENE-831 allows for two types for dealing 
with field values: either the old style int/string/long/etc arrays, or 
for a small speed hit and faster reopens, an ArrayObject type that is 
basically an Object that can provide access to one or two real or 
virtual arrays. So technically you could use an ArrayObject that had a 
sparse i

Re: InstantiatedIndex help

2008-11-16 Thread Mark Miller
Check out the docs at: 
http://lucene.apache.org/java/2_4_0/api/contrib-instantiated/index.html


There is a performance graph there to check  out.

The code should be fairly straightforward - you can make an 
InstantiatedIndex thats empty, or seed it with an IndexReader. Then you 
can make an InstantiatedReader or Writer, which take the 
InstantiatedIndex as a constructor arg.


You should be able to just wrap that InstantiatedReader in a regular 
Searcher.


Darren Govoni wrote:

Hi gang,
   I am trying to trace the 2.4 API to create an InstantiatedIndex, but
its rather difficult to connect directory,reader,search,index etc just
reading the javadocs. 


I have a (POI - plain old index) directory already and want to
create a faster InstantiatedIndex and IndexSearcher to query it like
before. What's the proper order to do this? 


Also, if anyone has any empirical data on the performance or reliability
of InstantiatedIndex, I'd be curious.

Thanks for the tips!
Darren


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: InstantiatedIndex help

2008-11-16 Thread Mark Miller

Can you start with an empty index? Then how about:

// Adding these

   iindex = InstantiatedIndex()
   ireader = iindex.indexReaderFactory()
   isearcher = IndexSearcher(ireader)

If you want a copy from another IndexReader though, you have to get that reader 
from somewhere right?

- Mark 




Darren Govoni wrote:

Hi Mark,
  Thanks for the tips. Here's what I will try (psuedo-code)

endirectory = RAMDirectory("index/dictionary.en")
ensearcher = IndexSearcher(endirectory)
// Adding these
reader = ensearcher.getIndexReader()
iindex = InstantiatedIndex(reader)
ireader = iindex.indexReaderFactory()
isearcher = IndexSearcher(ireader)

Kind of round about way to get an InstantiatedIndex I guess,but maybe
there's a briefer way?

Thank you.
Darren

On Sun, 2008-11-16 at 10:50 -0500, Mark Miller wrote:
  
Check out the docs at: 
http://lucene.apache.org/java/2_4_0/api/contrib-instantiated/index.html


There is a performance graph there to check  out.

The code should be fairly straightforward - you can make an 
InstantiatedIndex thats empty, or seed it with an IndexReader. Then you 
can make an InstantiatedReader or Writer, which take the 
InstantiatedIndex as a constructor arg.


You should be able to just wrap that InstantiatedReader in a regular 
Searcher.


Darren Govoni wrote:


Hi gang,
   I am trying to trace the 2.4 API to create an InstantiatedIndex, but
its rather difficult to connect directory,reader,search,index etc just
reading the javadocs. 


I have a (POI - plain old index) directory already and want to
create a faster InstantiatedIndex and IndexSearcher to query it like
before. What's the proper order to do this? 


Also, if anyone has any empirical data on the performance or reliability
of InstantiatedIndex, I'd be curious.

Thanks for the tips!
Darren


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  
  

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Spread of lucene score

2008-11-19 Thread Mark Miller

excitingComm2 wrote:

Hi everybody,

as far as I know the lucene score is an arbitrary number between 0.0 and
1.0.
Is this correct, that the scores in my resultset are always normalised to
this spread or is it possible to get higher scores?

Regards,
John W.
  
Hits is the class that did the normalizing, and its deprecated. TopDocs 
didn't normalize last I checked, so you could get > 1 from there.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene implementation/performance question

2008-11-20 Thread Mark Miller
Yeah, discussion came up on order and I believe we punted - its up to 
you to track order and sort at the moment. I think that was to prevent 
those that didnt need it from paying the sort cost, but I have to go 
find that discussion again (maybe its in the issue?) I'll look at the 
whole idea again though.


Greg Shackles wrote:

On Wed, Nov 19, 2008 at 12:33 PM, Greg Shackles <[EMAIL PROTECTED]> wrote:

  

In the searching phase, I would run the search across all page documents,
and then for each of those pages, do a search with
PayloadSpanUtil.getPayloadsForQuery that made it so it only got payloads for
each page at a time.  The function returns a Collection of Payloads as far
as I can tell, so is there any way of knowing which payloads go together?
That is to say, if you were to do a search for "lucene rocks" on the page
and it appeared 3 times, you would get back 6 payloads in total.  Is there a
quick way of knowing how to group them in the collection?




Just a follow-up on my post now that I was able to see what the real data
looks like when it comes back from PayloadSpanUtil.  The order of payload
terms in the collection doesn't seem useful, as I suspect it is somehow
related to the order they are stored in the index itself.  Because of that,
grouping them is going to be difficult as I suspected, but this seems like
something Lucene should be able to do for me.  Is that not correct?  I'd
like to keep as much of the logic as possible out of my own implementation
for the sake of performance so if there is some way to do this, I would love
to know.  Thanks!

By the way, the Payloads feature is really cool! Definitely way better than
how I was doing things originally.  : )

- Greg

  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: # of fields, performance

2008-12-02 Thread Mark Miller
There is not much impact as long as you turn off Norms for the  
majority of them.


- Mark


On Dec 2, 2008, at 8:47 AM, Darren Govoni <[EMAIL PROTECTED]> wrote:


Hi,
 I saw this question asked before without a clear answer. Pardons if I
missed it in the archive elsewhere.

Is there a serious degradation of performance when using high number  
of

fields per document? Like 100's? Is the impact more on the write than
the read?

What are the performance characteristics with a high number of fields
and is anyone using indexes this way?

thank you for any thoughts.

Darren


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene nicking my memory ?

2008-12-03 Thread Mark Miller
Careful here. Not only do you need to pass -server, but you need the 
ability to use it :) It will silently not work if its not there I 
believe. Oddly, the JRE doesn't seem to come with the server hotspot 
implementation. The JDK always does appear to. Probably varies by OS to 
some degree.


Some awesome options for visually watching garbage collection:

straight visualgc
the netbeans visualgc plugin
the awesome visualvm and the visualgc plugin

Eric Bowman wrote:

Are you not passing -server on the command line? You need to do that.

In my experience with Sun JVM 1.6.x, the default gc strategy is really
amazingly good, as long as you pass -server.

If passing -server doesn't fix it, I would recommend enabling the
various verbose GC logs and watching what happens there, and using the
Sun tools to analyze it a bit. If you do require specific heap tuning,
the verbose gc logging will steer you in the right direction.

Good luck!

-Eric

Michael McCandless wrote:
  

Are you actually hitting OOME?

Or, you're watching heap usage and it bothers you that the GC is
taking a long time (allowing too much garbage to use up heap space)
before sweeping?

One thing to try (only for testing) might be a lower and lower -Xmx
until you do hit OOME; then you'll know the "real" memory usage of the
app.

Mike

Magnus Rundberget wrote:



Sure,

Tried with the following
Java version: build 1.5.0_16-b06-284 (dev), 1.5.0_12 (production)
OS : Mac OS/X Leopard(dev) and Windows XP(dev), Windows 2003
(production)
Container : Jetty 6.1 and Tomcat 5.5 (latter is used both in dev and
production)


current jvm options
-Xms512m -Xmx1024M -XX:MaxPermSize=256m
... tried a few gc settings as well but nothing that has helped
(rather slowed things down)

production hw running 2 XEON dual core processors

in production our memory reaches the 1024 limit after a while (a few
hours) and at some point it stops responding to forced gc (using
jconsole).

need to digg quite a bit more to figure out the exact prod settings.
But safe to say the memory usage pattern can be recreated on
different hardware configs, with different os's, different 1.5 jvms
and different containers (jetty and tomcat).



cheers
Magnus



On 3. des.. 2008, at 13.10, Glen Newton wrote:

  

Hi Magnus,

Could you post the OS, version, RAM size, swapsize, Java VM version,
hardware, #cores, VM command line parameters, etc? This can be very
relevant.

Have you tried other garbage collectors and/or tuning as described in
http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html?

2008/12/3 Magnus Rundberget <[EMAIL PROTECTED]>:


Hi,

We have an application using Tomcat, Spring etc and Lucene 2.4.0.
Our index is about 100MB (in test) and has about 20 indexed fields.

Performance is pretty good, but we are experiencing a very high
usage of
memory when searching.

Looking at JConsole during a somewhat silly scenario (but
illustrates the
problem);
(Allocated 512 MB Min heap space, max 1024)

0. Initially memory usage is about 70MB
1. Search for word "er", heap memory usage goes up by 100-150MB
1.1 Wait for 30 seconds... memory usage stays the same (ie no gc)
2. Search by word "og", heap memory usage goes up another 50-100MB
2.1 See 1.1

...and so on until it seems to reach the 512 MB limit, and then a
garbage
collection is performed
i.e garbage collection doesn't seem to occur until it "hits the roof"

We believe the scenario is similar in production, were our heap
space is
limited to 1.5 GB.


Our search is basically as follows
--
1. Open an IndexSearcher
2. Build a Boolean Query searching across 4 fields (title, summary,
content
and daterangestring MMDD)
2.1 Sort on title
3. Perform search
4. Iterate over hits to build a set of custom result objects
(pretty small,
as we dont include content in these)
5. Close searcher
6. Return result objects.
  

You should not close the searcher: it can be shared by all queries.
What happens when you warm Lucene with a (large) number of queries: do
things stabilize over time?

A 100MB index is (relatively) very small for Lucene (I have indexes


100GB). What kind of response times are you getting, independent of
  

memory usage.

-glen



We have tried various options based on entries on this mailing list;
a) Cache the IndexSearcher - Same results
b) Remove sorting - Same result
c) In point 4 only iterating over a limited amount of hits rather
than whole
collection - Same result in terms of memory usage, but obviously
increased
performance
d) Using RamDirectory vs FSDirectory - Same result only initial
heap usage
is higher using ramdirectory (in conjuction with cached indexsearcher)


Doing some profiling using YourKit shows a huge number of char[],
int[] and
string[], and ever increasing number of lucene related objects.



Reading through the mailing lists, suspicions are that our problem is
related to ThreadLocals and memory not being re

Re: NPE inside org.apache.lucene.index.SegmentReader.getNorms

2008-12-03 Thread Mark Miller

Sounds familiar. This may actually be in JIRA already.

- Mark


On Dec 3, 2008, at 6:25 PM, "Teruhiko Kurosaka" <[EMAIL PROTECTED]>  
wrote:



Mike,
You are right.  There was an error on my part. I think
I was, in effect, making a SpanNearQuery object of:
  new SpanNearQuery(new SpanQuery[0], 0, true);



-Original Message-
From: Michael McCandless [mailto:[EMAIL PROTECTED]
Sent: Wednesday, December 03, 2008 10:47 AM
To: java-user@lucene.apache.org
Subject: Re: NPE inside  
org.apache.lucene.index.SegmentReader.getNorms



Actually I think something "outside" Lucene is probably
setting that field.

How did you create the Query that you are searching on?

Mike



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Open IndexReader read-only

2008-12-08 Thread Mark Miller

Chris Bamford wrote:


So does that mean if you don't explicitly open an IndexReader, the 
IndexSearcher will do it for you?  Or what?


Right. The IndexReader takes a Directory, and the IndexSearcher takes an 
IndexReader - there are sugar constructors though - An IndexSearcher 
will also accept a String file path, which will be used to create a 
Directory which is used to create an IndexReader. It will also take a 
Directory, which will be used to create an IndexReader. It will also 
just accept the IndexReader.


So you have to find how that IndexReader is being created (or where) and 
change the code so that you get to create it, and when you do, do it 
read-only. It should be easier than that roundabout info sounds.


- Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Fragment Highlighter Phrase?

2008-12-08 Thread Mark Miller

Ian Vink wrote:

Is there a way to get phrases counted in the list of fragments that come
back from Highlighter.GetBestFragments() in general.
It seems to only take words into account.

Ian

  
Not sure I fully understand, but have you tried the SpanScorer? It 
allows the Highlighter to work with phrase/span queries.


- Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Open IndexReader read-only

2008-12-08 Thread Mark Miller

Look for the static factory methods on IndexReader.

- Mark

Chris Bamford wrote:

Thanks Mark.

I have identified the spot where I need to do the surgery.  However, I 
discover that IndexReader is abstract, but it seems crazy that I need 
to make a concrete class for which I have no need to add any of my own 
logic...  Is there a suitable subclass I can use?  The documented ones 
- FilterIndexReader, InstantiatedIndexReader, MultiReader, 
ParallelReader - all seem too complicated for what I need.  My only 
requirement is to open it read-only!


Am I missing something?

Mark Miller wrote:

Chris Bamford wrote:


So does that mean if you don't explicitly open an IndexReader, the 
IndexSearcher will do it for you?  Or what?


Right. The IndexReader takes a Directory, and the IndexSearcher takes 
an IndexReader - there are sugar constructors though - An 
IndexSearcher will also accept a String file path, which will be used 
to create a Directory which is used to create an IndexReader. It will 
also take a Directory, which will be used to create an IndexReader. 
It will also just accept the IndexReader.


So you have to find how that IndexReader is being created (or where) 
and change the code so that you get to create it, and when you do, do 
it read-only. It should be easier than that roundabout info sounds.


- Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Open IndexReader read-only

2008-12-08 Thread Mark Miller

Chris Bamford wrote:

Mark

> Look for the static factory methods on IndexReader.

I take it you mean IndexReader.open (dir, true) ?


Yeah.

If so, how do I then pass that into DelayCloseIndexSearcher() so that 
I can continue to rely on all the existing calls like:


IndexReader reader = contentSearcher.getIndexReader();

Put another way, how do I associate the static IndexReader with an 
IndexSearcher object so I can use getIndexReader() to get it again?
Find where that contentSearcher is being created. Use a different 
constructor to create the Searcher - use the one that takes an 
IndexReader. Now you control the IndexReader creation, and you can use 
the readonly constructor option when you create it. That Searcher is 
either using a constructor that takes an IndexReader, or a Directory, or 
a String. If its using a String constructor, instead, use the Directory 
factory that takes a String, make a Directory, and use it to make an 
IndexReader that you build the IndexSearcher with. If its using a 
Directory, use that directory to make the IndexReader that is used for 
you IndexSearcher.




Thanks for your continued help with this  :-)

Chris

Mark Miller wrote:

Look for the static factory methods on IndexReader.

- Mark

Chris Bamford wrote:

Thanks Mark.

I have identified the spot where I need to do the surgery.  However, 
I discover that IndexReader is abstract, but it seems crazy that I 
need to make a concrete class for which I have no need to add any of 
my own logic...  Is there a suitable subclass I can use?  The 
documented ones - FilterIndexReader, InstantiatedIndexReader, 
MultiReader, ParallelReader - all seem too complicated for what I 
need.  My only requirement is to open it read-only!


Am I missing something?

Mark Miller wrote:

Chris Bamford wrote:


So does that mean if you don't explicitly open an IndexReader, the 
IndexSearcher will do it for you?  Or what?


Right. The IndexReader takes a Directory, and the IndexSearcher 
takes an IndexReader - there are sugar constructors though - An 
IndexSearcher will also accept a String file path, which will be 
used to create a Directory which is used to create an IndexReader. 
It will also take a Directory, which will be used to create an 
IndexReader. It will also just accept the IndexReader.


So you have to find how that IndexReader is being created (or 
where) and change the code so that you get to create it, and when 
you do, do it read-only. It should be easier than that roundabout 
info sounds.


- Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Has anyone written SpanFuzzyQuery?

2008-12-09 Thread Mark Miller
http://issues.apache.org/jira/browse/LUCENE-522

note the bugs mentioned at the bottom.

- Mark



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: GWT port of Lucene's QueryParser

2008-12-11 Thread Mark Miller

Paul Libbrecht wrote:


Hello again list,

has anyone tried to port or simply run the QueryParser of Lucene to GWT?
It would look like a very nice thing to do to provide direct rendering 
of the query interpretation (it could be made into a whole editor 
probably, e.g. removing or selecting parts of the query).


thanks in advance

paul
I dont think its worth the effort Paul (though it sounds like a fun 
thing to try, as long as JavaCC sticks to GWT compatible core classes). 
Seems a lot easier to just run the queryparser server side though and 
move the results back and forth with rpc. As a side note, Mark Harwood 
worked on a cool GWT port of Luke.


- Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Field.omitTF

2008-12-18 Thread Mark Miller

Drops positions as well.

- Mark


On Dec 18, 2008, at 4:57 PM, "John Wang"  wrote:


Hi:
  In lucene 2.4, when Field.omitTF() is called, payload is disabled as
well. Is this intentional? My understanding is payload is  
independent from

the term frequencies.

Thanks

-John


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Field.omitTF

2008-12-18 Thread Mark Miller
No, not a bug, certainly its the intended behavior (though the name is a 
bit tricky isn't it? I've actually thought about that in the past 
myself). If you check out the javadoc on Fieldable youll find:


 /** Expert:
  *
  * If set, omit term freq, positions and payloads from postings for 
this field.

  */
 void setOmitTf(boolean omitTf);

- Mark

John Wang wrote:

Thanks Mark!I don't think it is documented (at least the ones I've read),
should this be considered as a bug or ... ?

Thanks

-John

On Thu, Dec 18, 2008 at 2:05 PM, Mark Miller  wrote:

  

Drops positions as well.

- Mark



On Dec 18, 2008, at 4:57 PM, "John Wang"  wrote:

 Hi:


 In lucene 2.4, when Field.omitTF() is called, payload is disabled as
well. Is this intentional? My understanding is payload is independent from
the term frequencies.

Thanks

-John

  

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





  



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Approximate release date for Lucene 2.9

2008-12-18 Thread Mark Miller

Well look at the issues and see for yourself :)

Its a subjective call I think. Heres my take:

There are not going to be too many sweeping changes in the next release. 
There are tons of little bug fixes and improvements, but not a lot of 
the bullet point type stuff that you mention in your wishlist. Its a 
whole lot of little steps forward.


When it comes to sorting, there a couple possible goodies coming in the 
next release:


TrieRangeQuery has been added to contrib. Super awesome, super 
efficient, large scale sorting.


Work is ongoing to change searching semantics so that sorting is much 
faster in many cases. In fact, their may be search speed improvements 
across the board in many cases (don't quote me ). Sort fieldcache 
loading in the multi segment case will likely also be *blazingly* 
faster. Also, Filters and Fieldcaches may be pushed down to a single 
segment, making reopening sort fieldcaches *much* more efficient. Thats 
a nice step towards realtime.


RangeQuery, PrefixQuery and WildcardQuery will all have a constant score 
mode as well - this avoids maxclause limits and is often much faster on 
very large indexes.


Locallucene, a very cool bit of code that allows geo search, might make 
contrib for the next release.


Beyond that, there are a few more little gems, but its a lot of little 
fixes and improvements more than big features.


Column stride fields and flexible indexing will not be in the next 
release in my opinion, but a lot of progress towards flexible indexing 
has been made.


Keep in mind thats a biased view of the next release - I worked on two 
of those issues. Be sure to take it all with a healthy grain of salt.


- Mark

Ganesh wrote:
Does Lucene 2.9 has real time search? Any improvements in sorting? Any 
facility to store a payload per document (without updating document)?


Please highlight the important feature?

Regards
Ganesh

- Original Message - From: "Michael McCandless" 


To: 
Sent: Friday, December 19, 2008 3:40 AM
Subject: Re: Approximate release date for Lucene 2.9




Well... there are a couple threads on java-dev discussing this "now":

  http://www.nabble.com/2.9-3.0-plan---Java-1.5-td20972994.html
  http://www.nabble.com/2.9,-3.0-and-deprecation-td20099343.html

though they seem to have petered out.

Also we have 29 open issues for 2.9:


https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&&pid=12310110&fixfor=12312682&resolution=-1&sorter/field=priority&sorter/order=DESC 



For 2.4 it took at least a month to whittle the list down to 0.

So it's hard to say?  I'd love to see 2.9 out earlyish next year though.

Mike

Kay Kay wrote:


Hi -
I am just curious - what is the approximate release target date 
that  we have for Lucene 2.9 ( currently in beta in dev).



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Send instant messages to your online friends 
http://in.messenger.yahoo.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Approximate release date for Lucene 2.9

2008-12-18 Thread Mark Miller

Mark Miller wrote:



TrieRangeQuery has been added to contrib. Super awesome, super 
efficient, large scale sorting.


Sorry. Its way past my bedtime. Large scale numerical range searching. 
Sorting on the brain.



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Approximate release date for Lucene 2.9

2008-12-19 Thread Mark Miller
Right, I was debating throwing that in myself - its great stuff, but I 
wasn't sure how much of a feature benefit it brought now. My 
understanding is that its main benefit is along the flexible indexing 
path and using multiple consumers eg its more setup for the goodness yet 
to come. My understanding is certainly less than yours though :)


- Mark

Michael McCandless wrote:


The new extensible TokenStream API (based on AttributeSource) is also 
in 2.9.


Mike

Mark Miller wrote:


Well look at the issues and see for yourself :)

Its a subjective call I think. Heres my take:

There are not going to be too many sweeping changes in the next 
release. There are tons of little bug fixes and improvements, but not 
a lot of the bullet point type stuff that you mention in your 
wishlist. Its a whole lot of little steps forward.


When it comes to sorting, there a couple possible goodies coming in 
the next release:


TrieRangeQuery has been added to contrib. Super awesome, super 
efficient, large scale sorting.


Work is ongoing to change searching semantics so that sorting is much 
faster in many cases. In fact, their may be search speed improvements 
across the board in many cases (don't quote me ). Sort fieldcache 
loading in the multi segment case will likely also be *blazingly* 
faster. Also, Filters and Fieldcaches may be pushed down to a single 
segment, making reopening sort fieldcaches *much* more efficient. 
Thats a nice step towards realtime.


RangeQuery, PrefixQuery and WildcardQuery will all have a constant 
score mode as well - this avoids maxclause limits and is often much 
faster on very large indexes.


Locallucene, a very cool bit of code that allows geo search, might 
make contrib for the next release.


Beyond that, there are a few more little gems, but its a lot of 
little fixes and improvements more than big features.


Column stride fields and flexible indexing will not be in the next 
release in my opinion, but a lot of progress towards flexible 
indexing has been made.


Keep in mind thats a biased view of the next release - I worked on 
two of those issues. Be sure to take it all with a healthy grain of 
salt.


- Mark

Ganesh wrote:
Does Lucene 2.9 has real time search? Any improvements in sorting? 
Any facility to store a payload per document (without updating 
document)?


Please highlight the important feature?

Regards
Ganesh

- Original Message - From: "Michael McCandless" 


To: 
Sent: Friday, December 19, 2008 3:40 AM
Subject: Re: Approximate release date for Lucene 2.9




Well... there are a couple threads on java-dev discussing this "now":

 http://www.nabble.com/2.9-3.0-plan---Java-1.5-td20972994.html
 http://www.nabble.com/2.9,-3.0-and-deprecation-td20099343.html

though they seem to have petered out.

Also we have 29 open issues for 2.9:


https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&&pid=12310110&fixfor=12312682&resolution=-1&sorter/field=priority&sorter/order=DESC 



For 2.4 it took at least a month to whittle the list down to 0.

So it's hard to say?  I'd love to see 2.9 out earlyish next year 
though.


Mike

Kay Kay wrote:


Hi -
I am just curious - what is the approximate release target date 
that  we have for Lucene 2.9 ( currently in beta in dev).



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Send instant messages to your online friends 
http://in.messenger.yahoo.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Optimize and Out Of Memory Errors

2008-12-23 Thread Mark Miller

Lebiram wrote:
Also, what are norms 
Norms are a byte value per field stored in the index that is factored 
into the score. Its used for length normalization (shorter documents = 
more important) and index time boosting. If you want either of those, 
you need norms. When norms are loaded up into an IndexReader, its loaded 
into a byte[maxdoc] array for each field - so even if one document out 
of 400 million has a field, its still going to load byte[maxdoc] for 
that field (so a lot of wasted RAM).  Did you say you had 400 million 
docs and 7 fields? Google says that would be:



   **400 million x 7 byte = 2 670.28809 megabytes**

On top of your other RAM usage.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Optimize and Out Of Memory Errors

2008-12-23 Thread Mark Miller

Mark Miller wrote:

Lebiram wrote:
Also, what are norms 
Norms are a byte value per field stored in the index that is factored 
into the score. Its used for length normalization (shorter documents = 
more important) and index time boosting. If you want either of those, 
you need norms. When norms are loaded up into an IndexReader, its 
loaded into a byte[maxdoc] array for each field - so even if one 
document out of 400 million has a field, its still going to load 
byte[maxdoc] for that field (so a lot of wasted RAM).  Did you say you 
had 400 million docs and 7 fields? Google says that would be:



   **400 million x 7 byte = 2 670.28809 megabytes**

On top of your other RAM usage.
Just to avoid confusion, that should really read a byte per document per 
field. If I remember right, it gives 255 boost possibilities, limited to 
25 with length normalization.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Optimize and Out Of Memory Errors

2008-12-24 Thread Mark Miller
We don't know those norms are "the" problem. Luke is loading norms if 
its searching that index. But what else is Luke doing? What else is your 
App doing? I suspect your app requires more RAM than Luke? How much RAM 
do you have and much are you allocating to the JVM?


The norms are not necessarily the problem you have to solve - but it 
would appear they are taking up over 2 gig of memory. Unless you have 
some to spare (and it sounds like you may not), it could be a good idea 
to turn them off for particular fields.


- Mark

Lebiram wrote:

Is there away to not factor in norms data in scoring somehow?

I'm just stumped as to how Luke is able to do a seach (with limit) on the docs 
but in my code it just dies with OutOfMemory errors.
How does Luke not allocate these norms?




________
From: Mark Miller 
To: java-user@lucene.apache.org
Sent: Tuesday, December 23, 2008 5:25:30 PM
Subject: Re: Optimize and Out Of Memory Errors

Mark Miller wrote:
  

Lebiram wrote:

Also, what are norms 
  

Norms are a byte value per field stored in the index that is factored into the 
score. Its used for length normalization (shorter documents = more important) 
and index time boosting. If you want either of those, you need norms. When 
norms are loaded up into an IndexReader, its loaded into a byte[maxdoc] array 
for each field - so even if one document out of 400 million has a field, its 
still going to load byte[maxdoc] for that field (so a lot of wasted RAM).  Did 
you say you had 400 million docs and 7 fields? Google says that would be:


   **400 million x 7 byte = 2 670.28809 megabytes**

On top of your other RAM usage.


Just to avoid confusion, that should really read a byte per document per field. 
If I remember right, it gives 255 boost possibilities, limited to 25 with 
length normalization.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


  
  



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: about TopFieldDocs

2009-01-05 Thread Mark Miller
Erick Erickson wrote:
> The number of documents
> is irrelevant here, what is relevant is the number of
> distinct terms in your "fieldName" field.
>   
Depending on the size of your index, the number of docs will matter
though. You have to store the unique terms in a String[] array, but you
also store an int[] array the size of maxdoc that indexes into the
unique terms array. Depending on your index, this could be as much or
more of a cost than the unique terms.

It doesn't matter how many documents you get back though for a
particular search - its just how many docs are in the index.

- Mark

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ANNOUNCE: Welcome Patrick O'Leary as Contrib Committer

2009-01-16 Thread Mark Miller

Welcome Patrick!

+1 for LocalLucene.

patrick o'leary wrote:

Thanks Folks

I'm in the business well over a decade now; Started my career in my country
of origin in Ireland, and have since lived & worked in UK and the US. I've
also traveled extensively establishing development groups in remote offices
for my company
in a few countries.

I've worked in several areas, from global publishing services, CRM's /
fulfillment systems, web server development, to technical operations and for
the past number of years have made a home for myself in search and local
search.

My background has been in CS, math and physics.
And despite the rumors my user name "pjaol" is actually an acronym of my
full name, which is only ever used
by my mother when I'm in trouble :-)

It will be a pleasure to continue working with all of you, and thank you
again for this honor.

Thanks
Patrick O'Leary



  

On Jan 16, 2009, at 1:54 PM, Ryan McKinley wrote:

 The PMC is pleased to announce that Patrick O'Leary has been voted to be a


a Lucene-Java Contrib committer.

Patrick has contributed a great foundation for integrating spatial search
with lucene.  I look forward to future development in this area.

Patrick - traditionally we ask you to send out an introduction to the
community; its nice for folks to get a sense for who everyone is.  Also
check that your new svn karma works by adding yourself to the list of
contrib committers.

Welcome Patrick!

ryan

  

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





  



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: term offsets info seems to be wrong...

2009-01-16 Thread Mark Miller
Okay, Koji, hopefully I'll be more luckily suggesting this this time.

Have you tried http://issues.apache.org/jira/browse/LUCENE-1448 yet? I am
not sure if its in an applyable state, but I hope that covers your issue.

On Fri, Jan 16, 2009 at 7:15 PM, Koji Sekiguchi  wrote:

> Hello,
>
> I'm writing a highlighter by using term offsets info (yes, I borrowed
> the idea
> of LUCENE-644). In my highlighter, I'm seeing unexpected term offsets info
> when getting multi-valued field.
>
> For example, if I indexed [" "," bbb "] (multi-valued), I got term info
> bbb(7,10). This is expected result. But if I indexed [" aaa "," bbb "]
> (note that using " aaa " instead of " "), I got term info bbb(6,9)
> which
> is unexpected. I would like to get same offset info for bbb because they
> are same length of field values.
>
> Please use the following program to see the problem I'm seeing. I'm
> using trunk:
>
> public static void main(String[] args) throws Exception {
> // create an index
> Directory dir = new RAMDirectory();
> Analyzer analyzer = new WhitespaceAnalyzer();
> IndexWriter writer = new IndexWriter( dir, analyzer, true,
> MaxFieldLength.LIMITED );
> Document doc = new Document();
> doc.add( new Field( "f", " aaa ", Store.YES, Index.ANALYZED,
> TermVector.WITH_OFFSETS ) );
> //doc.add( new Field( "f", " ", Store.YES, Index.ANALYZED,
> TermVector.WITH_OFFSETS ) );
> doc.add( new Field( "f", " bbb ", Store.YES, Index.ANALYZED,
> TermVector.WITH_OFFSETS ) );
> writer.addDocument( doc );
> writer.close();
>
> // print the offsets
> IndexReader reader = IndexReader.open( dir );
> TermPositionVector tpv = (TermPositionVector)reader.getTermFreqVector(
> 0, "f" );
> for( int i = 0; i < tpv.getTerms().length; i++ ){
> System.out.print( "term = \"" + tpv.getTerms()[i] + "\"" );
> TermVectorOffsetInfo[] tvois = tpv.getOffsets( i );
> for( TermVectorOffsetInfo tvoi : tvois ){
> System.out.println( "(" + tvoi.getStartOffset() + "," +
> tvoi.getEndOffset() + ")" );
> }
> }
> reader.close();
> }
>
> regards,
>
> Koji
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Group by in Lucene ?

2009-01-28 Thread Mark Miller
Group-by in Lucene/Solr has not been solved in a great general way yet 
to my knowledge.


Ideally, we would want a solution that does not need to fit into memory. 
However, you need the value of the field for each document. to do the 
grouping As you are finding, this is not cheap to get. Currently, the 
efficient way to get it is to use a FieldCache. This, however, requires 
that every distinct value can fit into memory.


Once you have efficient access to the values, you need to be able to 
efficiently group the results, again not bounded by memory (which we 
already are with the FieldCache).


There are quite a few ways to do this. The simplest is to group until 
you have used all the memory you want, then for everything left, 
anything that doesnt match a group, write it to a file, if it does, 
increment the group count. Use the overflow file as the input in the 
next run, repeat until there is no overflow. You can improve on that by 
partitioning the overflow file.


And then there are a dozen other methods.

Solr has a patch in JIRA that uses a sorting method. First the results 
are sorted on the group-by field, then scanned through for grouping - 
all field values that are the same will be next to each other. Finally, 
if you really wanted to sort on a different field, another sort is 
applied. Thats not ideal IMO, but its a start.


- Mark

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



  1   2   3   4   5   6   7   >