Re: Lucene Query

2014-08-19 Thread Tri Cao

Oh sorry guys, ignore what I said. I am going to get myself a coffee. Uwe is 
absolutely correct here.

On Aug 19, 2014, at 01:13 PM, Uwe Schindler  wrote:

Hi,
Look at his docs. He has only 2 docs, the second one 3 keywords.

I would use a simple phrase query with a slop value < Analyzers 
positionIncrementGap. This is the gap between fields with same name. Span or 
phrase cannot cross the gap, if slop if small enough, but large enough to find the 
terms next to each other.

SpanQuery is not needed. Phrase does all thats needed. Slop is like edit 
distance of whole terms, order does not matter.

Uwe

Am 19. August 2014 22:05:23 MESZ, schrieb Tri Cao :
       >OR operator does that, AND only returns docs with ALL terms present.
       >
       >Note that you have two options here
       >1. Create a BooleanQuery object (see the Java doc I linked below) and
       >programatically
       >add the term queries with the following constraint:
       
>http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/BooleanClause.Occur.html#MUST_NOT
       >
       >2. Use Lucene classic QueryParser and pass in the query string "states
       >AND america AND united"
       >
       >I would suggest 1) if you are going to learn more about Lucene, and 2)
       >if you are just want to get some thing out.
       >
       >Hope this helps,
       >Tri
       >
       >On Aug 19, 2014, at 12:17 PM, Jin Guang Zheng  
wrote:
       >
       >Thanks for reply, but won't BooleanQuery return both doc1 and doc2 with
       >query:
       >
       >label:States AND label:America AND label:United
       >
       >Best,
       >Jin
       >
       >
       >On Tue, Aug 19, 2014 at 2:07 PM, Tri Cao  wrote:
       >
       >         > given that example, the easy way is a boolean AND query of 
all
       >the terms:
       >                >
       >                >
       >         >
       
>http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/BooleanQuery.html
       >                >
       >         > However, if your corpus is more sophisticated you'll find 
that
       >relevance
       >                > ranking is not always that trivial :)
       >                >
       >         > On Aug 19, 2014, at 11:00 AM, Jin Guang Zheng 
     > wrote:
       >                >
       >                > Hi,
       >                >
       >                > I am wondering if someone can help me on this:
       >                >
       >                > I have index:
       >                >
       >                > doc 1 -- label: United States of America
       >                >
       >                > doc 2 -- label: United
       >                > doc 2 -- label: America
       >                > doc 2 -- label: States
       >                >
       >         > I am wondering how to generate a query with terms: states
       >united america
       >                >
       >                > so only doc 1 returns.
       >                >
       >                >
       >                > I was thinking SpanNearQuery, but can't make it work.
       >                >
       >                > Thanks,
       >                > Jin
       >                >
       >                >

--
Uwe Schindler
H.-H.-Meier-Allee 63, 28213 Bremen
http://www.thetaphi.de


Re: Lucene Query

2014-08-19 Thread Tri Cao

Whoops, the constraint should be MUST to force all terms present:

http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/BooleanClause.Occur.html#MUST

On Aug 19, 2014, at 01:05 PM, "Tri Cao"  wrote:

OR operator does that, AND only returns docs with ALL terms present.

Note that you have two options here
1. Create a BooleanQuery object (see the Java doc I linked below) and 
programatically
add the term queries with the following constraint:
http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/BooleanClause.Occur.html#MUST_NOT

2. Use Lucene classic QueryParser and pass in the query string "states AND america 
AND united"

I would suggest 1) if you are going to learn more about Lucene, and 2) if you 
are just want to get some thing out.

Hope this helps,
Tri

On Aug 19, 2014, at 12:17 PM, Jin Guang Zheng  wrote:

Thanks for reply, but won't BooleanQuery return both doc1 and doc2 with
query:

label:States AND label:America AND label:United

Best,
Jin


On Tue, Aug 19, 2014 at 2:07 PM, Tri Cao  wrote:

       > given that example, the easy way is a boolean AND query of all the 
terms:
       >
       >
       > 
http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/BooleanQuery.html
       >
       > However, if your corpus is more sophisticated you'll find that 
relevance
       > ranking is not always that trivial :)
       >
       > On Aug 19, 2014, at 11:00 AM, Jin Guang Zheng  
wrote:
       >
       > Hi,
       >
       > I am wondering if someone can help me on this:
       >
       > I have index:
       >
       > doc 1 -- label: United States of America
       >
       > doc 2 -- label: United
       > doc 2 -- label: America
       > doc 2 -- label: States
       >
       > I am wondering how to generate a query with terms: states united 
america
       >
       > so only doc 1 returns.
       >
       >
       > I was thinking SpanNearQuery, but can't make it work.
       >
       > Thanks,
       > Jin
       >
       >


Re: Lucene Query

2014-08-19 Thread Tri Cao

OR operator does that, AND only returns docs with ALL terms present.

Note that you have two options here
1. Create a BooleanQuery object (see the Java doc I linked below) and 
programatically
add the term queries with the following constraint:
http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/BooleanClause.Occur.html#MUST_NOT

2. Use Lucene classic QueryParser and pass in the query string "states AND america 
AND united"

I would suggest 1) if you are going to learn more about Lucene, and 2) if you 
are just want to get some thing out.

Hope this helps,
Tri

On Aug 19, 2014, at 12:17 PM, Jin Guang Zheng  wrote:

Thanks for reply, but won't BooleanQuery return both doc1 and doc2 with
query:

label:States AND label:America AND label:United

Best,
Jin


On Tue, Aug 19, 2014 at 2:07 PM, Tri Cao  wrote:

       > given that example, the easy way is a boolean AND query of all the 
terms:
       >
       >
       > 
http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/BooleanQuery.html
       >
       > However, if your corpus is more sophisticated you'll find that 
relevance
       > ranking is not always that trivial :)
       >
       > On Aug 19, 2014, at 11:00 AM, Jin Guang Zheng  
wrote:
       >
       > Hi,
       >
       > I am wondering if someone can help me on this:
       >
       > I have index:
       >
       > doc 1 -- label: United States of America
       >
       > doc 2 -- label: United
       > doc 2 -- label: America
       > doc 2 -- label: States
       >
       > I am wondering how to generate a query with terms: states united 
america
       >
       > so only doc 1 returns.
       >
       >
       > I was thinking SpanNearQuery, but can't make it work.
       >
       > Thanks,
       > Jin
       >
       >


Re: Lucene Query

2014-08-19 Thread Tri Cao

given that example, the easy way is a boolean AND query of all the terms:

http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/BooleanQuery.html

However, if your corpus is more sophisticated you'll find that relevance 
ranking is not always that trivial :)

On Aug 19, 2014, at 11:00 AM, Jin Guang Zheng  wrote:

Hi,

I am wondering if someone can help me on this:

I have index:

doc 1 -- label: United States of America

doc 2 -- label: United
doc 2 -- label: America
doc 2 -- label: States

I am wondering how to generate a query with terms: states united america

so only doc 1 returns.


I was thinking SpanNearQuery, but can't make it work.

Thanks,
Jin


Re: Calculate Term Frequency

2014-08-19 Thread Tri Cao

Erick, Solr termfreq implementation also uses DocsEnum with the assumption that 
freq are called on ascending
doc IDs which is valid when scoring from from the hit list. If freq is 
requested for an out of order doc, a new
DocsEnum has to be created.

Bianca, can you explain your use case in more details? What did you mean by 
having a new document? A new
document is added to the index? Then you already have to reopen the 
searcher/reader anyway to get a new
DocsEnum.

On Aug 19, 2014, at 08:26 AM, Erick Erickson  wrote:

Hmmm, I'm not at all an expert here, but Solr has a function
query "termfreq" that does what you're doing I think? I wonder
if the code for that function query would be a good place to
copy (or even make use of)? See TermFreqValueSource...

Maybe not helpful at all, but...
Erick

On Tue, Aug 19, 2014 at 7:04 AM, Bianca Pereira  
wrote:
       > Hi everybody,
       >
       > I would like to know your suggestions to calculate Term Frequency in a
       > Lucene document. Currently I am using MultiFields.getTermDocsEnum,
       > iterating through the DocsEnum 'de' returned and getting the frequency 
with
       > de.freq() for the desired document.
       >
       > My solution gives me the result I want but I am having time issues. For
       > instance, I want to calculate the term frequency for a given term for N
       > documents in a sequence. Then, every time I have a new document I have 
to
       > retrieve exactly the same DocsEnum again and iterate until find the
       > document I want. Of course I cannot cache DocsEnum (yes, I did this 
huge
       > mistake) because it is an iterator.
       >
       > Do you have any suggestions on how I can get Term Frequency in a fast 
way?
       > The unique suggestion I had up to now was "Do it programatically, 
don't use
       > Lucene". Should be this the solution?
       >
       > Thank you.
       >
       > Regards,
       > Bianca Pereira

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: deleteDocument with NRT

2014-07-14 Thread Tri Cao

Solr has the notion of "soft commit" and "hard commit". A soft commit
means Solr will reopen a new searcher. A hard commit means a flush to disk.
All the update/delete logics are in Lucene, Solr doesn't maintain deleted
doc IDs. It does maintain its own caches though.

On Jul 14, 2014, at 03:09 AM, Ganesh  wrote:


How Solr handles this scenario... Is it reopening reader after every 
delete OR it maintains the list of delete documents in cache?


Regards
Ganesh

On 7/11/2014 4:00 AM, Tri Cao wrote:
       > You need to reopen your searcher after deleting. From Java doc for 
       > SearcherManager:

       >
       > In addition you should periodically call maybeRefresh 
       > . 
       > While it's possible to call this just before running each query, this 
       > is discouraged since it penalizes the unlucky queries that do the 
       > reopen. It's better to use a separate background thread, that 
       > periodically calls maybeReopen. Finally, be sure to call close 
       >  
       > once you are done.

       >
       >
       >
       > On Jul 10, 2014, at 01:56 PM, Jamie  wrote:
       >
       >        > Hi
       >        >
       >        > I am using NRT search with the SearcherManager class. When 
the user
       >        > elects to delete some documents, 
writer.deleteDocuments(terms) is called.
       >        >
       >        > The problem is that deletes are not immediately visible. What 
does it
       >        > take to make them so? Even after calling commit(), the deleted
       >        > documents are still returned.
       >        >
       >        > What is the recommended way to obtain a near realtime search 
result that
       >        > immediately reflect all deleted documents?
       >        >
       >        > Much appreciate
       >        >
       >        > Jamie
       >        >
       >        >
       >        >
       >        >
       >        >
       >        > 
-
       >        > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org 
       >        > <mailto:java-user-unsubscr...@lucene.apache.org  >
       >        > For additional commands, e-mail: java-user-h...@lucene.apache.org 
       >        > <mailto:java-user-h...@lucene.apache.org        >

       >        >



Finding words not followed by other words

2014-07-11 Thread Tri Cao

This is actually a tough problem in general: polysemy sense disambiguation. In your case, I think 
it's more like you'll probably need to do some named entity resolution to differentiate 
"George Washington" from "George Washington Carver" as they are two different 
entities.

Do you have a list of all the entity names in your corpus (either manually curated or by some 
pattern matching?). If you do, one thing you can do is to write a tokenizer that emit one token for 
each entity. So, for example, "George Washington" string emits a token like 
_George_Washington_, "George Washington Carver" emits _George Washington_Carver_, etc.

There are some open source NLP library that has does this, but the quality 
varies, as it will most likely depend on your domain and training data set.

Hope this helps,
Tri

On Jul 11, 2014, at 07:20 AM, Michael Ryan  wrote:

I'm trying to solve the following problem...

I have 3 documents that contain the following contents:
1: "George Washington Carver blah blah blah."
2: "George Washington blah blah blah."
3: "George Washington Carver blah blah blah. George Washington blah blah blah."

I want to create a query that matches documents 2 and 3, but not 1. That is, I want to find documents that 
mention "George Washington". It's okay if they also mention "George Washington Carver", 
but I don't want documents that only mention "George Washington Carver". So simply doing something 
like this does not solve it:
"George Washington" NOT "George Washington Carver"

Is there a Query type that does this out of the box? I've looked at the various 
types of span queries, but none of them seem to do this. I think it should be 
theoretically possible given the position data that Lucene stores...

-Michael


Re: deleteDocument with NRT

2014-07-10 Thread Tri Cao

You need to reopen your searcher after deleting. From Java doc for 
SearcherManager:

In addition you should periodically call maybeRefresh. While it's possible to 
call this just before running each query, this is discouraged since it 
penalizes the unlucky queries that do the reopen. It's better to use a separate 
background thread, that periodically calls maybeReopen. Finally, be sure to 
call close once you are done.


On Jul 10, 2014, at 01:56 PM, Jamie  wrote:

Hi

I am using NRT search with the SearcherManager class. When the user 
elects to delete some documents, writer.deleteDocuments(terms) is called.


The problem is that deletes are not immediately visible. What does it 
take to make them so? Even after calling commit(), the deleted 
documents are still returned.


What is the recommended way to obtain a near realtime search result that 
immediately reflect all deleted documents?


Much appreciate

Jamie





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to handle words that stem to stop words

2014-07-07 Thread Tri Cao

I think emitting two tokens for "vans" is the right (potentially only) way to 
do it. You could
also control the dictionary of terms that require this special treatment.

Any reason makes you not happy with this approach?

On Jul 06, 2014, at 11:48 AM, Arjen van der Meijden  
wrote:

Hello list,

We have a fairly large Lucene database for a 30+ million post forum. 
Users post and search for all kinds of things. To make sure users don't 
have to type exact matches, we combine a WordDelimiterFilter with a 
(Dutch) SnowballFilter.


Unfortunately users sometimes find examples of words that get stemmed to 
a word that's basically a stop word. Or reversely, where a very common 
word is stemmed so that it becomes the same as a rare word.


We do index stop words, so theoretically they could still find their 
result. But when a rare word is stemmed in such a way it yields a 
million hits, that makes it very unusable...


One example is the Dutch word 'van' which is the equivalent of 'of' in 
English. A user tried to search for the shoe brand 'vans', which gets 
stemmed to 'van' and obviously gives useless results.


I already noticed the 'KeywordRepeatFilter' to index/search both 'vans' 
and 'van' and the StemmerOverrideFilter to try and prevent these cases. 
Are there any other solutions for these kinds of problems?


Best regards,

Arjen van der Meijden

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Can Lucene based application be made to work with Scaled Elastic Beanstalk environemnt on Amazon Web Services

2014-06-27 Thread Tri Cao

I would just use S3 as a data push mechanism. In your servlet's init(), you
could download the index from S3 and unpack it to a local directory, then
initialize your Lucene searcher to that directory. 

Downloading from S3 to EC2 instances is free, and 5G would take a minute or two.
Also, if you pack the index inside your war file, the new instance has to 
download
that data anyway.

The big advantage is it also allows you to update your index without repacking 
your
deployment .war. Just upload the new index to the same location in S3, then 
restart
your webapp :)

Hope this helps,
Tri

On Jun 27, 2014, at 04:13 AM, Paul Taylor  wrote:

Hi

I have a simple WAR based web application that uses lucene created 
indexes to provide search results in a xml format.
It works fine locally but I want to deploy it using Elastic Beanstalk 
within Amazon Webservices


Problem 1 is that WAR definition doesn't seem to provide a location for 
data files (rather than config files) so when I deploy the WAR with EB 
it doesnt work at first because has no access to the data (lucene 
indexes) , however I solved this by connecting to the underlying EC2 
instance and copy the lucene indexes from S3 to the instance, and 
ensuring the file location is defined in the Wars web.xml file.


Problem 2 is more problematic, Im looking at AWS and EB because I wanted 
a way to deploy the application with little ongoing admin overhead and I 
like the way EB does load balancing and auto scaling for you, starting 
and stopping additional instances as required to meet demand. However 
these automatically started instances will not have access to the index 
files.


Possible solutions could be

1. Is there a location I can store the data index within the WAR itself, 
the index is only 5GB so I do have space on my root disk to store the 
indexes in the WAR if there is a way to use them, Tomcat was also be 
need to unwar the file at deployement, I cant see if tomcat on AWSdoes this.


2. A way for EC2 instances to be started with data preloaded i some way

(BTW Im aware of CloudSearch but its not an avenue I want to go down)

Does anybody have any experience of this,please ?

Paul



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search performance

2014-06-02 Thread Tri Cao

This is an interesting performance problem and I think there is probably not
a single answer here, so I'll just layout the steps I would take to tackle this:

1. What is the variance of the query latency? You said the average is 5 minutes,
but is it due to some really bad queries or most queries have the same perf?

2. We kind of assume that index size and number of docs is the issue here.
Can you validate that assumption by trying to index with 10M, 50M, … docs
and see how worse the performance is getting as a function of size?

3. What is the average doc hits for the bad queries? If you queries matches
a lot of hits, scoring will be very expensive. While you only ask for 1000 top
scored docs, Lucene still needs to score all the hits to get that 1000 docs.
If this is the case, there could be some work around, but Iet's make sure
that it's indeed the situation we are dealing with here.

Hope this helps,
Tri

On Jun 01, 2014, at 11:50 PM, Jamie  wrote:

Greetings

Despite following all the recommended optimizations (as described at 
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed) , in some of 
our installations, search performance has reached the point where is it 
unacceptably slow. For instance, in one environment, the total index 
size is 200GB, with 150 million documents indexed. With NRT enabled, 
search speed is roughly 5 minutes on average. The server resources are: 
2x6 Core Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux.


The only thing we haven't yet done, is to upgrade Lucene from 4.7.x to 
4.8.x. Is this likely to make any noticeable difference in performance?


Clearly, longer term, we need to move to a distributed search model. We 
thought to take advantage of the distributed search features offered in 
Solr, however, our solution is very tightly integrated into Lucene 
directly (since Solr didn't exist when we started out). Moving to Solr 
now seems like a daunting prospect. We've also following the Katta 
project with interest, but it doesn't appear support distributed 
indexing, and development on it seems to have stalled. It would be nice 
if there were a distributed search project on the Lucene level that we 
could use.


I realize this is a rather vague question, but are there any further 
suggestions on ways to improve search performance? We need cheap and 
dirty ideas, as well as longer term advice on a possible path forward.


Much appreciate

Jamie

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: maxDoc/numDocs int fields

2014-03-21 Thread Tri Cao
I ran into this issue before and after some digging, I don't think there is an easy way to accommodate long IDs in Lucene. So I decided to go with sharding documents into multiple indexes. It turned out to be a good decision in my case because I would have to shard the index anyway for performance reasons. (There are queries that requires collecting and scoring a large portion of the index).On Mar 21, 2014, at 09:41 AM, Artem Gayardo-Matrosov  wrote:Hi Oli,Thanks for your reply,I thought about this, but it feels like making a crude, inefficientimplementation of what's already in lucene -- CompositeReader, isn't it? Itwould involve writing my CompositeCompositeReader which would forward therequests to the underlying CompositeReader...Is there a better way?Thanks,Artem.On Fri, Mar 21, 2014 at 6:33 PM, Oliver Christ  wrote:        > Can you split your corpus across multiple Lucene instances?        >        > Cheers, Oli        >        > -Original Message-        > From: Artem Gayardo-Matrosov [mailto:ar...@gayardo.com]        > Sent: Friday, March 21, 2014 12:29 PM        > To: java-user@lucene.apache.org        > Subject: maxDoc/numDocs int fields        >        > Hi all,        >        > I am using lucene to index a large corpus of text, with every word being a        > separate document (this is something I cannot change), and I am hitting a        > limitation of the CompositeReader only supporting Integer.MAX_VALUE        > documents.        >        > Is there any way to work around this limitation? For the moment I have        > implemented my own DirectoryReader and BaseCompositeReader to at least make        > them support documents from Integer.MIN_VALUE to -1 (for twice more        > documents supported), the problem is that all the APIs are restricted to        > use the int type and after the docID value wraps back to 0, I have no way        > to restore the original docID.        >        > --        > Thanks in advance,        > Artem.        >-- Artem.

Re: How to search for terms containing negation

2014-03-17 Thread Tri Cao
StandardAnalyzer has a constructor that takes a stop word set, so I guess you can pass it an empty set:http://lucene.apache.org/core/4_6_1/analyzers-common/org/apache/lucene/analysis/standard/StandardAnalyzer.html#StandardAnalyzer(org.apache.lucene.util.Version, org.apache.lucene.analysis.util.CharArraySet)QueryParser is probably ok. I rarely use this parser but I don't think it recognizes "not" in its grammar.Hope this helps,TriOn Mar 17, 2014, at 12:46 PM, Natalia Connolly  wrote:Hi Tri,  Thank you so much for your message!  Yes, it looks like the negation terms have indeed been filtered out; when I query on "no" or "not", I get no results. I am just using StandardAnalyzer and the classic QueryParser:  Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_47); QueryParser parser = new QueryParser(Version.LUCENE_47, field, analyzer);  Which analyzer/parser would you recommend?  Thank you again,  Natalia    On Mon, Mar 17, 2014 at 3:35 PM, Tri Cao <tm...@me.com> wrote: Natalia,First make sure that your analyzers (both index and query analyzers) donot filter out these as stop words. I think the standard StopFilter listhas "no" and "not". You can try to see if you index have these terms byquerying for "no" as a TermQuery. If there is not match for that query,then you know for sure they have been filtered out.The next thing is to check is your query parser. What query parser are youusing? Some parser actually understands the "not" term and rewrite to anegation query.Hope this helps,TriOn Mar 17, 2014, at 12:02 PM, Natalia Connolly <natalia.v.conno...@gmail.com> wrote:Hi All,Is there any way I could construct a query that would not automaticallyexclude negation terms (such as "no", "not", etc)? For example, I need tofind strings like "not happy", "no idea", "never available". I triedusing a simple analyzer with combinations such as "not AND happy", andsimilar patterns, but it does not work.Any help would be appreciated!Natalia

Re: How to search for terms containing negation

2014-03-17 Thread Tri Cao
Natalia,First make sure that your analyzers (both index and query analyzers) do not filter out these as stop words. I think the standard StopFilter list has "no" and "not". You can try to see if you index have these terms by querying for "no" as a TermQuery. If there is not match for that query, then you know for sure they have been filtered out.The next thing is to check is your query parser. What query parser are you using? Some parser actually understands the "not" term and rewrite to a negation query.Hope this helps,TriOn Mar 17, 2014, at 12:02 PM, Natalia Connolly  wrote:Hi All,  Is there any way I could construct a query that would not automatically exclude negation terms (such as "no", "not", etc)? For example, I need to find strings like "not happy", "no idea", "never available". I tried using a simple analyzer with combinations such as "not AND happy", and similar patterns, but it does not work.  Any help would be appreciated!  Natalia

Re: IndexWriter croaks on large file

2014-02-19 Thread Tri Cao
John,Sure you can add identical documents to index if you like. I don't think Lucene requires a unique ID field, only Solr does. Lucene documents have internal doc IDs auto generated when indexing or merging index segments.If I remember correctly, Lucene 4.1 started doing cross document compression, so if could manage to index similar documents in the same chunk, it may help to reduce your stored fields.Hope this helps,TriOn Feb 19, 2014, at 04:51 AM, John Cecere  wrote:Thanks Tri. I've tried a variation of the approach you suggested here and it appears to work well. Just one question. Will there be  a problem with adding multiple Document objects to the IndexWriter that have the same field names and values for the StoredFields ?  They all have different TextFields (the content). I've tried doing this and haven't found any problems with it, but I'm just  wondering if there's anything I should be aware of.  Regards, John  On 2/14/14 4:37 PM, Tri Cao wrote:As docIDs are ints too, it's most likely he'll hit the limit of 2B documents per index though withthat approach though :)I do agree that indexing huge documents doesn't seem to have a lot of value, even when youknow a doc is a hit for a certain query, how are you going to display the results to users?John, for huge data set, it's usually a good idea to roll your own distributed indexes, and modelyou data schema very carefully. For example, if you are going to index log files, one reasonableidea is to make every 5 minutes of logs a document.Regards,TriOn Feb 14, 2014, at 01:20 PM, Glen Newton <glen.new...@gmail.com> wrote:You should consider making each _line_ of the log file a (Lucene)document (assuming it is a log-per-line log file)-GlenOn Fri, Feb 14, 2014 at 4:12 PM, John Cecere <john.cec...@oracle.com john.cec...@oracle.com>> wrote:I'm not sure in today's world I would call 2GB 'immense' or 'enormous'. Atany rate, I don't have control over the size of the documents that go intomy database. Sometimes my customer's log files end up really big. I'mwilling to have huge indexes for these things.Wouldn't just changing from int to long for the offsets solve the problem ?I'm sure it would probably have to be changed in a lot of places, but whyimpose such a limitation ? Especially since it's using an InputStream andonly dealing with a block of data at a time.I'll take a look at your suggestion.Thanks,JohnOn 2/14/14 3:20 PM, Michael McCandless wrote:Hmm, why are you indexing such immense documents?In 3.x Lucene never sanity checked the offsets, so we would silentlyindex negative (int overflow'd) offsets into e.g. term vectors.But in 4.x, we now detect this and throw the exception you're seeing,because it can lead to index corruption when you index the offsetsinto the postings.If you really must index such enormous documents, maybe you couldcreate a custom tokenizer (derived from StandardTokenizer) that"fixes" the offset before setting them? Or maybe just doesn't evenset them.Note that position can also overflow, if your documents get too large.Mike McCandlesshttp://blog.mikemccandless.comOn Fri, Feb 14, 2014 at 1:36 PM, John Cecere <john.cec...@oracle.com john.cec...@oracle.com>>wrote:I'm having a problem with Lucene 4.5.1. Whenever I attempt to index afile >2GB in size, it dies with the following exception:java.lang.IllegalArgumentException: startOffset must be non-negative, andendOffset must be >= startOffset,startOffset=-2147483648,endOffset=-2147483647Essentially, I'm doing this:Directory directory = new MMapDirectory(indexPath);Analyzer analyzer = new StandardAnalyzer();IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45,analyzer);IndexWriter iw = new IndexWriter(directory, iwc);InputStream is = ;InputStreamReader reader = new InputStreamReader(is);Document doc = new Document();doc.add(new StoredField("fileid", fileid));doc.add(new StoredField("pathname", pathname));doc.add(new TextField("content", reader));iw.addDocument(doc);It's the IndexWriter addDocument method that throws the exception. Inlooking at the Lucene source code, it appears that the offsets being usedinternally are int, which makes it somewhat obvious why this ishappening.This issue never happened when I used Lucene 3.6.0. 3.6.0 was perfectlycapable of handling a file over 2GB in this manner. What has changed andhowdo I get around this ? Is Lucene no longer capable of handling files thislarge, or is there some other way I should be doing this ?Here's the full stack trace sans my code:java.lang.IllegalArgumentException: startOffset must be non-negative, andendOffset must be >= startOffset,startOffset=-2147483648,endOffset=-2147483647atorg.apache.lucene.analysis.tokenattributes.OffsetAttributeImpl.set

Re: IndexWriter croaks on large file

2014-02-14 Thread Tri Cao
As docIDs are ints too, it's most likely he'll hit the limit of 2B documents per index though withthat approach though :)I do agree that indexing huge documents doesn't seem to have a lot of value, even when youknow a doc is a hit for a certain query, how are you going to display the results to users?John, for huge data set, it's usually a good idea to roll your own distributed indexes, and modelyou data schema very carefully. For example, if you are going to index log files, one reasonableidea is to make every 5 minutes of logs a document.Regards,TriOn Feb 14, 2014, at 01:20 PM, Glen Newton  wrote:You should consider making each _line_ of the log file a (Lucene) document (assuming it is a log-per-line log file)  -Glen  On Fri, Feb 14, 2014 at 4:12 PM, John Cecere  wrote:I'm not sure in today's world I would call 2GB 'immense' or 'enormous'. Atany rate, I don't have control over the size of the documents that go intomy database. Sometimes my customer's log files end up really big. I'mwilling to have huge indexes for these things.Wouldn't just changing from int to long for the offsets solve the problem ?I'm sure it would probably have to be changed in a lot of places, but whyimpose such a limitation ? Especially since it's using an InputStream andonly dealing with a block of data at a time.I'll take a look at your suggestion.Thanks,JohnOn 2/14/14 3:20 PM, Michael McCandless wrote:Hmm, why are you indexing such immense documents?In 3.x Lucene never sanity checked the offsets, so we would silentlyindex negative (int overflow'd) offsets into e.g. term vectors.But in 4.x, we now detect this and throw the exception you're seeing,because it can lead to index corruption when you index the offsetsinto the postings.If you really must index such enormous documents, maybe you couldcreate a custom tokenizer (derived from StandardTokenizer) that"fixes" the offset before setting them? Or maybe just doesn't evenset them.Note that position can also overflow, if your documents get too large.Mike McCandlesshttp://blog.mikemccandless.comOn Fri, Feb 14, 2014 at 1:36 PM, John Cecere wrote:I'm having a problem with Lucene 4.5.1. Whenever I attempt to index afile >2GB in size, it dies with the following exception:java.lang.IllegalArgumentException: startOffset must be non-negative, andendOffset must be >= startOffset,startOffset=-2147483648,endOffset=-2147483647Essentially, I'm doing this:Directory directory = new MMapDirectory(indexPath);Analyzer analyzer = new StandardAnalyzer();IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45,analyzer);IndexWriter iw = new IndexWriter(directory, iwc);InputStream is = ;InputStreamReader reader = new InputStreamReader(is);Document doc = new Document();doc.add(new StoredField("fileid", fileid));doc.add(new StoredField("pathname", pathname));doc.add(new TextField("content", reader));iw.addDocument(doc);It's the IndexWriter addDocument method that throws the exception. Inlooking at the Lucene source code, it appears that the offsets being usedinternally are int, which makes it somewhat obvious why this ishappening.This issue never happened when I used Lucene 3.6.0. 3.6.0 was perfectlycapable of handling a file over 2GB in this manner. What has changed andhowdo I get around this ? Is Lucene no longer capable of handling files thislarge, or is there some other way I should be doing this ?Here's the full stack trace sans my code:java.lang.IllegalArgumentException: startOffset must be non-negative, andendOffset must be >= startOffset,startOffset=-2147483648,endOffset=-2147483647atorg.apache.lucene.analysis.tokenattributes.OffsetAttributeImpl.setOffset(OffsetAttributeImpl.java:45)atorg.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:183)atorg.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49)atorg.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)atorg.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82)atorg.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)atorg.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)atorg.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:254)atorg.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:446)atorg.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1551)atorg.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1221)atorg.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1202)Thanks,John--John CecerePrincipal Engineer - Oracle Corporation732-987-4317 / john.cec...@oracle.com-To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.orgFor additional commands, e-mail: java-user-h...@lucene.apache.org---

Re: Collector is collecting more than the specified hits

2014-02-14 Thread Tri Cao
If I understand correctly, you'd like to shortcut the execution when you reach the desirednumber of hits. Unfortunately, I don't think there's a graceful way to do that right now inCollector. To stop further collecting, you need to throw an IOException (or a subtype of it)and catch the exception later in your code.Regards,TriOn Feb 14, 2014, at 09:36 AM, saisantoshi  wrote:I am not interested in the scores at all. My requirement is simple, I only need the first 100 hits or the numHits I specify ( irrespective of there scores). The collector should stop after collecting the numHits specified. Is there a way to tell in the collector to stop after collecting the numHits.  Please correct me if I am wrong. I am trying to do the following.  public void collect(int doc) throws IOException {  if (collector.getTotalHits() <= maxHits ) { // this way, I can stop it to not collect after the getTotalHits is more than numHits.  delegate.collect(doc);   }  }  I have to write a separate collector extending the Collector because I am not able to get the call to getTotalHits() if I am using PositiveScoresOnlyCollector.  -- View this message in context: http://lucene.472066.n3.nabble.com/Collector-is-collecting-more-than-the-specified-hits-tp4117329p4117441.html Sent from the Lucene - Java Users mailing list archive at Nabble.com.  - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org 

Re: incrementally indexing

2012-07-05 Thread Tri Cao
If you want to index your hard drive, you'll need to keep a copy
of the current file system's directory/files structure. Otherwise, you
won't be able to remove from your index files that have been deleted.


On Jul 5, 2012, at 12:18 PM, Erick Erickson wrote:

> Hmmm, it's not quite clear what the problem is. But let's
> say you have indexed your hard drive. Somewhere you'll
> have to keep a record of what you've done, say the timestamp
> when you started looking at your hard drive to index it.
> 
> Next time you run, you simply only index files that have changed
> since the last timestamp, assuming you want any changed
> documents on your disk to reflect those changes. That's usually
> what's meant by "incremental indexing", you only add new/changed
> data to your index.
> 
> Hope that helps
> Erick
> 
> On Wed, Jul 4, 2012 at 7:09 AM,   wrote:
>> Hello,
>> 
>> First ask your pardon for my poor English.
>> 
>> I am making an application in Java using Lucene 3.6 for indexing the hard
>> drive, and I've read that you can index incrementally, but not like
>> putting that option, because every time I indexed the hard disk overwrite
>> the existing index and makes me again, with the consequent expenditure of
>> time in making such indexing.
>> 
>> If someone could help me.
>> 
>> Regards and thanks in advance
>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: custom scoring

2012-04-08 Thread Tri Cao
Hi,After reading through the IndexSearcher code, it seems I have to do the following:- implement a custom Collector to collect not just the doc IDs and score, but the fields I care about as well- extend ScoreDoc to hold the extra fields- when I get back a TopDocs from a search() call, I can go through the TopDocs and apply the constraints I need toI think this will work, but have some concern about performance. What would you think?Thanks,Tri.On Apr 06, 2012, at 10:06 AM, Tri Cao  wrote:Hi all,What would be the best approach for a custom scoring that requires a "global" view of the result set. For example, I have a field call "color" and I would like to have constraints that there are at most 3 docs with color:red, 4 docs with color:blue in the first 16 hits. And the items should still be sorted in by their relevance scores after the constraints are applied.Thanks,Tri.

custom scoring

2012-04-06 Thread Tri Cao
Hi all,What would be the best approach for a custom scoring that requires a "global" view of the result set. For example, I have a field call "color" and I would like to have constraints that there are at most 3 docs with color:red, 4 docs with color:blue in the first 16 hits. And the items should still be sorted in by their relevance scores after the constraints are applied.Thanks,Tri.