Re: Matching on "owned" docs -- filter or query? Or sort?

2012-07-23 Thread Erick Erickson
Well, if you only indexed a single document per title with multiple owner IDs in
that document, you wouldn't have multiple documents come back for a
particular title.

And the grouping code (http://wiki.apache.org/solr/FieldCollapsing is
the Solr-level,
but I assume it's all realized in the Lucene level) allows searches
within groups to
be different than between groups.

May or may not work for you though
Erick

On Sun, Jul 22, 2012 at 1:33 PM, Uncle  wrote:
> Thanks for the reply.  I thought of using boosting, for example "((userId:14 
> AND title:have)^10 OR (title:have))" or "((userId:14^10 AND title:have) OR 
> (title:have))" or something like that.  However, there would still be 
> duplicates (all 3 docs for "To Have and To Have Not" would be included 
> whereas I would only want the one I own to be there).  This also requires 
> using the scoring for sorting so I can't apply other sorting (I would want to 
> sort the results secondarily by title for example). I might be able to go 
> this route, but it seems like some combination of custom filtering and 
> sorting would work better.
>
> I thought of somehow doing an empty query to fetch all docs, sorting them to 
> put docs with the userId first, and then running a DuplicateFilter on title 
> with KM_USE_FIRST_OCCURRENCE.  This is the duplicate elimination behavior I 
> want.  Then do a text search on the remainder.  But this seems very expensive.
>
> Randy
>
> On Jul 22, 2012, at 11:33 AM, Erick Erickson wrote:
>
>> Hmmm, what about simply boosting very high on owner, and probably
>> grouping on title?
>>
>> If you boosted on owner, you wouldn't even have to index the title
>> separately for each user, your "owner" field could be multivalued and
>> contain _all_ the owner IDs. In that case you wouldn't have to group
>> at all..
>>
>> Best
>> Erick
>>
>> On Sun, Jul 22, 2012 at 11:06 AM, Uncle  wrote:
>>> I also posted this to StackOverflow, apologies if you see this twice.
>>>
>>> I have a data set whereby documents are associated to a user id. Say that 
>>> the documents represent books, and each book can have one or more owner. I 
>>> am indexing the titles with Lucene. When searching, I want all results 
>>> owned by me to be sorted at the top of the results before results that are 
>>> not owned by me. So the data might look like:
>>>
>>> Owner ID   Book Title
>>>  --
>>> 13   To Have and To Have Not
>>> 14   To Have and To Have Not
>>> 19   To Have and To Have Not
>>> 18   Have a Little Faith
>>> 15   Snow Crash
>>> 17   Snow Crash
>>> 18   Cryptonomicon
>>> 14   Of Mice And Men
>>> 17   Flash Crash
>>>
>>> Say that my user id is 14 and I search on "have", I want to match on both 
>>> "To Have and To Have Not" and "Have a Little Faith", but "To Have and To 
>>> Have Not" should show up higher in my search results, because I own it.  
>>> Similarly, if I am user id 15 and search for "Crash", I will match both 
>>> "Snow Crash" and "want "Flash Crash", but "Snow Crash" should show up first 
>>> because I own it.  If I am user id 14 and I search for "crash", I would 
>>> still get a match for "Snow Crash" even though I don't own it.  If I did a 
>>> fuzzy match for "a" which would match almost all of these titles, I would 
>>> see those that I own before I see the others.
>>>
>>> I am a little stuck on whether this is a query, filter, custom sort, or 
>>> some combination, and how to get the best performance.  For example, if I 
>>> could write a filter that eliminates all duplicate titles, giving 
>>> preference to those owned by me, I could then just perform a search on the 
>>> remainder (assuming that filters are applied before searches). Then, a 
>>> custom sort based on whether or not I own the doc would be straightforward.
>>>
>>> But I am not sure how to implement the filter. It is not a simple 
>>> DuplicateFilter because it operates on two fields. It is similar to the 
>>> security filter example in section 5.6.7 of Lucene in Action, except that I 
>>> still want to be able to see documents that I don't own, if I don't own a 
>>> book with the same title. The custom filter in section 6.4 is also close, 
>>> but my problem is more complex because it depends on two fields.
>>>
>>> While iterating over the documents, the filter would have to remember which 
>>> titles have been seen, and then keep the ones that I own. For example if it 
>>> iterated over the values above in order, it would see the title "To Have 
>>> and To Have Not", not owned by me; and then see the same title again, owned 
>>> by me, and have to know that it should drop the first doc and keep the 
>>> second. I can't think of how to do this without using a lot of memory, 
>>> essentially keeping all titles in memory while iterating, which seems very 
>>> expensive. It isn't a simple "match" fun

Re: QueryParser and BooleanQuery

2012-07-23 Thread Deepak Shakya
Hey Jack,

Can you let me know how should I do that? I am using the Lucene 3.6 version
and I dont see any parse() method for StandardAnalyzer.


On Mon, Jul 23, 2012 at 8:47 AM, Jack Krupansky wrote:

> Yes, I failed to notice that the removal of the slash was yet another
> instance of the analyzer transforming its input. But the bottom line is
> that you must do 100% of the same steps that analysis performs. If in
> doubt, pass your literals through the standard analyzer itself.
>
>
> -- Jack Krupansky
>
> -Original Message- From: Deepak Shakya
> Sent: Sunday, July 22, 2012 9:35 PM
> To: java-user@lucene.apache.org
> Subject: Re: QueryParser and BooleanQuery
>
>
> I tried changing the case to lower case, but still the BooleanQuery doesn't
> return any documents.
>
> I see that the text "/blank" is converted to "blank" in the QueryParser.
> But in BooleanQuery it remains the same. When I remove the forward slash
> sign from the input string, I get the matched documents with BooleanQuery.
> Does the Standard Analyzer does this stripping of special characters as
> well?
>
> On Sun, Jul 22, 2012 at 8:58 PM, Jack Krupansky *
> *wrote:
>
>  The query parser/analyzer is lower-casing the query terms automatically.
>> You have to do the same with with terms for BooleanQuery -
>> Term("cs-method", "GET") should be "Term("cs-method", "get")".
>>
>> StandardAnalyzer is doing the lower-casing.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Deepak Shakya
>> Sent: Sunday, July 22, 2012 10:17 AM
>> To: java-user@lucene.apache.org
>> Subject: QueryParser and BooleanQuery
>>
>>
>> Hi,
>>
>> I have following dataset indexed in Lucene.
>> 2010-04-21 02:24:01 GET /blank 200 120
>> 2010-04-21 02:24:01 GET /US/registrationFrame 200 605
>> 2010-04-21 02:24:02 GET /US/kids/boys 200 785
>> 2010-04-21 02:24:02 POST /blank 304 56
>> 2010-04-21 02:24:04 GET /blank 304 233
>> 2010-04-21 02:24:04 GET /blank 500 567
>> 2010-04-21 02:24:04 GET /blank 200 897
>> 2010-04-21 02:24:04 POST /blank 200 567
>> 2010-04-21 02:24:05 GET /US/search 200 658
>> 2010-04-21 02:24:05 POST /US/shop 200 768
>> 2010-04-21 02:24:05 GET /blank 200 347
>>
>> I am querying it in two ways, first with QueryParser and other with
>> BooleanQuery.
>>
>> *QueryParser version:*
>>
>> Query q = new QueryParser(version, "cs-method", new
>> StandardAnalyzer(version)).parse("cs-method:GET AND cs-uri:/blank");
>>
>>
>> *BooleanQuery version:*
>>
>> BooleanQuery q = new BooleanQuery();
>> q.add(new TermQuery(new Term("cs-method", "GET"),
>> BooleanClause.Occur.SHOULD);
>> q.add(new TermQuery(new Term("cs-uri", "/blank"),
>> BooleanClause.Occur.SHOULD);
>>
>> When I run the two version, I am able to match the documents with the
>> QueryParser version, but not with BooleanQuery. The output is as follows:
>>
>> *QueryParser output:*
>>
>> Total Number of Documents - 11
>> Query --> +cs-method:get +cs-uri:blank
>> Total Clues Found - 5
>>
>> *BooleanQuery output:*
>>
>> Total Number of Documents - 11
>> Query --> cs-method:GET cs-uri:/blank
>> Total Clues Found - 0
>>
>> Does anybody know why the BooleanQuery doesn't return any documents while
>> QueryParser does? Also, how can I change the BooleanQuery to work for the
>> above case?
>>
>> --
>> With Regards,
>> Deepak Shakya
>>
>> --**
>> --**-
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org<
>> java-user-**unsubscr...@lucene.apache.org
>> >
>> For additional commands, e-mail: java-user-help@lucene.apache.org<
>> java-user-help@lucene.**apache.org >
>>
>>
>>
>
> --
> With Regards,
> Deepak Shakya
> http://www.google.com/**profiles/justdpk
>
> --**--**-
> To unsubscribe, e-mail: 
> java-user-unsubscribe@lucene.**apache.org
> For additional commands, e-mail: 
> java-user-help@lucene.apache.**org
>
>


-- 
With Regards,
Deepak Shakya
http://www.google.com/profiles/justdpk


Re: QueryParser and BooleanQuery

2012-07-23 Thread Ian Lea
QueryParser returns a query.  Just add that to the BooleanQuery.

QueryParser qp = ...;
BooleanQuery bq = new BooleanQuery();
Query parsedq = qp.parse("...);
bq.add(parsedq, ...);



--
Ian.


On Mon, Jul 23, 2012 at 1:16 PM, Deepak Shakya  wrote:
> Hey Jack,
>
> Can you let me know how should I do that? I am using the Lucene 3.6 version
> and I dont see any parse() method for StandardAnalyzer.
>
>
> On Mon, Jul 23, 2012 at 8:47 AM, Jack Krupansky 
> wrote:
>
>> Yes, I failed to notice that the removal of the slash was yet another
>> instance of the analyzer transforming its input. But the bottom line is
>> that you must do 100% of the same steps that analysis performs. If in
>> doubt, pass your literals through the standard analyzer itself.
>>
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Deepak Shakya
>> Sent: Sunday, July 22, 2012 9:35 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: QueryParser and BooleanQuery
>>
>>
>> I tried changing the case to lower case, but still the BooleanQuery doesn't
>> return any documents.
>>
>> I see that the text "/blank" is converted to "blank" in the QueryParser.
>> But in BooleanQuery it remains the same. When I remove the forward slash
>> sign from the input string, I get the matched documents with BooleanQuery.
>> Does the Standard Analyzer does this stripping of special characters as
>> well?
>>
>> On Sun, Jul 22, 2012 at 8:58 PM, Jack Krupansky *
>> *wrote:
>>
>>  The query parser/analyzer is lower-casing the query terms automatically.
>>> You have to do the same with with terms for BooleanQuery -
>>> Term("cs-method", "GET") should be "Term("cs-method", "get")".
>>>
>>> StandardAnalyzer is doing the lower-casing.
>>>
>>> -- Jack Krupansky
>>>
>>> -Original Message- From: Deepak Shakya
>>> Sent: Sunday, July 22, 2012 10:17 AM
>>> To: java-user@lucene.apache.org
>>> Subject: QueryParser and BooleanQuery
>>>
>>>
>>> Hi,
>>>
>>> I have following dataset indexed in Lucene.
>>> 2010-04-21 02:24:01 GET /blank 200 120
>>> 2010-04-21 02:24:01 GET /US/registrationFrame 200 605
>>> 2010-04-21 02:24:02 GET /US/kids/boys 200 785
>>> 2010-04-21 02:24:02 POST /blank 304 56
>>> 2010-04-21 02:24:04 GET /blank 304 233
>>> 2010-04-21 02:24:04 GET /blank 500 567
>>> 2010-04-21 02:24:04 GET /blank 200 897
>>> 2010-04-21 02:24:04 POST /blank 200 567
>>> 2010-04-21 02:24:05 GET /US/search 200 658
>>> 2010-04-21 02:24:05 POST /US/shop 200 768
>>> 2010-04-21 02:24:05 GET /blank 200 347
>>>
>>> I am querying it in two ways, first with QueryParser and other with
>>> BooleanQuery.
>>>
>>> *QueryParser version:*
>>>
>>> Query q = new QueryParser(version, "cs-method", new
>>> StandardAnalyzer(version)).parse("cs-method:GET AND cs-uri:/blank");
>>>
>>>
>>> *BooleanQuery version:*
>>>
>>> BooleanQuery q = new BooleanQuery();
>>> q.add(new TermQuery(new Term("cs-method", "GET"),
>>> BooleanClause.Occur.SHOULD);
>>> q.add(new TermQuery(new Term("cs-uri", "/blank"),
>>> BooleanClause.Occur.SHOULD);
>>>
>>> When I run the two version, I am able to match the documents with the
>>> QueryParser version, but not with BooleanQuery. The output is as follows:
>>>
>>> *QueryParser output:*
>>>
>>> Total Number of Documents - 11
>>> Query --> +cs-method:get +cs-uri:blank
>>> Total Clues Found - 5
>>>
>>> *BooleanQuery output:*
>>>
>>> Total Number of Documents - 11
>>> Query --> cs-method:GET cs-uri:/blank
>>> Total Clues Found - 0
>>>
>>> Does anybody know why the BooleanQuery doesn't return any documents while
>>> QueryParser does? Also, how can I change the BooleanQuery to work for the
>>> above case?
>>>
>>> --
>>> With Regards,
>>> Deepak Shakya
>>>
>>> --**
>>> --**-
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org<
>>> java-user-**unsubscr...@lucene.apache.org
>>> >
>>> For additional commands, e-mail: java-user-help@lucene.apache.org<
>>> java-user-help@lucene.**apache.org >
>>>
>>>
>>>
>>
>> --
>> With Regards,
>> Deepak Shakya
>> http://www.google.com/**profiles/justdpk
>>
>> --**--**-
>> To unsubscribe, e-mail: 
>> java-user-unsubscribe@lucene.**apache.org
>> For additional commands, e-mail: 
>> java-user-help@lucene.apache.**org
>>
>>
>
>
> --
> With Regards,
> Deepak Shakya
> http://www.google.com/profiles/justdpk

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: using phrase query with wildcard

2012-07-23 Thread Ahmet Arslan
> I'm trying to create a phrase query with wildcard, from the
> forums it seems that the solution is not trivial.
> I'm trying to create the following queries: "this is a
> phrase*"  OR  "*This is a phrase" and
> Get hits on every possibility where the * resides.
> What is the best way to achieve this?

Some pointers: 

https://issues.apache.org/jira/browse/LUCENE-1486

http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/queryParser/complexPhrase/ComplexPhraseQueryParser.html

http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_combine_wildcard_and_phrase_search.2C_e.g._.22foo_ba.2A.22.3F

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Usage of NoMergePolicy and its potential implications

2012-07-23 Thread Ian Lea
I can't answer your questions, but use of lucene's document ids as
persistent ids is strongly discouraged, particularly in version 4.x
where I think it just won't work at all.  There was a related thread a
couple of weeks ago.  See Uwe's message at
http://mail-archives.apache.org/mod_mbox/lucene-java-user/201207.mbox/%3C033f01cd606a$1ad94960$508bdc20$%40thetaphi.de%3E
where he says "To uniquely identify documents later you *have* to use
a own key field."


--
Ian.


On Mon, Jul 23, 2012 at 12:17 AM, snehal.chennuru
 wrote:
> Hello Everyone,
>
> We have a legacy system which uses lucene 2.4.1. We have ported a small hack
> to lucene source code back then, so that the underlying lucene segment
> merger code wouldn't reuse deleted docids. This helped us use lucene docids
> as persistent dbids as well. But we want to upgrade lucene to 3.6, but it is
> near impossible to "hack" lucene now to get the same behavior.
>
> I checked out NoMergePolicy, and it seemed to help achieve similar behavior
> of not letting lucene reuse deleted docids. But I guess this would increase
> the number of segments in the index. Any idea how many segments we are
> talking about over here? Also, can we configure lucene to tell how many
> documents to keep in a given segment. Each lucene index in this system can
> have utmost 1M documents in them. Is there an alternative that I can
> consider?
>
> Thanks,
> Snehal
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Usage-of-NoMergePolicy-and-its-potential-implications-tp3996630.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



FixedStraightBytesImpl - flushing

2012-07-23 Thread Simon McDuff

Hello, (LUCENE 4.0.0-ALPHA)

We are using the DocValues features (very nice).

We are using FixedBytesRef.

In that specific case, we were wondering why does it flush at the end (when we 
commit) ?

Would be more efficient (for memory) to write its buffer as it goes ?

Thank you

Simon
  

Re: Usage of NoMergePolicy and its potential implications

2012-07-23 Thread snehal.chennuru
Thanks for the heads up Ian. I know it is highly discouraged. But, like I
said, it is a legacy application and it is very hard to go back and re-do
it.  



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Usage-of-NoMergePolicy-and-its-potential-implications-tp3996630p3996784.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: QueryParser and BooleanQuery

2012-07-23 Thread Trejkaz
On Mon, Jul 23, 2012 at 10:16 PM, Deepak Shakya  wrote:
> Hey Jack,
>
> Can you let me know how should I do that? I am using the Lucene 3.6 version
> and I dont see any parse() method for StandardAnalyzer.

In your case, presumably at indexing time you should be using a
PerFieldAnalyzerWrapper with cs-uri getting a KeywordAnalyser.

If you pass that analyser in when you construct the QueryParser, it
won't remove the slash.

The main thing is that you should use the same analyser for indexing
and searching.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org