Re: Internals question: BooleanQuery with many TermQuery children

2009-04-07 Thread Paul Elschot
On Tuesday 07 April 2009 05:04:44 Daniel Noll wrote:
> Hi all.
> 
> This is something I have been wondering for a while but can't find a 
> good answer by reading the code myself.
> 
> If you have a query like this:
> 
>( field:Value1 OR
>  field:Value2 OR
>  field:Value3 OR
>   ... )
> 
> How many TermEnum / TermDocs scans should this execute?
> 
> (a) One per clause, or
> (b) One for the entire boolean query?

One per clause.

> 
> I wonder because we do use a lot of queries of this nature, and I can't 
> find any direct evidence that they get logically merged, leading me to 
> believe that it's one per clause at present (and thus this becomes a 
> potential optimisation.)

The problem is not only in the scanning of the TermDocs, but also in
the merging by docId (on a heap) that has to take place when more of them
are used at the same time during the query search.

Some optimisations are already in place:
- By allowing docs scored out of order, most top level OR queries
  can be merged with a faster algorithm (distributive sort over docId ranges)
  using the term frequencies (see BooleanQuery.setAllowDocsOutOfOrder())
- Various Filters that merge into a bitset, using a single TermDocs
  and ignoring term frequencies, (see MultiTermQuery.getFilter()).
- The new TrieRangeFilter that premerges ranges at indexing time,
  also ignoring term frequencies.

Using the TermDocs one by one has another advantage in that it
reduces disk seek distances in the index. This is noticeable when
disks have heads that take more time to move longer distances.
SSD's don't have moving heads, so they have smaller performance
differences between merging into a bitset, by distributive sort,
and by a heap.

For the time being, Lucene does not have a low level facility for key values
that occur at most once per document field, so for these it normally helps
to use a Filter.

Regards,
Paul Elschot


Re: boost and score doubt

2009-04-07 Thread Michael McCandless
Negative boosts are accepted, though rather "unusual".  Also note that
Lucene by default filters out any hits with scores <= 0.0.

Normally you'd set boost to something > 0.0 (0.1 should work).

What unexpected effect are you seeing?

If you omit norms, then indeed your per-doc boost (and per-field
boost, if used) are discarded (have no effect).

Mike

On Mon, Apr 6, 2009 at 4:01 PM, Marc Sturlese  wrote:
>
> Hey there,
> Does de function doc.setBoost(x.y) accept negative values or values minor
> than 1?? I mean... it compile and doesn't give errors but the behabiour is
> not exactly what I was expecting.
> In my use case I have the field title... I want to give very very low
> relevance to the documents witch title has less that 40 characters. I have
> tried setting boost to negatives values or to 0.1
> Wich is the best way to do that?
> Is there any range of values for setting boost?
>
> And another thing that confuses me...  if I omit norms is the score
> function... how does it affect to the boosting I am setting? does it loose
> the effect?
>
> Thanks in advance!
> --
> View this message in context: 
> http://www.nabble.com/boost-and-score-doubt-tp22916108p22916108.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: boost and score doubt

2009-04-07 Thread Marc Sturlese

That was my problem... I was ommiting norms, so the boost I gave at the
document at index time was not taking effect.
Since I stop omitting them results have changed completely.
Thanks!


Michael McCandless-2 wrote:
> 
> Negative boosts are accepted, though rather "unusual".  Also note that
> Lucene by default filters out any hits with scores <= 0.0.
> 
> Normally you'd set boost to something > 0.0 (0.1 should work).
> 
> What unexpected effect are you seeing?
> 
> If you omit norms, then indeed your per-doc boost (and per-field
> boost, if used) are discarded (have no effect).
> 
> Mike
> 
> On Mon, Apr 6, 2009 at 4:01 PM, Marc Sturlese 
> wrote:
>>
>> Hey there,
>> Does de function doc.setBoost(x.y) accept negative values or values minor
>> than 1?? I mean... it compile and doesn't give errors but the behabiour
>> is
>> not exactly what I was expecting.
>> In my use case I have the field title... I want to give very very low
>> relevance to the documents witch title has less that 40 characters. I
>> have
>> tried setting boost to negatives values or to 0.1
>> Wich is the best way to do that?
>> Is there any range of values for setting boost?
>>
>> And another thing that confuses me...  if I omit norms is the score
>> function... how does it affect to the boosting I am setting? does it
>> loose
>> the effect?
>>
>> Thanks in advance!
>> --
>> View this message in context:
>> http://www.nabble.com/boost-and-score-doubt-tp22916108p22916108.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/boost-and-score-doubt-tp22916108p22925764.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



How to customize score according to field value?

2009-04-07 Thread Jinming Zhang
Hi,

I have the following situation which needs to customize the final score
according to field value.

Suppose there are two docs in my query result, and they are ordered by
default score sort:

doc1(field1:bookA, field2:2000-01-01) -- score:0.80
doc2(field1:bookB, filed2:2009-01-01) -- score:0.70

I want "doc2" to have a higher score since it's publishing date is more
recent, while "doc1" to have a lower score:

doc2(field1:bookB, filed2:2009-01-01) -- score:0.77
doc1(field1:bookA, field2:2000-01-01) -- score:0.73

I found this scenario is different from doc.setBoost() and field.setBoost().
Is there any way to impact the score calculated for "doc1" & "doc2"
according to the value of "field2"?

Thank you in advance!


Re: How to customize score according to field value?

2009-04-07 Thread Erick Erickson
Do you want the dates to *influence* or *determine* the order? I
don't have much help if what you're after is something like "docs
that are more recent tend to rank higher", although I vaguely
remember this question coming up on the user list, maybe a
search of the archive would turn something helpful up...

But if you want the date to completely determine order you can
always sort by date, see some of the IndexSearcher.search(...sort...)
methods.


Best
Erick


On Tue, Apr 7, 2009 at 3:08 AM, Jinming Zhang  wrote:

> Hi,
>
> I have the following situation which needs to customize the final score
> according to field value.
>
> Suppose there are two docs in my query result, and they are ordered by
> default score sort:
>
> doc1(field1:bookA, field2:2000-01-01) -- score:0.80
> doc2(field1:bookB, filed2:2009-01-01) -- score:0.70
>
> I want "doc2" to have a higher score since it's publishing date is more
> recent, while "doc1" to have a lower score:
>
> doc2(field1:bookB, filed2:2009-01-01) -- score:0.77
> doc1(field1:bookA, field2:2000-01-01) -- score:0.73
>
> I found this scenario is different from doc.setBoost() and
> field.setBoost().
> Is there any way to impact the score calculated for "doc1" & "doc2"
> according to the value of "field2"?
>
> Thank you in advance!
>


RE: Multiple Analyzer on Single field

2009-04-07 Thread Allahbaksh Mohammedali Asadullah
Hi All,
Sorry for the confused email.

Suppose I have a field text with content below

KeyWordAnalyzer is a class. this keyword is used in java.

Here the KeyWordAnalyzer into Key Word Analyzer and class should be a Key word. 
So if some one search. Apart from this I want Key Word Analzer to tokenized 
properly so that search become better.
Regards,
Allahbaksh
 
 
 
-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Monday, April 06, 2009 9:31 PM
To: java-user@lucene.apache.org
Subject: Re: Multiple Analyzer on Single field

This really doesn't make sense. KeywordAnalyzer will NOT
tokenize the input stream. StandardAnalyzer WILL tokenize
the input stream. I can't imagine what it means to do both at
the same time.

Perhaps you could give us some examples of what your desired
inputs and outputs are we could steer you in the right direction.

I suspect you're thinking more in terms of TokenFilters and/or
Tokenizers...

Best
Erick

On Mon, Apr 6, 2009 at 10:52 AM, Allahbaksh Mohammedali Asadullah <
allahbaksh_asadul...@infosys.com> wrote:

> Hi,
> I want to add multiple Analyzer on single field. I want properties of
> KeywordAnalyzer, SimpleAnalyzer, StandardAnalyzer, WhiteSpaceAnalyzer. Is
> there any easy way to have all analyzer bundled on single field.
> Regards,
> Allahbaksh
>
>
>
>
>
>
>
>  CAUTION - Disclaimer *
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
> solely
> for the use of the addressee(s). If you are not the intended recipient,
> please
> notify the sender by e-mail and delete the original message. Further, you
> are not
> to copy, disclose, or distribute this e-mail or its contents to any other
> person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has
> taken
> every reasonable precaution to minimize this risk, but is not liable for
> any damage
> you may sustain as a result of any virus in this e-mail. You should carry
> out your
> own virus checks before opening the e-mail or attachment. Infosys reserves
> the
> right to monitor and review the content of all messages sent to or from
> this e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS End of Disclaimer INFOSYS***
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to customize score according to field value?

2009-04-07 Thread Tim Williams
On Tue, Apr 7, 2009 at 3:08 AM, Jinming Zhang  wrote:
> Hi,
>
> I have the following situation which needs to customize the final score
> according to field value.
>
> Suppose there are two docs in my query result, and they are ordered by
> default score sort:
>
> doc1(field1:bookA, field2:2000-01-01) -- score:0.80
> doc2(field1:bookB, filed2:2009-01-01) -- score:0.70
>
> I want "doc2" to have a higher score since it's publishing date is more
> recent, while "doc1" to have a lower score:
>
> doc2(field1:bookB, filed2:2009-01-01) -- score:0.77
> doc1(field1:bookA, field2:2000-01-01) -- score:0.73
>
> I found this scenario is different from doc.setBoost() and field.setBoost().
> Is there any way to impact the score calculated for "doc1" & "doc2"
> according to the value of "field2"?
>
> Thank you in advance!

If you have access to the MEAP for Lucine In Action 2nd Edition, it
demonstrates using a CustomScoreQuery[1] for to boost a docs score
based on recency.

--tim

[1] - 
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/function/CustomScoreQuery.html

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Multiple Analyzer on Single field

2009-04-07 Thread Erick Erickson
H. There's nothing in Lucene that I know of that will do what you
want, you'll have to do one of two things:

In general, you'll have to break up your token stream yourself, either
through pre-processing or building your own analyzers. There's
nothing already built that I know of that will break up, for instance,
KeyWordAnalyzer into three tokens.

Part of the confusion is the use of the phrase "keyword" as in
"and class should be a Key word". If I'm reading this right, you'll
want "class" to be in a separate field since it's special (in your
context). Again, to accomplish this you either need to pre-process
the input stream, extract "class", and put it in a separate field or
create your own analyzer that extracts only "class" from the
input stream. Then you'd feed the entire contents into *both* fields (say
"content" and "key"). The analyzer attached to the "content" field
(see PerFieldAnalyzerWrapper) would take care of breaking up
things like KeyWordAnalyzer, and the analyzer attached to the
"key" field would throw away everything except "class"..

Hope this helps
Erick

On Tue, Apr 7, 2009 at 8:57 AM, Allahbaksh Mohammedali Asadullah <
allahbaksh_asadul...@infosys.com> wrote:

> Hi All,
> Sorry for the confused email.
>
> Suppose I have a field text with content below
>
> KeyWordAnalyzer is a class. this keyword is used in java.
>
> Here the KeyWordAnalyzer into Key Word Analyzer and class should be a Key
> word. So if some one search. Apart from this I want Key Word Analzer to
> tokenized properly so that search become better.
> Regards,
> Allahbaksh
>
>
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Monday, April 06, 2009 9:31 PM
> To: java-user@lucene.apache.org
> Subject: Re: Multiple Analyzer on Single field
>
> This really doesn't make sense. KeywordAnalyzer will NOT
> tokenize the input stream. StandardAnalyzer WILL tokenize
> the input stream. I can't imagine what it means to do both at
> the same time.
>
> Perhaps you could give us some examples of what your desired
> inputs and outputs are we could steer you in the right direction.
>
> I suspect you're thinking more in terms of TokenFilters and/or
> Tokenizers...
>
> Best
> Erick
>
> On Mon, Apr 6, 2009 at 10:52 AM, Allahbaksh Mohammedali Asadullah <
> allahbaksh_asadul...@infosys.com> wrote:
>
> > Hi,
> > I want to add multiple Analyzer on single field. I want properties of
> > KeywordAnalyzer, SimpleAnalyzer, StandardAnalyzer, WhiteSpaceAnalyzer. Is
> > there any easy way to have all analyzer bundled on single field.
> > Regards,
> > Allahbaksh
> >
> >
> >
> >
> >
> >
> >
> >  CAUTION - Disclaimer *
> > This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
> > solely
> > for the use of the addressee(s). If you are not the intended recipient,
> > please
> > notify the sender by e-mail and delete the original message. Further, you
> > are not
> > to copy, disclose, or distribute this e-mail or its contents to any other
> > person and
> > any such actions are unlawful. This e-mail may contain viruses. Infosys
> has
> > taken
> > every reasonable precaution to minimize this risk, but is not liable for
> > any damage
> > you may sustain as a result of any virus in this e-mail. You should carry
> > out your
> > own virus checks before opening the e-mail or attachment. Infosys
> reserves
> > the
> > right to monitor and review the content of all messages sent to or from
> > this e-mail
> > address. Messages sent to or from this e-mail address may be stored on
> the
> > Infosys e-mail system.
> > ***INFOSYS End of Disclaimer INFOSYS***
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Lucene and Phrase Correction

2009-04-07 Thread Glyn Darkin
Karl,

Thankyou for your in-depth reply. This has given me good grounds to go on.

Regards

Glyn


2009/4/6 Karl Wettin :
> 6 apr 2009 kl. 14.59 skrev Glyn Darkin:
>
> Hi Glyn,
>
>> to be able to spell check phrases
>>
>> E.g
>>
>> "Harry Poter" is converted to "Harry Potter"
>>
>> We have a fixed dataset so can build indexes/ dictionaries from our
>> own data.
>
> the most obvious solution is index your contrib/spell checker with shingles.
> This will however probably only help out with exact phrases. Perhaps that is
> enough for you.
>
> If your example is a real one that you came up with by analyzing query logs
> then you might want consider creating an index "stemmed" to handle various
> problems associated with reading and writing disorders. Dyslectic people
> often miss out on vowels, they who suffer from dysgraphia have problems with
> q/p/d/b, other have problems with reoccuring characters, et c. A combination
> of these problems could end up in a secondary "fuzzy" index that contains
> weighted shingles like this for the document that points at "harry potter":
>
> "hary poter"^0.9
> "harry #otter"^0.8
> "hary #oter"^0.7
> "hrry pttr"^0.7
> "hry ptr"^0.5
>
> In order to get a good precision/recall your query to such an index would
> have to produce a boolean query containing all of the "stems" above if the
> input was spelled correct.
>
>
> One alternative to the contrib/spell checker is Spelt:
> http://groups.google.com/group/spelt/ and I believe it is supposed to handle
> phrases.
>
>
> Note the difference between spell checking and suggestion schemes. Something
> can be wrong even though the spelling is correct. Consider the game "Heroes
> of might and magic", people might have fogotten what it was called and
> search for "Heroes of light and magic" instead. Hopefully your query would
> still yield a fairly good result for the correct document if the latter was
> entered, but if you require all terms or something similar then it might
> return no hits.
>
>
> More advanced strategies for contextual spell checking of phrases usually
> involve statistical models such as neural networs, hidden markov models, et
> c. LingPipe contains such an implementation.
>
>
> You can also take a look at reinforcement learning, learning from the
> misstakes and corrections made by your users. It requires a lot of data
> (user query logs) in order to work but will yeild very cool results such as
> acronyms.
>
> LUCENE-626 is a multi layered spell checker with reinforcement learning in
> the top, backed by an a priori corpus (that can be compiled from old user
> queries) used to find context. It also use a refactored version of the
> contrib/spell checker as second level suggestion when there is nothing to
> pick up from previous user behaviour. I never deployed this in a real
> system, it does however seem to work great when trained with a few hundred
> thousand query sessions.
>
>
> Finally I recommend that you take some time to analyze user query sessions
> to find what the most common problems your users have and try to find a
> solution that best fit those problems. Too often features are implemented
> because they are listed in a specification and not because the users need
> them.
>
>
> I hope this helps.
>
>     karl
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 
Glyn Darkin

Darkin Systems Ltd
Mob: 07961815649
Fax: 08717145065
Web: www.darkinsystems.com

Company No: 6173001
VAT No: 906350835

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to customize score according to field value?

2009-04-07 Thread patrick o'leary
You might want to play with both boosting and multiple sorting.
You might want to look at something like Solr's boost queries or boost
functions
http://wiki.apache.org/solr/DisMaxRequestHandler#head-6862070cf279d9a09bdab971309135c7aea22fb3

Or if you want to go down the path of a custom score, most folks
override the customScore method of CustomScoreQuery

*//create a term query to search against all documents*
Query tq = *new* TermQuery(*new* Term(*"metafile"*, *"doc"*));

FieldScoreQuery fsQuery = *new* FieldScoreQuery(*"geo_distance"*, Type.FLOAT);
CustomScoreQuery customScore = *new* CustomScoreQuery(tq,fsQuery){
   @Override
   *public* *float* customScore(*int* doc, *float* subQueryScore,
*float* valSrcScore){
 .
 return myFunkyScore;
   }
}


You can see a quick version in
http://svn.apache.org/viewvc/lucene/java/trunk/contrib/spatial/src/test/org/apache/lucene/spatial/tier/TestCartesian.java?revision=762801&view=markup

HTH
P

On Tue, Apr 7, 2009 at 9:01 AM, Tim Williams  wrote:

> On Tue, Apr 7, 2009 at 3:08 AM, Jinming Zhang 
> wrote:
> > Hi,
> >
> > I have the following situation which needs to customize the final score
> > according to field value.
> >
> > Suppose there are two docs in my query result, and they are ordered by
> > default score sort:
> >
> > doc1(field1:bookA, field2:2000-01-01) -- score:0.80
> > doc2(field1:bookB, filed2:2009-01-01) -- score:0.70
> >
> > I want "doc2" to have a higher score since it's publishing date is more
> > recent, while "doc1" to have a lower score:
> >
> > doc2(field1:bookB, filed2:2009-01-01) -- score:0.77
> > doc1(field1:bookA, field2:2000-01-01) -- score:0.73
> >
> > I found this scenario is different from doc.setBoost() and
> field.setBoost().
> > Is there any way to impact the score calculated for "doc1" & "doc2"
> > according to the value of "field2"?
> >
> > Thank you in advance!
>
> If you have access to the MEAP for Lucine In Action 2nd Edition, it
> demonstrates using a CustomScoreQuery[1] for to boost a docs score
> based on recency.
>
> --tim
>
> [1] -
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/function/CustomScoreQuery.html
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: How to search a phrase using quotes in a query ???

2009-04-07 Thread Ariel
Here is my code for indexing:
[code]
public static void main(String[] args) throws IOException {

if(args.length==2){
String docsDirectory =args[0];
String indexFilepath = args[1];
int numIndexed = 0;
IndexWriter writer;
ArrayList arrayList = new ArrayList();
try {
Analyzer Analyzer = new EnglishAnalyzer();
writer = new IndexWriter(indexFilepath,   Analyzer , true);
writer.setUseCompoundFile(true);
File directory = new File(docsDirectory);
String[] list = directory.list();
for (int i = 0; i < list.length; i++) {
File doc = new File(docsDirectory, list[i]);
BufferedReader reader;
try {
reader = new BufferedReader(new FileReader(doc));
String linea = reader.readLine();
StringBuffer texto = new StringBuffer();
while (linea != null){
// Aquí lo que tengamos que hacer con la línea
puede ser esto
texto.append(linea);
linea = reader.readLine();
}
System.out.println(i);
indexFile(writer,
texto.toString(),doc.getAbsolutePath() );
arrayList.add(new String(new byte[1000]));
reader.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
numIndexed = writer.docCount();
writer.optimize();
writer.close();
} catch (CorruptIndexException e1) {
e1.printStackTrace();
} catch (LockObtainFailedException e1) {
e1.printStackTrace();
} catch (IOException e1) {
e1.printStackTrace();
}

} else {
System.err.println("You need to provide arguments ");
}
}

// method to actually index a file using Lucene
private static void indexFile(IndexWriter writer, String content, String
title)  throws IOException {
long init = System.currentTimeMillis();
Document doc = new Document();
doc.add(new Field("content", content, Field.Store.YES ,
Field.Index.TOKENIZED, Field.TermVector.YES));
doc.add(new Field("title", title, Field.Store.YES ,
Field.Index.TOKENIZED, Field.TermVector.YES));
writer.addDocument(doc);
long end = System.currentTimeMillis();
System.out.println("ms " + (end - init));
}t
[/code]

And for searching:
[code]
public static void main(String[] args) {
String path = "C:\\index";
try {
IndexSearcher indexSearcher = new IndexSearcher(path );
String[] fields = new String[]{"title","content"};
Analyzer analyzer = new EnglishAnalyzer();
String[] textFields = new String[]{"\"The Bank of
America\"","\"The Bank of America\""};;
Query query = MultiFieldQueryParser.parse(textFields, fields,
analyzer);
Hits hits = indexSearcher.search(query );
System.out.println("Founded: " + hits.length());

QueryScorer scorer = new QueryScorer(query);
Highlighter highlighter = new Highlighter(scorer);
Fragmenter fragmenter = new SimpleFragmenter(100);
highlighter.setTextFragmenter(fragmenter);

for (int i = 0; i < hits.length(); i++) {
Document document = hits.doc(i);
String body = hits.doc(i).get("content");
System.out.println((i+1)+ " " + body.substring(0, 20));
System.out.println(document.get("path"));
if (body==null) body ="";
TokenStream stream =  analyzer.tokenStream("content",new
StringReader(body));
//System.out.println(highlighter.getBestFragment(stream,
body));
String[] fragment = highlighter.getBestFragments(stream,
body, 3);
if (fragment.length == 0){
fragment = new String[1];
fragment[0] = "";
}
StringBuilder buffer = new StringBuilder();
for (int I = 0; I < fragment.length; I++){
buffer.append(fragment[I] + "...\n");
}
String stringFragment = buffer.toString();
System.out.println(stringFragment);
}
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ParseException e) {
   

Re: How to search a phrase using quotes in a query ???

2009-04-07 Thread Erick Erickson
Well, nothing jumps out at me, although I confess that I've not
used MultiFieldQueryParser. So here's what I'd do.

1> drop back to a simpler way of doing things. Forget about
MultiFieldQueryParser for instance. Get the really simple case
working then build back up. I'd also drop back to a very
basic analyzer (perhaps SimpleAnalyzer). Get that very simple
case to work. Then substitute your EnglishAnalyzer back in. etc.

I'm guessing that one of these steps will suddenly fail and you'll have
a good place to start.

2> Print out query.toString() and paste the results into Luke and
see what it gives you. The Explain (in Luke) should help.

Sorry I can't be more help, but I've often found that getting the easy
way of doing things to work then adding my complications back in
produces one of those "I didn't think *that* could possibly fail" moments
.

Best
Erick

On Tue, Apr 7, 2009 at 5:12 PM, Ariel  wrote:

> Here is my code for indexing:
> [code]
>public static void main(String[] args) throws IOException {
>
>if(args.length==2){
>String docsDirectory =args[0];
>String indexFilepath = args[1];
>int numIndexed = 0;
>IndexWriter writer;
>ArrayList arrayList = new ArrayList();
>try {
>Analyzer Analyzer = new EnglishAnalyzer();
>writer = new IndexWriter(indexFilepath,   Analyzer , true);
>writer.setUseCompoundFile(true);
>File directory = new File(docsDirectory);
>String[] list = directory.list();
>for (int i = 0; i < list.length; i++) {
>File doc = new File(docsDirectory, list[i]);
>BufferedReader reader;
>try {
>reader = new BufferedReader(new FileReader(doc));
>String linea = reader.readLine();
>StringBuffer texto = new StringBuffer();
>while (linea != null){
>// Aquí lo que tengamos que hacer con la línea
> puede ser esto
>texto.append(linea);
>linea = reader.readLine();
>}
>System.out.println(i);
>indexFile(writer,
> texto.toString(),doc.getAbsolutePath() );
>arrayList.add(new String(new byte[1000]));
>reader.close();
>} catch (FileNotFoundException e) {
>e.printStackTrace();
>} catch (IOException e) {
>e.printStackTrace();
>}
>}
>numIndexed = writer.docCount();
>writer.optimize();
>writer.close();
>} catch (CorruptIndexException e1) {
>e1.printStackTrace();
>} catch (LockObtainFailedException e1) {
>e1.printStackTrace();
>} catch (IOException e1) {
>e1.printStackTrace();
>}
>
>} else {
>System.err.println("You need to provide arguments ");
>}
>}
>
>// method to actually index a file using Lucene
>private static void indexFile(IndexWriter writer, String content, String
> title)  throws IOException {
>long init = System.currentTimeMillis();
>Document doc = new Document();
>doc.add(new Field("content", content, Field.Store.YES ,
> Field.Index.TOKENIZED, Field.TermVector.YES));
>doc.add(new Field("title", title, Field.Store.YES ,
> Field.Index.TOKENIZED, Field.TermVector.YES));
>writer.addDocument(doc);
>long end = System.currentTimeMillis();
>System.out.println("ms " + (end - init));
>}t
> [/code]
>
> And for searching:
> [code]
>public static void main(String[] args) {
>String path = "C:\\index";
>try {
>IndexSearcher indexSearcher = new IndexSearcher(path );
>String[] fields = new String[]{"title","content"};
>Analyzer analyzer = new EnglishAnalyzer();
>String[] textFields = new String[]{"\"The Bank of
> America\"","\"The Bank of America\""};;
>Query query = MultiFieldQueryParser.parse(textFields, fields,
> analyzer);
>Hits hits = indexSearcher.search(query );
>System.out.println("Founded: " + hits.length());
>
>QueryScorer scorer = new QueryScorer(query);
>Highlighter highlighter = new Highlighter(scorer);
>Fragmenter fragmenter = new SimpleFragmenter(100);
>highlighter.setTextFragmenter(fragmenter);
>
>for (int i = 0; i < hits.length(); i++) {
>Document document = hits.doc(i);
>String body = hits.doc(i).get("content");
>System.out.println((i+1)+ " " + body.substring(0, 20));
>System.out.printl

Re: How to customize score according to field value?

2009-04-07 Thread 김관호

Jinming Zhang wrote:

Hi,

I have the following situation which needs to customize the final score
according to field value.

Suppose there are two docs in my query result, and they are ordered by
default score sort:

doc1(field1:bookA, field2:2000-01-01) -- score:0.80
doc2(field1:bookB, filed2:2009-01-01) -- score:0.70

I want "doc2" to have a higher score since it's publishing date is more
recent, while "doc1" to have a lower score:

doc2(field1:bookB, filed2:2009-01-01) -- score:0.77
doc1(field1:bookA, field2:2000-01-01) -- score:0.73

I found this scenario is different from doc.setBoost() and field.setBoost().
Is there any way to impact the score calculated for "doc1" & "doc2"
according to the value of "field2"?

Thank you in advance!

  

hi,

If I were you, I would store the date information as a long type
(as i know, lucene stores any date information as a long type 
automatically, so it should be a natural way. you can change date type 
to long type and vice versa very easily by using lucene's provided date 
apis.) and make a linear function about the date information.


The input of the function is a date information and the output is a 
simple float value which indicates how recent a book is. As more recent 
books have larger function values linearly, you can finely adjust your 
score by weighting the output of function according to your ranking policy.


After then, simply, modify your docs' score on the fly by using the 
function's output at customScore() which are mentioned at one of replies 
for your question by patrick o'leary.


bye.





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



test

2009-04-07 Thread Antony Joseph
hi



--
DigitalGlue, India



RE: test

2009-04-07 Thread Antony Joseph
Hi,

In a long running process Lucene get crashed in my application, Is there any
way to diagnose or how can I  turn on debug logging / trace logging for
Lucene?


Thanks
Antony


--
DigitalGlue, India




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org