Re: Building maven artifacts

2010-07-19 Thread Pavel Minchenkov
Hi,

I don't know. I tried to setup somethind like this:
 

But error is the same. Maybe there are any other parameters?

2010/7/16 Zhang, Lisheng 

> Hi,
>
> I never this kind of build before, but just from the error message
> I guess it could mean two variables:
>
> ${project.artifactId}
> ${project.version}
>
> are not defined (otherwise exact jar file name would be printed out)?
>
> Could it be some environment setup issue?
>
> Best regards, Lisheng
>
> -Original Message-
> From: Pavel Minchenkov [mailto:char...@gmail.com]
> Sent: Friday, July 16, 2010 8:35 AM
> To: java-user@lucene.apache.org; solr-u...@lucene.apache.org
> Subject: Building maven artifacts
>
>


API to retrieve search results without scoring or sorting

2010-07-19 Thread Naveen Kumar
HI
Is there any API using which I can retrieve search results, such that they
are neither scored nor sorted (for performance reasons). I just need the
results, don't need any extra computation on that.
Any suggestion will be very helpful.

-- 
Thanks
Naveen Kumar


Re: API to retrieve search results without scoring or sorting

2010-07-19 Thread Yonik Seeley
On Mon, Jul 19, 2010 at 6:14 AM, Naveen Kumar  wrote:
> Is there any API using which I can retrieve search results, such that they
> are neither scored nor sorted (for performance reasons). I just need the
> results, don't need any extra computation on that.

Use your own custom Collector class.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Get lengthNorm of a field

2010-07-19 Thread Philippe

Hi,

is there a possibility to retrieve the lengthNorm for all (or a 
specific) fields in a specific document?


Regards,
Philippe

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Get lengthNorm of a field

2010-07-19 Thread Yonik Seeley
On Mon, Jul 19, 2010 at 9:53 AM, Philippe  wrote:
> is there a possibility to retrieve the lengthNorm for all (or a specific)
> fields in a specific document?

See IndexReader: public abstract byte[] norms(String field) throws IOException;
And Similarity: public float decodeNormValue(byte b) {

The byte[] is indexed by document id, and you can decode that into a
float value with a Similarity.

-Yonik
http://www.lucidimagination.com




> Regards,
>    Philippe
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Get lengthNorm of a field

2010-07-19 Thread Philippe

Hi Yonik,

Am 19.07.2010 16:21, schrieb Yonik Seeley:

On Mon, Jul 19, 2010 at 9:53 AM, Philippe  wrote:
   

is there a possibility to retrieve the lengthNorm for all (or a specific)
fields in a specific document?
 

See IndexReader: public abstract byte[] norms(String field) throws IOException;
And Similarity: public float decodeNormValue(byte b) {

The byte[] is indexed by document id, and you can decode that into a
float value with a Similarity.
   
Thanks for the quick reply. I was searching for methods in 
IndexSearcher. Therefore I did not find the norms method.


Cheers,
Philippe


-Yonik
http://www.lucidimagination.com




   

Regards,
Philippe

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


   



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Scoring exact matches higher in a stemmed field

2010-07-19 Thread Shai Erera
If your analyzer outputs b and b$ in the same position, then the below query
will already be what the QP output today If you want to incorporate
boosting, I can suggest that you extend QP, override newTermQuery for
example, and if the term is a stemmed term, then set the query's boost
(Query.setBoost) accordingly. Would that work for you?

You'll need to check whether you want to boost terms inside phrases, or
entire phrases, and then override more methods from QP. But that approach
will get you the native product of the engine, I think. Alternatively, you
can set a payload on the stemmed terms and incorporate that into Similarity,
but that's more costly.

I don't follow that's been deprecated on Sim that you cannot use anymore?
All I see are 3 deprecated static methods which are related to norms ...

Shai

On Sat, Jul 17, 2010 at 9:04 PM, Itamar Syn-Hershko wrote:

> Shai, you got it right. I want to be able to send "b bb" through the QP
> with my custom analyzer, and get back "(b b$) (b bb$)" -- 2 terms with 2
> tokens in the same position for each.
>
> I want this to be a native product of the engine, as opposed to forcing
> this from the query end. I'm using different types of queries (Bool,
> DisMax), and I'm actually interested in using the QP itself. Instead of
> going through all sub-queries post-parsing and boosting terms ending with $,
> I want some sort of a plugin mechanism to do this for me per result. The
> easiest path would be subcalssing Similarity, if only the relevant functions
> wouldn't have been deprecated...
>
> Are there any other ways to do so? For example, is this doable with
> function queries (since access to the actual term is required)?
>
> Itamar.
>
> On 16/7/2010 8:01 PM, Shai Erera wrote:
>
>> Depends for which query no? ;)
>>
>> Sounds like you want to simulate the QP behavior
>> http://lucene.apache.org/java/2_4_0/queryparsersyntax.html for
>> boosting. Meaning, if for the query "b" you want to simulate the query
>> "b OR b$^2" and have matches of b$ count more than b, then I'd follow
>> how QP does it - create the query programmatically or something (I'm
>> not near the code at the moment so I cannot give a more concrete
>> approach).
>>
>> If you want b and b$ to count the same, then that's already the
>> behavior - i.e., docs containing both will score higher.
>>
>> If I misunderstood your question, then plea correct me.
>>
>> Shai
>>
>> On Friday, July 16, 2010, Itamar Syn-Hershko  wrote:
>>
>>
>>> Hi all,
>>>
>>>
>>> Consider the following string: "the buffalo buffaloes" [1].
>>>
>>>
>>> When passed through a stemming analyzer, the resulting token would be
>>> "buffalo buffalo" (assuming a good stemmer).
>>>
>>>
>>> To enable exact searches, say I mark the original term and index it at
>>> the same term position. So "the buffalo buffaloes" ->  (buffalo buffalo$)
>>> (buffalo buffaloes$) - now exact searches are allowed on the same field
>>> without having 2 different fields [2].
>>>
>>>
>>> However, with this approach default scoring isn't working well. What is
>>> my best option at upgrading a match for an exact match of this sort, also
>>> when using the same stemming analyzer, without using payloads on the marked
>>> token?
>>>
>>>
>>> In other words - how do I make documents containing "the buffalo
>>> buffaloes" considered more relevant than docs containing the word "buffalo"
>>> only once?
>>>
>>>
>>> The trick here is to boost the marked token if found at search time.
>>> While this sounds easy to do, I can't find the best approach on implementing
>>> this - esp. since Similarity.float Idf(Index.Term term, Searcher searcher)
>>> seem to have been deprecated for some reason.
>>>
>>>
>>> Itamar.
>>>
>>>
>>> [1]
>>> http://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffalo_buffalo_buffalo_Buffalo_buffalo:)
>>>
>>> [2] Rationale:
>>> http://www.code972.com/blog/2010/07/more-flexible-hebrew-indexing-hebmorph/
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>>
>>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>>
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


RE: Building maven artifacts

2010-07-19 Thread Zhang, Lisheng
Hi Pavel,

I have not done this build, I sent last message based on my
experiences using ant on other projects, maybe people who 
worked on maven artifacts could help?

Best regards, Lisheng

-Original Message-
From: Pavel Minchenkov [mailto:char...@gmail.com]
Sent: Monday, July 19, 2010 3:03 AM
To: java-user@lucene.apache.org
Subject: Re: Building maven artifacts


Hi,

I don't know. I tried to setup somethind like this:
 

But error is the same. Maybe there are any other parameters?

2010/7/16 Zhang, Lisheng 

> Hi,
>
> I never this kind of build before, but just from the error message
> I guess it could mean two variables:
>
> ${project.artifactId}
> ${project.version}
>
> are not defined (otherwise exact jar file name would be printed out)?
>
> Could it be some environment setup issue?
>
> Best regards, Lisheng
>
> -Original Message-
> From: Pavel Minchenkov [mailto:char...@gmail.com]
> Sent: Friday, July 16, 2010 8:35 AM
> To: java-user@lucene.apache.org; solr-u...@lucene.apache.org
> Subject: Building maven artifacts
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Scoring exact matches higher in a stemmed field

2010-07-19 Thread Itamar Syn-Hershko

On 19/7/2010 5:50 PM, Shai Erera wrote:

If your analyzer outputs b and b$ in the same position, then the below query
will already be what the QP output today If you want to incorporate
boosting, I can suggest that you extend QP, override newTermQuery for
example, and if the term is a stemmed term, then set the query's boost
(Query.setBoost) accordingly. Would that work for you?
   
I want to avoid overriding the QP, and do this as a pluggable extension. 
What other options do I have other than what you've suggested?


Ideally, that would be through a class or a function I can override or 
extend, so each term hit while searching will be examined. By checking 
its type and text (for suffix), that interface could double its weight 
(or boost). The similarity functions I mentioned could have provided 
this ability (see below). How can this be done without them?

You'll need to check whether you want to boost terms inside phrases, or
entire phrases, and then override more methods from QP. But that approach
will get you the native product of the engine, I think.
Just to make sure we are on the same page here, here's an example 
(assuming the default tf/idf implementation in Lucene).


I want to make sure anyone searching for "song of songs" will find texts 
discussing the biblical book, and have them ranked the highest, instead 
of having short texts containing one word "song" score higher.


So what I do is have my stemming analyzer save the string "song of 
songs" like this, where each parenthesis represents a token position: 
(song song$) (song songs$).


The part I'm missing is how to score terms with suffixes higher. The 
best approach seem to be looking at the term read by IndexReader and 
boost this finding somehow. The assumption is if IndexReader has read 
the term songs$ it has been looked for, and therefore this is the exact 
word that has been queried for.


Which is the best Lucene part to hijack for this mission?

Alternatively, you
can set a payload on the stemmed terms and incorporate that into Similarity,
but that's more costly.
   
I had mentioned Payloads - this will get me exactly what I want but as 
you say are quite costly when used for almost every term in the index. 
If I could replace the suffix with Payloads I would have done this (byte 
vs. byte), but I'm using the suffix for one other thing.

I don't follow that's been deprecated on Sim that you cannot use anymore?
All I see are 3 deprecated static methods which are related to norms ...
   

In 2.3.2 there were these functions:

public float idf(Term term, Searcher searcher)

public float idf(Collection terms, Searcher searcher)

These have been deprecated somewhere between that version and 2.9.2, and 
it seems like I could have used those for what I'm trying to do.


Thanks,

Itamar.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



How to modify a document Field before the document is indexed?

2010-07-19 Thread Joe Hansen
Hey All,

I am using Apache Lucene (2.9.1) and its fast and it works great! I
have a question in connection with Apache PDFBox.

The following command creates a Lucent Document from a PDF file:
Document document =
org.apache.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(docFile);

The Lucene Document, document, has a bunch of fields. Among those
fields, is a field named, "content". I need to add some more data to
that field. For example, I would like to add some description and
keywords. How do I go about doing that? Any pointers would be greatly
welcome! :)

Thanks for your time!

Regards,
Joe

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to modify a document Field before the document is indexed?

2010-07-19 Thread Koji Sekiguchi

(10/07/20 7:31), Joe Hansen wrote:

Hey All,

I am using Apache Lucene (2.9.1) and its fast and it works great! I
have a question in connection with Apache PDFBox.

The following command creates a Lucent Document from a PDF file:
Document document =
org.apache.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(docFile);

The Lucene Document, document, has a bunch of fields. Among those
fields, is a field named, "content". I need to add some more data to
that field. For example, I would like to add some description and
keywords. How do I go about doing that? Any pointers would be greatly
welcome! :)

Thanks for your time!

Regards,
Joe

   

Joe,

You can add your data to the document object:

Document document =
org.apache.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(docFile);
document.add( new Field( "content", "your data", Store.YES, 
Index.ANALYZED ) );


http://lucene.apache.org/java/2_9_3/api/all/org/apache/lucene/document/Document.html#add%28org.apache.lucene.document.Fieldable%29

Koji

--
http://www.rondhuit.com/en/


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to modify a document Field before the document is indexed?

2010-07-19 Thread Joe Hansen
Thanks for your reply Koji! Your suggestion worked fine. I thought
adding a field named "contents" to a document, even though it contains
a field already named "contents" would NOT do anything. But looks like
I am wrong!

Thank you for your kind help! :)

Regards,
Joe

On Mon, Jul 19, 2010 at 5:12 PM, Koji Sekiguchi  wrote:
> (10/07/20 7:31), Joe Hansen wrote:
>>
>> Hey All,
>>
>> I am using Apache Lucene (2.9.1) and its fast and it works great! I
>> have a question in connection with Apache PDFBox.
>>
>> The following command creates a Lucent Document from a PDF file:
>> Document document =
>>
>> org.apache.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(docFile);
>>
>> The Lucene Document, document, has a bunch of fields. Among those
>> fields, is a field named, "content". I need to add some more data to
>> that field. For example, I would like to add some description and
>> keywords. How do I go about doing that? Any pointers would be greatly
>> welcome! :)
>>
>> Thanks for your time!
>>
>> Regards,
>> Joe
>>
>>
>
> Joe,
>
> You can add your data to the document object:
>
> Document document =
> org.apache.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(docFile);
> document.add( new Field( "content", "your data", Store.YES, Index.ANALYZED )
> );
>
> http://lucene.apache.org/java/2_9_3/api/all/org/apache/lucene/document/Document.html#add%28org.apache.lucene.document.Fieldable%29
>
> Koji
>
> --
> http://www.rondhuit.com/en/
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to modify a document Field before the document is indexed?

2010-07-19 Thread Erick Erickson
One subtlety you might be able to use to advantage... This is
where getPositionIncrementGap in your analyzer can be used
to separate the two bits of data in the same field. If I have my
own analyzer (which could be a trivial override of an existing one)
that returns, say 10,000 from getPositionIncrementGap Now, if
you wanted to insure that proximity queries only matched in a particular
add to your "content" field, you could specify that all the terms had
to occur within 10,000 of each other...

FWIW
Erick

On Mon, Jul 19, 2010 at 7:56 PM, Joe Hansen  wrote:

> Thanks for your reply Koji! Your suggestion worked fine. I thought
> adding a field named "contents" to a document, even though it contains
> a field already named "contents" would NOT do anything. But looks like
> I am wrong!
>
> Thank you for your kind help! :)
>
> Regards,
> Joe
>
> On Mon, Jul 19, 2010 at 5:12 PM, Koji Sekiguchi 
> wrote:
> > (10/07/20 7:31), Joe Hansen wrote:
> >>
> >> Hey All,
> >>
> >> I am using Apache Lucene (2.9.1) and its fast and it works great! I
> >> have a question in connection with Apache PDFBox.
> >>
> >> The following command creates a Lucent Document from a PDF file:
> >> Document document =
> >>
> >>
> org.apache.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(docFile);
> >>
> >> The Lucene Document, document, has a bunch of fields. Among those
> >> fields, is a field named, "content". I need to add some more data to
> >> that field. For example, I would like to add some description and
> >> keywords. How do I go about doing that? Any pointers would be greatly
> >> welcome! :)
> >>
> >> Thanks for your time!
> >>
> >> Regards,
> >> Joe
> >>
> >>
> >
> > Joe,
> >
> > You can add your data to the document object:
> >
> > Document document =
> >
> org.apache.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(docFile);
> > document.add( new Field( "content", "your data", Store.YES,
> Index.ANALYZED )
> > );
> >
> >
> http://lucene.apache.org/java/2_9_3/api/all/org/apache/lucene/document/Document.html#add%28org.apache.lucene.document.Fieldable%29
> >
> > Koji
> >
> > --
> > http://www.rondhuit.com/en/
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>