Re: Is lucene right for us

2008-10-12 Thread Grant Ingersoll
Lucene should work quite well for this, you'll just need some  
infrastructure around it to get the file and extract the contents (see  
Lucene's Tika project).  And, yes, Lucene is thread-safe, so you can  
index safely as you describe.



On Oct 11, 2008, at 10:22 AM, Mag Gam wrote:


Hello All,

At my university we have over 20,000 small file ranging from 20k to
500k per directory and we would like to index them. I was wondering if
Lucene is the right tool for this? The information we would like to
keep is: filename, filesize, filedate, filecontent. Also, is it
possible to run the initial index in multithreaded mode since we are
talking about many directories with similar contents?

TIA

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]








--
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ










-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Retrieving Top Terms for a subset of the index (or for all results of a query)

2008-10-12 Thread Grant Ingersoll

How large of a subset are you talking?

You might look at the FitleredTermEnum class, but you will probably  
have to do some work on it to extend it to what you want


If you are talking a smallish subset (say, at most a couple hundred  
docs), then you could store Term Vectors and use the TermVectorMapper,  
I suspect.



HTH,
Grant


On Oct 11, 2008, at 6:36 AM, Aleksander M. Stensby wrote:

Hello everyone. I've been fiddeling with the idea of retrieving the  
top terms from a subset of the index (i.e. top terms from the  
documents retrieved by a given search). This could for instance be  
useful to identify top ranking terms in a given datespan etc.


It would be something like getting the top 50 terms (like you can do  
with luke) but instead of doing it for the full index, I would like  
to do the same procedure after applying a filter or a query. Don't  
know if this is a bad explaination or wheter it makes any sense at  
all...


So, I really want to avoid iterating over all results (obviously),  
so my question is really if there is a prefered approach for doing  
such analysis / has this been done in a good way before?


Thanks for any help!

Best regards,
Aleksander

--
Aleksander M. Stensby
Senior Software Developer
Integrasco A/S
+47 41 22 82 72
[EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ










-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Access Scoring Values of Lucene for Post-Processing

2008-10-12 Thread Grant Ingersoll
Have a look at the o.a.lucene.search.function package and the  
ValueSourceQuery.  You will probably be able to factor in those pieces  
during scoring, so no need to resort at all.


-Grant

On Oct 8, 2008, at 11:15 AM, excitingComm2 wrote:



Hi everybody,

I am using Lucene for searching items in a online shop. E.g. I  
search the
shop for "shirt" I get a resultset from lucene. Now I want to  
improve the
sort order by calculating the lucene score with my business data,  
e.g. sales
or margin. Is there any possibility to get the scoring value of  
lucene, so

that I can put it into my own formula and re-sort the products?

http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Hit.html
The method getScore() sounds great, but is unfortunately marked as
deprecated.

Regards,
ExComm2
--
View this message in context: 
http://www.nabble.com/Access-Scoring-Values-of-Lucene-for-Post-Processing-tp19880927p19880927.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ










-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Searching sets of documents

2008-10-12 Thread spring
Hi,

I want to search for sets of documents. For instance I index some folders
with documents in it and now I do not want to find certain documents but
folders.

Sample:

folder A
  doc 1, contains X, Y
  doc 2, contains Y, Z

folder B
  doc 3, contains X, Y
  doc 4, contains A, Z

Now I want to find all folders which match "A AND Y" -> folder B.

How can this be done?

Thank you



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Detecting why a collection of documents matched a query

2008-10-12 Thread Khawaja Shams
Hello,  I noticed that indexSearcher.explain() method is not supposed to be
run for a large collection of documents, so I am looking for an alternative
that just explains why a document matched without all the scoring
information. Basically, I would like to know which field of the document was
responsible for getting it included in the results so I can give users some
indication of what matched. We present the results 100 documents at a time.
I would appreciate any ideas or directions towards implementation.


Thanks!


Enumerating all the terms of a particular field

2008-10-12 Thread Khawaja Shams
Hello,   How can I get a list of all the terms for a particular field? Is
the right approach to extend FilteredTermEnum?

Thanks!!


Re: Searching sets of documents

2008-10-12 Thread 叶双明
all folders which match "A AND Y", do you search for file name?
If yes, A or Y in "A AND Y" is a Strring too, so you can do it by:
construct a Lucene Document for each folder, and name of files under the
folder is the search data.

2008/10/13 <[EMAIL PROTECTED]>

> Hi,
>
> I want to search for sets of documents. For instance I index some folders
> with documents in it and now I do not want to find certain documents but
> folders.
>
> Sample:
>
> folder A
>  doc 1, contains X, Y
>  doc 2, contains Y, Z
>
> folder B
>  doc 3, contains X, Y
>  doc 4, contains A, Z
>
> Now I want to find all folders which match "A AND Y" -> folder B.
>
> How can this be done?
>
> Thank you
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
Sorry for my English!! 明
Please help me correct my English expression and error in syntax


Re: Enumerating all the terms of a particular field

2008-10-12 Thread Chris Hostetter

Someone just asked this question a week ago (unforunatley they asked it on 
the wrong list)...

http://www.nabble.com/Can-I-filter-the-results-returned-by-IndexReader.terms%28field%29-using-a-field--to19849593.html#a19849593

: Subject: Enumerating all the terms of a particular field



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: bunch of newbie queries, PS

2008-10-12 Thread Chris Hostetter

: the "anonymous" SVN (http://svn.apache.org/repos/asf/lucene/java/trunk/)
: does not work for me (I am using Eclipse 3.3, and have the subversion 
plug-in, v.
: 1.2.4, and have successfully checked out code using SVN from other 
repositories).
: Apparently here I need a user-id and pwd -- what is that or where do I get 
one? 

i'm not sure why you would be having problems with that ... i can't speak 
for Eclipse, but I just double checked and it's definitely allowing 
anonymous checkout from the command line.  can you try that and see if it 
works for you? (perhaps it's an issue with the server running subversion 
1.5 and your plugin only working with 1.4 ?)


: Allowing for the explanation below ("preserving history"), it seems like 
: there may not be a way to do what I had hoped for. Here's an example: I 
: poke around, looking for 2.2; I get to here: 
: http://lucene.apache.org/java/2_2_0/releases.html
: 
: OK, cool, now I click on ==>> Both binary and source releases are 
: available for download from the Apache Mirrors

Hmm ... this is actually the generic wording we currently use -- that page 
provides generic info on "how to get official releases"  ... nothing about 
that link (or that page) suggests that it will take you directly to a 
specific version of Lucene.  The fact that the URL has 2_2_0 in it is just 
an indicator that you are looking at the version of releases.html that was 
inlcuded in 2.2.0.

If you can suggest better wording to make it clear to novice users that 
page is *general* info about Lucene-Java Downloads, and not specific to 
any one version, i'm certainly interested.

: Maybe the closest one could get is to rephrase (from now on) the 
: sentence/link above, to read something like this:
: 
: ==>> Both binary and source releases, for the current 
: version, are available for download from the Apache Mirrors

But that statement wouldn't be true: older versions are in fact 
available from the mirrors.  Perhaps the most straight forward way to help 
people in a similar situation in the future would be to make the archive 
sub directory more promoment ... i'll try to figure out where that 
README.html lives and update it with some more helpful verbage.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]