Re: Is lucene right for us
Lucene should work quite well for this, you'll just need some infrastructure around it to get the file and extract the contents (see Lucene's Tika project). And, yes, Lucene is thread-safe, so you can index safely as you describe. On Oct 11, 2008, at 10:22 AM, Mag Gam wrote: Hello All, At my university we have over 20,000 small file ranging from 20k to 500k per directory and we would like to index them. I was wondering if Lucene is the right tool for this? The information we would like to keep is: filename, filesize, filedate, filecontent. Also, is it possible to run the initial index in multithreaded mode since we are talking about many directories with similar contents? TIA - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans. http://www.lucenebootcamp.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Retrieving Top Terms for a subset of the index (or for all results of a query)
How large of a subset are you talking? You might look at the FitleredTermEnum class, but you will probably have to do some work on it to extend it to what you want If you are talking a smallish subset (say, at most a couple hundred docs), then you could store Term Vectors and use the TermVectorMapper, I suspect. HTH, Grant On Oct 11, 2008, at 6:36 AM, Aleksander M. Stensby wrote: Hello everyone. I've been fiddeling with the idea of retrieving the top terms from a subset of the index (i.e. top terms from the documents retrieved by a given search). This could for instance be useful to identify top ranking terms in a given datespan etc. It would be something like getting the top 50 terms (like you can do with luke) but instead of doing it for the full index, I would like to do the same procedure after applying a filter or a query. Don't know if this is a bad explaination or wheter it makes any sense at all... So, I really want to avoid iterating over all results (obviously), so my question is really if there is a prefered approach for doing such analysis / has this been done in a good way before? Thanks for any help! Best regards, Aleksander -- Aleksander M. Stensby Senior Software Developer Integrasco A/S +47 41 22 82 72 [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans. http://www.lucenebootcamp.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Access Scoring Values of Lucene for Post-Processing
Have a look at the o.a.lucene.search.function package and the ValueSourceQuery. You will probably be able to factor in those pieces during scoring, so no need to resort at all. -Grant On Oct 8, 2008, at 11:15 AM, excitingComm2 wrote: Hi everybody, I am using Lucene for searching items in a online shop. E.g. I search the shop for "shirt" I get a resultset from lucene. Now I want to improve the sort order by calculating the lucene score with my business data, e.g. sales or margin. Is there any possibility to get the scoring value of lucene, so that I can put it into my own formula and re-sort the products? http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Hit.html The method getScore() sounds great, but is unfortunately marked as deprecated. Regards, ExComm2 -- View this message in context: http://www.nabble.com/Access-Scoring-Values-of-Lucene-for-Post-Processing-tp19880927p19880927.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans. http://www.lucenebootcamp.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Searching sets of documents
Hi, I want to search for sets of documents. For instance I index some folders with documents in it and now I do not want to find certain documents but folders. Sample: folder A doc 1, contains X, Y doc 2, contains Y, Z folder B doc 3, contains X, Y doc 4, contains A, Z Now I want to find all folders which match "A AND Y" -> folder B. How can this be done? Thank you - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Detecting why a collection of documents matched a query
Hello, I noticed that indexSearcher.explain() method is not supposed to be run for a large collection of documents, so I am looking for an alternative that just explains why a document matched without all the scoring information. Basically, I would like to know which field of the document was responsible for getting it included in the results so I can give users some indication of what matched. We present the results 100 documents at a time. I would appreciate any ideas or directions towards implementation. Thanks!
Enumerating all the terms of a particular field
Hello, How can I get a list of all the terms for a particular field? Is the right approach to extend FilteredTermEnum? Thanks!!
Re: Searching sets of documents
all folders which match "A AND Y", do you search for file name? If yes, A or Y in "A AND Y" is a Strring too, so you can do it by: construct a Lucene Document for each folder, and name of files under the folder is the search data. 2008/10/13 <[EMAIL PROTECTED]> > Hi, > > I want to search for sets of documents. For instance I index some folders > with documents in it and now I do not want to find certain documents but > folders. > > Sample: > > folder A > doc 1, contains X, Y > doc 2, contains Y, Z > > folder B > doc 3, contains X, Y > doc 4, contains A, Z > > Now I want to find all folders which match "A AND Y" -> folder B. > > How can this be done? > > Thank you > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Sorry for my English!! 明 Please help me correct my English expression and error in syntax
Re: Enumerating all the terms of a particular field
Someone just asked this question a week ago (unforunatley they asked it on the wrong list)... http://www.nabble.com/Can-I-filter-the-results-returned-by-IndexReader.terms%28field%29-using-a-field--to19849593.html#a19849593 : Subject: Enumerating all the terms of a particular field -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: bunch of newbie queries, PS
: the "anonymous" SVN (http://svn.apache.org/repos/asf/lucene/java/trunk/) : does not work for me (I am using Eclipse 3.3, and have the subversion plug-in, v. : 1.2.4, and have successfully checked out code using SVN from other repositories). : Apparently here I need a user-id and pwd -- what is that or where do I get one? i'm not sure why you would be having problems with that ... i can't speak for Eclipse, but I just double checked and it's definitely allowing anonymous checkout from the command line. can you try that and see if it works for you? (perhaps it's an issue with the server running subversion 1.5 and your plugin only working with 1.4 ?) : Allowing for the explanation below ("preserving history"), it seems like : there may not be a way to do what I had hoped for. Here's an example: I : poke around, looking for 2.2; I get to here: : http://lucene.apache.org/java/2_2_0/releases.html : : OK, cool, now I click on ==>> Both binary and source releases are : available for download from the Apache Mirrors Hmm ... this is actually the generic wording we currently use -- that page provides generic info on "how to get official releases" ... nothing about that link (or that page) suggests that it will take you directly to a specific version of Lucene. The fact that the URL has 2_2_0 in it is just an indicator that you are looking at the version of releases.html that was inlcuded in 2.2.0. If you can suggest better wording to make it clear to novice users that page is *general* info about Lucene-Java Downloads, and not specific to any one version, i'm certainly interested. : Maybe the closest one could get is to rephrase (from now on) the : sentence/link above, to read something like this: : : ==>> Both binary and source releases, for the current : version, are available for download from the Apache Mirrors But that statement wouldn't be true: older versions are in fact available from the mirrors. Perhaps the most straight forward way to help people in a similar situation in the future would be to make the archive sub directory more promoment ... i'll try to figure out where that README.html lives and update it with some more helpful verbage. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]