Re: lucene 2.9 sorting algorithm

2009-10-15 Thread Jake Mannix
I had to dig through the source code (actually, walk through a unit test, because that was simpler to see what was going on in the 2.9 sorting), but I think John's way has slightly lower complexity in the balanced segment size case. On Wed, Oct 14, 2009 at 8:57 PM, Yonik Seeley wrote: > Interesti

search trough single pdf document - return page number

2009-10-15 Thread IvanDrago
Hi, I have to search a single pdf document for requested string and if that string is found, I need to return a page number where that string was found. Requested string can be anything in a pdf document. It is a big document(abount 5000 pages) so I'm asking if that is possible with lucene. I'm

Re: lucene 2.9 sorting algorithm

2009-10-15 Thread Michael McCandless
If I remembering it right... this (matching MultiSearcher's approach) was nearly the first thing we tried with LUCENE-1483. But the CPU cost was higher in our tests. I think we had tested unbalanced and balanced segments, but memory is definitely somewhat hazy at this point... I suspect even in

Re: search trough single pdf document - return page number

2009-10-15 Thread Erick Erickson
It depends (tm). Do you want to permanently index this content and search it multiple times or is each search a one-off? If the latter, I'd look for packages specific to handling PDF files. Although since Reader takes forever to search a document, so I suspect there's not much joy there. If you wan

How to set boost for a certain term in a query?

2009-10-15 Thread Chuan
For example, I want the term 'sport' to have more impact on the final rank. Thanks in advance. Chuan -- View this message in context: http://www.nabble.com/How-to-set-boost-for-a-certain-term-in-a-query--tp25909294p25909294.html Sent from the Lucene - Java Developer mailing list archive at Nabb

Re: How to set boost for a certain term in a query?

2009-10-15 Thread Erick Erickson
This question is better posted on the user list, but the short answer is to use boosting. On Thu, Oct 15, 2009 at 10:08 AM, Chuan wrote: > > For example, I want the term 'sport' to have more impact on the final rank. > Thanks in advance. > > Chuan > -- > View this message in context: > http://ww

Re: How to set boost for a certain term in a query?

2009-10-15 Thread Anshum Gupta
Hi chuan, It'd make a better question at java user list. This one is meant for lucene core dev. --Original Message-- From: Chuan To: java-dev@lucene.apache.org ReplyTo: java-dev@lucene.apache.org Subject: How to set boost for a certain term in a query? Sent: Oct 15, 2009 19:38 For examp

Re: search trough single pdf document - return page number

2009-10-15 Thread IvanDrago
Thanks for the reply Erick. I would like to permanently index this content and search it multiple times so I would like a permanent copy and I want to search for different terms multiple times. My problem is that I dont know how to retrieve a page number where the searched string was found so if

Re: search trough single pdf document - return page number

2009-10-15 Thread Robert Muir
if you just have a single pdf document (it seems from the subject line this is the case), and you want to retrieve pages, maybe consider splitting the PDF into single pages. there is some functionality in pdfbox to do this. then index each page as a single lucene document (so you will have 5000 l

Re: lucene 2.9 sorting algorithm

2009-10-15 Thread Yonik Seeley
On Thu, Oct 15, 2009 at 4:31 AM, Jake Mannix wrote: >> Conversion from one segment to another is only >> done as needed... only the bottom slot is converted automatically when >> the segment is switched. > > That's not what it looks like, actually: you convert the bottom slot, and > as soon as you

Re: search trough single pdf document - return page number

2009-10-15 Thread Erick Erickson
Your search would be on the "contents" field if you use LucenePDFDocument. But on a quick look, LucenePDFDocument doesn't give you any page information. So, you'd have to collect that somehow, but I don't see a clear way to. Doing it manually, you could do something like: Document doc = new Docu

Re: lucene 2.9 sorting algorithm

2009-10-15 Thread Yonik Seeley
On Thu, Oct 15, 2009 at 11:53 AM, Yonik Seeley wrote: > And it seems like a PQ per segment simply delays many of the slow > lookups until the end where the PQs must be merged. Actually, I'm wrong about that part - one can simply merge on values... there will be lots of string comparisons (and a n

Re: lucene 2.9 sorting algorithm

2009-10-15 Thread Jake Mannix
On Thu, Oct 15, 2009 at 9:12 AM, Yonik Seeley wrote: > On Thu, Oct 15, 2009 at 11:53 AM, Yonik Seeley > wrote: > > And it seems like a PQ per segment simply delays many of the slow > > lookups until the end where the PQs must be merged. > > Actually, I'm wrong about that part - one can simply mer

Re: lucene 2.9 sorting algorithm

2009-10-15 Thread Jake Mannix
On Thu, Oct 15, 2009 at 3:12 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > If I remembering it right... this (matching MultiSearcher's approach) > was nearly the first thing we tried with LUCENE-1483. But the CPU > cost was higher in our tests. I think we had tested unbalanced and

Re: lucene 2.9 sorting algorithm

2009-10-15 Thread John Wang
Hi guys: I did some Big O math a few times and reached the same conclusion Jake had. I was not sure about the code tuning opportunities we could have done with the MergeAtTheEnd method as Yonik mentioned and the internal behavior with PQ Mike suggested, so I went ahead and implemented the

[jira] Created: (LUCENE-1984) DisjunctionMaxQuery - Type safety

2009-10-15 Thread Kay Kay (JIRA)
DisjunctionMaxQuery - Type safety --- Key: LUCENE-1984 URL: https://issues.apache.org/jira/browse/LUCENE-1984 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: Ka

[jira] Updated: (LUCENE-1984) DisjunctionMaxQuery - Type safety

2009-10-15 Thread Kay Kay (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Kay updated LUCENE-1984: Attachment: LUCENE-1984.patch > DisjunctionMaxQuery - Type safety > ---

[jira] Updated: (LUCENE-1984) DisjunctionMaxQuery - Type safety

2009-10-15 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1984: -- Component/s: Query/Scoring Fix Version/s: 3.0 Assignee: Uwe Schindler We are co

[jira] Commented: (LUCENE-1984) DisjunctionMaxQuery - Type safety

2009-10-15 Thread Kay Kay (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766249#action_12766249 ] Kay Kay commented on LUCENE-1984: - Great - Thanks. For the sake of continuity - which br

[jira] Commented: (LUCENE-1984) DisjunctionMaxQuery - Type safety

2009-10-15 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766251#action_12766251 ] Uwe Schindler commented on LUCENE-1984: --- 3.0 is currently trunk. 2.9 is a branch. P

Re: lucene 2.9 sorting algorithm

2009-10-15 Thread Michael McCandless
Nice results! Comments below... On Thu, Oct 15, 2009 at 3:58 PM, John Wang wrote: > Hi guys: > >     I did some Big O math a few times and reached the same conclusion Jake > had. > >     I was not sure about the code tuning opportunities we could have done > with the MergeAtTheEnd method as Yoni

Re: lucene 2.9 sorting algorithm

2009-10-15 Thread Michael McCandless
On Thu, Oct 15, 2009 at 3:52 PM, Jake Mannix wrote: > > On Thu, Oct 15, 2009 at 3:12 AM, Michael McCandless > wrote: >> >> If I remembering it right... this (matching MultiSearcher's approach) >> was nearly the first thing we tried with LUCENE-1483.  But the CPU >> cost was higher in our tests.  

Re: lucene 2.9 sorting algorithm

2009-10-15 Thread Jake Mannix
On Thu, Oct 15, 2009 at 2:12 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > Nice results! Comments below... > > > Here are the numbers (times are measured in nanoseconds): > > > > numHits = 50: > > > > Lucene 2.9/OneComparatorNonScoringCollector: > > num string compares: 251 > > num

Re: lucene 2.9 sorting algorithm

2009-10-15 Thread Michael McCandless
On Thu, Oct 15, 2009 at 5:51 PM, Jake Mannix wrote: > > > On Thu, Oct 15, 2009 at 2:12 PM, Michael McCandless > wrote: >> >> Nice results!  Comments below... >> >> > Here are the numbers (times are measured in nanoseconds): >> > >> > numHits = 50: >> > >> > Lucene 2.9/OneComparatorNonScoringColle

Re: lucene 2.9 sorting algorithm

2009-10-15 Thread Jake Mannix
On Thu, Oct 15, 2009 at 2:33 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > > I don't think we do any branch tuning on the PQ insertion -- the ifs > involved in re-heapifying the PQ are simply hard for the CPU to > predict (though, apparently, not as hard as comparing strings ;). >

Re: lucene 2.9 sorting algorithm

2009-10-15 Thread John Wang
Hi Mike: Here are the results for numHits = 10: Lucene 2.9: num string compares: 86 num conversions: 21 num inserts: 115 time: 15069705 cpu: 174294 my test sort: num string compares: 49 num conversions: 0 num inserts: 778 time: 14665375 cpu: 156442 This is how the test data is indexed

Re: lucene 2.9 sorting algorithm

2009-10-15 Thread Yonik Seeley
On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless wrote: > Though it'd be odd if the switch to searching by segment > really was most of the gains here. I had assumed that much of the improvement was due to ditching MultiTermEnum/MultiTermDocs. Note that LUCENE-1483 was before LUCENE-1596... bu

Re: lucene 2.9 sorting algorithm

2009-10-15 Thread John Wang
Numbers Mike requested for Int types: only the time/cputime are posted, others are all the same since the algorithm is the same. Lucene 2.9: numhits: 10 time: 14619495 cpu: 146126 numhits: 20 time: 14550568 cpu: 163242 numhits: 100 time: 16467647 cpu: 178379 my test: numHits: 10 time: 1410109

Re: lucene 2.9 sorting algorithm

2009-10-15 Thread John Wang
BTW, we are have a little sandbox for these experiments. And all my testcode are at. They are not very polished. https://lucene-book.googlecode.com/svn/trunk -John On Thu, Oct 15, 2009 at 3:29 PM, John Wang wrote: > Numbers Mike requested for Int types: > > only the time/cputime are posted, ot

Re: lucene 2.9 sorting algorithm

2009-10-15 Thread Michael McCandless
On Thu, Oct 15, 2009 at 5:59 PM, Jake Mannix wrote: >> I don't think we do any branch tuning on the PQ insertion -- the ifs >> involved in re-heapifying the PQ are simply hard for the CPU to >> predict (though, apparently, not as hard as comparing strings ;). > > But it does look like you do some

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-15 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766339#action_12766339 ] Michael McCandless commented on LUCENE-1458: I just committed some small impro

Re: lucene 2.9 sorting algorithm

2009-10-15 Thread Michael McCandless
On Thu, Oct 15, 2009 at 6:04 PM, Yonik Seeley wrote: > On Thu, Oct 15, 2009 at 5:33 PM, Michael McCandless > wrote: >> Though it'd be odd if the switch to searching by segment >> really was most of the gains here. > > I had assumed that much of the improvement was due to ditching > MultiTermEnum/

Re: lucene 2.9 sorting algorithm

2009-10-15 Thread Michael McCandless
OK, thanks for running these. It looks like the gains are holding up across smaller queue sizes, and for ints. Though, it's odd that sorting w/ ints is also faster; I'd expect the single PQ to win there. Mike On Thu, Oct 15, 2009 at 6:29 PM, John Wang wrote: > Numbers Mike requested for Int ty

Re: lucene 2.9 sorting algorithm

2009-10-15 Thread Michael McCandless
John, looks like this requires login -- any plans to open that up, or, post the code on an issue? How self-contained is your Multi PQ sorting? EG is it a standalone Collector impl that I can test? Mike On Thu, Oct 15, 2009 at 6:33 PM, John Wang wrote: > BTW, we are have a little sandbox for th

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-15 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766350#action_12766350 ] Mark Miller commented on LUCENE-1458: - {quote}// nocommit -- why scanCnt > 1?

[jira] Issue Comment Edited: (LUCENE-1458) Further steps towards flexible indexing

2009-10-15 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766350#action_12766350 ] Mark Miller edited comment on LUCENE-1458 at 10/15/09 5:41 PM: -

[jira] Issue Comment Edited: (LUCENE-1458) Further steps towards flexible indexing

2009-10-15 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766359#action_12766359 ] Mark Miller edited comment on LUCENE-1458 at 10/15/09 5:56 PM: -

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-15 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766359#action_12766359 ] Mark Miller commented on LUCENE-1458: - {code} // nocommit -- not needed? we don't

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-15 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766362#action_12766362 ] Mark Miller commented on LUCENE-1458: - // nocommit -- wonder if simple double-barrel

[jira] Issue Comment Edited: (LUCENE-1458) Further steps towards flexible indexing

2009-10-15 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766359#action_12766359 ] Mark Miller edited comment on LUCENE-1458 at 10/15/09 6:10 PM: -

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-15 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766367#action_12766367 ] Mark Miller commented on LUCENE-1458: - Hmm - I'm still getting the heap space issue I

Re: lucene 2.9 sorting algorithm

2009-10-15 Thread John Wang
Hi Michael: It is open, http://code.google.com/p/lucene-book/source/checkout I think I sent the https url instead, sorry. The multi PQ sorting is fairly self-contained, I have 2 versions, 1 for string and 1 for int, each are Collector impls. I shouldn't say the Multi Q is fast

[jira] Commented: (LUCENE-1313) Near Realtime Search (using a built in RAMDirectory)

2009-10-15 Thread Jason Rutherglen (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766376#action_12766376 ] Jason Rutherglen commented on LUCENE-1313: -- I think this patch has a memory leak,

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-10-15 Thread Robert Muir (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766396#action_12766396 ] Robert Muir commented on LUCENE-1606: - if anyone can spare a sec to take a glance/revi

Hudson build is back to normal: Lucene-trunk #980

2009-10-15 Thread Apache Hudson Server
See - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: lucene 2.9 sorting algorithm

2009-10-15 Thread John Wang
Hi Michael: I added classes: ScoreDocComparatorQueue and OneSortNoScoreCollector as a more general case. I think keeping the old api for ScoreDocComparator and SortComparatorSource would work. Please take a look. Thanks -John On Thu, Oct 15, 2009 at 6:52 PM, John Wang wrote: > Hi Michae