Re: indexing rss feeds in multiple languages
Melanie Langlois wrote: Well, thanks, sounds like the best option to me. Does anybody use the PerFieldAnalyzerWrapper? I'm just curious to know if there is any impact on the performances when using different analyzers. I've not done any specifc comparisons between using a single Analyzer and multiple Analyzer with PFAW, but our indexes are typically 20-25 fields, each of which can have a different analyzer depending on language or field type, although in practice about 8-10 fields may use the non-default analyzer. Performance is pretty good in any case and there's not been any noticeable degradtion when tweaking analyzers. Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Querying fragments of a tree structure
Hi Erick, excellent insight, thanks a lot. As you would expect, this method works a treat. thanks a lot for your time! Emanuel - Original Message - From: "Erick Erickson" <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, March 21, 2007 2:12:49 PM (GMT+0100) Europe/Berlin Subject: Re: Querying fragments of a tree structure Is it a fair restatement of your problem that you want to generate a list of all children of a node? That's what I'm reading. Would it work for you to store the complete ancestry in each node? By that I mean (from your example), NOTE: it's no problem in Lucene to store different values for the same field in the same document. I.e Document doc. doc.add("field" "value1"); doc.add("field" "value2"); writer.add(doc);... This is equivalent (if using WhitespaceAnalyzer in this example) to: Document doc; doc.add("field", "value1 value2"); writer.add(doc); (There is a subtle difference between the two having to do with PositionIncrementGap, but that's probably irrelevant for you in this problem). So what about just doing that for each parent node in your tree? So your "ancestry" field for documents D and E have stored "C" and "A". This is TOKENIZED, but not necessarily STORED Document C has only "A". Now, finding the children of "A" reduces to something like +ancestry:A which you can add to your BooleanClauses if you want to also specify other search criteria or just use by itself if you don't. What follows is my first idea, but I think the above is a better notion. Node A stores nothing Nodes B and C store "A" Nodex D and E store "A$C" etc. Now, finding all the children of A reduces to doing a WildcardTermEnum on "A*" and, for each resulting term using TermDocs.seek(term) to find the corresponding document. Note a couple of things: 1> index the ancestry field UN_TOKENIZED. You don't need to store it. 1a> You could use something like this to form a Lucene Filter if you needed to, say, find all the nodes in the tree that were children of a specified node AND met certain search criteria. 2> You could also just search on A*, but be aware that you may have to deal with TooManyClauses exceptions. The TermEnum/TermDocs method avoids that problem, but may be overkill in your situation. 2a> Lucene 2.1 allows wildcards in the first position if you do a wlidcard search, but you need to turn that on by a call which I can't bring up from memory. Hope this helps Erick On 3/21/07, Emanuel Schleussinger <[EMAIL PROTECTED]> wrote: > > Hi, > > first, thanks for this great a resource, and sorry if i am oversimplfying > a few things, i am still rather new to Lucene. > > I have been thinking how to integrate my app with Lucene - it is a CMS > type system that has documents organized in a tree-style layout. A few facts > about the system: > - Every node in the system is represented by a unique numeric id in a > field called "id" > - There is one deinfed root node, and an arbitrary amount of descendants. > - Each of the nodes on any level knows his descendants in a field called > "child" > - Each node also knows his parent node in a field called "parent" > > I am indexing all the fields from all the nodes in Lucene already, and > thus, i can use Lucene to e.g. get all the descendant node IDs of a node > simply by issuing a query like "id:2" and then extracting the > multivalue-field "child". > > Now, here is what i am trying to solve now -- i would like to be able to > fetch all the nodes that match a certain criteria, if they are contained in > some fragment of the tree. To visualize: > > Root > +-> A > | +-> B > | +-> C > | +->D > | +->E > +->F > +->G > > i would like to issue a query that gives me all the nodes within "A", a > flat list of results that contain B,C,D and E > > Now, since per my definition D is not directly correlated with A (it knows > his parent C, but not that its also part of A -- only C knows that) , i was > thinking of introducing a new field for every node into my Lucene index that > holds a list of IDs thats trace back to the root element (in this case, the > D node would have C and A in that field, in that order) - but it strikes me > this may not be the most elegant approach... > > The above is only a simplified example, in reality, i have a tree about 10 > levels deep, with thousands of nodes, and i frequently need to surface nodes > within a certain fragment of that tree. > > Is there any best practice that you ran into on how to map this elegantly > into Lucene? > > Thanks a ton for any pointers, > Emanuel Schleussinger > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
how ungrouped query handled?
Hi, Can anyone explain how lucene handles the belowed query? My query is *field1:source AND (field2:name OR field3:dest)* . I've given this string to queryparser and then searched by using searcher. It returns correct results. It's query.toString() print is :: +field1:source +(field2:name field3:dest) But if i don't group my terms (i.e) my query : *field1:source AND field2:name OR field3:dest *,then it gives the result of first two term's search result. It doesn't search 3rd term. It's query.toString() print is :: +field1:source +field2:name field3:dest. If i use same boolean operator between all terms, then it returns correct results. Why it doesn't search the terms after 2nd term if grouping not used? Thanks & Regards RSK
bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad
Hi: First of all apology to those friends who follow all the list. Often times I work offline and I do not have any commit rights to any of the projects. All the modifications I make for various clients and trying to keep up to date with latest trunk somehow makes it difficult for me to just stick with "subversion". I have heard many things about distributed revision control system and I am sure there are tricks/fixes for the subversion problem i mentioned above, but I also wanted to learn something new :-) So after some trial with many DRCS I have decided to go for Bazaar! Its really cool DRCS.. you got try it. http://bazaar-vcs.org/. Now due to the fact that SVN is RCS and bzr is DRCS one need to convert SVN repos to bzr repos. and cool enough.. there is a free vcs mirroring service at Launchpad https://launchpad.net/ So now the following projects are available via bzr branch. You can access them here. Nutch - https://launchpad.net/nutch Solr - https://launchpad.net/solr Lucene - https://launchpad.net/lucene Hadoop - https://launchpad.net/hadoop It only mirrors "trunk". Thats what I need to follow thats why and I don't see any reason to mirror releases. Regards - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad
Is the point of this that you can make "commits" to Lucene so that you don't lose your changes on trunk? On Mar 22, 2007, at 7:14 AM, rubdabadub wrote: Hi: First of all apology to those friends who follow all the list. Often times I work offline and I do not have any commit rights to any of the projects. All the modifications I make for various clients and trying to keep up to date with latest trunk somehow makes it difficult for me to just stick with "subversion". I have heard many things about distributed revision control system and I am sure there are tricks/fixes for the subversion problem i mentioned above, but I also wanted to learn something new :-) So after some trial with many DRCS I have decided to go for Bazaar! Its really cool DRCS.. you got try it. http://bazaar-vcs.org/. Now due to the fact that SVN is RCS and bzr is DRCS one need to convert SVN repos to bzr repos. and cool enough.. there is a free vcs mirroring service at Launchpad https://launchpad.net/ So now the following projects are available via bzr branch. You can access them here. Nutch - https://launchpad.net/nutch Solr - https://launchpad.net/solr Lucene - https://launchpad.net/lucene Hadoop - https://launchpad.net/hadoop It only mirrors "trunk". Thats what I need to follow thats why and I don't see any reason to mirror releases. Regards -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad
On 3/22/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: Is the point of this that you can make "commits" to Lucene so that you don't lose your changes on trunk? Not only that. But I can make as many local branch as I like ..for example customer X, customer Y. This way I can support X and Y as they have separate features .. All of the above can be done with SVN but its a pain at least for me. And off course work off line .. during summer .. under trees :-) and then update the whole branch with main repo without loosing any changes. It just seems easy, I have also had a case where I need to bake some part of Nutch and some part Solr under one tree i.e. new project and still maintain that tree with the original repo. and i could do that just fine. Bazaar commands are like SVN commands so its not much to learn either :-) Regards On Mar 22, 2007, at 7:14 AM, rubdabadub wrote: > Hi: > > First of all apology to those friends who follow all the list. > > Often times I work offline and I do not have any commit rights to any > of the projects. All the modifications I make for various clients and > trying to keep up to date with latest trunk somehow makes it difficult > for me to just stick with "subversion". I have heard many things about > distributed > revision control system and I am sure there are tricks/fixes for the > subversion problem i mentioned above, but I also wanted to learn > something new :-) So after some trial with many DRCS I have decided to > go for Bazaar! Its really cool DRCS.. you got try it. > > http://bazaar-vcs.org/. > > Now due to the fact that SVN is RCS and bzr is DRCS one need to > convert SVN repos to bzr repos. and cool enough.. there is a free vcs > mirroring service at Launchpad > > https://launchpad.net/ > > So now the following projects are available via bzr branch. You can > access them here. > > Nutch - https://launchpad.net/nutch > Solr - https://launchpad.net/solr > Lucene - https://launchpad.net/lucene > Hadoop - https://launchpad.net/hadoop > > It only mirrors "trunk". Thats what I need to follow thats why and I > don't see any reason to mirror releases. > > Regards -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad
Nice idea and I can see the benefit of it to you and I don't mean to be a wet blanket on it, I just wonder about the legality of it. People may find it and think it is the official Apache Lucene, since it is branded that way. I'm not a lawyer, so I don't know for sure. I think you have the right to store and use the code, even create a whole other search product based solely on Lucene (I think), I just don't know about this kind of thing. In some sense it is like mirroring, but that fact that you can commit w/ out going through the Apache process makes me think that others coming upon the code will be mislead about what's in it. The site _definitely_ makes it look like Launchpad is the home for Lucene with the intro and the bug tracking, etc, even though we all know this site will rank further down in the SERPs than the main site. Perhaps I am misunderstanding? On Mar 22, 2007, at 7:42 AM, rubdabadub wrote: On 3/22/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: Is the point of this that you can make "commits" to Lucene so that you don't lose your changes on trunk? Not only that. But I can make as many local branch as I like ..for example customer X, customer Y. This way I can support X and Y as they have separate features .. All of the above can be done with SVN but its a pain at least for me. And off course work off line .. during summer .. under trees :-) and then update the whole branch with main repo without loosing any changes. It just seems easy, I have also had a case where I need to bake some part of Nutch and some part Solr under one tree i.e. new project and still maintain that tree with the original repo. and i could do that just fine. Bazaar commands are like SVN commands so its not much to learn either :-) Regards On Mar 22, 2007, at 7:14 AM, rubdabadub wrote: > Hi: > > First of all apology to those friends who follow all the list. > > Often times I work offline and I do not have any commit rights to any > of the projects. All the modifications I make for various clients and > trying to keep up to date with latest trunk somehow makes it difficult > for me to just stick with "subversion". I have heard many things about > distributed > revision control system and I am sure there are tricks/fixes for the > subversion problem i mentioned above, but I also wanted to learn > something new :-) So after some trial with many DRCS I have decided to > go for Bazaar! Its really cool DRCS.. you got try it. > > http://bazaar-vcs.org/. > > Now due to the fact that SVN is RCS and bzr is DRCS one need to > convert SVN repos to bzr repos. and cool enough.. there is a free vcs > mirroring service at Launchpad > > https://launchpad.net/ > > So now the following projects are available via bzr branch. You can > access them here. > > Nutch - https://launchpad.net/nutch > Solr - https://launchpad.net/solr > Lucene - https://launchpad.net/lucene > Hadoop - https://launchpad.net/hadoop > > It only mirrors "trunk". Thats what I need to follow thats why and I > don't see any reason to mirror releases. > > Regards -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://www.grantingersoll.com/ http://lucene.grantingersoll.com http://www.paperoftheweek.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad
On 3/22/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: Nice idea and I can see the benefit of it to you and I don't mean to be a wet blanket on it, I just wonder about the legality of it. People may find it and think it is the official Apache Lucene, since it is branded that way. I'm not a lawyer, so I don't know for sure. I think you have the right to store and use the code, even create a whole other search product based solely on Lucene (I think), I just don't know about this kind of thing. In some sense it is like mirroring, but that fact that you can commit w/ out going through the NO NO!! I don't make any commits to apache trunk. Nor any one else for that matter. The repo at launchpad is just a pure mirror and will always be a mirror. Just to clarify what I meant by commit. Basically you "pull" the Lucene branch from launchpad to your local machine and that becomes a complete copy of the trunk and you make another local branch from that branch. Example bzr branch http://bazaar.launchpad.net/~vcs-imports/lucene/trunk local.copy bzr branch local.copy local.customerx then you do all your work on local.customerx and make commits there. Cos you want to keep the local.copy exactly identical to lanuchpad version which in turns a mirror like any other mirror that apache have thats all. If I were to commit things to the launchpad version I loose the whole point of mirroring and getting changes from trunk. Apache process makes me think that others coming upon the code will be mislead about what's in it. The site _definitely_ makes it look like Launchpad is the home for Lucene with the intro and the bug tracking, etc, even though we all know this site will rank further down in the SERPs than the main site. I am not a lawyer or branding expert. But if you want me to edit the description text with something like "A mirrored copy of Apache Lucene.. original site at..." No problem Please provide me the exact text so I can edit it to avoid confusion etc.. Last thing I want to do is create confusion. Moreover if such need like mine exist maybe Apache Infrastructure should consider a DRCS system then a RCS system .. SVN doesn't provide the flexibility that I need. In apache there is CVS and SVN co-exist as well as there are mirrors of such all over the world so.. why not have a bzr branch? if Launchpad want to host it great if other mirror wants to host it great. I hope it clarifies misunderstanding.. Please do provide an exact text so we don't get into some lawyer trouble :-) I don't want to take a stab on the text its better you provide me exact instructions. Regards. On Mar 22, 2007, at 7:42 AM, rubdabadub wrote: > On 3/22/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: >> Is the point of this that you can make "commits" to Lucene so that >> you don't lose your changes on trunk? > > Not only that. But I can make as many local branch as I like ..for > example > customer X, customer Y. This way I can support X and Y as they have > separate features .. All of the above can be done with SVN but its > a pain > at least for me. > > And off course work off line .. during summer .. under trees :-) > and then update > the whole branch with main repo without loosing any changes. It > just seems easy, > I have also had a case where I need to bake some part of Nutch and > some part > Solr under one tree i.e. new project and still maintain that tree with > the original > repo. and i could do that just fine. Bazaar commands are like SVN > commands > so its not much to learn either :-) > > Regards >> On Mar 22, 2007, at 7:14 AM, rubdabadub wrote: >> >> > Hi: >> > >> > First of all apology to those friends who follow all the list. >> > >> > Often times I work offline and I do not have any commit rights >> to any >> > of the projects. All the modifications I make for various >> clients and >> > trying to keep up to date with latest trunk somehow makes it >> difficult >> > for me to just stick with "subversion". I have heard many things >> about >> > distributed >> > revision control system and I am sure there are tricks/fixes for >> the >> > subversion problem i mentioned above, but I also wanted to learn >> > something new :-) So after some trial with many DRCS I have >> decided to >> > go for Bazaar! Its really cool DRCS.. you got try it. >> > >> > http://bazaar-vcs.org/. >> > >> > Now due to the fact that SVN is RCS and bzr is DRCS one need to >> > convert SVN repos to bzr repos. and cool enough.. there is a >> free vcs >> > mirroring service at Launchpad >> > >> > https://launchpad.net/ >> > >> > So now the following projects are available via bzr branch. You can >> > access them here. >> > >> > Nutch - https://launchpad.net/nutch >> > Solr - https://launchpad.net/solr >> > Lucene - https://launchpad.net/lucene >> > Hadoop - https://launchpad.net/hadoop >> > >> > It only mirrors "trunk". Thats what I need to follow thats why >> and I >> > don't see any reason to mirror releases. >> > >> > Regards >> >>
Re: Spelt, for better spelling correction
Otis, I hadn't really thought about this, but it would be easy to build a dictionary from an existing Lucene index. Tha main caveat is that it would only work with "stored" fields. That's because this spellchecker boosts accuracy using pair frequencies in addition to term frequencies, and Lucene doesn't need or track pair frequencies to my knowledge. So any field which you wanted to spellcheck would need to be indexed with Field.Store.YES. Of course a side effect is that they'd have to be Analyzed again, with the resulting time cost. Still, this could make sense for a lot of people. I'll make sure the contribution includes an index-to-dictionary API, and thank you very much for the input. --Martin On 3/21/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: Martin, This sounds like the spellchecker dictionary needs to be built in parallel with the main Lucene index. Is it possible to create a dictionary out of an existing (and no longer modified) Lucene index? Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Martin Haye <[EMAIL PROTECTED]> To: Yonik Seeley <[EMAIL PROTECTED]> Cc: java-user@lucene.apache.org Sent: Wednesday, March 21, 2007 2:03:50 PM Subject: Re: Spelt, for better spelling correction The dictionary is generated from the corpus, with the result that a larger corpus gives better results. Words are queued up during an index run, and at the end are munged to create an optimized dictionary. It also supports incremental building, though the overhead would be too much for those applications that are continuously adding things to an index. Happily, it's not as important to keep the spelling dictionary absolutely up to date, so it would be fine to queue words over several index runs, and refresh the dictionary less often. --Martin On 3/20/07, Yonik Seeley <[EMAIL PROTECTED]> wrote: > > Sounds interesting Martin! > Is the dictionary static, or is it generated from the corpus or from > user queries? > > -Yonik > > On 3/20/07, Martin Haye <[EMAIL PROTECTED]> wrote: > > As part of XTF, an open source publishing engine that uses Lucene, I > > developed a new spelling correction engine specifically to provide "Did > you > > mean..." links for misspelled queries. I and a small group are preparing > > this for submission as a contrib module to Lucene. And we're inviting > > interested people to join the discussion about it. > > > > The new engine is being called "Spelt" and differs from the one > currently in > > Lucene contrib in the following ways: > > > > - More accurate: Much better performance on single-word queries (90% > correct > > in #1 slot in my tests). On general list including multi-word queries, > gets > > 80%+ correct. > > - Multi-word: Handles and corrects multi-word queries such as > "harrypotter" > > -> "harry potter". > > - Fast: In my tests, builds the dictionary more than 30 times faster. > > - Small: Dictionary size is roughly a third of that built by the > existing > > engine. > > - Other bells and whistles... > > > > There is already a standalone test program that people can try out, and > > we're interested in feedback. If you're interested in discussing, > testing, > > or previewing, consider joining the Google group: > > http://groups.google.com/group/spelt/ > > > > --Martin > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad
On Mar 22, 2007, at 8:16 AM, rubdabadub wrote: On 3/22/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: Nice idea and I can see the benefit of it to you and I don't mean to be a wet blanket on it, I just wonder about the legality of it. People may find it and think it is the official Apache Lucene, since it is branded that way. I'm not a lawyer, so I don't know for sure. I think you have the right to store and use the code, even create a whole other search product based solely on Lucene (I think), I just don't know about this kind of thing. In some sense it is like mirroring, but that fact that you can commit w/ out going through the NO NO!! I don't make any commits to apache trunk. Nor any one else for that matter. The repo at launchpad is just a pure mirror and will always be a mirror. Just to clarify what I meant by commit. Basically you "pull" the Lucene branch from launchpad to your local machine and that becomes a complete copy of the trunk and you make another local branch from that branch. Example bzr branch http://bazaar.launchpad.net/~vcs-imports/lucene/trunk local.copy bzr branch local.copy local.customerx then you do all your work on local.customerx and make commits there. Cos you want to keep the local.copy exactly identical to lanuchpad version which in turns a mirror like any other mirror that apache have thats all. If I were to commit things to the launchpad version I loose the whole point of mirroring and getting changes from trunk. Gotcha. I guess I just rely on IntelliJ built in versioning to provide similar capabilities, plus, maybe checking out multiple copies of the source. Also, I try to avoid making changes in open source libraries unless absolutely necessary. Apache process makes me think that others coming upon the code will be mislead about what's in it. The site _definitely_ makes it look like Launchpad is the home for Lucene with the intro and the bug tracking, etc, even though we all know this site will rank further down in the SERPs than the main site. I am not a lawyer or branding expert. But if you want me to edit the description text with something like "A mirrored copy of Apache Lucene.. original site at..." No problem Please provide me the exact text so I can edit it to avoid confusion etc.. Last thing I want to do is create confusion. Moreover if such need like mine exist maybe Apache Infrastructure should consider a DRCS system then a RCS system .. SVN doesn't provide the flexibility that I need. In apache there is CVS and SVN co-exist as well as there are mirrors of such all over the world so.. why not have a bzr branch? if Launchpad want to host it great if other mirror wants to host it great. I hope it clarifies misunderstanding.. Please do provide an exact text so we don't get into some lawyer trouble :-) I don't want to take a stab on the text its better you provide me exact instructions. I'll wait for some of the others that are closer to the Foundation to contribute (maybe one of the PMC members). Like I said, I don't know if it is an issue at all. I just don't want people to be confused about it. I think you could propose a DRCS to infrastructure and make a case for it. Personally, I'm fine with SVN, but then again I used to think I was fine with CVS and I don't think I would want to go back to that! I am curious, how many custom changes are you making to the code that this is even an issue? Perhaps submitting patches and working to get them committed would be a more efficient strategy. -Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Combining score from two or more hits
Don't know if it's useful or not, but if you used TopDocs instead, you have access to an array of ScoreDoc which you could modify freely. In my app, I used a FieldSortedHitQueue to re-sort things when I needed to. ERick On 3/22/07, Antony Bowesman <[EMAIL PROTECTED]> wrote: I have indexed objects that contain one or more attachments. Each attachment is indexed as a separate Document along with the object metadata. When I make a search, I may get hits in more than one Document that refer to the same object. I have a HitCollector which knows if the object has already been found, so I want to be able to update the score of an existing hit in a way that makes sense. e.g. If hit H1 has score 1.35 and hit H2 has score 2.9 is is possible to re-score it on the basis that the real hit result is (H1 AND H2). I can take the highest score of any Document, but just wondered if this is possible during the HitCollector.collect method? Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad
Good to hear :-) I am curious, how many custom changes are you making to the code that this is even an issue? Perhaps submitting patches and working to get them committed would be a more efficient strategy. Well there are 3 problems I see. 1. There are very good patches on all of the lucene Jiira but for one way or another these issues never get applied to trunk. For me its not a question of why its more of a question how can i use it and learn from it. So having my own local branch to do "whatever" is really great. I build I apply patch play around .. tear it down without thinking about anything else. Yes you could do this with various copies of the source but often times these patches works with rev. etc.. Its much easier to play when you are in control of the local.trunk. 2. I also have to customer modifications and maintain i.e support and some of the fixes only works with a certain rev of trunk and often times i make mistake and do svn up .. it happens and that does create some extra key strokes :-) 3. You are correct about the committing strategy but most of my changes are customer specifics and customer have specific rules so it never gets back to you guys. Well customer rules I can't decide on the modifications I make. Regards - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: how ungrouped query handled?
This is a pretty common issue that I've been grappling with by chance recently. The main point is that the parser is NOT a boolean logic parser. Search the mail archive for the thread "bad query parser bug" and you'll find a good discussion. I tried using PrecedenceQueryParser, but that didn't work for me very well, search the mail archive on that and you'll see some examples of why. I solved this problem for my immediate issues by writing a very quick-and-dirty parenthesizer for my raw query. If it wasn't going on summer, I might see if I can contribute something by seeing if there's a way I can see to fix PrecedenceQueryParser. Best Erick On 3/22/07, SK R <[EMAIL PROTECTED]> wrote: Hi, Can anyone explain how lucene handles the belowed query? My query is *field1:source AND (field2:name OR field3:dest)* . I've given this string to queryparser and then searched by using searcher. It returns correct results. It's query.toString() print is :: +field1:source +(field2:name field3:dest) But if i don't group my terms (i.e) my query : *field1:source AND field2:name OR field3:dest *,then it gives the result of first two term's search result. It doesn't search 3rd term. It's query.toString() print is :: +field1:source +field2:name field3:dest. If i use same boolean operator between all terms, then it returns correct results. Why it doesn't search the terms after 2nd term if grouping not used? Thanks & Regards RSK
Speeding up looping over Hits
Hi, While looking into performance enhancement for our search feature I noticed a significant difference in Documents access time while looping over Hits. I wrote a test application search for a list of search terms and then for each returned Hits object loops twice over every single hits.doc(i). for (int i = 0; i < numberOfDocs; i++) {doc = hits.doc(i);} I am seeing differences like the following Found 16,215 hits for 'Water or Wine' in 219 ms Processed 16,215 docs in 53,141 ms; per single doc 3.2773 ms Processed 16,215 docs in 2,032 ms; per single doc 0.1253 ms Interestingly if I run the same test application a second time in my IDE the difference between the first and the second loop is very low. I have no explanation why I see this difference but it becomes a huge problem for us due to the fact that I need to extract from each document a small set of information pieces and the first time looping just takes too much time. I could not find any indication for an external caching of Hits. I am running my tests within Eclipse with a memory setting of -Xms766M -Xmx1024M. What is the explanation in the different access speed for the same search results? Is there a way to speed up looping over the Hits data structure? Andreas - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Speeding up looping over Hits
Your timing differences are probably because of caching. But this has been mentioned many times in the archive, that a Hits object is intended to allow fast, simple retrieval of the first few documents in a result set (100 if memory serves). Each 100 or so calls to next() causes the search to be re-issued. See HitCollector, TopDocs, etc... Erick On 3/22/07, Andreas Guther <[EMAIL PROTECTED]> wrote: Hi, While looking into performance enhancement for our search feature I noticed a significant difference in Documents access time while looping over Hits. I wrote a test application search for a list of search terms and then for each returned Hits object loops twice over every single hits.doc(i). for (int i = 0; i < numberOfDocs; i++) {doc = hits.doc(i);} I am seeing differences like the following Found 16,215 hits for 'Water or Wine' in 219 ms Processed 16,215 docs in 53,141 ms; per single doc 3.2773 ms Processed 16,215 docs in 2,032 ms; per single doc 0.1253 ms Interestingly if I run the same test application a second time in my IDE the difference between the first and the second loop is very low. I have no explanation why I see this difference but it becomes a huge problem for us due to the fact that I need to extract from each document a small set of information pieces and the first time looping just takes too much time. I could not find any indication for an external caching of Hits. I am running my tests within Eclipse with a memory setting of -Xms766M -Xmx1024M. What is the explanation in the different access speed for the same search results? Is there a way to speed up looping over the Hits data structure? Andreas - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad
rubdabadub wrote: On 3/22/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: Nice idea and I can see the benefit of it to you and I don't mean to be a wet blanket on it, I just wonder about the legality of it. So long as it meets the Apache license conditions regarding the distribution it's not forbidden. It could be confusing or superfluous, but it couldn't be illegal. People may find it and think it is the official Apache Lucene, since it is branded that way. I'm not a lawyer, so I don't know for sure. I think you have the right to store and use the code, even create a whole other search product based solely on Lucene (I think), I just don't know about this kind of thing. In some sense it is like mirroring, but that fact that you can commit w/ out going through the NO NO!! I don't make any commits to apache trunk. Nor any one else for that matter. The repo at launchpad is just a pure mirror and will always be a mirror. Actually, I often find myself in a similar situation to "rubdabadub". I'm working on several commercial projects that use and modify Lucene/Nutch, and often such modifications are proprietary (about equally often they are not, and are submitted as patches). Over time, the issue of tracking the vendor source tree and merging from that tree (per svnbook) to several different private svn repos becomes a tricky and time-consuming business ... I'd welcome any improvements here. It seems I need to find some time to get more familiar with bzr ... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How can I index Phrases in Lucene?
Hi, I know how to index terms in lucene, now I wanna see how can I index phrases like "information retreival" in lucene and calculate the number of times that phrase has appeared in the document. Is there any way to do it in Lucene? Thanks It's here! Your new message! Get new email alerts with the free Yahoo! Toolbar. http://tools.search.yahoo.com/toolbar/features/mail/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How can I index Phrases in Lucene?
Well, you don't index phrases, it's done for you. You should try something like the following Create a SpanNearQuery with your terms. Specify an appropriate slop (probably 0 assuming you want them all next to each other). Now use call getSpans and count ... You may have to do something with overlapping spans, but you'll need to experiment a bit to understand it. Erick On 3/22/07, Maryam <[EMAIL PROTECTED]> wrote: Hi, I know how to index terms in lucene, now I wanna see how can I index phrases like "information retreival" in lucene and calculate the number of times that phrase has appeared in the document. Is there any way to do it in Lucene? Thanks It's here! Your new message! Get new email alerts with the free Yahoo! Toolbar. http://tools.search.yahoo.com/toolbar/features/mail/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: how ungrouped query handled?
see also the FAQ "Why am I getting no hits / incorrect hits?" which points to... http://wiki.apache.org/lucene-java/BooleanQuerySyntax ...I've just added some more words of wisdom there from past emails. : Date: Thu, 22 Mar 2007 09:51:15 -0400 : From: Erick Erickson <[EMAIL PROTECTED]> : Reply-To: java-user@lucene.apache.org : To: java-user@lucene.apache.org : Subject: Re: how ungrouped query handled? : : This is a pretty common issue that I've been grappling with by chance : recently. The main point is that the parser is NOT a boolean logic : parser. : : Search the mail archive for the thread "bad query parser bug" and : you'll find a good discussion. : : I tried using PrecedenceQueryParser, but that didn't work for : me very well, search the mail archive on that and you'll see some : examples of why. : : I solved this problem for my immediate issues by writing a very : quick-and-dirty parenthesizer for my raw query. If it wasn't going : on summer, I might see if I can contribute something by : seeing if there's a way I can see to fix PrecedenceQueryParser. : : Best : Erick : : On 3/22/07, SK R <[EMAIL PROTECTED]> wrote: : > : > Hi, : > Can anyone explain how lucene handles the belowed query? : > My query is *field1:source AND (field2:name OR field3:dest)* . I've : > given this string to queryparser and then searched by using searcher. It : > returns correct results. It's query.toString() print is :: +field1:source : > +(field2:name field3:dest) : > But if i don't group my terms (i.e) my query : *field1:source AND : > field2:name OR field3:dest *,then it gives the result of first two term's : > search result. It doesn't search 3rd term. It's query.toString() print is : > :: : > +field1:source +field2:name field3:dest. : > If i use same boolean operator between all terms, then it returns correct : > results. : > Why it doesn't search the terms after 2nd term if grouping not used? : > : > Thanks & Regards : > RSK : > : -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Extracting formatted text from PDF files
Mike O'Leary wrote: Please forgive the laziness inherent in this question, as I haven't looked through the PDFBox code yet. I am wondering if that code supports extracting text from PDF files while preserving such things as sequences of whitespace between characters and other layout and formatting information. I am working with a project that extracts and operates on certain table-like blocks of text from PDF files, and a lot of freeware and shareware PDF to text converters seem to either ignore formatting or try to preserve formatting and not get it quite right. I am wondering if PDFBox provides better support for this kind of thing. Thanks. That is not so simple. Usually there is not this information inside a PDF file. PDF is an output file format. It contains just the information print a character "a" at the position x and y. In many cases a PDF file doesn’t know even words or white spaces. We read words due to the position of characters, we see paragraphs due to the position of characters, and we see tables due to the position of characters. The file doesn’t contain this information. I found this code in a PDF file for the German word "Wuchsform" (form of growing) and the colon ":": /F1 1 Tf -3.8801 -1.274 TD [ (W) 29.60001 (uchsform:) ] TJ First line: Select a font Second line: Move the cursor to position -3.8801, -1.274 Third line: Print the character "W", move the cursor 29.60001 units to right and print the characters "uchsform:". Extracting the words from a PDF file for indexing means you have first to build words from the characters positions. Recognizing paragraphs, column text, tables, captions, lists, footnotes etc. is much more difficult. Sören - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Combining score from two or more hits
Erick Erickson wrote: Don't know if it's useful or not, but if you used TopDocs instead, you have access to an array of ScoreDoc which you could modify freely. In my app, I used a FieldSortedHitQueue to re-sort things when I needed to. Thanks Erick, I've been using TopDocs, but am playing with my own HitCollector variant of TopDocHitCollector. The problem is not adjusting the score, it's what to adjust it by, i.e. is it possible to re-evaluate the scores of H1 and H2 knowing that the original query resulted in hits on H1 AND H2. Antony ERick On 3/22/07, Antony Bowesman <[EMAIL PROTECTED]> wrote: I have indexed objects that contain one or more attachments. Each attachment is indexed as a separate Document along with the object metadata. When I make a search, I may get hits in more than one Document that refer to the same object. I have a HitCollector which knows if the object has already been found, so I want to be able to update the score of an existing hit in a way that makes sense. e.g. If hit H1 has score 1.35 and hit H2 has score 2.9 is is possible to re-score it on the basis that the real hit result is (H1 AND H2). I can take the highest score of any Document, but just wondered if this is possible during the HitCollector.collect method? Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Speeding up looping over Hits
Another thing you may want to look at is the newer version 2.1.0 and getFieldable. I think that will lazy load the data, that way you are only reading the parts of the document that you need at that moment rather than the whole thing. Someone please correct me if I am wrong or point to what I really mean :) I had a similar situation a long while back and I was able to find a patch for the version of Lucene I was using that allowed the above. It made a huge difference. I think something similar is now built in 2.1.0. Andreas Guther <[EMAIL PROTECTED]> wrote: Hi, While looking into performance enhancement for our search feature I noticed a significant difference in Documents access time while looping over Hits. I wrote a test application search for a list of search terms and then for each returned Hits object loops twice over every single hits.doc(i). for (int i = 0; i < numberOfDocs; i++) {doc = hits.doc(i);} I am seeing differences like the following Found 16,215 hits for 'Water or Wine' in 219 ms Processed 16,215 docs in 53,141 ms; per single doc 3.2773 ms Processed 16,215 docs in 2,032 ms; per single doc 0.1253 ms Interestingly if I run the same test application a second time in my IDE the difference between the first and the second loop is very low. I have no explanation why I see this difference but it becomes a huge problem for us due to the fact that I need to extract from each document a small set of information pieces and the first time looping just takes too much time. I could not find any indication for an external caching of Hits. I am running my tests within Eclipse with a memory setting of -Xms766M -Xmx1024M. What is the explanation in the different access speed for the same search results? Is there a way to speed up looping over the Hits data structure? Andreas - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Now that's room service! Choose from over 150,000 hotels in 45,000 destinations on Yahoo! Travel to find your fit.
Software Product Development Job Opportunity (Baltimore, MD)
Official job description & info to submit a resume: http://www.systemsalliance.com/careers/internal-jobs/baltimore/Software_ Engineer_MD.html Located 15 minutes North of Baltimore in Sparks, MD Position is on a team, working with myself and others, maintaining and developing an existing content management system. Quiet working environment in shared office with a nice view. Management that chooses to do the right thing more often than the expedient. Full stack on your own machine (IIS/Apache, Coldfusion [JRun], SQL Server/Oracle) for local development. Trac for defect tracking & source control. Java work includes Lucene, XOM and JavaCC. Feel free to contact me with questions email: mlesko at systemsalliance dot com. This email communication is confidential, is intended only for the use of the named recipient(s), and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any distribution or copying of this email or any of its contents is strictly prohibited. If you have received this communication in error, please re-send it to the sender and delete the original message and any copy of it from your computer system. To contact intended sender please call 410-584-0595. Thank you.
Re: Speeding up looping over Hits
Oh yeah.. By only loading the relevant fields, my query times reduced by over 90%. I actually wrote that up on the mailing list if you wanted to try to find it, but it took Andreas' message to remind me... Erick On 3/22/07, Santa Clause <[EMAIL PROTECTED]> wrote: Another thing you may want to look at is the newer version 2.1.0and getFieldable. I think that will lazy load the data, that way you are only reading the parts of the document that you need at that moment rather than the whole thing. Someone please correct me if I am wrong or point to what I really mean :) I had a similar situation a long while back and I was able to find a patch for the version of Lucene I was using that allowed the above. It made a huge difference. I think something similar is now built in 2.1.0. Andreas Guther <[EMAIL PROTECTED]> wrote: Hi, While looking into performance enhancement for our search feature I noticed a significant difference in Documents access time while looping over Hits. I wrote a test application search for a list of search terms and then for each returned Hits object loops twice over every single hits.doc(i). for (int i = 0; i < numberOfDocs; i++) {doc = hits.doc(i);} I am seeing differences like the following Found 16,215 hits for 'Water or Wine' in 219 ms Processed 16,215 docs in 53,141 ms; per single doc 3.2773 ms Processed 16,215 docs in 2,032 ms; per single doc 0.1253 ms Interestingly if I run the same test application a second time in my IDE the difference between the first and the second loop is very low. I have no explanation why I see this difference but it becomes a huge problem for us due to the fact that I need to extract from each document a small set of information pieces and the first time looping just takes too much time. I could not find any indication for an external caching of Hits. I am running my tests within Eclipse with a memory setting of -Xms766M -Xmx1024M. What is the explanation in the different access speed for the same search results? Is there a way to speed up looping over the Hits data structure? Andreas - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Now that's room service! Choose from over 150,000 hotels in 45,000 destinations on Yahoo! Travel to find your fit.
Re: Extracting formatted text from PDF files
Mike O'Leary wrote: Please forgive the laziness inherent in this question, as I haven't looked through the PDFBox code yet. I am wondering if that code supports extracting text from PDF files while preserving such things as sequences of whitespace between characters and other layout and formatting information. I am working with a project that extracts and operates on certain table-like blocks of text from PDF files, and a lot of freeware and shareware PDF to text converters seem to either ignore formatting or try to preserve formatting and not get it quite right. Even pdftohtml? The sample outputs I've seen from that application don't look too bad to me. Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699 Web: http://nuix.com/ Fax: +61 2 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
problem in reading an index
Hi, I have written this piece of code to read the index, mainly to see what terms are in each document and what the frequency of each term in the document is. This piece of code correctly calculates the number of docs in the index, but I dont know why variable myTermFreq[] is null. Would you please let me know your idea bout it? IndexReader reader = IndexReader.open(myInd); for (int docNo = 0; docNo < reader.numDocs(); docNo++) { TermFreqVector myTermFreq[] = reader.getTermFreqVectors(docNo); if (myTermFreq != null) { for (int i = 0; i < myTermFreq.length; i++) { int freq[] = myTermFreq[i].getTermFrequencies(); //String terms[]= myTermFreq[i].getTerms(); for (int j=0;jhttp://tools.search.yahoo.com/shortcuts/#news - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: problem in reading an index
Maryam wrote: Hi, I have written this piece of code to read the index, mainly to see what terms are in each document and what the frequency of each term in the document is. This piece of code correctly calculates the number of docs in the index, but I don’t know why variable myTermFreq[] is null. Would you please let me know your idea bout it? From TFJD: Return an array of term frequency vectors for the specified document. The array contains a vector for each vectorized field in the document. Each vector contains terms and frequencies for all terms in a given vectorized field. If no such fields existed, the method returns null. i.e. you may not have stored the term vectors when indexing the data. Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699 Web: http://nuix.com/ Fax: +61 2 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Reverse search
Hello, I want to manage user subscriptions to specific documents. So I would like to store the subscription (query) into the lucene directory, and whenever I receive a new document, I will search all the matching subscriptions to send the documents to all subcribers. For instance if a user subscribes to all documents with text containing (WORD1 and WORD2) or WORD3, how can I match the incoming document based on stored subscriptions? I was thinking to have two subfields for each field of the subscription: the AND conditions and the OR conditions. -OR. I will tokenized the document field content and insert OR between each of them, and run the query against OR condition of subscription -It's for the AND that I will have an issue, because if the incoming text may contains more words than the sequence I want to search. For instance, if I subscribe for documents contents lucene and java for instance , if the incoming document contents is lucene is a great API which has been developed in java, once I removed stopwords my query would look like lucene and great and API and developed and java. As query is composed of more words than the stored subscription I will fail to retrieve the subscription. But if I put only or words, the results will not be accurate, as I can obtain subscription only for java for instance. Do you know how I can handle this situation? I'm not sure I can actually do this using Lucene... Thank you, Mélanie
Re: Reverse search
23 mar 2007 kl. 02.12 skrev Melanie Langlois: I want to manage user subscriptions to specific documents. So I would like to store the subscription (query) into the lucene directory, and whenever I receive a new document, I will search all the matching subscriptions to send the documents to all subcribers. For instance if a user subscribes to all documents with text containing (WORD1 and WORD2) or WORD3, how can I match the incoming document based on stored subscriptions? I was thinking to have two subfields for each field of the subscription: the AND conditions and the OR conditions. -OR. I will tokenized the document field content and insert OR between each of them, and run the query against OR condition of subscription -It's for the AND that I will have an issue, because if the incoming text may contains more words than the sequence I want to search. For instance, if I subscribe for documents contents lucene and java for instance , if the incoming document contents is lucene is a great API which has been developed in java, once I removed stopwords my query would look like lucene and great and API and developed and java. As query is composed of more words than the stored subscription I will fail to retrieve the subscription. But if I put only or words, the results will not be accurate, as I can obtain subscription only for java for instance. I wrote such a thing way back, where I used the new document as the query and the user subscriptions as the index. Similar to what you describe, I had an AND, OR and NOT field. This really limited the type of queries users could store. It does however work, particullary well on systems with /huge/ amounts of subscriptions (many millions). Today I would have used something else. If you insert one document at the time to your index, take a look at MemoryIndex in contrib. If you insert documents in batches larger than one document at the time, take a look at LUCENE-550 in the Jira. Add new documents to such an index and place the subscribed queries on it. Depening on the queries, the speed should be some 20-100 times faster than using a RAMDirectory. One million queries should take some 20 seconds to assemble and place on a 25 document index on my laptop. See issues.apache.org/jira/secure/attachment/ 12353601/12353601_HitCollectionBench.jpg> for performace of LUCENE-550. -- karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: problem in reading an index
23 mar 2007 kl. 02.09 skrev Daniel Noll: Maryam wrote: Hi, I have written this piece of code to read the index, mainly to see what terms are in each document and what the frequency of each term in the document is. This piece of code correctly calculates the number of docs in the index, but I don’t know why variable myTermFreq[] is null. Would you please let me know your idea bout it? From TFJD: Return an array of term frequency vectors for the specified document. The array contains a vector for each vectorized field in the document. Each vector contains terms and frequencies for all terms in a given vectorized field. If no such fields existed, the method returns null. i.e. you may not have stored the term vectors when indexing the data. This thread might be of interest: http://www.nabble.com/Resolving-term-vector-even-when-not-stored-- tf3412160.html#a9507268 -- karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Reverse search
Thanks Karl, the performances graph is really amazing! I have to say that it would not have think this way around would be faster, but sounds nice if I can use this, make everything easier to manage. I'm just wondering what did you consider when you build your graph, only the time to run the queries? Because, I should add the time for creating the index anytime a new document comes in (or a subset of documents if several comes in same time), and the indexing of these documents. The documents should not be big, around 2KB. Did you measure this part ? Mélanie -Original Message- From: karl wettin [mailto:[EMAIL PROTECTED] Sent: Friday, March 23, 2007 10:35 AM To: java-user@lucene.apache.org Subject: Re: Reverse search 23 mar 2007 kl. 02.12 skrev Melanie Langlois: > I want to manage user subscriptions to specific documents. So I > would like to store the subscription (query) into the lucene > directory, and whenever I receive a new document, I will search all > the matching subscriptions to send the documents to all subcribers. > For instance if a user subscribes to all documents with text > containing (WORD1 and WORD2) or WORD3, how can I match the incoming > document based on stored subscriptions? I was thinking to have two > subfields for each field of the subscription: the AND conditions > and the OR conditions. > > -OR. I will tokenized the document field content and insert OR > between each of them, and run the query against OR condition of > subscription > > -It's for the AND that I will have an issue, because if the > incoming text may contains more words than the sequence I want to > search. > > For instance, if I subscribe for documents contents lucene and java > for instance , if the incoming document contents is lucene is a > great API which has been developed in java, once I removed > stopwords my query would look like lucene and great and API and > developed and java. > > As query is composed of more words than the stored subscription I > will fail to retrieve the subscription. But if I put only or words, > the results will not be accurate, as I can obtain subscription only > for java for instance. > I wrote such a thing way back, where I used the new document as the query and the user subscriptions as the index. Similar to what you describe, I had an AND, OR and NOT field. This really limited the type of queries users could store. It does however work, particullary well on systems with /huge/ amounts of subscriptions (many millions). Today I would have used something else. If you insert one document at the time to your index, take a look at MemoryIndex in contrib. If you insert documents in batches larger than one document at the time, take a look at LUCENE-550 in the Jira. Add new documents to such an index and place the subscribed queries on it. Depening on the queries, the speed should be some 20-100 times faster than using a RAMDirectory. One million queries should take some 20 seconds to assemble and place on a 25 document index on my laptop. See for performace of LUCENE-550. -- karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Questions about Indexing
Hi, I have three questions about indexing: 1) I am indexing HTML documents, how can I do "stop removal" before indexing, I dont want to index stop words? 2) I can have an access to the terms in one document, but how can I have access to the document name that these terms has been appeared? 3) I want to find phrases at index level, e.x. find frequency of phrases in the collection, also their frequency in each document. How can I do it in Lucene, is there any sample code? Thanks Be a PS3 game guru. Get your game face on with the latest PS3 news and previews at Yahoo! Games. http://videogames.yahoo.com/platform?platform=120121 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: contrib/benchmark questions
OK, Doron (and other benchmarkers!), on to search: Here's my alg file: #Indexing declaration up here OpenReader { "SrchSameRdr" Search > : 5000 { "SrchTrvSameRdr" SearchTrav > : 5000 { "SrchTrvSameRdrTopTen" SearchTrav(10) > : 5000 { "SrchTrvRetLoadAllSameRdr" SearchTravRet > : 5000 #Skip bytes and body { "SrchTrvRetLoadSomeSameRdr" SearchTravRetLoadFieldSelector (docid,docname,docdate,doctitle) > : 5000 CloseReader Never mind the last task, I will be submitting a patch shortly that will make sense out of it. Essentially, it specifies what fields to load for the document Here are theresults: Operation round merge max.buffered runCnt recsPerRunrec/s elapsedSec avgUsedMemavgTotalMem [java] OpenReader - - - - - - - - 0 - 10 - - - 10 - - 1 - - - - 1 - - 125.0 - - 0.01 - 5,385,600 - - 9,965,568 [java] SrchSameRdr_5000 010 101 5000 1,184.34.22 5,805,120 9,965,568 [java] SrchTrvSameRdr_5000 - - - - - 0 - 10 - - - 10 - - 1 - - 427500 - 71,776.4 - - 5.96 - 5,806,144 - - 9,965,568 [java] SrchTrvSameRdrTopTen_5000 010 101 427500 62,001.46.89 5,766,584 9,965,568 [java] SrchTrvRetLoadAllSameRdr_5000 - - 0 - 10 - - - 10 - - 1 - - 85 - - 7,226.4 - - 117.62 - 6,161,728 - - 9,965,568 [java] SrchTrvRetLoadSomeSameRdr_5000 010 101 85 10,334.0 82.25 6,162,752 9,965,568 [java] CloseReader - - - - - - - - 0 - 10 - - - 10 - - 1 - - - - 1 - - 1,000.0 - - 0.00 - 5,921,856 - - 9,965,568 The line I'm a bit confused by is the recsPerRun For the tasks that are doing the traversal and the retrieval, why so many recsPerRun? Is it counting the hits, the traversals and the retrievals each as one record? What I am trying to do is compare: Search Search plus traversal of all hits Search plus traversal of top ten Search plus traversal and retrieval of all documents and all fields on the document Search plus traversal and retrieval of all documents and some fields on the document I think I see in the ReadTask that it is the res var that is being incremented and would have to be altered. I guess I can go by elapsed time, but even that seems slightly askew. I think this is due to the withRetrieve() function overhead inside the for loop. I have moved it out and will submit that change, too. Am I interpreting this correctly? -Grant On Mar 19, 2007, at 5:11 PM, Doron Cohen wrote: Grant Ingersoll <[EMAIL PROTECTED]> wrote on 19/03/2007 13:10:16: So, if I am understanding correctly: "SearchSameRdr" Search > : 5000 means don't collect indiv. stats fur SearchSameRdr, but do whatever that task does 5000 times, right? Almost... It should be btw { "SearchSameRdr" Search > : 5000 and it means: run Search 5000 times, sequentially, 5000 times, assign the name "SearchSameRdr" to that sequence of 5000, and do not collect individual stats for the individual tasks making that sequence. If it was just { Search > : 5000 it would still mean the same, just that a name was assigned to this for you, something like: "Seq_Search_5000". If it was: { "SearchSameRdr" Search } : 5000 it would be the same as your example, just that stas would be collected not only for the entire elapsed sequence, but also breaking it down for each of the 5000 calls to "Search". Similar logic with [ .. ] and [ .. > just that the tasks making the (parallel) sequence are executed in parallel, each in a separate thread. 3. Is there anyway to dump out the stats as a CSV file or something? Would I implement a Task for this? Ultimately, I want to be able to create a graph in Excel that shows tradeoffs between speed and memory. Yes, implementing a report task would be the way. ... but when I look at how I implemented these reports, all the work is done in the class Points. Seems it should be modified a little with more thought of making it easiert to extend reports. I may take a crack at it, but deadline for the talk is looming I'll take a look too, let you know if I have anything. - Being intetested in memory stats - the thing that all the rounds run in a single program, same JVM run, usually means what you see is very much dependent in the GC behavior of the specific VM you are using. If it does not release memory (most likely) to the OS you would not be able to notice that round i+1 used less memory than round i. It would probably better for something like this to put the "round" logic in an ant script, invoking each round in a separate new exec. But then things get more complicated for having a final stats report containing all rounds. What do you think about this?
Re: How can I index Phrases in Lucene?
Is there any way to find frequent phrases without knowing what you are looking for? I could index "A B C D E" as "A B C", "B C D", "C D E" etc, but that seems kind of clunky particularly if the phrase length is large. Is there any position offset magic that will surface frequent phrases automatically? thanks ryan On 3/22/07, Erick Erickson <[EMAIL PROTECTED]> wrote: Well, you don't index phrases, it's done for you. You should try something like the following Create a SpanNearQuery with your terms. Specify an appropriate slop (probably 0 assuming you want them all next to each other). Now use call getSpans and count ... You may have to do something with overlapping spans, but you'll need to experiment a bit to understand it. Erick On 3/22/07, Maryam <[EMAIL PROTECTED]> wrote: > > Hi, > > I know how to index terms in lucene, now I wanna see > how can I index phrases like "information retreival" > in lucene and calculate the number of times that > phrase has appeared in the document. Is there any way > to do it in Lucene? > > Thanks > > > > > > It's here! Your new message! > Get new email alerts with the free Yahoo! Toolbar. > http://tools.search.yahoo.com/toolbar/features/mail/ > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: contrib/benchmark questions
On Mar 22, 2007, at 11:21 PM, Grant Ingersoll wrote: I think I see in the ReadTask that it is the res var that is being incremented and would have to be altered. I guess I can go by elapsed time, but even that seems slightly askew. I think this is due to the withRetrieve() function overhead inside the for loop. I have moved it out and will submit that change, too. Moving it out of the loop made little diff. so I guess it is mostly just due to it being late and me being tired and not thinking clearly. B/c if I were, I would just realize that those operations are also retrieving documents... -Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How can I index Phrases in Lucene?
23 mar 2007 kl. 04.25 skrev Ryan McKinley: Is there any way to find frequent phrases without knowing what you are looking for? I think you are looking for association rules. Try searching for Levelwise-Scan. Weka contains GPLed Java code. Cite seer is your best friend for whitepapers. http:// citeseer.ist.psu.edu/cs -- karl I could index "A B C D E" as "A B C", "B C D", "C D E" etc, but that seems kind of clunky particularly if the phrase length is large. Is there any position offset magic that will surface frequent phrases automatically? thanks ryan On 3/22/07, Erick Erickson <[EMAIL PROTECTED]> wrote: Well, you don't index phrases, it's done for you. You should try something like the following Create a SpanNearQuery with your terms. Specify an appropriate slop (probably 0 assuming you want them all next to each other). Now use call getSpans and count ... You may have to do something with overlapping spans, but you'll need to experiment a bit to understand it. Erick On 3/22/07, Maryam <[EMAIL PROTECTED]> wrote: > > Hi, > > I know how to index terms in lucene, now I wanna see > how can I index phrases like "information retreival" > in lucene and calculate the number of times that > phrase has appeared in the document. Is there any way > to do it in Lucene? > > Thanks > > > > > _ ___ > It's here! Your new message! > Get new email alerts with the free Yahoo! Toolbar. > http://tools.search.yahoo.com/toolbar/features/mail/ > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Reverse search
23 mar 2007 kl. 03.07 skrev Melanie Langlois: Thanks Karl, the performances graph is really amazing! I have to say that it would not have think this way around would be faster, but sounds nice if I can use this, make everything easier to manage. I'm just wondering what did you consider when you build your graph, only the time to run the queries? Because, I should add the time for creating the index anytime a new document comes in (or a subset of documents if several comes in same time), and the indexing of these documents. The documents should not be big, around 2KB. Did you measure this part ? Adding a document to a MemoryIndex or InstantiatedIndex takes more or less the same time it would take to add it to an empty RAMDirectory. How many clock ticks is spent really depends on what analysers you use. -- karl Mélanie -Original Message- From: karl wettin [mailto:[EMAIL PROTECTED] Sent: Friday, March 23, 2007 10:35 AM To: java-user@lucene.apache.org Subject: Re: Reverse search 23 mar 2007 kl. 02.12 skrev Melanie Langlois: I want to manage user subscriptions to specific documents. So I would like to store the subscription (query) into the lucene directory, and whenever I receive a new document, I will search all the matching subscriptions to send the documents to all subcribers. For instance if a user subscribes to all documents with text containing (WORD1 and WORD2) or WORD3, how can I match the incoming document based on stored subscriptions? I was thinking to have two subfields for each field of the subscription: the AND conditions and the OR conditions. -OR. I will tokenized the document field content and insert OR between each of them, and run the query against OR condition of subscription -It's for the AND that I will have an issue, because if the incoming text may contains more words than the sequence I want to search. For instance, if I subscribe for documents contents lucene and java for instance , if the incoming document contents is lucene is a great API which has been developed in java, once I removed stopwords my query would look like lucene and great and API and developed and java. As query is composed of more words than the stored subscription I will fail to retrieve the subscription. But if I put only or words, the results will not be accurate, as I can obtain subscription only for java for instance. I wrote such a thing way back, where I used the new document as the query and the user subscriptions as the index. Similar to what you describe, I had an AND, OR and NOT field. This really limited the type of queries users could store. It does however work, particullary well on systems with /huge/ amounts of subscriptions (many millions). Today I would have used something else. If you insert one document at the time to your index, take a look at MemoryIndex in contrib. If you insert documents in batches larger than one document at the time, take a look at LUCENE-550 in the Jira. Add new documents to such an index and place the subscribed queries on it. Depening on the queries, the speed should be some 20-100 times faster than using a RAMDirectory. One million queries should take some 20 seconds to assemble and place on a 25 document index on my laptop. See 12353601/12353601_HitCollectionBench.jpg> for performace of LUCENE-550. -- karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Combining score from two or more hits
: Thanks Erick, I've been using TopDocs, but am playing with my own HitCollector : variant of TopDocHitCollector. The problem is not adjusting the score, it's : what to adjust it by, i.e. is it possible to re-evaluate the scores of H1 and H2 : knowing that the original query resulted in hits on H1 AND H2. if you are using a HitCollector, there any re-evaluation is going to happen in your code using whatever mechanism you want -- once your collect method is called on a docid, Lucene is done with that docid and no longer cares about it ... it's only whatever storage you may be maintaining of high scoring docs thta needs to know that you've decided the score has changed. your big problem is going to be that you basically need to maintain a list of *every* doc collected, if you don't know what the score of any of them are until you've processed all the rest ... since docs are collected in increasing order of docid, you might be able to make some optimizations based on how big of a gap you've got between the doc you are currently collecting and the last doc you've collected if you know that you're always going to add docs that "relate" to eachother in sequential bundles -- but this would be some very custom code depending on your use case. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: contrib/benchmark questions
Hi Grant, I think you resolved the question already, but just to make sure... Grant Ingersoll <[EMAIL PROTECTED]> wrote on 22/03/2007 20:41:27: > > On Mar 22, 2007, at 11:21 PM, Grant Ingersoll wrote: > > > I think I see in the ReadTask that it is the res var that is being > > incremented and would have to be altered. I guess I can go by > > elapsed time, but even that seems slightly askew. I think this is > > due to the withRetrieve() function overhead inside the for loop. I > > have moved it out and will submit that change, too. > > > > Moving it out of the loop made little diff. so I guess it is mostly > just due to it being late and me being tired and not thinking > clearly. B/c if I were, I would just realize that those operations > are also retrieving documents... Seems the cause for confusion is that #recs means different things for different tasks. For all tasks, it means (at least) the number of times that task executed. For warm, it adds one for each document retrieved. For traverse, adds one for each doc id "traveresed", and for traverseAndRetrieve, also adds one for each doc being retrieved. I'll update the javadocs with this clarification. Moving the call out of the loop is the right thing of course, changed the time only, not the #recs, right? Regards, Doron - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Questions about Indexing
Maryam wrote: Hi, I have three questions about indexing: 1) I am indexing HTML documents, how can I do "stop removal" before indexing, I dont want to index stop words? The same way you would do it for indexing text documents: StopFilter. 2) I can have an access to the terms in one document, but how can I have access to the document name that these terms has been appeared? The usual way to do this is to store the document name as another field. Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699 Web: http://nuix.com/ Fax: +61 2 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Ignore Words Problem
I want to be make sure, if this statement is Right or not? "I am using StatndardAnaylyzer for Indexing documents. Bydefault it ignores some words when doing indexing. But when we search something, Lucene again include the ignore words in searching".??? Myproblem is that:- I indexed a word document using StandarAnaylyzer. There are many words like "is am are that the" which are ignored by the Lucene. And When i want to search a query which must search all words given by user (AND query) then it does not return results. For example I want to search those documents which MUST have ALL these words "this is garden". for this i have made a AND query, but Lucene now gives result because "garden" is there but it cannot find "is" and "this" word because they are ignored at indexing time. So what is the better work around. Any help will be appreciated. __ Yahoo! India Answers: Share what you know. Learn something new http://in.answers.yahoo.com/
Re: how ungrouped query handled?
Thanks for your reply and this useful links. On 3/23/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: see also the FAQ "Why am I getting no hits / incorrect hits?" which points to... http://wiki.apache.org/lucene-java/BooleanQuerySyntax ...I've just added some more words of wisdom there from past emails. : Date: Thu, 22 Mar 2007 09:51:15 -0400 : From: Erick Erickson <[EMAIL PROTECTED]> : Reply-To: java-user@lucene.apache.org : To: java-user@lucene.apache.org : Subject: Re: how ungrouped query handled? : : This is a pretty common issue that I've been grappling with by chance : recently. The main point is that the parser is NOT a boolean logic : parser. : : Search the mail archive for the thread "bad query parser bug" and : you'll find a good discussion. : : I tried using PrecedenceQueryParser, but that didn't work for : me very well, search the mail archive on that and you'll see some : examples of why. : : I solved this problem for my immediate issues by writing a very : quick-and-dirty parenthesizer for my raw query. If it wasn't going : on summer, I might see if I can contribute something by : seeing if there's a way I can see to fix PrecedenceQueryParser. : : Best : Erick : : On 3/22/07, SK R <[EMAIL PROTECTED]> wrote: : > : > Hi, : > Can anyone explain how lucene handles the belowed query? : > My query is *field1:source AND (field2:name OR field3:dest)* . I've : > given this string to queryparser and then searched by using searcher. It : > returns correct results. It's query.toString() print is :: +field1:source : > +(field2:name field3:dest) : > But if i don't group my terms (i.e) my query : *field1:source AND : > field2:name OR field3:dest *,then it gives the result of first two term's : > search result. It doesn't search 3rd term. It's query.toString() print is : > :: : > +field1:source +field2:name field3:dest. : > If i use same boolean operator between all terms, then it returns correct : > results. : > Why it doesn't search the terms after 2nd term if grouping not used? : > : > Thanks & Regards : > RSK : > : -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
MergeFactor and MaxBufferedDocs value should ...?
Hi, I've looked the uses of MergeFactor and MaxBufferedDocs. If I set MergeFactor = 100 and MaxBufferedDocs=250 , then first 100 segments will be merged in RAMDir when 100 docs arrived. At the end of 350th doc added to writer , RAMDir have 2 merged segment files + 50 seperate segment files not merged together and these are flushed to FSDir. If wrong, please correct me. My doubt is whether we should set MergeFactor & MaxBufferedDocs in proportional ratio (i.e) MaxBufferedDocs = n*MergeFactor where n = 1,2 ... to reduce indexing time and get greater performance or no need to worry about it's relation? Thanks & Regards RSK
Re: Ignore Words Problem
What part of Grant and Karl's answers to you the last time you asked this question wasn't clear? have you tried it? http://www.nabble.com/Re%3A-Common-Words-ignoring-problem-p9550886.html http://www.nabble.com/Re%3A-Common-Words-ignoring-problem-p9567881.html : I want to be make sure, if this statement is Right or not? : "I am using StatndardAnaylyzer for Indexing documents. Bydefault it ignores some words when doing indexing. But when we search something, Lucene again include the ignore words in searching".??? : Myproblem is that:- : I indexed a word document using StandarAnaylyzer. There are many words like "is am are that the" which are ignored by the Lucene. And When i want to search a query which must search all words given by user (AND query) then it does not return results. For example : I want to search those documents which MUST have ALL these words "this is garden". for this i have made a AND query, but Lucene now gives result because "garden" is there but it cannot find "is" and "this" word because they are ignored at indexing time. So what is the better work around. : Any help will be appreciated. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]