Re: indexing rss feeds in multiple languages

2007-03-22 Thread Antony Bowesman

Melanie Langlois wrote:

Well, thanks, sounds like the best option to me. Does anybody use the
PerFieldAnalyzerWrapper? I'm just curious to know if there is any impact on
the performances when using different analyzers.


I've not done any specifc comparisons between using a single Analyzer and 
multiple Analyzer with PFAW, but our indexes are typically 20-25 fields, each of 
which can have a different analyzer depending on language or field type, 
although in practice about 8-10 fields may use the non-default analyzer.


Performance is pretty good in any case and there's not been any noticeable 
degradtion when tweaking analyzers.

Antony





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Querying fragments of a tree structure

2007-03-22 Thread Emanuel Schleussinger
Hi Erick,

excellent insight, thanks a lot. As you would expect, this method works a treat.

thanks a lot for your time!
Emanuel

- Original Message -
From: "Erick Erickson" <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, March 21, 2007 2:12:49 PM (GMT+0100) Europe/Berlin
Subject: Re: Querying fragments of a tree structure

Is it a fair restatement of your problem that you want to generate
a list of all children of a node? That's what I'm reading.

Would it work for you to store the complete ancestry in each node?
By that I mean (from your example),

NOTE: it's no problem in Lucene to store different values for the
same field in the same document. I.e
Document doc.
doc.add("field" "value1");
doc.add("field" "value2");
writer.add(doc);...

This is equivalent (if using WhitespaceAnalyzer in this example)  to:
Document doc;
doc.add("field", "value1 value2");
writer.add(doc);

(There is a subtle difference between the two having to do
with PositionIncrementGap, but that's probably irrelevant for you
in this problem).

So what about just doing that for each parent node in your tree? So
your "ancestry" field for documents D and E have stored "C" and "A".
This is TOKENIZED, but not necessarily STORED

Document C has only "A".

Now, finding the children of "A" reduces to something like
+ancestry:A
which you can add to your BooleanClauses if you want to also specify
other search criteria or just use by itself if you don't.


What follows is my first idea, but I think the above is a better notion.

Node A stores nothing
Nodes  B and C  store "A"
Nodex D and E store "A$C"
etc.

Now, finding all the children of A reduces to doing a WildcardTermEnum
on "A*" and, for each resulting term using TermDocs.seek(term) to find
the corresponding document.

Note a couple of things:
1> index the ancestry field UN_TOKENIZED. You don't need to store it.
1a> You could use something like this to form a Lucene Filter if you needed
to, say, find all the nodes in the tree that were children of a specified
node
AND met certain search criteria.

2> You could also just search on A*, but be aware that you may have to
deal with TooManyClauses exceptions. The TermEnum/TermDocs method
avoids that problem, but may be overkill in your situation.
2a> Lucene 2.1 allows wildcards in the first position if you do a wlidcard
search, but you need to turn that on by a call which I can't bring up from
memory.


Hope this helps
Erick

On 3/21/07, Emanuel Schleussinger <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> first, thanks for this great a resource, and sorry if i am oversimplfying
> a few things, i am still rather new to Lucene.
>
> I have been thinking how to integrate my app with Lucene - it is a CMS
> type system that has documents organized in a tree-style layout. A few facts
> about the system:
> - Every node in the system is represented by a unique numeric id in  a
> field called "id"
> - There is one deinfed root node, and an arbitrary amount of descendants.
> - Each of the nodes on any level knows his descendants in a field called
> "child"
> - Each node also knows his parent node in a field called "parent"
>
> I am indexing all the fields from all the nodes in Lucene already, and
> thus, i can use Lucene to e.g. get all the descendant node IDs of a node
> simply by issuing a query like "id:2" and then extracting the
> multivalue-field "child".
>
> Now, here is what i am trying to solve now -- i would like to be able to
> fetch all the nodes that match a certain criteria, if they are contained in
> some fragment of the tree. To visualize:
>
> Root
> +-> A
> |   +-> B
> |   +-> C
> |   +->D
> |   +->E
> +->F
> +->G
>
> i would like to issue a query that gives me all the nodes within "A", a
> flat list of results that contain B,C,D and E
>
> Now, since per my definition D is not directly correlated with A (it knows
> his parent C, but not that its also part of A -- only C knows that) , i was
> thinking of introducing a new field for every node into my Lucene index that
> holds a list of IDs thats trace back to the root element (in this case, the
> D node would have C and A in that field, in that order) - but it strikes me
> this may not be the most elegant approach...
>
> The above is only a simplified example, in reality, i have a tree about 10
> levels deep, with thousands of nodes, and i frequently need to surface nodes
> within a certain fragment of that tree.
>
> Is there any best practice that you ran into on how to map this elegantly
> into Lucene?
>
> Thanks a ton for any pointers,
> Emanuel Schleussinger
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



how ungrouped query handled?

2007-03-22 Thread SK R

Hi,
Can anyone explain how lucene handles the belowed query?
   My query is *field1:source AND (field2:name OR field3:dest)* . I've
given this string to queryparser and then searched by using searcher. It
returns correct results. It's query.toString() print is :: +field1:source
+(field2:name field3:dest)
   But if i don't group my terms (i.e) my query : *field1:source AND
field2:name OR field3:dest *,then it gives the result of  first two term's
search result. It doesn't search 3rd term. It's query.toString() print is ::
+field1:source +field2:name field3:dest.
If i use same boolean operator between all terms, then it returns correct
results.
Why it doesn't search the terms after 2nd term if grouping not used?

Thanks & Regards
RSK


bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad

2007-03-22 Thread rubdabadub

Hi:

First of all apology to those friends who follow all the list.

Often times I work offline and I do not have any commit rights to any
of the projects. All the modifications I make for various clients and
trying to keep up to date with latest trunk somehow makes it difficult
for me to just stick with "subversion". I have heard many things about
distributed
revision control system and I am sure there are tricks/fixes for the
subversion problem i mentioned above, but I also wanted to learn
something new :-) So after some trial with many DRCS I have decided to
go for Bazaar! Its really cool DRCS.. you got try it.

http://bazaar-vcs.org/.

Now due to the fact that SVN is RCS and bzr is DRCS one need to
convert SVN repos to bzr repos. and cool enough.. there is a free vcs
mirroring service at Launchpad

https://launchpad.net/

So now the following projects are available via bzr branch. You can
access them here.

Nutch - https://launchpad.net/nutch
Solr - https://launchpad.net/solr
Lucene - https://launchpad.net/lucene
Hadoop - https://launchpad.net/hadoop

It only mirrors "trunk". Thats what I need to follow thats why and I
don't see any reason to mirror releases.

Regards

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad

2007-03-22 Thread Grant Ingersoll
Is the point of this that you can make "commits" to Lucene so that  
you don't lose your changes on trunk?


On Mar 22, 2007, at 7:14 AM, rubdabadub wrote:


Hi:

First of all apology to those friends who follow all the list.

Often times I work offline and I do not have any commit rights to any
of the projects. All the modifications I make for various clients and
trying to keep up to date with latest trunk somehow makes it difficult
for me to just stick with "subversion". I have heard many things about
distributed
revision control system and I am sure there are tricks/fixes for the
subversion problem i mentioned above, but I also wanted to learn
something new :-) So after some trial with many DRCS I have decided to
go for Bazaar! Its really cool DRCS.. you got try it.

http://bazaar-vcs.org/.

Now due to the fact that SVN is RCS and bzr is DRCS one need to
convert SVN repos to bzr repos. and cool enough.. there is a free vcs
mirroring service at Launchpad

https://launchpad.net/

So now the following projects are available via bzr branch. You can
access them here.

Nutch - https://launchpad.net/nutch
Solr - https://launchpad.net/solr
Lucene - https://launchpad.net/lucene
Hadoop - https://launchpad.net/hadoop

It only mirrors "trunk". Thats what I need to follow thats why and I
don't see any reason to mirror releases.

Regards


--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad

2007-03-22 Thread rubdabadub

On 3/22/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

Is the point of this that you can make "commits" to Lucene so that
you don't lose your changes on trunk?


Not only that. But I can make as many local branch as I like ..for example
customer X, customer Y. This way I can support X and Y as they have
separate features .. All of the above can be done with SVN but its a pain
at least for me.

And off course work off line .. during summer .. under trees :-) and then update
the whole branch with main repo without loosing any changes. It just seems easy,
I have also had a case where I need to bake some part of Nutch and some part
Solr under one tree i.e. new project and still maintain that tree with
the original
repo. and i could do that just fine. Bazaar commands are like SVN commands
so its not much to learn either :-)

Regards

On Mar 22, 2007, at 7:14 AM, rubdabadub wrote:

> Hi:
>
> First of all apology to those friends who follow all the list.
>
> Often times I work offline and I do not have any commit rights to any
> of the projects. All the modifications I make for various clients and
> trying to keep up to date with latest trunk somehow makes it difficult
> for me to just stick with "subversion". I have heard many things about
> distributed
> revision control system and I am sure there are tricks/fixes for the
> subversion problem i mentioned above, but I also wanted to learn
> something new :-) So after some trial with many DRCS I have decided to
> go for Bazaar! Its really cool DRCS.. you got try it.
>
> http://bazaar-vcs.org/.
>
> Now due to the fact that SVN is RCS and bzr is DRCS one need to
> convert SVN repos to bzr repos. and cool enough.. there is a free vcs
> mirroring service at Launchpad
>
> https://launchpad.net/
>
> So now the following projects are available via bzr branch. You can
> access them here.
>
> Nutch - https://launchpad.net/nutch
> Solr - https://launchpad.net/solr
> Lucene - https://launchpad.net/lucene
> Hadoop - https://launchpad.net/hadoop
>
> It only mirrors "trunk". Thats what I need to follow thats why and I
> don't see any reason to mirror releases.
>
> Regards

--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad

2007-03-22 Thread Grant Ingersoll
Nice idea and I can see the benefit of it to you and I don't mean to  
be a wet blanket on it, I just wonder about the legality of it.   
People may find it and think it is the official Apache Lucene, since  
it is branded that way.  I'm not a lawyer, so I don't know for sure.   
I think you have the right to store and use the code, even create a  
whole other search product based solely on Lucene (I think), I just  
don't know about this kind of thing.  In some sense it is like  
mirroring, but that fact that you can commit w/ out going through the  
Apache process makes me think that others coming upon the code will  
be mislead about what's in it.  The site _definitely_ makes it look  
like Launchpad is the home for Lucene with the intro and the bug  
tracking, etc, even though we all know this site will rank further  
down in the SERPs than the main site.


Perhaps I am misunderstanding?


On Mar 22, 2007, at 7:42 AM, rubdabadub wrote:


On 3/22/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

Is the point of this that you can make "commits" to Lucene so that
you don't lose your changes on trunk?


Not only that. But I can make as many local branch as I like ..for  
example

customer X, customer Y. This way I can support X and Y as they have
separate features .. All of the above can be done with SVN but its  
a pain

at least for me.

And off course work off line .. during summer .. under trees :-)  
and then update
the whole branch with main repo without loosing any changes. It  
just seems easy,
I have also had a case where I need to bake some part of Nutch and  
some part

Solr under one tree i.e. new project and still maintain that tree with
the original
repo. and i could do that just fine. Bazaar commands are like SVN  
commands

so its not much to learn either :-)

Regards

On Mar 22, 2007, at 7:14 AM, rubdabadub wrote:

> Hi:
>
> First of all apology to those friends who follow all the list.
>
> Often times I work offline and I do not have any commit rights  
to any
> of the projects. All the modifications I make for various  
clients and
> trying to keep up to date with latest trunk somehow makes it  
difficult
> for me to just stick with "subversion". I have heard many things  
about

> distributed
> revision control system and I am sure there are tricks/fixes for  
the

> subversion problem i mentioned above, but I also wanted to learn
> something new :-) So after some trial with many DRCS I have  
decided to

> go for Bazaar! Its really cool DRCS.. you got try it.
>
> http://bazaar-vcs.org/.
>
> Now due to the fact that SVN is RCS and bzr is DRCS one need to
> convert SVN repos to bzr repos. and cool enough.. there is a  
free vcs

> mirroring service at Launchpad
>
> https://launchpad.net/
>
> So now the following projects are available via bzr branch. You can
> access them here.
>
> Nutch - https://launchpad.net/nutch
> Solr - https://launchpad.net/solr
> Lucene - https://launchpad.net/lucene
> Hadoop - https://launchpad.net/hadoop
>
> It only mirrors "trunk". Thats what I need to follow thats why  
and I

> don't see any reason to mirror releases.
>
> Regards

--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad

2007-03-22 Thread rubdabadub

On 3/22/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

Nice idea and I can see the benefit of it to you and I don't mean to
be a wet blanket on it, I just wonder about the legality of it.
People may find it and think it is the official Apache Lucene, since
it is branded that way.  I'm not a lawyer, so I don't know for sure.
I think you have the right to store and use the code, even create a
whole other search product based solely on Lucene (I think), I just
don't know about this kind of thing.  In some sense it is like
mirroring, but that fact that you can commit w/ out going through the


NO NO!! I don't make any commits to apache trunk. Nor any one else
for that matter. The repo at launchpad is just a pure mirror and will
always be a mirror.

Just to clarify what I meant by commit. Basically you "pull" the Lucene
branch from launchpad to your local machine and that becomes a
complete copy of the trunk and you make another local branch from
that branch. Example

bzr branch http://bazaar.launchpad.net/~vcs-imports/lucene/trunk local.copy
bzr branch local.copy local.customerx

then you do all your work on local.customerx and make commits there. Cos you
want to keep the local.copy exactly identical to lanuchpad version
which in turns
a mirror like any other mirror that apache have thats all. If I were
to commit things
to the launchpad version I loose the whole point of mirroring and
getting changes
from trunk.


Apache process makes me think that others coming upon the code will
be mislead about what's in it.  The site _definitely_ makes it look
like Launchpad is the home for Lucene with the intro and the bug
tracking, etc, even though we all know this site will rank further
down in the SERPs than the main site.


I am not a lawyer or branding expert. But if you want me to edit the description
text with something like "A mirrored copy of Apache Lucene.. original
site at..."
No problem Please provide me the exact text so I can edit it to avoid confusion
etc.. Last thing I want to do is create confusion.

Moreover if such need like mine exist maybe Apache Infrastructure
should consider
a DRCS system then a RCS system .. SVN doesn't provide the flexibility
that I need.
In apache there is CVS and SVN co-exist as well as there are mirrors
of such all
over the world so.. why not have a bzr branch? if Launchpad want to
host it great
if other mirror wants to host it great.

I hope it clarifies misunderstanding.. Please do provide an exact text
so we don't
get into some lawyer trouble :-) I don't want to take a stab on the
text its better you
provide me exact instructions.

Regards.

On Mar 22, 2007, at 7:42 AM, rubdabadub wrote:

> On 3/22/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>> Is the point of this that you can make "commits" to Lucene so that
>> you don't lose your changes on trunk?
>
> Not only that. But I can make as many local branch as I like ..for
> example
> customer X, customer Y. This way I can support X and Y as they have
> separate features .. All of the above can be done with SVN but its
> a pain
> at least for me.
>
> And off course work off line .. during summer .. under trees :-)
> and then update
> the whole branch with main repo without loosing any changes. It
> just seems easy,
> I have also had a case where I need to bake some part of Nutch and
> some part
> Solr under one tree i.e. new project and still maintain that tree with
> the original
> repo. and i could do that just fine. Bazaar commands are like SVN
> commands
> so its not much to learn either :-)
>
> Regards
>> On Mar 22, 2007, at 7:14 AM, rubdabadub wrote:
>>
>> > Hi:
>> >
>> > First of all apology to those friends who follow all the list.
>> >
>> > Often times I work offline and I do not have any commit rights
>> to any
>> > of the projects. All the modifications I make for various
>> clients and
>> > trying to keep up to date with latest trunk somehow makes it
>> difficult
>> > for me to just stick with "subversion". I have heard many things
>> about
>> > distributed
>> > revision control system and I am sure there are tricks/fixes for
>> the
>> > subversion problem i mentioned above, but I also wanted to learn
>> > something new :-) So after some trial with many DRCS I have
>> decided to
>> > go for Bazaar! Its really cool DRCS.. you got try it.
>> >
>> > http://bazaar-vcs.org/.
>> >
>> > Now due to the fact that SVN is RCS and bzr is DRCS one need to
>> > convert SVN repos to bzr repos. and cool enough.. there is a
>> free vcs
>> > mirroring service at Launchpad
>> >
>> > https://launchpad.net/
>> >
>> > So now the following projects are available via bzr branch. You can
>> > access them here.
>> >
>> > Nutch - https://launchpad.net/nutch
>> > Solr - https://launchpad.net/solr
>> > Lucene - https://launchpad.net/lucene
>> > Hadoop - https://launchpad.net/hadoop
>> >
>> > It only mirrors "trunk". Thats what I need to follow thats why
>> and I
>> > don't see any reason to mirror releases.
>> >
>> > Regards
>>
>> 

Re: Spelt, for better spelling correction

2007-03-22 Thread Martin Haye

Otis,

I hadn't really thought about this, but it would be easy to build a
dictionary from an existing Lucene index. Tha main caveat is that it would
only work with "stored" fields. That's because this spellchecker boosts
accuracy using pair frequencies in addition to term frequencies, and Lucene
doesn't need or track pair frequencies to my knowledge. So any field which
you wanted to spellcheck would need to be indexed with Field.Store.YES.

Of course a side effect is that they'd have to be Analyzed again, with the
resulting time cost. Still, this could make sense for a lot of people.

I'll make sure the contribution includes an index-to-dictionary API, and
thank you very much for the input.

--Martin

On 3/21/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:


Martin,
This sounds like the spellchecker dictionary needs to be built in parallel
with the main Lucene index.  Is it possible to create a dictionary out of an
existing (and no longer modified) Lucene index?

Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Martin Haye <[EMAIL PROTECTED]>
To: Yonik Seeley <[EMAIL PROTECTED]>
Cc: java-user@lucene.apache.org
Sent: Wednesday, March 21, 2007 2:03:50 PM
Subject: Re: Spelt, for better spelling correction

The dictionary is generated from the corpus, with the result that a larger
corpus gives better results.

Words are queued up during an index run, and at the end are munged to
create
an optimized dictionary. It also supports incremental building, though the
overhead would be too much for those applications that are continuously
adding things to an index. Happily, it's not as important to keep the
spelling dictionary absolutely up to date, so it would be fine to queue
words over several index runs, and refresh the dictionary less often.

--Martin

On 3/20/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>
> Sounds interesting Martin!
> Is the dictionary static, or is it generated from the corpus or from
> user queries?
>
> -Yonik
>
> On 3/20/07, Martin Haye <[EMAIL PROTECTED]> wrote:
> > As part of XTF, an open source publishing engine that uses Lucene, I
> > developed a new spelling correction engine specifically to provide
"Did
> you
> > mean..." links for misspelled queries. I and a small group are
preparing
> > this for submission as a contrib module to Lucene. And we're inviting
> > interested people to join the discussion about it.
> >
> > The new engine is being called "Spelt" and differs from the one
> currently in
> > Lucene contrib in the following ways:
> >
> > - More accurate: Much better performance on single-word queries (90%
> correct
> > in #1 slot in my tests). On general list including multi-word queries,
> gets
> > 80%+ correct.
> > - Multi-word: Handles and corrects multi-word queries such as
> "harrypotter"
> > -> "harry potter".
> > - Fast: In my tests, builds the dictionary more than 30 times faster.
> > - Small: Dictionary size is roughly a third of that built by the
> existing
> > engine.
> > - Other bells and whistles...
> >
> > There is already a standalone test program that people can try out,
and
> > we're interested in feedback. If you're interested in discussing,
> testing,
> > or previewing, consider joining the Google group:
> > http://groups.google.com/group/spelt/
> >
> > --Martin
> >
>




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad

2007-03-22 Thread Grant Ingersoll


On Mar 22, 2007, at 8:16 AM, rubdabadub wrote:


On 3/22/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

Nice idea and I can see the benefit of it to you and I don't mean to
be a wet blanket on it, I just wonder about the legality of it.
People may find it and think it is the official Apache Lucene, since
it is branded that way.  I'm not a lawyer, so I don't know for sure.
I think you have the right to store and use the code, even create a
whole other search product based solely on Lucene (I think), I just
don't know about this kind of thing.  In some sense it is like
mirroring, but that fact that you can commit w/ out going through the


NO NO!! I don't make any commits to apache trunk. Nor any one else
for that matter. The repo at launchpad is just a pure mirror and will
always be a mirror.

Just to clarify what I meant by commit. Basically you "pull" the  
Lucene

branch from launchpad to your local machine and that becomes a
complete copy of the trunk and you make another local branch from
that branch. Example

bzr branch http://bazaar.launchpad.net/~vcs-imports/lucene/trunk  
local.copy

bzr branch local.copy local.customerx

then you do all your work on local.customerx and make commits  
there. Cos you

want to keep the local.copy exactly identical to lanuchpad version
which in turns
a mirror like any other mirror that apache have thats all. If I were
to commit things
to the launchpad version I loose the whole point of mirroring and
getting changes
from trunk.



Gotcha.  I guess I just rely on IntelliJ built in versioning to  
provide similar capabilities, plus, maybe checking out multiple  
copies of the source.  Also, I try to avoid making changes in open  
source libraries unless absolutely necessary.



Apache process makes me think that others coming upon the code will
be mislead about what's in it.  The site _definitely_ makes it look
like Launchpad is the home for Lucene with the intro and the bug
tracking, etc, even though we all know this site will rank further
down in the SERPs than the main site.


I am not a lawyer or branding expert. But if you want me to edit  
the description

text with something like "A mirrored copy of Apache Lucene.. original
site at..."
No problem Please provide me the exact text so I can edit it to  
avoid confusion

etc.. Last thing I want to do is create confusion.

Moreover if such need like mine exist maybe Apache Infrastructure
should consider
a DRCS system then a RCS system .. SVN doesn't provide the flexibility
that I need.
In apache there is CVS and SVN co-exist as well as there are mirrors
of such all
over the world so.. why not have a bzr branch? if Launchpad want to
host it great
if other mirror wants to host it great.

I hope it clarifies misunderstanding.. Please do provide an exact text
so we don't
get into some lawyer trouble :-) I don't want to take a stab on the
text its better you
provide me exact instructions.



I'll wait for some of the others that are closer to the Foundation to  
contribute (maybe one of the PMC members).  Like I said, I don't know  
if it is an issue at all.  I just don't want people to be confused  
about it.  I think you could propose a DRCS to infrastructure and  
make a case for it.  Personally, I'm fine with SVN, but then again I  
used to think I was fine with CVS and I don't think I would want to  
go back to that!


I am curious, how many custom changes are you making to the code that  
this is even an issue?  Perhaps submitting patches and working to get  
them committed would be a more efficient strategy.


-Grant





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Combining score from two or more hits

2007-03-22 Thread Erick Erickson

Don't know if it's useful or not, but if you used  TopDocs instead,
you have access to an array of ScoreDoc which you could modify
freely. In my app, I used a FieldSortedHitQueue to re-sort things
when I needed to.

ERick

On 3/22/07, Antony Bowesman <[EMAIL PROTECTED]> wrote:


I have indexed objects that contain one or more attachments.  Each
attachment is
indexed as a separate Document along with the object metadata.

When I make a search, I may get hits in more than one Document that refer
to the
same object.  I have a HitCollector which knows if the object has already
been
found, so I want to be able to update the score of an existing hit in a
way that
makes sense.  e.g. If hit H1 has score 1.35 and hit H2 has score 2.9 is is
possible to re-score it on the basis that the real hit result is (H1 AND
H2).

I can take the highest score of any Document, but just wondered if this is
possible during the HitCollector.collect method?

Antony





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad

2007-03-22 Thread rubdabadub

Good to hear :-)


I am curious, how many custom changes are you making to the code that
this is even an issue?  Perhaps submitting patches and working to get
them committed would be a more efficient strategy.


Well there are 3 problems I see.

1. There are very good patches on all of the lucene Jiira but for one way
or another these issues never get applied to trunk. For me its not a question
of why its more of a question how can i use it and learn from it. So having
my own local branch to do "whatever" is really great. I build I apply patch
play around .. tear it down without thinking about anything else. Yes you
could do this with various copies of the source but often times these patches
works with rev.  etc.. Its much easier to play when you are in control
of the local.trunk.

2. I also have to customer modifications and maintain i.e support and some
of the fixes only works with a certain rev of trunk and often times i
make mistake
and do svn up .. it happens and that does create some extra key strokes :-)

3. You are correct about the committing strategy but most of my changes are
customer specifics and customer have specific rules so it never gets back to
you guys. Well customer rules I can't decide on the modifications I make.

Regards

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: how ungrouped query handled?

2007-03-22 Thread Erick Erickson

This is a pretty common issue that I've been grappling with by chance
recently. The main point is that the parser is NOT a boolean logic
parser.

Search the mail archive for the thread "bad query parser bug" and
you'll find a good discussion.

I tried using PrecedenceQueryParser, but that didn't work for
me very well, search the mail archive on that and you'll see some
examples of why.

I solved this problem for my immediate issues by writing a very
quick-and-dirty parenthesizer for my raw query. If it wasn't going
on summer, I might see if I can contribute something by
seeing if there's a way I can see to fix PrecedenceQueryParser.

Best
Erick

On 3/22/07, SK R <[EMAIL PROTECTED]> wrote:


Hi,
 Can anyone explain how lucene handles the belowed query?
My query is *field1:source AND (field2:name OR field3:dest)* . I've
given this string to queryparser and then searched by using searcher. It
returns correct results. It's query.toString() print is :: +field1:source
+(field2:name field3:dest)
But if i don't group my terms (i.e) my query : *field1:source AND
field2:name OR field3:dest *,then it gives the result of  first two term's
search result. It doesn't search 3rd term. It's query.toString() print is
::
+field1:source +field2:name field3:dest.
If i use same boolean operator between all terms, then it returns correct
results.
Why it doesn't search the terms after 2nd term if grouping not used?

Thanks & Regards
RSK



Speeding up looping over Hits

2007-03-22 Thread Andreas Guther
Hi,

While looking into performance enhancement for our search feature I
noticed a significant difference in Documents access time while looping
over Hits.

I wrote a test application search for a list of search terms and then
for each returned Hits object loops twice over every single hits.doc(i).

for (int i = 0; i < numberOfDocs; i++) {doc = hits.doc(i);}

I am seeing differences like the following

Found 16,215 hits for 'Water or Wine' in 219 ms
Processed 16,215 docs in 53,141 ms; per single doc 3.2773 ms
Processed 16,215 docs in 2,032 ms; per single doc 0.1253 ms

Interestingly if I run the same test application a second time in my IDE
the difference between the first and the second loop is very low.

I have no explanation why I see this difference but it becomes a huge
problem for us due to the fact that I need to extract from each document
a small set of information pieces and the first time looping just takes
too much time.

I could not find any indication for an external caching of Hits.  I am
running my tests within Eclipse with a memory setting of -Xms766M
-Xmx1024M.

What is the explanation in the different access speed for the same
search results?

Is there a way to speed up looping over the Hits data structure?

Andreas



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Speeding up looping over Hits

2007-03-22 Thread Erick Erickson

Your timing differences are probably because of caching. But this
has been mentioned many times in the archive, that a Hits object
is intended to allow fast, simple retrieval of the first few documents
in a result set (100 if memory serves). Each 100 or so calls to
next() causes the search to be re-issued.

See HitCollector, TopDocs, etc...

Erick

On 3/22/07, Andreas Guther <[EMAIL PROTECTED]> wrote:


Hi,

While looking into performance enhancement for our search feature I
noticed a significant difference in Documents access time while looping
over Hits.

I wrote a test application search for a list of search terms and then
for each returned Hits object loops twice over every single hits.doc(i).

for (int i = 0; i < numberOfDocs; i++) {doc = hits.doc(i);}

I am seeing differences like the following

Found 16,215 hits for 'Water or Wine' in 219 ms
Processed 16,215 docs in 53,141 ms; per single doc 3.2773 ms
Processed 16,215 docs in 2,032 ms; per single doc 0.1253 ms

Interestingly if I run the same test application a second time in my IDE
the difference between the first and the second loop is very low.

I have no explanation why I see this difference but it becomes a huge
problem for us due to the fact that I need to extract from each document
a small set of information pieces and the first time looping just takes
too much time.

I could not find any indication for an external caching of Hits.  I am
running my tests within Eclipse with a memory setting of -Xms766M
-Xmx1024M.

What is the explanation in the different access speed for the same
search results?

Is there a way to speed up looping over the Hits data structure?

Andreas



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad

2007-03-22 Thread Andrzej Bialecki

rubdabadub wrote:

On 3/22/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

Nice idea and I can see the benefit of it to you and I don't mean to
be a wet blanket on it, I just wonder about the legality of it.


So long as it meets the Apache license conditions regarding the 
distribution it's not forbidden. It could be confusing or superfluous, 
but it couldn't be illegal.




People may find it and think it is the official Apache Lucene, since
it is branded that way.  I'm not a lawyer, so I don't know for sure.
I think you have the right to store and use the code, even create a
whole other search product based solely on Lucene (I think), I just
don't know about this kind of thing.  In some sense it is like
mirroring, but that fact that you can commit w/ out going through the


NO NO!! I don't make any commits to apache trunk. Nor any one else
for that matter. The repo at launchpad is just a pure mirror and will
always be a mirror.



Actually, I often find myself in a similar situation to "rubdabadub". 
I'm working on several commercial projects that use and modify 
Lucene/Nutch, and often such modifications are proprietary (about 
equally often they are not, and are submitted as patches).


Over time, the issue of tracking the vendor source tree and merging from 
that tree (per svnbook) to several different private svn repos becomes a 
tricky and time-consuming business ... I'd welcome any improvements here.


It seems I need to find some time to get more familiar with bzr ...

--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How can I index Phrases in Lucene?

2007-03-22 Thread Maryam
Hi, 

I know how to index terms in lucene, now I wanna see
how can I index phrases like "information retreival"
in lucene and calculate the number of times that
phrase has appeared in the document. Is there any way
to do it in Lucene?

Thanks


 

It's here! Your new message!  
Get new email alerts with the free Yahoo! Toolbar.
http://tools.search.yahoo.com/toolbar/features/mail/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How can I index Phrases in Lucene?

2007-03-22 Thread Erick Erickson

Well, you don't index phrases, it's done for you. You should try
something like the following

Create a SpanNearQuery with your terms. Specify an appropriate
slop (probably 0 assuming you want them all next to each other).

Now use call getSpans and count ... You may have to do
something with overlapping spans, but you'll need to experiment
a bit to understand it.

Erick

On 3/22/07, Maryam <[EMAIL PROTECTED]> wrote:


Hi,

I know how to index terms in lucene, now I wanna see
how can I index phrases like "information retreival"
in lucene and calculate the number of times that
phrase has appeared in the document. Is there any way
to do it in Lucene?

Thanks





It's here! Your new message!
Get new email alerts with the free Yahoo! Toolbar.
http://tools.search.yahoo.com/toolbar/features/mail/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: how ungrouped query handled?

2007-03-22 Thread Chris Hostetter

see also the FAQ "Why am I getting no hits / incorrect hits?" which points
to...

http://wiki.apache.org/lucene-java/BooleanQuerySyntax

...I've just added some more words of wisdom there from past emails.


: Date: Thu, 22 Mar 2007 09:51:15 -0400
: From: Erick Erickson <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Re: how ungrouped query handled?
:
: This is a pretty common issue that I've been grappling with by chance
: recently. The main point is that the parser is NOT a boolean logic
: parser.
:
: Search the mail archive for the thread "bad query parser bug" and
: you'll find a good discussion.
:
: I tried using PrecedenceQueryParser, but that didn't work for
: me very well, search the mail archive on that and you'll see some
: examples of why.
:
: I solved this problem for my immediate issues by writing a very
: quick-and-dirty parenthesizer for my raw query. If it wasn't going
: on summer, I might see if I can contribute something by
: seeing if there's a way I can see to fix PrecedenceQueryParser.
:
: Best
: Erick
:
: On 3/22/07, SK R <[EMAIL PROTECTED]> wrote:
: >
: > Hi,
: >  Can anyone explain how lucene handles the belowed query?
: > My query is *field1:source AND (field2:name OR field3:dest)* . I've
: > given this string to queryparser and then searched by using searcher. It
: > returns correct results. It's query.toString() print is :: +field1:source
: > +(field2:name field3:dest)
: > But if i don't group my terms (i.e) my query : *field1:source AND
: > field2:name OR field3:dest *,then it gives the result of  first two term's
: > search result. It doesn't search 3rd term. It's query.toString() print is
: > ::
: > +field1:source +field2:name field3:dest.
: > If i use same boolean operator between all terms, then it returns correct
: > results.
: > Why it doesn't search the terms after 2nd term if grouping not used?
: >
: > Thanks & Regards
: > RSK
: >
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Extracting formatted text from PDF files

2007-03-22 Thread Soeren Pekrul

Mike O'Leary wrote:

Please forgive the laziness inherent in this question, as I haven't looked
through the PDFBox code yet. I am wondering if that code supports extracting
text from PDF files while preserving such things as sequences of whitespace
between characters and other layout and formatting information. I am working
with a project that extracts and operates on certain table-like blocks of
text from PDF files, and a lot of freeware and shareware PDF to text
converters seem to either ignore formatting or try to preserve formatting
and not get it quite right. I am wondering if PDFBox provides better support
for this kind of thing. Thanks.


That is not so simple. Usually there is not this information inside a 
PDF file. PDF is an output file format. It contains just the information 
print a character "a" at the position x and y. In many cases a PDF file 
doesn’t know even words or white spaces. We read words due to the 
position of characters, we see paragraphs due to the position of 
characters, and we see tables due to the position of characters. The 
file doesn’t contain this information.
I found this code in a PDF file for the German word "Wuchsform" (form of 
growing) and the colon ":":


/F1 1 Tf
-3.8801 -1.274 TD
[ (W) 29.60001 (uchsform:) ] TJ

First line: Select a font
Second line: Move the cursor to position -3.8801, -1.274
Third line: Print the character "W", move the cursor 29.60001 units to 
right and print the characters "uchsform:".


Extracting the words from a PDF file for indexing means you have first 
to build words from the characters positions. Recognizing paragraphs, 
column text, tables, captions, lists, footnotes etc. is much more difficult.


Sören

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Combining score from two or more hits

2007-03-22 Thread Antony Bowesman

Erick Erickson wrote:

Don't know if it's useful or not, but if you used  TopDocs instead,
you have access to an array of ScoreDoc which you could modify
freely. In my app, I used a FieldSortedHitQueue to re-sort things
when I needed to.


Thanks Erick, I've been using TopDocs, but am playing with my own HitCollector 
variant of TopDocHitCollector.  The problem is not adjusting the score, it's 
what to adjust it by, i.e. is it possible to re-evaluate the scores of H1 and H2 
knowing that the original query resulted in hits on H1 AND H2.


Antony



ERick

On 3/22/07, Antony Bowesman <[EMAIL PROTECTED]> wrote:


I have indexed objects that contain one or more attachments.  Each
attachment is
indexed as a separate Document along with the object metadata.

When I make a search, I may get hits in more than one Document that refer
to the
same object.  I have a HitCollector which knows if the object has already
been
found, so I want to be able to update the score of an existing hit in a
way that
makes sense.  e.g. If hit H1 has score 1.35 and hit H2 has score 2.9 
is is

possible to re-score it on the basis that the real hit result is (H1 AND
H2).

I can take the highest score of any Document, but just wondered if 
this is

possible during the HitCollector.collect method?

Antony




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Speeding up looping over Hits

2007-03-22 Thread Santa Clause
Another thing you may want to look at is the newer version 2.1.0 and  
getFieldable. I think that will lazy load the data, that way you are  only 
reading the parts of the document that you need at that moment  rather than the 
whole thing. Someone please correct me if I am wrong or  point to what I really 
mean :) 
  
  I had a similar situation a long while back and I was able to find a  patch 
for the version of Lucene I was using that allowed the above. It  made a huge 
difference. I think something similar is now built in 2.1.0.
  

Andreas Guther <[EMAIL PROTECTED]> wrote:  Hi,

While looking into performance enhancement for our search feature I
noticed a significant difference in Documents access time while looping
over Hits.

I wrote a test application search for a list of search terms and then
for each returned Hits object loops twice over every single hits.doc(i).

for (int i = 0; i < numberOfDocs; i++) {doc = hits.doc(i);}

I am seeing differences like the following

Found 16,215 hits for 'Water or Wine' in 219 ms
Processed 16,215 docs in 53,141 ms; per single doc 3.2773 ms
Processed 16,215 docs in 2,032 ms; per single doc 0.1253 ms

Interestingly if I run the same test application a second time in my IDE
the difference between the first and the second loop is very low.

I have no explanation why I see this difference but it becomes a huge
problem for us due to the fact that I need to extract from each document
a small set of information pieces and the first time looping just takes
too much time.

I could not find any indication for an external caching of Hits.  I am
running my tests within Eclipse with a memory setting of -Xms766M
-Xmx1024M.

What is the explanation in the different access speed for the same
search results?

Is there a way to speed up looping over the Hits data structure?

Andreas



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



 
-
Now that's room service! Choose from over 150,000 hotels 
in 45,000 destinations on Yahoo! Travel to find your fit.

Software Product Development Job Opportunity (Baltimore, MD)

2007-03-22 Thread Lesko, Matt
Official job description & info to submit a resume:
http://www.systemsalliance.com/careers/internal-jobs/baltimore/Software_
Engineer_MD.html 

 

Located 15 minutes North of Baltimore in Sparks, MD

 

Position is on a team, working with myself and others, maintaining and
developing an existing content management system.

 

Quiet working environment in shared office with a nice view. 

 

Management that chooses to do the right thing more often than the
expedient. 

 

Full stack on your own machine (IIS/Apache, Coldfusion [JRun], SQL
Server/Oracle) for local development.

 

Trac for defect tracking & source control. 

 

Java work includes Lucene, XOM and JavaCC. 

 

Feel free to contact me with questions email: mlesko at systemsalliance
dot com. 

 



This email communication is confidential, is intended only for 
the use of the named recipient(s), and may be legally privileged.
If the reader of this message is not the intended recipient, 
you are hereby notified that any distribution or copying of this 
email or any of its contents is strictly prohibited.  If you have 
received this communication in error, please re-send it to the 
sender and delete the original message and any copy of it 
from your computer system.  To contact intended sender
please call 410-584-0595.  Thank you.




Re: Speeding up looping over Hits

2007-03-22 Thread Erick Erickson

Oh yeah.. By only loading the relevant fields, my query times
reduced by over 90%. I actually wrote that up on the mailing list if
you wanted to try to find it, but it took Andreas' message to
remind me...

Erick

On 3/22/07, Santa Clause <[EMAIL PROTECTED]> wrote:


Another thing you may want to look at is the newer version 2.1.0and  
getFieldable. I think that will lazy load the data, that way you
are  only reading the parts of the document that you need at that
moment  rather than the whole thing. Someone please correct me if I am wrong
or  point to what I really mean :)

  I had a similar situation a long while back and I was able to find
a  patch for the version of Lucene I was using that allowed the above.
It  made a huge difference. I think something similar is now built in
2.1.0.


Andreas Guther <[EMAIL PROTECTED]> wrote:  Hi,

While looking into performance enhancement for our search feature I
noticed a significant difference in Documents access time while looping
over Hits.

I wrote a test application search for a list of search terms and then
for each returned Hits object loops twice over every single hits.doc(i).

for (int i = 0; i < numberOfDocs; i++) {doc = hits.doc(i);}

I am seeing differences like the following

Found 16,215 hits for 'Water or Wine' in 219 ms
Processed 16,215 docs in 53,141 ms; per single doc 3.2773 ms
Processed 16,215 docs in 2,032 ms; per single doc 0.1253 ms

Interestingly if I run the same test application a second time in my IDE
the difference between the first and the second loop is very low.

I have no explanation why I see this difference but it becomes a huge
problem for us due to the fact that I need to extract from each document
a small set of information pieces and the first time looping just takes
too much time.

I could not find any indication for an external caching of Hits.  I am
running my tests within Eclipse with a memory setting of -Xms766M
-Xmx1024M.

What is the explanation in the different access speed for the same
search results?

Is there a way to speed up looping over the Hits data structure?

Andreas



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
Now that's room service! Choose from over 150,000 hotels
in 45,000 destinations on Yahoo! Travel to find your fit.


Re: Extracting formatted text from PDF files

2007-03-22 Thread Daniel Noll

Mike O'Leary wrote:

Please forgive the laziness inherent in this question, as I haven't looked
through the PDFBox code yet. I am wondering if that code supports extracting
text from PDF files while preserving such things as sequences of whitespace
between characters and other layout and formatting information. I am working
with a project that extracts and operates on certain table-like blocks of
text from PDF files, and a lot of freeware and shareware PDF to text
converters seem to either ignore formatting or try to preserve formatting
and not get it quite right.


Even pdftohtml?  The sample outputs I've seen from that application 
don't look too bad to me.


Daniel


--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699
Web: http://nuix.com/   Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



problem in reading an index

2007-03-22 Thread Maryam
Hi, 

I have written this piece of code to read the index,
mainly to see what terms are in each document and what
the frequency of each term in the document is. This
piece of code correctly calculates the number of docs
in the index, but I don’t know why variable
myTermFreq[] is null. Would you please let me know
your idea bout it?

IndexReader reader = IndexReader.open(myInd);
for (int docNo = 0; docNo < reader.numDocs(); docNo++)
{
TermFreqVector myTermFreq[] =
reader.getTermFreqVectors(docNo);
if (myTermFreq != null) {
for (int i = 0; i < myTermFreq.length; i++) {
int freq[] = myTermFreq[i].getTermFrequencies();
//String terms[]= myTermFreq[i].getTerms();
for (int j=0;jhttp://tools.search.yahoo.com/shortcuts/#news

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: problem in reading an index

2007-03-22 Thread Daniel Noll

Maryam wrote:
Hi, 


I have written this piece of code to read the index,
mainly to see what terms are in each document and what
the frequency of each term in the document is. This
piece of code correctly calculates the number of docs
in the index, but I don’t know why variable
myTermFreq[] is null. Would you please let me know
your idea bout it?


From TFJD:
   Return an array of term frequency vectors for the specified document.
   The array contains a vector for each vectorized field in the
   document.  Each vector contains terms and frequencies for all terms
   in a given vectorized field.  If no such fields existed, the method
   returns null.

i.e. you may not have stored the term vectors when indexing the data.

Daniel



--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699
Web: http://nuix.com/   Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reverse search

2007-03-22 Thread Melanie Langlois
Hello,

 

I want to manage user subscriptions to specific documents. So I would like to 
store the subscription (query) into the lucene directory, and whenever I 
receive a new document, I will search all the matching subscriptions to send 
the documents to all subcribers. For instance if a user subscribes to all 
documents with text containing (WORD1 and WORD2) or WORD3, how can I match the 
incoming document based on stored subscriptions? I was thinking to have two 
subfields for each field of the subscription: the AND conditions and the OR 
conditions. 

-OR. I will tokenized the document field content and insert OR between each of 
them, and run the query against OR condition of subscription

-It's for the AND that I will have an issue, because if the incoming text may 
contains more words than the sequence I want to search. 

For instance, if I subscribe for documents contents lucene and java for 
instance , if the incoming document contents is lucene is a great API which has 
been developed in java, once I removed stopwords my query would look like 
lucene and great and API and developed and java. 

As query is composed of more words than the stored subscription I will fail to 
retrieve the subscription. But if I put only or words, the results will not be 
accurate, as I can obtain subscription only for java for instance.

 

Do you know how I can handle this situation? I'm not sure I can actually do 
this using Lucene...

 

Thank you,

 

Mélanie 
 

 



Re: Reverse search

2007-03-22 Thread karl wettin


23 mar 2007 kl. 02.12 skrev Melanie Langlois:

I want to manage user subscriptions to specific documents. So I  
would like to store the subscription (query) into the lucene  
directory, and whenever I receive a new document, I will search all  
the matching subscriptions to send the documents to all subcribers.  
For instance if a user subscribes to all documents with text  
containing (WORD1 and WORD2) or WORD3, how can I match the incoming  
document based on stored subscriptions? I was thinking to have two  
subfields for each field of the subscription: the AND conditions  
and the OR conditions.


-OR. I will tokenized the document field content and insert OR  
between each of them, and run the query against OR condition of  
subscription


-It's for the AND that I will have an issue, because if the  
incoming text may contains more words than the sequence I want to  
search.


For instance, if I subscribe for documents contents lucene and java  
for instance , if the incoming document contents is lucene is a  
great API which has been developed in java, once I removed  
stopwords my query would look like lucene and great and API and  
developed and java.


As query is composed of more words than the stored subscription I  
will fail to retrieve the subscription. But if I put only or words,  
the results will not be accurate, as I can obtain subscription only  
for java for instance.




I wrote such a thing way back, where I used the new document as the  
query and the user subscriptions as the index. Similar to what you  
describe, I had an AND, OR and NOT field. This really limited the  
type of queries users could store. It does however work, particullary  
well on systems with /huge/ amounts of subscriptions (many millions).


Today I would have used something else. If you insert one document at  
the time to your index, take a look at MemoryIndex in contrib. If you  
insert documents in batches larger than one document at the time,  
take a look at LUCENE-550 in the Jira. Add new documents to such an  
index and place the subscribed queries on it. Depening on the  
queries, the speed should be some 20-100 times faster than using a  
RAMDirectory. One million queries should take some 20 seconds to  
assemble and place on a 25 document index on my laptop. See issues.apache.org/jira/secure/attachment/ 
12353601/12353601_HitCollectionBench.jpg> for performace of LUCENE-550.


--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: problem in reading an index

2007-03-22 Thread karl wettin


23 mar 2007 kl. 02.09 skrev Daniel Noll:


Maryam wrote:

Hi, I have written this piece of code to read the index,
mainly to see what terms are in each document and what
the frequency of each term in the document is. This
piece of code correctly calculates the number of docs
in the index, but I don’t know why variable
myTermFreq[] is null. Would you please let me know
your idea bout it?


From TFJD:
   Return an array of term frequency vectors for the specified  
document.

   The array contains a vector for each vectorized field in the
   document.  Each vector contains terms and frequencies for all terms
   in a given vectorized field.  If no such fields existed, the method
   returns null.

i.e. you may not have stored the term vectors when indexing the data.


This thread might be of interest:

http://www.nabble.com/Resolving-term-vector-even-when-not-stored-- 
tf3412160.html#a9507268


--
karl
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Reverse search

2007-03-22 Thread Melanie Langlois
Thanks Karl, the performances graph is really amazing!
I have to say that it would not have think this way around would be faster, but 
sounds nice if I can use this, make everything easier to manage. I'm just 
wondering what did you consider when you build your graph, only the time to run 
the queries? Because, I should add the time for creating the index anytime a 
new document comes in (or a subset of documents if several comes in same time), 
and the indexing of these documents. The documents should not be big, around 
2KB. Did you measure this part ?

Mélanie 
  
-Original Message-
From: karl wettin [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 23, 2007 10:35 AM
To: java-user@lucene.apache.org
Subject: Re: Reverse search


23 mar 2007 kl. 02.12 skrev Melanie Langlois:

> I want to manage user subscriptions to specific documents. So I  
> would like to store the subscription (query) into the lucene  
> directory, and whenever I receive a new document, I will search all  
> the matching subscriptions to send the documents to all subcribers.  
> For instance if a user subscribes to all documents with text  
> containing (WORD1 and WORD2) or WORD3, how can I match the incoming  
> document based on stored subscriptions? I was thinking to have two  
> subfields for each field of the subscription: the AND conditions  
> and the OR conditions.
>
> -OR. I will tokenized the document field content and insert OR  
> between each of them, and run the query against OR condition of  
> subscription
>
> -It's for the AND that I will have an issue, because if the  
> incoming text may contains more words than the sequence I want to  
> search.
>
> For instance, if I subscribe for documents contents lucene and java  
> for instance , if the incoming document contents is lucene is a  
> great API which has been developed in java, once I removed  
> stopwords my query would look like lucene and great and API and  
> developed and java.
>
> As query is composed of more words than the stored subscription I  
> will fail to retrieve the subscription. But if I put only or words,  
> the results will not be accurate, as I can obtain subscription only  
> for java for instance.
>

I wrote such a thing way back, where I used the new document as the  
query and the user subscriptions as the index. Similar to what you  
describe, I had an AND, OR and NOT field. This really limited the  
type of queries users could store. It does however work, particullary  
well on systems with /huge/ amounts of subscriptions (many millions).

Today I would have used something else. If you insert one document at  
the time to your index, take a look at MemoryIndex in contrib. If you  
insert documents in batches larger than one document at the time,  
take a look at LUCENE-550 in the Jira. Add new documents to such an  
index and place the subscribed queries on it. Depening on the  
queries, the speed should be some 20-100 times faster than using a  
RAMDirectory. One million queries should take some 20 seconds to  
assemble and place on a 25 document index on my laptop. See  for performace of LUCENE-550.

-- 
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Questions about Indexing

2007-03-22 Thread Maryam
Hi, 

I have three questions about indexing:

1) I am indexing HTML documents, how can I do "stop
removal" before indexing, I dont want to index stop
words? 

2) I can have an access to the terms in one document,
but how can I have access to the document name that
these terms has been appeared?

3) I want to find phrases at index level, e.x. find
frequency of phrases in the collection, also their
frequency in each document. How can I do it in Lucene,
is there any sample code?

Thanks



 

Be a PS3 game guru.
Get your game face on with the latest PS3 news and previews at Yahoo! Games.
http://videogames.yahoo.com/platform?platform=120121

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: contrib/benchmark questions

2007-03-22 Thread Grant Ingersoll

OK, Doron (and other benchmarkers!), on to search:

Here's my alg file:

#Indexing declaration up here

OpenReader
{ "SrchSameRdr" Search > : 5000

{ "SrchTrvSameRdr" SearchTrav > : 5000
{ "SrchTrvSameRdrTopTen" SearchTrav(10) > : 5000
{ "SrchTrvRetLoadAllSameRdr" SearchTravRet > : 5000

#Skip bytes and body
{ "SrchTrvRetLoadSomeSameRdr" SearchTravRetLoadFieldSelector 
(docid,docname,docdate,doctitle) > : 5000

CloseReader


Never mind the last task, I will be submitting a patch shortly that  
will make sense out of it.  Essentially, it specifies what fields to  
load for the document


Here are theresults:
Operation  round merge  
max.buffered   runCnt   recsPerRunrec/s  elapsedSec 
avgUsedMemavgTotalMem
 [java] OpenReader -  -  -  -  -  -  -  -  0 -  10 -  -  -   10  
-  -   1 -  -  -  - 1 -  -   125.0 -  -   0.01 -   5,385,600 -  -  
9,965,568
 [java] SrchSameRdr_5000   010
101 5000  1,184.34.22 5,805,120   
9,965,568
 [java] SrchTrvSameRdr_5000 -  -  -  -  -  0 -  10 -  -  -   10  
-  -   1 -  -  427500 -   71,776.4 -  -   5.96 -   5,806,144 -  -  
9,965,568
 [java] SrchTrvSameRdrTopTen_5000  010
101   427500 62,001.46.89 5,766,584   
9,965,568
 [java] SrchTrvRetLoadAllSameRdr_5000 -  - 0 -  10 -  -  -   10  
-  -   1 -  -  85 -  - 7,226.4 -  - 117.62 -   6,161,728 -  -  
9,965,568
 [java] SrchTrvRetLoadSomeSameRdr_5000 010
101   85 10,334.0   82.25 6,162,752   
9,965,568
 [java] CloseReader -  -  -  -  -  -  -  - 0 -  10 -  -  -   10  
-  -   1 -  -  -  - 1 -  - 1,000.0 -  -   0.00 -   5,921,856 -  -  
9,965,568


The line I'm a bit confused by is the recsPerRun
For the tasks that are doing the traversal and the retrieval, why so  
many recsPerRun?  Is it counting the hits, the traversals and the  
retrievals each as one record?


What I am trying to do is compare:
Search
Search plus traversal of all hits
Search plus traversal of top ten
Search plus traversal and retrieval of all documents and all fields  
on the document
Search plus traversal and retrieval of all documents and some fields  
on the document


I think I see in the ReadTask that it is the res var that is being  
incremented and would have to be altered.  I guess I can go by  
elapsed time, but even that seems slightly askew.  I think this is  
due to the withRetrieve() function overhead inside the for loop.  I  
have moved it out and will submit that change, too.


Am I interpreting this correctly?

-Grant

On Mar 19, 2007, at 5:11 PM, Doron Cohen wrote:


Grant Ingersoll <[EMAIL PROTECTED]> wrote on 19/03/2007 13:10:16:


So, if I am understanding correctly:


"SearchSameRdr" Search > : 5000


means don't collect indiv. stats fur SearchSameRdr, but do whatever
that task does 5000 times, right?


Almost...

It should be btw
   { "SearchSameRdr" Search > : 5000
and it means: run Search 5000 times, sequentially, 5000 times,  
assign the

name "SearchSameRdr" to that sequence of 5000, and do not collect
individual stats for the individual tasks making that sequence.

If it was just
  { Search > : 5000
it would still mean the same, just that a name was assigned to this  
for

you, something like: "Seq_Search_5000".

If it was:
   { "SearchSameRdr" Search } : 5000
it would be the same as your example, just that stas would be  
collected not
only for the entire elapsed sequence, but also breaking it down for  
each of

the 5000 calls to "Search".

Similar logic with
  [ .. ]
and
  [ .. >
just that the tasks making the (parallel) sequence are executed in
parallel, each in a separate thread.





3. Is there anyway to dump out the stats as a CSV file or  
something?
Would I implement a Task for this?  Ultimately, I want to be  
able to

create a graph in Excel that shows tradeoffs between speed and
memory.


Yes, implementing a report task would be the way.
... but when I look at how I implemented these reports, all the
work is
done in the class Points. Seems it should be modified a little with
more
thought of making it easiert to extend reports.


I may take a crack at it, but deadline for the talk is looming


I'll take a look too, let you know if I have anything.


- Being intetested in memory stats - the thing that all the rounds
run in a
single program, same JVM run, usually means what you see is very  
much

dependent in the GC behavior of the specific VM you are using. If
it does
not release memory (most likely) to the OS you would not be able to
notice
that round i+1 used less memory than round i. It would probably
better for
something like this to put the "round" logic in an ant script,
invoking
each round in a separate new exec. But then things get more
complicated for
having a final stats report containing all rounds. What do you
think about
this?



Re: How can I index Phrases in Lucene?

2007-03-22 Thread Ryan McKinley

Is there any way to find frequent phrases without knowing what you are
looking for?

I could index "A B C D E" as "A B C", "B C D", "C D E" etc, but that
seems kind of clunky particularly if the phrase length is large.  Is
there any position offset magic that will surface frequent phrases
automatically?

thanks
ryan


On 3/22/07, Erick Erickson <[EMAIL PROTECTED]> wrote:

Well, you don't index phrases, it's done for you. You should try
something like the following

Create a SpanNearQuery with your terms. Specify an appropriate
slop (probably 0 assuming you want them all next to each other).

Now use call getSpans and count ... You may have to do
something with overlapping spans, but you'll need to experiment
a bit to understand it.

Erick

On 3/22/07, Maryam <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> I know how to index terms in lucene, now I wanna see
> how can I index phrases like "information retreival"
> in lucene and calculate the number of times that
> phrase has appeared in the document. Is there any way
> to do it in Lucene?
>
> Thanks
>
>
>
>
> 

> It's here! Your new message!
> Get new email alerts with the free Yahoo! Toolbar.
> http://tools.search.yahoo.com/toolbar/features/mail/
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: contrib/benchmark questions

2007-03-22 Thread Grant Ingersoll


On Mar 22, 2007, at 11:21 PM, Grant Ingersoll wrote:

I think I see in the ReadTask that it is the res var that is being  
incremented and would have to be altered.  I guess I can go by  
elapsed time, but even that seems slightly askew.  I think this is  
due to the withRetrieve() function overhead inside the for loop.  I  
have moved it out and will submit that change, too.




Moving it out of the loop made little diff. so I guess it is mostly  
just due to it being late and me being tired and not thinking  
clearly.  B/c if I were, I would just realize that those operations  
are also retrieving documents...


-Grant

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How can I index Phrases in Lucene?

2007-03-22 Thread karl wettin


23 mar 2007 kl. 04.25 skrev Ryan McKinley:


Is there any way to find frequent phrases without knowing what you are
looking for?


I think you are looking for association rules. Try searching for  
Levelwise-Scan.


Weka contains GPLed Java code.
Cite seer is your best friend for whitepapers. http:// 
citeseer.ist.psu.edu/cs



--
karl




I could index "A B C D E" as "A B C", "B C D", "C D E" etc, but that
seems kind of clunky particularly if the phrase length is large.  Is
there any position offset magic that will surface frequent phrases
automatically?

thanks
ryan


On 3/22/07, Erick Erickson <[EMAIL PROTECTED]> wrote:

Well, you don't index phrases, it's done for you. You should try
something like the following

Create a SpanNearQuery with your terms. Specify an appropriate
slop (probably 0 assuming you want them all next to each other).

Now use call getSpans and count ... You may have to do
something with overlapping spans, but you'll need to experiment
a bit to understand it.

Erick

On 3/22/07, Maryam <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> I know how to index terms in lucene, now I wanna see
> how can I index phrases like "information retreival"
> in lucene and calculate the number of times that
> phrase has appeared in the document. Is there any way
> to do it in Lucene?
>
> Thanks
>
>
>
>
>  
_ 
___

> It's here! Your new message!
> Get new email alerts with the free Yahoo! Toolbar.
> http://tools.search.yahoo.com/toolbar/features/mail/
>
>  
-

> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Reverse search

2007-03-22 Thread karl wettin


23 mar 2007 kl. 03.07 skrev Melanie Langlois:


Thanks Karl, the performances graph is really amazing!
I have to say that it would not have think this way around would be  
faster, but sounds nice if I can use this, make everything easier  
to manage. I'm just wondering what did you consider when you build  
your graph, only the time to run the queries? Because, I should add  
the time for creating the index anytime a new document comes in (or  
a subset of documents if several comes in same time), and the  
indexing of these documents. The documents should not be big,  
around 2KB. Did you measure this part ?


Adding a document to a MemoryIndex or InstantiatedIndex takes more or  
less the same time it would take to add it to an empty RAMDirectory.  
How many clock ticks is spent really depends on what analysers you use.


--
karl



Mélanie

-Original Message-
From: karl wettin [mailto:[EMAIL PROTECTED]
Sent: Friday, March 23, 2007 10:35 AM
To: java-user@lucene.apache.org
Subject: Re: Reverse search


23 mar 2007 kl. 02.12 skrev Melanie Langlois:


I want to manage user subscriptions to specific documents. So I
would like to store the subscription (query) into the lucene
directory, and whenever I receive a new document, I will search all
the matching subscriptions to send the documents to all subcribers.
For instance if a user subscribes to all documents with text
containing (WORD1 and WORD2) or WORD3, how can I match the incoming
document based on stored subscriptions? I was thinking to have two
subfields for each field of the subscription: the AND conditions
and the OR conditions.

-OR. I will tokenized the document field content and insert OR
between each of them, and run the query against OR condition of
subscription

-It's for the AND that I will have an issue, because if the
incoming text may contains more words than the sequence I want to
search.

For instance, if I subscribe for documents contents lucene and java
for instance , if the incoming document contents is lucene is a
great API which has been developed in java, once I removed
stopwords my query would look like lucene and great and API and
developed and java.

As query is composed of more words than the stored subscription I
will fail to retrieve the subscription. But if I put only or words,
the results will not be accurate, as I can obtain subscription only
for java for instance.



I wrote such a thing way back, where I used the new document as the
query and the user subscriptions as the index. Similar to what you
describe, I had an AND, OR and NOT field. This really limited the
type of queries users could store. It does however work, particullary
well on systems with /huge/ amounts of subscriptions (many millions).

Today I would have used something else. If you insert one document at
the time to your index, take a look at MemoryIndex in contrib. If you
insert documents in batches larger than one document at the time,
take a look at LUCENE-550 in the Jira. Add new documents to such an
index and place the subscribed queries on it. Depening on the
queries, the speed should be some 20-100 times faster than using a
RAMDirectory. One million queries should take some 20 seconds to
assemble and place on a 25 document index on my laptop. See 12353601/12353601_HitCollectionBench.jpg> for performace of  
LUCENE-550.


--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Combining score from two or more hits

2007-03-22 Thread Chris Hostetter
: Thanks Erick, I've been using TopDocs, but am playing with my own HitCollector
: variant of TopDocHitCollector.  The problem is not adjusting the score, it's
: what to adjust it by, i.e. is it possible to re-evaluate the scores of H1 and 
H2
: knowing that the original query resulted in hits on H1 AND H2.

if you are using a HitCollector, there any re-evaluation is going to
happen in your code using whatever mechanism you want -- once your collect
method is called on a docid, Lucene is done with that docid and no longer
cares about it ... it's only whatever storage you may be maintaining of
high scoring docs thta needs to know that you've decided the score has
changed.

your big problem is going to be that you basically need to maintain a list
of *every* doc collected, if you don't know what the score of any of them
are until you've processed all the rest ... since docs are collected in
increasing order of docid, you might be able to make some optimizations
based on how big of a gap you've got between the doc you are currently
collecting and the last doc you've collected if you know that you're
always going to add docs that "relate" to eachother in sequential bundles
-- but this would be some very custom code depending on your use case.





-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: contrib/benchmark questions

2007-03-22 Thread Doron Cohen
Hi Grant, I think you resolved the question already, but just to
make sure...

Grant Ingersoll <[EMAIL PROTECTED]> wrote on 22/03/2007 20:41:27:

>
> On Mar 22, 2007, at 11:21 PM, Grant Ingersoll wrote:
>
> > I think I see in the ReadTask that it is the res var that is being
> > incremented and would have to be altered.  I guess I can go by
> > elapsed time, but even that seems slightly askew.  I think this is
> > due to the withRetrieve() function overhead inside the for loop.  I
> > have moved it out and will submit that change, too.
> >
>
> Moving it out of the loop made little diff. so I guess it is mostly
> just due to it being late and me being tired and not thinking
> clearly.  B/c if I were, I would just realize that those operations
> are also retrieving documents...

Seems the cause for confusion is that #recs means different things for
different tasks. For all tasks, it means (at least) the number of times
that task executed. For warm, it adds one for each document retrieved.
For traverse, adds one for each doc id "traveresed", and for
traverseAndRetrieve, also adds one for each doc being retrieved.

I'll update the javadocs with this clarification.

Moving the call out of the loop is the right thing of course, changed
the time only, not the #recs, right?

Regards,
Doron


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Questions about Indexing

2007-03-22 Thread Daniel Noll

Maryam wrote:
Hi, 


I have three questions about indexing:

1) I am indexing HTML documents, how can I do "stop
removal" before indexing, I dont want to index stop
words? 


The same way you would do it for indexing text documents: StopFilter.


2) I can have an access to the terms in one document,
but how can I have access to the document name that
these terms has been appeared?


The usual way to do this is to store the document name as another field.

Daniel



--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699
Web: http://nuix.com/   Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Ignore Words Problem

2007-03-22 Thread aslam bari
I want to be make sure, if this statement is Right or not?
"I am using StatndardAnaylyzer for Indexing documents. Bydefault it ignores 
some words when doing indexing. But when we search something, Lucene again 
include the ignore words in searching".???
Myproblem is that:-
I indexed a word document using StandarAnaylyzer. There are many words like "is 
am are that the" which are ignored by the Lucene. And When i want to search a 
query which must search all words given by user (AND query) then it does not 
return results. For example
I want to search those documents which MUST have ALL these words "this is 
garden".  for this i have made a AND query, but Lucene now gives result because 
"garden" is there but it cannot find "is" and "this" word because they are 
ignored at indexing time. So what is the better work around.
Any help will be appreciated.



__
Yahoo! India Answers: Share what you know. Learn something new
http://in.answers.yahoo.com/

Re: how ungrouped query handled?

2007-03-22 Thread SK R

Thanks for your reply and this useful links.

On 3/23/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:



see also the FAQ "Why am I getting no hits / incorrect hits?" which points
to...

http://wiki.apache.org/lucene-java/BooleanQuerySyntax

...I've just added some more words of wisdom there from past emails.


: Date: Thu, 22 Mar 2007 09:51:15 -0400
: From: Erick Erickson <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Re: how ungrouped query handled?
:
: This is a pretty common issue that I've been grappling with by chance
: recently. The main point is that the parser is NOT a boolean logic
: parser.
:
: Search the mail archive for the thread "bad query parser bug" and
: you'll find a good discussion.
:
: I tried using PrecedenceQueryParser, but that didn't work for
: me very well, search the mail archive on that and you'll see some
: examples of why.
:
: I solved this problem for my immediate issues by writing a very
: quick-and-dirty parenthesizer for my raw query. If it wasn't going
: on summer, I might see if I can contribute something by
: seeing if there's a way I can see to fix PrecedenceQueryParser.
:
: Best
: Erick
:
: On 3/22/07, SK R <[EMAIL PROTECTED]> wrote:
: >
: > Hi,
: >  Can anyone explain how lucene handles the belowed query?
: > My query is *field1:source AND (field2:name OR field3:dest)* .
I've
: > given this string to queryparser and then searched by using searcher.
It
: > returns correct results. It's query.toString() print is ::
+field1:source
: > +(field2:name field3:dest)
: > But if i don't group my terms (i.e) my query : *field1:source AND
: > field2:name OR field3:dest *,then it gives the result of  first two
term's
: > search result. It doesn't search 3rd term. It's query.toString() print
is
: > ::
: > +field1:source +field2:name field3:dest.
: > If i use same boolean operator between all terms, then it returns
correct
: > results.
: > Why it doesn't search the terms after 2nd term if grouping not used?
: >
: > Thanks & Regards
: > RSK
: >
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




MergeFactor and MaxBufferedDocs value should ...?

2007-03-22 Thread SK R

Hi,
   I've looked the uses of MergeFactor and MaxBufferedDocs.

   If I set MergeFactor = 100 and MaxBufferedDocs=250 , then first 100
segments will be merged in RAMDir when 100 docs arrived. At the end of 350th
doc added to writer , RAMDir have 2 merged segment files + 50 seperate
segment files not merged together and these are flushed to FSDir.

   If wrong, please correct me.

   My doubt is whether we should set MergeFactor & MaxBufferedDocs in
proportional ratio (i.e) MaxBufferedDocs = n*MergeFactor where n = 1,2 ...
to reduce indexing time and get greater performance or no need to worry
about it's relation?


Thanks & Regards
RSK


Re: Ignore Words Problem

2007-03-22 Thread Chris Hostetter

What part of Grant and Karl's answers to you the last time you asked this
question wasn't clear?  have you tried it?

http://www.nabble.com/Re%3A-Common-Words-ignoring-problem-p9550886.html
http://www.nabble.com/Re%3A-Common-Words-ignoring-problem-p9567881.html

: I want to be make sure, if this statement is Right or not?
: "I am using StatndardAnaylyzer for Indexing documents. Bydefault it ignores 
some words when doing indexing. But when we search something, Lucene again 
include the ignore words in searching".???
: Myproblem is that:-
: I indexed a word document using StandarAnaylyzer. There are many words like 
"is am are that the" which are ignored by the Lucene. And When i want to search 
a query which must search all words given by user (AND query) then it does not 
return results. For example
: I want to search those documents which MUST have ALL these words "this is 
garden".  for this i have made a AND query, but Lucene now gives result because 
"garden" is there but it cannot find "is" and "this" word because they are 
ignored at indexing time. So what is the better work around.
: Any help will be appreciated.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]