date:20070322


Hi:

First of all apology to those friends who follow all the list.

Often times I work offline and I do not have any commit rights to any
of the projects. All the modifications I make for various clients and
trying to keep up to date with latest trunk somehow makes it difficult
for me to just stick with "subversion". I have heard many things about
distributed
revision control system and I am sure there are tricks/fixes for the
subversion problem i mentioned above, but I also wanted to learn
something new :-) So after some trial with many DRCS I have decided to
go for Bazaar! Its really cool DRCS.. you got try it.

http://bazaar-vcs.org/.

Now due to the fact that SVN is RCS and bzr is DRCS one need to
convert SVN repos to bzr repos. and cool enough.. there is a free vcs
mirroring service at Launchpad

https://launchpad.net/

So now the following projects are available via bzr branch. You can
access them here.

Nutch - https://launchpad.net/nutch
Solr - https://launchpad.net/solr
Lucene - https://launchpad.net/lucene
Hadoop - https://launchpad.net/hadoop

It only mirrors "trunk". Thats what I need to follow thats why and I
don't see any reason to mirror releases.

Regards

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad

Is the point of this that you can make "commits" to Lucene so that  
you don't lose your changes on trunk?


On Mar 22, 2007, at 7:14 AM, rubdabadub wrote:


Hi:

First of all apology to those friends who follow all the list.

Often times I work offline and I do not have any commit rights to any
of the projects. All the modifications I make for various clients and
trying to keep up to date with latest trunk somehow makes it difficult
for me to just stick with "subversion". I have heard many things about
distributed
revision control system and I am sure there are tricks/fixes for the
subversion problem i mentioned above, but I also wanted to learn
something new :-) So after some trial with many DRCS I have decided to
go for Bazaar! Its really cool DRCS.. you got try it.

http://bazaar-vcs.org/.

Now due to the fact that SVN is RCS and bzr is DRCS one need to
convert SVN repos to bzr repos. and cool enough.. there is a free vcs
mirroring service at Launchpad

https://launchpad.net/

So now the following projects are available via bzr branch. You can
access them here.

Nutch - https://launchpad.net/nutch
Solr - https://launchpad.net/solr
Lucene - https://launchpad.net/lucene
Hadoop - https://launchpad.net/hadoop

It only mirrors "trunk". Thats what I need to follow thats why and I
don't see any reason to mirror releases.

Regards


--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad

On 3/22/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

Is the point of this that you can make "commits" to Lucene so that
you don't lose your changes on trunk?

Not only that. But I can make as many local branch as I like ..for example
customer X, customer Y. This way I can support X and Y as they have
separate features .. All of the above can be done with SVN but its a pain
at least for me.

And off course work off line .. during summer .. under trees :-) and then update
the whole branch with main repo without loosing any changes. It just seems easy,
I have also had a case where I need to bake some part of Nutch and some part
Solr under one tree i.e. new project and still maintain that tree with
the original
repo. and i could do that just fine. Bazaar commands are like SVN commands
so its not much to learn either :-)

Regards

On Mar 22, 2007, at 7:14 AM, rubdabadub wrote:

> Hi:
>
> First of all apology to those friends who follow all the list.
>
> Often times I work offline and I do not have any commit rights to any
> of the projects. All the modifications I make for various clients and
> trying to keep up to date with latest trunk somehow makes it difficult
> for me to just stick with "subversion". I have heard many things about
> distributed
> revision control system and I am sure there are tricks/fixes for the
> subversion problem i mentioned above, but I also wanted to learn
> something new :-) So after some trial with many DRCS I have decided to
> go for Bazaar! Its really cool DRCS.. you got try it.
>
> http://bazaar-vcs.org/.
>
> Now due to the fact that SVN is RCS and bzr is DRCS one need to
> convert SVN repos to bzr repos. and cool enough.. there is a free vcs
> mirroring service at Launchpad
>
> https://launchpad.net/
>
> So now the following projects are available via bzr branch. You can
> access them here.
>
> Nutch - https://launchpad.net/nutch
> Solr - https://launchpad.net/solr
> Lucene - https://launchpad.net/lucene
> Hadoop - https://launchpad.net/hadoop
>
> It only mirrors "trunk". Thats what I need to follow thats why and I
> don't see any reason to mirror releases.
>
> Regards

--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad

Nice idea and I can see the benefit of it to you and I don't mean to  
be a wet blanket on it, I just wonder about the legality of it.   
People may find it and think it is the official Apache Lucene, since  
it is branded that way.  I'm not a lawyer, so I don't know for sure.   
I think you have the right to store and use the code, even create a  
whole other search product based solely on Lucene (I think), I just  
don't know about this kind of thing.  In some sense it is like  
mirroring, but that fact that you can commit w/ out going through the  
Apache process makes me think that others coming upon the code will  
be mislead about what's in it.  The site _definitely_ makes it look  
like Launchpad is the home for Lucene with the intro and the bug  
tracking, etc, even though we all know this site will rank further  
down in the SERPs than the main site.

Perhaps I am misunderstanding?

On Mar 22, 2007, at 7:42 AM, rubdabadub wrote:

On 3/22/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

Is the point of this that you can make "commits" to Lucene so that
you don't lose your changes on trunk?

Not only that. But I can make as many local branch as I like ..for  
example

customer X, customer Y. This way I can support X and Y as they have
separate features .. All of the above can be done with SVN but its  
a pain

at least for me.

And off course work off line .. during summer .. under trees :-)  
and then update
the whole branch with main repo without loosing any changes. It  
just seems easy,
I have also had a case where I need to bake some part of Nutch and  
some part

Solr under one tree i.e. new project and still maintain that tree with
the original
repo. and i could do that just fine. Bazaar commands are like SVN  
commands

so its not much to learn either :-)

Regards

On Mar 22, 2007, at 7:14 AM, rubdabadub wrote:

> Hi:
>
> First of all apology to those friends who follow all the list.
>
> Often times I work offline and I do not have any commit rights  
to any
> of the projects. All the modifications I make for various  
clients and
> trying to keep up to date with latest trunk somehow makes it  
difficult
> for me to just stick with "subversion". I have heard many things  
about

> distributed
> revision control system and I am sure there are tricks/fixes for  
the

> subversion problem i mentioned above, but I also wanted to learn
> something new :-) So after some trial with many DRCS I have  
decided to

> go for Bazaar! Its really cool DRCS.. you got try it.
>
> http://bazaar-vcs.org/.
>
> Now due to the fact that SVN is RCS and bzr is DRCS one need to
> convert SVN repos to bzr repos. and cool enough.. there is a  
free vcs

> mirroring service at Launchpad
>
> https://launchpad.net/
>
> So now the following projects are available via bzr branch. You can
> access them here.
>
> Nutch - https://launchpad.net/nutch
> Solr - https://launchpad.net/solr
> Lucene - https://launchpad.net/lucene
> Hadoop - https://launchpad.net/hadoop
>
> It only mirrors "trunk". Thats what I need to follow thats why  
and I

> don't see any reason to mirror releases.
>
> Regards

--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad

On 3/22/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

Nice idea and I can see the benefit of it to you and I don't mean to
be a wet blanket on it, I just wonder about the legality of it.
People may find it and think it is the official Apache Lucene, since
it is branded that way.  I'm not a lawyer, so I don't know for sure.
I think you have the right to store and use the code, even create a
whole other search product based solely on Lucene (I think), I just
don't know about this kind of thing.  In some sense it is like
mirroring, but that fact that you can commit w/ out going through the

NO NO!! I don't make any commits to apache trunk. Nor any one else
for that matter. The repo at launchpad is just a pure mirror and will
always be a mirror.

Just to clarify what I meant by commit. Basically you "pull" the Lucene
branch from launchpad to your local machine and that becomes a
complete copy of the trunk and you make another local branch from
that branch. Example

bzr branch http://bazaar.launchpad.net/~vcs-imports/lucene/trunk local.copy
bzr branch local.copy local.customerx

then you do all your work on local.customerx and make commits there. Cos you
want to keep the local.copy exactly identical to lanuchpad version
which in turns
a mirror like any other mirror that apache have thats all. If I were
to commit things
to the launchpad version I loose the whole point of mirroring and
getting changes
from trunk.

Apache process makes me think that others coming upon the code will
be mislead about what's in it.  The site _definitely_ makes it look
like Launchpad is the home for Lucene with the intro and the bug
tracking, etc, even though we all know this site will rank further
down in the SERPs than the main site.

I am not a lawyer or branding expert. But if you want me to edit the description
text with something like "A mirrored copy of Apache Lucene.. original
site at..."
No problem Please provide me the exact text so I can edit it to avoid confusion
etc.. Last thing I want to do is create confusion.

Moreover if such need like mine exist maybe Apache Infrastructure
should consider
a DRCS system then a RCS system .. SVN doesn't provide the flexibility
that I need.
In apache there is CVS and SVN co-exist as well as there are mirrors
of such all
over the world so.. why not have a bzr branch? if Launchpad want to
host it great
if other mirror wants to host it great.

I hope it clarifies misunderstanding.. Please do provide an exact text
so we don't
get into some lawyer trouble :-) I don't want to take a stab on the
text its better you
provide me exact instructions.

Regards.

On Mar 22, 2007, at 7:42 AM, rubdabadub wrote:

> On 3/22/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>> Is the point of this that you can make "commits" to Lucene so that
>> you don't lose your changes on trunk?
>
> Not only that. But I can make as many local branch as I like ..for
> example
> customer X, customer Y. This way I can support X and Y as they have
> separate features .. All of the above can be done with SVN but its
> a pain
> at least for me.
>
> And off course work off line .. during summer .. under trees :-)
> and then update
> the whole branch with main repo without loosing any changes. It
> just seems easy,
> I have also had a case where I need to bake some part of Nutch and
> some part
> Solr under one tree i.e. new project and still maintain that tree with
> the original
> repo. and i could do that just fine. Bazaar commands are like SVN
> commands
> so its not much to learn either :-)
>
> Regards
>> On Mar 22, 2007, at 7:14 AM, rubdabadub wrote:
>>
>> > Hi:
>> >
>> > First of all apology to those friends who follow all the list.
>> >
>> > Often times I work offline and I do not have any commit rights
>> to any
>> > of the projects. All the modifications I make for various
>> clients and
>> > trying to keep up to date with latest trunk somehow makes it
>> difficult
>> > for me to just stick with "subversion". I have heard many things
>> about
>> > distributed
>> > revision control system and I am sure there are tricks/fixes for
>> the
>> > subversion problem i mentioned above, but I also wanted to learn
>> > something new :-) So after some trial with many DRCS I have
>> decided to
>> > go for Bazaar! Its really cool DRCS.. you got try it.
>> >
>> > http://bazaar-vcs.org/.
>> >
>> > Now due to the fact that SVN is RCS and bzr is DRCS one need to
>> > convert SVN repos to bzr repos. and cool enough.. there is a
>> free vcs
>> > mirroring service at Launchpad
>> >
>> > https://launchpad.net/
>> >
>> > So now the following projects are available via bzr branch. You can
>> > access them here.
>> >
>> > Nutch - https://launchpad.net/nutch
>> > Solr - https://launchpad.net/solr
>> > Lucene - https://launchpad.net/lucene
>> > Hadoop - https://launchpad.net/hadoop
>> >
>> > It only mirrors "trunk". Thats what I need to follow thats why
>> and I
>> > don't see any reason to mirror releases.
>> >
>> > Regards
>>
>>

Re: Spelt, for better spelling correction

2007-03-22 Thread Martin Haye

Otis,

I hadn't really thought about this, but it would be easy to build a
dictionary from an existing Lucene index. Tha main caveat is that it would
only work with "stored" fields. That's because this spellchecker boosts
accuracy using pair frequencies in addition to term frequencies, and Lucene
doesn't need or track pair frequencies to my knowledge. So any field which
you wanted to spellcheck would need to be indexed with Field.Store.YES.

Of course a side effect is that they'd have to be Analyzed again, with the
resulting time cost. Still, this could make sense for a lot of people.

I'll make sure the contribution includes an index-to-dictionary API, and
thank you very much for the input.

--Martin

On 3/21/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:

Martin,
This sounds like the spellchecker dictionary needs to be built in parallel
with the main Lucene index.  Is it possible to create a dictionary out of an
existing (and no longer modified) Lucene index?

Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Martin Haye <[EMAIL PROTECTED]>
To: Yonik Seeley <[EMAIL PROTECTED]>
Cc: java-user@lucene.apache.org
Sent: Wednesday, March 21, 2007 2:03:50 PM
Subject: Re: Spelt, for better spelling correction

The dictionary is generated from the corpus, with the result that a larger
corpus gives better results.

Words are queued up during an index run, and at the end are munged to
create
an optimized dictionary. It also supports incremental building, though the
overhead would be too much for those applications that are continuously
adding things to an index. Happily, it's not as important to keep the
spelling dictionary absolutely up to date, so it would be fine to queue
words over several index runs, and refresh the dictionary less often.

--Martin

On 3/20/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>
> Sounds interesting Martin!
> Is the dictionary static, or is it generated from the corpus or from
> user queries?
>
> -Yonik
>
> On 3/20/07, Martin Haye <[EMAIL PROTECTED]> wrote:
> > As part of XTF, an open source publishing engine that uses Lucene, I
> > developed a new spelling correction engine specifically to provide
"Did
> you
> > mean..." links for misspelled queries. I and a small group are
preparing
> > this for submission as a contrib module to Lucene. And we're inviting
> > interested people to join the discussion about it.
> >
> > The new engine is being called "Spelt" and differs from the one
> currently in
> > Lucene contrib in the following ways:
> >
> > - More accurate: Much better performance on single-word queries (90%
> correct
> > in #1 slot in my tests). On general list including multi-word queries,
> gets
> > 80%+ correct.
> > - Multi-word: Handles and corrects multi-word queries such as
> "harrypotter"
> > -> "harry potter".
> > - Fast: In my tests, builds the dictionary more than 30 times faster.
> > - Small: Dictionary size is roughly a third of that built by the
> existing
> > engine.
> > - Other bells and whistles...
> >
> > There is already a standalone test program that people can try out,
and
> > we're interested in feedback. If you're interested in discussing,
> testing,
> > or previewing, consider joining the Google group:
> > http://groups.google.com/group/spelt/
> >
> > --Martin
> >
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad



On Mar 22, 2007, at 8:16 AM, rubdabadub wrote:


On 3/22/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

Nice idea and I can see the benefit of it to you and I don't mean to
be a wet blanket on it, I just wonder about the legality of it.
People may find it and think it is the official Apache Lucene, since
it is branded that way.  I'm not a lawyer, so I don't know for sure.
I think you have the right to store and use the code, even create a
whole other search product based solely on Lucene (I think), I just
don't know about this kind of thing.  In some sense it is like
mirroring, but that fact that you can commit w/ out going through the


NO NO!! I don't make any commits to apache trunk. Nor any one else
for that matter. The repo at launchpad is just a pure mirror and will
always be a mirror.

Just to clarify what I meant by commit. Basically you "pull" the  
Lucene

branch from launchpad to your local machine and that becomes a
complete copy of the trunk and you make another local branch from
that branch. Example

bzr branch http://bazaar.launchpad.net/~vcs-imports/lucene/trunk  
local.copy

bzr branch local.copy local.customerx

then you do all your work on local.customerx and make commits  
there. Cos you

want to keep the local.copy exactly identical to lanuchpad version
which in turns
a mirror like any other mirror that apache have thats all. If I were
to commit things
to the launchpad version I loose the whole point of mirroring and
getting changes
from trunk.



Gotcha.  I guess I just rely on IntelliJ built in versioning to  
provide similar capabilities, plus, maybe checking out multiple  
copies of the source.  Also, I try to avoid making changes in open  
source libraries unless absolutely necessary.



Apache process makes me think that others coming upon the code will
be mislead about what's in it.  The site _definitely_ makes it look
like Launchpad is the home for Lucene with the intro and the bug
tracking, etc, even though we all know this site will rank further
down in the SERPs than the main site.


I am not a lawyer or branding expert. But if you want me to edit  
the description

text with something like "A mirrored copy of Apache Lucene.. original
site at..."
No problem Please provide me the exact text so I can edit it to  
avoid confusion

etc.. Last thing I want to do is create confusion.

Moreover if such need like mine exist maybe Apache Infrastructure
should consider
a DRCS system then a RCS system .. SVN doesn't provide the flexibility
that I need.
In apache there is CVS and SVN co-exist as well as there are mirrors
of such all
over the world so.. why not have a bzr branch? if Launchpad want to
host it great
if other mirror wants to host it great.

I hope it clarifies misunderstanding.. Please do provide an exact text
so we don't
get into some lawyer trouble :-) I don't want to take a stab on the
text its better you
provide me exact instructions.



I'll wait for some of the others that are closer to the Foundation to  
contribute (maybe one of the PMC members).  Like I said, I don't know  
if it is an issue at all.  I just don't want people to be confused  
about it.  I think you could propose a DRCS to infrastructure and  
make a case for it.  Personally, I'm fine with SVN, but then again I  
used to think I was fine with CVS and I don't think I would want to  
go back to that!


I am curious, how many custom changes are you making to the code that  
this is even an issue?  Perhaps submitting patches and working to get  
them committed would be a more efficient strategy.


-Grant





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Combining score from two or more hits


Don't know if it's useful or not, but if you used  TopDocs instead,
you have access to an array of ScoreDoc which you could modify
freely. In my app, I used a FieldSortedHitQueue to re-sort things
when I needed to.

ERick

On 3/22/07, Antony Bowesman <[EMAIL PROTECTED]> wrote:


I have indexed objects that contain one or more attachments.  Each
attachment is
indexed as a separate Document along with the object metadata.

When I make a search, I may get hits in more than one Document that refer
to the
same object.  I have a HitCollector which knows if the object has already
been
found, so I want to be able to update the score of an existing hit in a
way that
makes sense.  e.g. If hit H1 has score 1.35 and hit H2 has score 2.9 is is
possible to re-score it on the basis that the real hit result is (H1 AND
H2).

I can take the highest score of any Document, but just wondered if this is
possible during the HitCollector.collect method?

Antony





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad


Good to hear :-)


I am curious, how many custom changes are you making to the code that
this is even an issue?  Perhaps submitting patches and working to get
them committed would be a more efficient strategy.


Well there are 3 problems I see.

1. There are very good patches on all of the lucene Jiira but for one way
or another these issues never get applied to trunk. For me its not a question
of why its more of a question how can i use it and learn from it. So having
my own local branch to do "whatever" is really great. I build I apply patch
play around .. tear it down without thinking about anything else. Yes you
could do this with various copies of the source but often times these patches
works with rev.  etc.. Its much easier to play when you are in control
of the local.trunk.

2. I also have to customer modifications and maintain i.e support and some
of the fixes only works with a certain rev of trunk and often times i
make mistake
and do svn up .. it happens and that does create some extra key strokes :-)

3. You are correct about the committing strategy but most of my changes are
customer specifics and customer have specific rules so it never gets back to
you guys. Well customer rules I can't decide on the modifications I make.

Regards

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: how ungrouped query handled?


This is a pretty common issue that I've been grappling with by chance
recently. The main point is that the parser is NOT a boolean logic
parser.

Search the mail archive for the thread "bad query parser bug" and
you'll find a good discussion.

I tried using PrecedenceQueryParser, but that didn't work for
me very well, search the mail archive on that and you'll see some
examples of why.

I solved this problem for my immediate issues by writing a very
quick-and-dirty parenthesizer for my raw query. If it wasn't going
on summer, I might see if I can contribute something by
seeing if there's a way I can see to fix PrecedenceQueryParser.

Best
Erick

On 3/22/07, SK R <[EMAIL PROTECTED]> wrote:


Hi,
 Can anyone explain how lucene handles the belowed query?
My query is *field1:source AND (field2:name OR field3:dest)* . I've
given this string to queryparser and then searched by using searcher. It
returns correct results. It's query.toString() print is :: +field1:source
+(field2:name field3:dest)
But if i don't group my terms (i.e) my query : *field1:source AND
field2:name OR field3:dest *,then it gives the result of  first two term's
search result. It doesn't search 3rd term. It's query.toString() print is
::
+field1:source +field2:name field3:dest.
If i use same boolean operator between all terms, then it returns correct
results.
Why it doesn't search the terms after 2nd term if grouping not used?

Thanks & Regards
RSK

Speeding up looping over Hits

2007-03-22 Thread Andreas Guther

Hi,

While looking into performance enhancement for our search feature I
noticed a significant difference in Documents access time while looping
over Hits.

I wrote a test application search for a list of search terms and then
for each returned Hits object loops twice over every single hits.doc(i).

for (int i = 0; i < numberOfDocs; i++) {doc = hits.doc(i);}

I am seeing differences like the following

Found 16,215 hits for 'Water or Wine' in 219 ms
Processed 16,215 docs in 53,141 ms; per single doc 3.2773 ms
Processed 16,215 docs in 2,032 ms; per single doc 0.1253 ms

Interestingly if I run the same test application a second time in my IDE
the difference between the first and the second loop is very low.

I have no explanation why I see this difference but it becomes a huge
problem for us due to the fact that I need to extract from each document
a small set of information pieces and the first time looping just takes
too much time.

I could not find any indication for an external caching of Hits.  I am
running my tests within Eclipse with a memory setting of -Xms766M
-Xmx1024M.

What is the explanation in the different access speed for the same
search results?

Is there a way to speed up looping over the Hits data structure?

Andreas



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Speeding up looping over Hits


Your timing differences are probably because of caching. But this
has been mentioned many times in the archive, that a Hits object
is intended to allow fast, simple retrieval of the first few documents
in a result set (100 if memory serves). Each 100 or so calls to
next() causes the search to be re-issued.

See HitCollector, TopDocs, etc...

Erick

On 3/22/07, Andreas Guther <[EMAIL PROTECTED]> wrote:


Hi,

While looking into performance enhancement for our search feature I
noticed a significant difference in Documents access time while looping
over Hits.

I wrote a test application search for a list of search terms and then
for each returned Hits object loops twice over every single hits.doc(i).

for (int i = 0; i < numberOfDocs; i++) {doc = hits.doc(i);}

I am seeing differences like the following

Found 16,215 hits for 'Water or Wine' in 219 ms
Processed 16,215 docs in 53,141 ms; per single doc 3.2773 ms
Processed 16,215 docs in 2,032 ms; per single doc 0.1253 ms

Interestingly if I run the same test application a second time in my IDE
the difference between the first and the second loop is very low.

I have no explanation why I see this difference but it becomes a huge
problem for us due to the fact that I need to extract from each document
a small set of information pieces and the first time looping just takes
too much time.

I could not find any indication for an external caching of Hits.  I am
running my tests within Eclipse with a memory setting of -Xms766M
-Xmx1024M.

What is the explanation in the different access speed for the same
search results?

Is there a way to speed up looping over the Hits data structure?

Andreas



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: bzr branches for Apache Lucene/Nutch/Solr/Hadoop at Launchpad

2007-03-22 Thread Andrzej Bialecki


rubdabadub wrote:

On 3/22/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

Nice idea and I can see the benefit of it to you and I don't mean to
be a wet blanket on it, I just wonder about the legality of it.


So long as it meets the Apache license conditions regarding the 
distribution it's not forbidden. It could be confusing or superfluous, 
but it couldn't be illegal.




People may find it and think it is the official Apache Lucene, since
it is branded that way.  I'm not a lawyer, so I don't know for sure.
I think you have the right to store and use the code, even create a
whole other search product based solely on Lucene (I think), I just
don't know about this kind of thing.  In some sense it is like
mirroring, but that fact that you can commit w/ out going through the


NO NO!! I don't make any commits to apache trunk. Nor any one else
for that matter. The repo at launchpad is just a pure mirror and will
always be a mirror.



Actually, I often find myself in a similar situation to "rubdabadub". 
I'm working on several commercial projects that use and modify 
Lucene/Nutch, and often such modifications are proprietary (about 
equally often they are not, and are submitted as patches).


Over time, the issue of tracking the vendor source tree and merging from 
that tree (per svnbook) to several different private svn repos becomes a 
tricky and time-consuming business ... I'd welcome any improvements here.


It seems I need to find some time to get more familiar with bzr ...

--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

How can I index Phrases in Lucene?

2007-03-22 Thread Maryam

Hi, 

I know how to index terms in lucene, now I wanna see
how can I index phrases like "information retreival"
in lucene and calculate the number of times that
phrase has appeared in the document. Is there any way
to do it in Lucene?

Thanks


 

It's here! Your new message!  
Get new email alerts with the free Yahoo! Toolbar.
http://tools.search.yahoo.com/toolbar/features/mail/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How can I index Phrases in Lucene?


Well, you don't index phrases, it's done for you. You should try
something like the following

Create a SpanNearQuery with your terms. Specify an appropriate
slop (probably 0 assuming you want them all next to each other).

Now use call getSpans and count ... You may have to do
something with overlapping spans, but you'll need to experiment
a bit to understand it.

Erick

On 3/22/07, Maryam <[EMAIL PROTECTED]> wrote:


Hi,

I know how to index terms in lucene, now I wanna see
how can I index phrases like "information retreival"
in lucene and calculate the number of times that
phrase has appeared in the document. Is there any way
to do it in Lucene?

Thanks





It's here! Your new message!
Get new email alerts with the free Yahoo! Toolbar.
http://tools.search.yahoo.com/toolbar/features/mail/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: how ungrouped query handled?

2007-03-22 Thread Chris Hostetter


see also the FAQ "Why am I getting no hits / incorrect hits?" which points
to...

http://wiki.apache.org/lucene-java/BooleanQuerySyntax

...I've just added some more words of wisdom there from past emails.


: Date: Thu, 22 Mar 2007 09:51:15 -0400
: From: Erick Erickson <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Re: how ungrouped query handled?
:
: This is a pretty common issue that I've been grappling with by chance
: recently. The main point is that the parser is NOT a boolean logic
: parser.
:
: Search the mail archive for the thread "bad query parser bug" and
: you'll find a good discussion.
:
: I tried using PrecedenceQueryParser, but that didn't work for
: me very well, search the mail archive on that and you'll see some
: examples of why.
:
: I solved this problem for my immediate issues by writing a very
: quick-and-dirty parenthesizer for my raw query. If it wasn't going
: on summer, I might see if I can contribute something by
: seeing if there's a way I can see to fix PrecedenceQueryParser.
:
: Best
: Erick
:
: On 3/22/07, SK R <[EMAIL PROTECTED]> wrote:
: >
: > Hi,
: >  Can anyone explain how lucene handles the belowed query?
: > My query is *field1:source AND (field2:name OR field3:dest)* . I've
: > given this string to queryparser and then searched by using searcher. It
: > returns correct results. It's query.toString() print is :: +field1:source
: > +(field2:name field3:dest)
: > But if i don't group my terms (i.e) my query : *field1:source AND
: > field2:name OR field3:dest *,then it gives the result of  first two term's
: > search result. It doesn't search 3rd term. It's query.toString() print is
: > ::
: > +field1:source +field2:name field3:dest.
: > If i use same boolean operator between all terms, then it returns correct
: > results.
: > Why it doesn't search the terms after 2nd term if grouping not used?
: >
: > Thanks & Regards
: > RSK
: >
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Extracting formatted text from PDF files

2007-03-22 Thread Soeren Pekrul


Mike O'Leary wrote:

Please forgive the laziness inherent in this question, as I haven't looked
through the PDFBox code yet. I am wondering if that code supports extracting
text from PDF files while preserving such things as sequences of whitespace
between characters and other layout and formatting information. I am working
with a project that extracts and operates on certain table-like blocks of
text from PDF files, and a lot of freeware and shareware PDF to text
converters seem to either ignore formatting or try to preserve formatting
and not get it quite right. I am wondering if PDFBox provides better support
for this kind of thing. Thanks.


That is not so simple. Usually there is not this information inside a 
PDF file. PDF is an output file format. It contains just the information 
print a character "a" at the position x and y. In many cases a PDF file 
doesn’t know even words or white spaces. We read words due to the 
position of characters, we see paragraphs due to the position of 
characters, and we see tables due to the position of characters. The 
file doesn’t contain this information.
I found this code in a PDF file for the German word "Wuchsform" (form of 
growing) and the colon ":":


/F1 1 Tf
-3.8801 -1.274 TD
[ (W) 29.60001 (uchsform:) ] TJ

First line: Select a font
Second line: Move the cursor to position -3.8801, -1.274
Third line: Print the character "W", move the cursor 29.60001 units to 
right and print the characters "uchsform:".


Extracting the words from a PDF file for indexing means you have first 
to build words from the characters positions. Recognizing paragraphs, 
column text, tables, captions, lists, footnotes etc. is much more difficult.


Sören

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Combining score from two or more hits

2007-03-22 Thread Antony Bowesman


Erick Erickson wrote:

Don't know if it's useful or not, but if you used  TopDocs instead,
you have access to an array of ScoreDoc which you could modify
freely. In my app, I used a FieldSortedHitQueue to re-sort things
when I needed to.


Thanks Erick, I've been using TopDocs, but am playing with my own HitCollector 
variant of TopDocHitCollector.  The problem is not adjusting the score, it's 
what to adjust it by, i.e. is it possible to re-evaluate the scores of H1 and H2 
knowing that the original query resulted in hits on H1 AND H2.


Antony



ERick

On 3/22/07, Antony Bowesman <[EMAIL PROTECTED]> wrote:


I have indexed objects that contain one or more attachments.  Each
attachment is
indexed as a separate Document along with the object metadata.

When I make a search, I may get hits in more than one Document that refer
to the
same object.  I have a HitCollector which knows if the object has already
been
found, so I want to be able to update the score of an existing hit in a
way that
makes sense.  e.g. If hit H1 has score 1.35 and hit H2 has score 2.9 
is is

possible to re-score it on the basis that the real hit result is (H1 AND
H2).

I can take the highest score of any Document, but just wondered if 
this is

possible during the HitCollector.collect method?

Antony




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Speeding up looping over Hits

2007-03-22 Thread Santa Clause

Another thing you may want to look at is the newer version 2.1.0 and  
getFieldable. I think that will lazy load the data, that way you are  only 
reading the parts of the document that you need at that moment  rather than the 
whole thing. Someone please correct me if I am wrong or  point to what I really 
mean :) 
  
  I had a similar situation a long while back and I was able to find a  patch 
for the version of Lucene I was using that allowed the above. It  made a huge 
difference. I think something similar is now built in 2.1.0.
  

Andreas Guther <[EMAIL PROTECTED]> wrote:  Hi,

While looking into performance enhancement for our search feature I
noticed a significant difference in Documents access time while looping
over Hits.

I wrote a test application search for a list of search terms and then
for each returned Hits object loops twice over every single hits.doc(i).

for (int i = 0; i < numberOfDocs; i++) {doc = hits.doc(i);}

I am seeing differences like the following

Found 16,215 hits for 'Water or Wine' in 219 ms
Processed 16,215 docs in 53,141 ms; per single doc 3.2773 ms
Processed 16,215 docs in 2,032 ms; per single doc 0.1253 ms

Interestingly if I run the same test application a second time in my IDE
the difference between the first and the second loop is very low.

I have no explanation why I see this difference but it becomes a huge
problem for us due to the fact that I need to extract from each document
a small set of information pieces and the first time looping just takes
too much time.

I could not find any indication for an external caching of Hits.  I am
running my tests within Eclipse with a memory setting of -Xms766M
-Xmx1024M.

What is the explanation in the different access speed for the same
search results?

Is there a way to speed up looping over the Hits data structure?

Andreas



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



 
-
Now that's room service! Choose from over 150,000 hotels 
in 45,000 destinations on Yahoo! Travel to find your fit.

Software Product Development Job Opportunity (Baltimore, MD)

2007-03-22 Thread Lesko, Matt

Official job description & info to submit a resume:
http://www.systemsalliance.com/careers/internal-jobs/baltimore/Software_
Engineer_MD.html 

 

Located 15 minutes North of Baltimore in Sparks, MD

 

Position is on a team, working with myself and others, maintaining and
developing an existing content management system.

 

Quiet working environment in shared office with a nice view. 

 

Management that chooses to do the right thing more often than the
expedient. 

 

Full stack on your own machine (IIS/Apache, Coldfusion [JRun], SQL
Server/Oracle) for local development.

 

Trac for defect tracking & source control. 

 

Java work includes Lucene, XOM and JavaCC. 

 

Feel free to contact me with questions email: mlesko at systemsalliance
dot com. 

 



This email communication is confidential, is intended only for 
the use of the named recipient(s), and may be legally privileged.
If the reader of this message is not the intended recipient, 
you are hereby notified that any distribution or copying of this 
email or any of its contents is strictly prohibited.  If you have 
received this communication in error, please re-send it to the 
sender and delete the original message and any copy of it 
from your computer system.  To contact intended sender
please call 410-584-0595.  Thank you.

Re: Speeding up looping over Hits


Oh yeah.. By only loading the relevant fields, my query times
reduced by over 90%. I actually wrote that up on the mailing list if
you wanted to try to find it, but it took Andreas' message to
remind me...

Erick

On 3/22/07, Santa Clause <[EMAIL PROTECTED]> wrote:


Another thing you may want to look at is the newer version 2.1.0and  
getFieldable. I think that will lazy load the data, that way you
are  only reading the parts of the document that you need at that
moment  rather than the whole thing. Someone please correct me if I am wrong
or  point to what I really mean :)

  I had a similar situation a long while back and I was able to find
a  patch for the version of Lucene I was using that allowed the above.
It  made a huge difference. I think something similar is now built in
2.1.0.


Andreas Guther <[EMAIL PROTECTED]> wrote:  Hi,

While looking into performance enhancement for our search feature I
noticed a significant difference in Documents access time while looping
over Hits.

I wrote a test application search for a list of search terms and then
for each returned Hits object loops twice over every single hits.doc(i).

for (int i = 0; i < numberOfDocs; i++) {doc = hits.doc(i);}

I am seeing differences like the following

Found 16,215 hits for 'Water or Wine' in 219 ms
Processed 16,215 docs in 53,141 ms; per single doc 3.2773 ms
Processed 16,215 docs in 2,032 ms; per single doc 0.1253 ms

Interestingly if I run the same test application a second time in my IDE
the difference between the first and the second loop is very low.

I have no explanation why I see this difference but it becomes a huge
problem for us due to the fact that I need to extract from each document
a small set of information pieces and the first time looping just takes
too much time.

I could not find any indication for an external caching of Hits.  I am
running my tests within Eclipse with a memory setting of -Xms766M
-Xmx1024M.

What is the explanation in the different access speed for the same
search results?

Is there a way to speed up looping over the Hits data structure?

Andreas



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
Now that's room service! Choose from over 150,000 hotels
in 45,000 destinations on Yahoo! Travel to find your fit.

Re: Extracting formatted text from PDF files

2007-03-22 Thread Daniel Noll


Mike O'Leary wrote:

Please forgive the laziness inherent in this question, as I haven't looked
through the PDFBox code yet. I am wondering if that code supports extracting
text from PDF files while preserving such things as sequences of whitespace
between characters and other layout and formatting information. I am working
with a project that extracts and operates on certain table-like blocks of
text from PDF files, and a lot of freeware and shareware PDF to text
converters seem to either ignore formatting or try to preserve formatting
and not get it quite right.


Even pdftohtml?  The sample outputs I've seen from that application 
don't look too bad to me.


Daniel


--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699
Web: http://nuix.com/   Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

problem in reading an index

2007-03-22 Thread Maryam

Hi, 

I have written this piece of code to read the index,
mainly to see what terms are in each document and what
the frequency of each term in the document is. This
piece of code correctly calculates the number of docs
in the index, but I dont know why variable
myTermFreq[] is null. Would you please let me know
your idea bout it?

IndexReader reader = IndexReader.open(myInd);
for (int docNo = 0; docNo < reader.numDocs(); docNo++)
{
TermFreqVector myTermFreq[] =
reader.getTermFreqVectors(docNo);
if (myTermFreq != null) {
for (int i = 0; i < myTermFreq.length; i++) {
int freq[] = myTermFreq[i].getTermFrequencies();
//String terms[]= myTermFreq[i].getTerms();
for (int j=0;jhttp://tools.search.yahoo.com/shortcuts/#news

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: problem in reading an index

2007-03-22 Thread Daniel Noll


Maryam wrote:
Hi, 


I have written this piece of code to read the index,
mainly to see what terms are in each document and what
the frequency of each term in the document is. This
piece of code correctly calculates the number of docs
in the index, but I don’t know why variable
myTermFreq[] is null. Would you please let me know
your idea bout it?


From TFJD:
   Return an array of term frequency vectors for the specified document.
   The array contains a vector for each vectorized field in the
   document.  Each vector contains terms and frequencies for all terms
   in a given vectorized field.  If no such fields existed, the method
   returns null.

i.e. you may not have stored the term vectors when indexing the data.

Daniel



--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699
Web: http://nuix.com/   Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reverse search

2007-03-22 Thread Melanie Langlois

Hello,

 

I want to manage user subscriptions to specific documents. So I would like to 
store the subscription (query) into the lucene directory, and whenever I 
receive a new document, I will search all the matching subscriptions to send 
the documents to all subcribers. For instance if a user subscribes to all 
documents with text containing (WORD1 and WORD2) or WORD3, how can I match the 
incoming document based on stored subscriptions? I was thinking to have two 
subfields for each field of the subscription: the AND conditions and the OR 
conditions. 

-OR. I will tokenized the document field content and insert OR between each of 
them, and run the query against OR condition of subscription

-It's for the AND that I will have an issue, because if the incoming text may 
contains more words than the sequence I want to search. 

For instance, if I subscribe for documents contents lucene and java for 
instance , if the incoming document contents is lucene is a great API which has 
been developed in java, once I removed stopwords my query would look like 
lucene and great and API and developed and java. 

As query is composed of more words than the stored subscription I will fail to 
retrieve the subscription. But if I put only or words, the results will not be 
accurate, as I can obtain subscription only for java for instance.

 

Do you know how I can handle this situation? I'm not sure I can actually do 
this using Lucene...

 

Thank you,

 

Mélanie

Re: Reverse search

23 mar 2007 kl. 02.12 skrev Melanie Langlois:

I want to manage user subscriptions to specific documents. So I
would like to store the subscription (query) into the lucene
directory, and whenever I receive a new document, I will search all
the matching subscriptions to send the documents to all subcribers.
For instance if a user subscribes to all documents with text
containing (WORD1 and WORD2) or WORD3, how can I match the incoming
document based on stored subscriptions? I was thinking to have two
subfields for each field of the subscription: the AND conditions
and the OR conditions.

-OR. I will tokenized the document field content and insert OR
between each of them, and run the query against OR condition of
subscription

-It's for the AND that I will have an issue, because if the
incoming text may contains more words than the sequence I want to
search.

For instance, if I subscribe for documents contents lucene and java
for instance , if the incoming document contents is lucene is a
great API which has been developed in java, once I removed
stopwords my query would look like lucene and great and API and
developed and java.

As query is composed of more words than the stored subscription I
will fail to retrieve the subscription. But if I put only or words,
the results will not be accurate, as I can obtain subscription only
for java for instance.

I wrote such a thing way back, where I used the new document as the
query and the user subscriptions as the index. Similar to what you
describe, I had an AND, OR and NOT field. This really limited the
type of queries users could store. It does however work, particullary
well on systems with /huge/ amounts of subscriptions (many millions).

Today I would have used something else. If you insert one document at
the time to your index, take a look at MemoryIndex in contrib. If you
insert documents in batches larger than one document at the time,
take a look at LUCENE-550 in the Jira. Add new documents to such an
index and place the subscribed queries on it. Depening on the
queries, the speed should be some 20-100 times faster than using a
RAMDirectory. One million queries should take some 20 seconds to
assemble and place on a 25 document index on my laptop. See issues.apache.org/jira/secure/attachment/
12353601/12353601_HitCollectionBench.jpg> for performace of LUCENE-550.

--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: problem in reading an index



23 mar 2007 kl. 02.09 skrev Daniel Noll:


Maryam wrote:

Hi, I have written this piece of code to read the index,
mainly to see what terms are in each document and what
the frequency of each term in the document is. This
piece of code correctly calculates the number of docs
in the index, but I don’t know why variable
myTermFreq[] is null. Would you please let me know
your idea bout it?


From TFJD:
   Return an array of term frequency vectors for the specified  
document.

   The array contains a vector for each vectorized field in the
   document.  Each vector contains terms and frequencies for all terms
   in a given vectorized field.  If no such fields existed, the method
   returns null.

i.e. you may not have stored the term vectors when indexing the data.


This thread might be of interest:

http://www.nabble.com/Resolving-term-vector-even-when-not-stored-- 
tf3412160.html#a9507268


--
karl
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Reverse search

2007-03-22 Thread Melanie Langlois

Thanks Karl, the performances graph is really amazing!
I have to say that it would not have think this way around would be faster, but 
sounds nice if I can use this, make everything easier to manage. I'm just 
wondering what did you consider when you build your graph, only the time to run 
the queries? Because, I should add the time for creating the index anytime a 
new document comes in (or a subset of documents if several comes in same time), 
and the indexing of these documents. The documents should not be big, around 
2KB. Did you measure this part ?

Mélanie 
  
-Original Message-
From: karl wettin [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 23, 2007 10:35 AM
To: java-user@lucene.apache.org
Subject: Re: Reverse search


23 mar 2007 kl. 02.12 skrev Melanie Langlois:

> I want to manage user subscriptions to specific documents. So I  
> would like to store the subscription (query) into the lucene  
> directory, and whenever I receive a new document, I will search all  
> the matching subscriptions to send the documents to all subcribers.  
> For instance if a user subscribes to all documents with text  
> containing (WORD1 and WORD2) or WORD3, how can I match the incoming  
> document based on stored subscriptions? I was thinking to have two  
> subfields for each field of the subscription: the AND conditions  
> and the OR conditions.
>
> -OR. I will tokenized the document field content and insert OR  
> between each of them, and run the query against OR condition of  
> subscription
>
> -It's for the AND that I will have an issue, because if the  
> incoming text may contains more words than the sequence I want to  
> search.
>
> For instance, if I subscribe for documents contents lucene and java  
> for instance , if the incoming document contents is lucene is a  
> great API which has been developed in java, once I removed  
> stopwords my query would look like lucene and great and API and  
> developed and java.
>
> As query is composed of more words than the stored subscription I  
> will fail to retrieve the subscription. But if I put only or words,  
> the results will not be accurate, as I can obtain subscription only  
> for java for instance.
>

I wrote such a thing way back, where I used the new document as the  
query and the user subscriptions as the index. Similar to what you  
describe, I had an AND, OR and NOT field. This really limited the  
type of queries users could store. It does however work, particullary  
well on systems with /huge/ amounts of subscriptions (many millions).

Today I would have used something else. If you insert one document at  
the time to your index, take a look at MemoryIndex in contrib. If you  
insert documents in batches larger than one document at the time,  
take a look at LUCENE-550 in the Jira. Add new documents to such an  
index and place the subscribed queries on it. Depening on the  
queries, the speed should be some 20-100 times faster than using a  
RAMDirectory. One million queries should take some 20 seconds to  
assemble and place on a 25 document index on my laptop. See  for performace of LUCENE-550.

-- 
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Questions about Indexing

2007-03-22 Thread Maryam

Hi, 

I have three questions about indexing:

1) I am indexing HTML documents, how can I do "stop
removal" before indexing, I dont want to index stop
words? 

2) I can have an access to the terms in one document,
but how can I have access to the document name that
these terms has been appeared?

3) I want to find phrases at index level, e.x. find
frequency of phrases in the collection, also their
frequency in each document. How can I do it in Lucene,
is there any sample code?

Thanks



 

Be a PS3 game guru.
Get your game face on with the latest PS3 news and previews at Yahoo! Games.
http://videogames.yahoo.com/platform?platform=120121

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: contrib/benchmark questions


OK, Doron (and other benchmarkers!), on to search:

Here's my alg file:

#Indexing declaration up here

OpenReader
{ "SrchSameRdr" Search > : 5000

{ "SrchTrvSameRdr" SearchTrav > : 5000
{ "SrchTrvSameRdrTopTen" SearchTrav(10) > : 5000
{ "SrchTrvRetLoadAllSameRdr" SearchTravRet > : 5000

#Skip bytes and body
{ "SrchTrvRetLoadSomeSameRdr" SearchTravRetLoadFieldSelector 
(docid,docname,docdate,doctitle) > : 5000

CloseReader


Never mind the last task, I will be submitting a patch shortly that  
will make sense out of it.  Essentially, it specifies what fields to  
load for the document


Here are theresults:
Operation  round merge  
max.buffered   runCnt   recsPerRunrec/s  elapsedSec 
avgUsedMemavgTotalMem
 [java] OpenReader -  -  -  -  -  -  -  -  0 -  10 -  -  -   10  
-  -   1 -  -  -  - 1 -  -   125.0 -  -   0.01 -   5,385,600 -  -  
9,965,568
 [java] SrchSameRdr_5000   010
101 5000  1,184.34.22 5,805,120   
9,965,568
 [java] SrchTrvSameRdr_5000 -  -  -  -  -  0 -  10 -  -  -   10  
-  -   1 -  -  427500 -   71,776.4 -  -   5.96 -   5,806,144 -  -  
9,965,568
 [java] SrchTrvSameRdrTopTen_5000  010
101   427500 62,001.46.89 5,766,584   
9,965,568
 [java] SrchTrvRetLoadAllSameRdr_5000 -  - 0 -  10 -  -  -   10  
-  -   1 -  -  85 -  - 7,226.4 -  - 117.62 -   6,161,728 -  -  
9,965,568
 [java] SrchTrvRetLoadSomeSameRdr_5000 010
101   85 10,334.0   82.25 6,162,752   
9,965,568
 [java] CloseReader -  -  -  -  -  -  -  - 0 -  10 -  -  -   10  
-  -   1 -  -  -  - 1 -  - 1,000.0 -  -   0.00 -   5,921,856 -  -  
9,965,568


The line I'm a bit confused by is the recsPerRun
For the tasks that are doing the traversal and the retrieval, why so  
many recsPerRun?  Is it counting the hits, the traversals and the  
retrievals each as one record?


What I am trying to do is compare:
Search
Search plus traversal of all hits
Search plus traversal of top ten
Search plus traversal and retrieval of all documents and all fields  
on the document
Search plus traversal and retrieval of all documents and some fields  
on the document


I think I see in the ReadTask that it is the res var that is being  
incremented and would have to be altered.  I guess I can go by  
elapsed time, but even that seems slightly askew.  I think this is  
due to the withRetrieve() function overhead inside the for loop.  I  
have moved it out and will submit that change, too.


Am I interpreting this correctly?

-Grant

On Mar 19, 2007, at 5:11 PM, Doron Cohen wrote:


Grant Ingersoll <[EMAIL PROTECTED]> wrote on 19/03/2007 13:10:16:


So, if I am understanding correctly:


"SearchSameRdr" Search > : 5000


means don't collect indiv. stats fur SearchSameRdr, but do whatever
that task does 5000 times, right?


Almost...

It should be btw
   { "SearchSameRdr" Search > : 5000
and it means: run Search 5000 times, sequentially, 5000 times,  
assign the

name "SearchSameRdr" to that sequence of 5000, and do not collect
individual stats for the individual tasks making that sequence.

If it was just
  { Search > : 5000
it would still mean the same, just that a name was assigned to this  
for

you, something like: "Seq_Search_5000".

If it was:
   { "SearchSameRdr" Search } : 5000
it would be the same as your example, just that stas would be  
collected not
only for the entire elapsed sequence, but also breaking it down for  
each of

the 5000 calls to "Search".

Similar logic with
  [ .. ]
and
  [ .. >
just that the tasks making the (parallel) sequence are executed in
parallel, each in a separate thread.





3. Is there anyway to dump out the stats as a CSV file or  
something?
Would I implement a Task for this?  Ultimately, I want to be  
able to

create a graph in Excel that shows tradeoffs between speed and
memory.


Yes, implementing a report task would be the way.
... but when I look at how I implemented these reports, all the
work is
done in the class Points. Seems it should be modified a little with
more
thought of making it easiert to extend reports.


I may take a crack at it, but deadline for the talk is looming


I'll take a look too, let you know if I have anything.


- Being intetested in memory stats - the thing that all the rounds
run in a
single program, same JVM run, usually means what you see is very  
much

dependent in the GC behavior of the specific VM you are using. If
it does
not release memory (most likely) to the OS you would not be able to
notice
that round i+1 used less memory than round i. It would probably
better for
something like this to put the "round" logic in an ant script,
invoking
each round in a separate new exec. But then things get more
complicated for
having a final stats report containing all rounds. What do you
think about
this?

Re: How can I index Phrases in Lucene?

2007-03-22 Thread Ryan McKinley

Is there any way to find frequent phrases without knowing what you are
looking for?

I could index "A B C D E" as "A B C", "B C D", "C D E" etc, but that
seems kind of clunky particularly if the phrase length is large.  Is
there any position offset magic that will surface frequent phrases
automatically?

thanks
ryan

On 3/22/07, Erick Erickson <[EMAIL PROTECTED]> wrote:

Well, you don't index phrases, it's done for you. You should try
something like the following

Create a SpanNearQuery with your terms. Specify an appropriate
slop (probably 0 assuming you want them all next to each other).

Now use call getSpans and count ... You may have to do
something with overlapping spans, but you'll need to experiment
a bit to understand it.

Erick

On 3/22/07, Maryam <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> I know how to index terms in lucene, now I wanna see
> how can I index phrases like "information retreival"
> in lucene and calculate the number of times that
> phrase has appeared in the document. Is there any way
> to do it in Lucene?
>
> Thanks
>
>
>
>
> 

> It's here! Your new message!
> Get new email alerts with the free Yahoo! Toolbar.
> http://tools.search.yahoo.com/toolbar/features/mail/
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: contrib/benchmark questions



On Mar 22, 2007, at 11:21 PM, Grant Ingersoll wrote:

I think I see in the ReadTask that it is the res var that is being  
incremented and would have to be altered.  I guess I can go by  
elapsed time, but even that seems slightly askew.  I think this is  
due to the withRetrieve() function overhead inside the for loop.  I  
have moved it out and will submit that change, too.




Moving it out of the loop made little diff. so I guess it is mostly  
just due to it being late and me being tired and not thinking  
clearly.  B/c if I were, I would just realize that those operations  
are also retrieving documents...


-Grant

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How can I index Phrases in Lucene?

23 mar 2007 kl. 04.25 skrev Ryan McKinley:

Is there any way to find frequent phrases without knowing what you are
looking for?

I think you are looking for association rules. Try searching for  
Levelwise-Scan.

Weka contains GPLed Java code.
Cite seer is your best friend for whitepapers. http:// 
citeseer.ist.psu.edu/cs

--
karl

I could index "A B C D E" as "A B C", "B C D", "C D E" etc, but that
seems kind of clunky particularly if the phrase length is large.  Is
there any position offset magic that will surface frequent phrases
automatically?

thanks
ryan

On 3/22/07, Erick Erickson <[EMAIL PROTECTED]> wrote:

Well, you don't index phrases, it's done for you. You should try
something like the following

Create a SpanNearQuery with your terms. Specify an appropriate
slop (probably 0 assuming you want them all next to each other).

Now use call getSpans and count ... You may have to do
something with overlapping spans, but you'll need to experiment
a bit to understand it.

Erick

On 3/22/07, Maryam <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> I know how to index terms in lucene, now I wanna see
> how can I index phrases like "information retreival"
> in lucene and calculate the number of times that
> phrase has appeared in the document. Is there any way
> to do it in Lucene?
>
> Thanks
>
>
>
>
>  
_ 
___

> It's here! Your new message!
> Get new email alerts with the free Yahoo! Toolbar.
> http://tools.search.yahoo.com/toolbar/features/mail/
>
>  
-

> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Reverse search