from:"wallen"

RE: demo IndexHTML parser breaks unicode?

2004-09-24 Thread wallen

In org.apache.lucene.demo.HTMLDocument you need to change the input stream
to use a different encoding.  Replace the fis with this:

fis = new InputStreamReader(new FileInputStream(f), "UTF-16");

-Original Message-
From: Fred Toth [mailto:[EMAIL PROTECTED]
Sent: Friday, September 24, 2004 9:25 PM
To: Lucene Users List
Subject: Re: demo IndexHTML parser breaks unicode?


Sorry, that didn't cure it.

Again, anyone want to point me to the quickest replacement
HTML parser (that's unicode clean)?

Thanks,

Fred

At 03:17 PM 9/24/2004, you wrote:
>On Friday 24 September 2004 19:58, Fred Toth wrote:
>
> > I've got unicode in my source HTML. In particular, within meta tags,
> > and it's getting broken by the indexer. Note that I'm not trying to
> > query on any of this, just store and retrieve document titles with
> > unicode characters.
>
>Please try again with the code from CVS, Christoph Goller committed a fix
>for this problem (at least I think it was this problem) 1-3 weeks ago.
>
>Regards
>  Daniel
>
>--
>http://www.danielnaber.de
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

TopTerms on query results

2004-09-22 Thread wallen

Can anyone help me with code to get the topterms of a given field for a
query resultset?

Here is code modified from Luke to get the topterms for a field:

public TermInfo[] mostCommonTerms( String fieldName, int numberOfTerms )
{
//make sure min will get a positive number
if ( numberOfTerms < 1 )
{
numberOfTerms = Integer.MAX_VALUE;
}
numberOfTerms = Math.min( numberOfTerms, 50 );
//String[] commonTerms = new String[numberOfTerms];
try
{
IndexReader reader = IndexReader.open( indexPath );
TermInfoQueue tiq = new TermInfoQueue( numberOfTerms );
TermEnum terms = reader.terms();

int minFreq = 0;
while ( terms.next() )
{
if ( fieldName.equalsIgnoreCase( terms.term().field() ) )
{
if ( terms.docFreq() > minFreq )
{
tiq.put( new TermInfo( terms.term(), terms.docFreq()
) );
if ( tiq.size() >= numberOfTerms ) // if tiq
overfull
{
tiq.pop(); // remove lowest in tiq
minFreq = ( (TermInfo) tiq.top() ).docFreq; //
reset
// minFreq
}
}
}
}
TermInfo[] res = new TermInfo[ tiq.size() ];
for ( int i = 0; i < res.length; i++ )
{
res[ res.length - i - 1 ] = (TermInfo) tiq.pop();
}
reader.close();
return res;

}
catch ( IOException ioe )
{
logger.error( "IOException: " + ioe.getMessage() );
}
return null;
}


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

getting most common terms for a smaller set of documents

2004-09-07 Thread wallen

Dear Lucene Users:

What is the best way to get the most common terms for a subset of the total
documents in your index?

I know how to get the most common terms for a field for the entire index,
but what is the most efficient way to do this for a subset of documents?

Here is the code I am using to get the top "numberOfTerms" common terms for
the field "fieldName":

public TermInfo[] mostCommonTerms(String fieldName, int
numberOfTerms)
{
//make sure min will get a positive number
if (numberOfTerms < 1)
{
numberOfTerms = Integer.MAX_VALUE;
}
numberOfTerms = Math.min(numberOfTerms, 50);
//String[] commonTerms = new String[numberOfTerms];
try
{
IndexReader reader = IndexReader.open(indexPath);
TermInfoQueue tiq = new
TermInfoQueue(numberOfTerms);
TermEnum terms = reader.terms();

int minFreq = 0;
while (terms.next())
{

if(fieldName.equalsIgnoreCase(terms.term().field()))
{
if (terms.docFreq() > minFreq)
{
tiq.put(new
TermInfo(terms.term(), terms.docFreq()));
if (tiq.size() >=
numberOfTerms) // if tiq overfull
{
tiq.pop(); // remove
lowest in tiq
minFreq =
((TermInfo) tiq.top()).docFreq; // reset

// minFreq
}
}

}
}
TermInfo[] res = new TermInfo[tiq.size()];
for (int i = 0; i < res.length; i++)
{
res[res.length - i - 1] = (TermInfo)
tiq.pop();
}
reader.close();
return res;

}
catch (IOException ioe)
{
logger.error("IOException: " + ioe.getMessage());
}
return null;
}

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Spam:too many open files

2004-09-07 Thread wallen

A note to developers, the code checked into lucene CVS ~Aug 15th, post
1.4.1, was causing frequent index corruptions.  When I reverted back to
version 1.4 I no longer am getting the corruptions.

I was unable to trace the problem to anything specific, but was using the
newer code to take advantage of the sort fixes.

-Original Message-
From: Patrick Kates [mailto:[EMAIL PROTECTED]
Sent: Wednesday, September 01, 2004 1:30 PM
To: [EMAIL PROTECTED]
Subject: Spam:too many open files


I am having two problems with my client's lucene indexes.

One, we are getting a FileNotFound exception (too many open files).  THis
would seem to indicate that I need to increase the number of open files on
our Suse 9.0 Pro box.  I have our sys admin working on this problem for me.

Two, because of this error and subsequent restarting of the box, we seem to
have lost an index segment or two.  My client's tape backups do not contain
the segments we know about.

I am concerned about the missing index segments as they seem to be
preventing any further update of the index.  Does anyone have any
suggestions as to how to fix this besides a full re-index of the problem
indexes?

I was wondering if maybe a merge of the index might solve the problem?  I
could move our nightly merge of the index files to sooner, but I am afraid
that the merge might make matters worse?

Any ideas or helpful speculation would be greatly appreciated.

Patrick




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Spam:too many open files

2004-09-07 Thread wallen

I sent out an email to this list a few weeks ago about how to fix a corrupt
index.  I basically edited the segments file with a hex editor removing the
entry for the missing file and decremented the total count of files from the
file count that is near the beginning of the segments file.

-Original Message-
From: Patrick Kates [mailto:[EMAIL PROTECTED]
Sent: Wednesday, September 01, 2004 1:30 PM
To: [EMAIL PROTECTED]
Subject: Spam:too many open files

I am having two problems with my client's lucene indexes.

One, we are getting a FileNotFound exception (too many open files).  THis
would seem to indicate that I need to increase the number of open files on
our Suse 9.0 Pro box.  I have our sys admin working on this problem for me.

Two, because of this error and subsequent restarting of the box, we seem to
have lost an index segment or two.  My client's tape backups do not contain
the segments we know about.

I am concerned about the missing index segments as they seem to be
preventing any further update of the index.  Does anyone have any
suggestions as to how to fix this besides a full re-index of the problem
indexes?

I was wondering if maybe a merge of the index might solve the problem?  I
could move our nightly merge of the index files to sooner, but I am afraid
that the merge might make matters worse?

Any ideas or helpful speculation would be greatly appreciated.

Patrick

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Restoring a corrupt index

2004-08-17 Thread wallen

Change 02 to be 01 and delete the bytes that represent the one record that
is bad.  It was easier to see what a record was in my file because I had
about 30 _files.

-Original Message-
From: Honey George [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 17, 2004 10:39 AM
To: Lucene Users List
Subject: RE: Restoring a corrupt index


I think attachments are filtered. This is what I see
when I open in the hex editor.

: 00 04 e0 af 00 00 00 02 05 5f 36 75 6e 67 00
04 ..à¯._6ung..
:0010 1e fb 05 5f 36 75 6e 69 00 00 00 01 00 00 00
00 .û._6uni
:0020 00 00 c1 b4 
   ..Á´

-George





 --- Honey George <[EMAIL PROTECTED]> wrote: 
> Wallen,
> Which hex editor have you used. I am also facing a
> similar problem. I tried to use KHexEdit and it
> doesn't seem to help. I am attaching with this email
> my segments file. I think only the segment with name
> _ung is a valid one, I wanted to delete the
> remaining..but couldn't. Can you help?
> 
> -George
> 
> 
> 
>  --- [EMAIL PROTECTED] wrote: 
> > I fixed my own problem, but hope this might help
> > someone else in the future:
> > 
> > I went into my segments file (with a hex editor),
> > deleted the record for
> > _cu0v and changed the length 0x20 to be 0x1f, and
> it
> > seems I have most of my
> > index back!
> > 
> > Maybe a developer could elaborate on this?
> > 
> 
> 
>   
>   
>   
>
___ALL-NEW
> Yahoo! Messenger - all new features - even more fun!
>  http://uk.messenger.yahoo.com
> >
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
[EMAIL PROTECTED] 





___ALL-NEW Yahoo!
Messenger - all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Restoring a corrupt index

2004-08-17 Thread wallen

http://www.ultraedit.com/ is the best!

However, I cannot imagine how another hexeditor wouldnt work.

-Original Message-
From: Honey George [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 17, 2004 10:35 AM
To: Lucene Users List
Subject: RE: Restoring a corrupt index

Wallen,
Which hex editor have you used. I am also facing a
similar problem. I tried to use KHexEdit and it
doesn't seem to help. I am attaching with this email
my segments file. I think only the segment with name
_ung is a valid one, I wanted to delete the
remaining..but couldn't. Can you help?

-George

 --- [EMAIL PROTECTED] wrote: 
> I fixed my own problem, but hope this might help
> someone else in the future:
> 
> I went into my segments file (with a hex editor),
> deleted the record for
> _cu0v and changed the length 0x20 to be 0x1f, and it
> seems I have most of my
> index back!
> 
> Maybe a developer could elaborate on this?
> 

___ALL-NEW Yahoo!
Messenger - all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Restoring a corrupt index

2004-08-16 Thread wallen

I fixed my own problem, but hope this might help someone else in the future:

I went into my segments file (with a hex editor), deleted the record for
_cu0v and changed the length 0x20 to be 0x1f, and it seems I have most of my
index back!

Maybe a developer could elaborate on this?

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Monday, August 16, 2004 2:16 PM
To: [EMAIL PROTECTED]
Subject: Restoring a corrupt index


Dear fellow Luceners,

I had a disk failure while indexing and am now unable to get ANY of the
documents stored in my index.  I am interested in restoring as many
documents as possible from what is a mostly complete index.

Is there something I can alter by hand to at least get most of the data
back?  I am getting an EOF error on the file/segment _cu0v which was
presumably the file that was being written when the index crashed.  Is there
a reference to that file in segments that I could edit out??

I have included what I hope is useful information below.

Thank you,
Will




This is the call-stack from an optimize call

IndexWriter writer = new IndexWriter(path, new
StandardAnalyzer(), false);
--> writer.optimize();
logger.debug(writer.docCount() + "");
writer.close();

Call Stack---
java.io.IOException: read past EOF
at org.apache.lucene.store.InputStream.refill(InputStream.java:154)
at org.apache.lucene.store.InputStream.readByte(InputStream.java:43)
at org.apache.lucene.store.InputStream.readVInt(InputStream.java:83)
at
org.apache.lucene.index.CompoundFileReader.(CompoundFileReader.java:66
)
at
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:104)
at
org.apache.lucene.index.SegmentReader.(SegmentReader.java:94)
at
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:480)
at
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at TryStuff.tryFixingLuceneIndex(TryStuff.java:60)
at TryStuff.main(TryStuff.java:49)

-Directory listing-

-rw-rw-r--1 wallen   devs   383461 Jul 27 16:48 _1wtg.cfs
-rw-rw-r--1 wallen   devs 754131765 Jul 27 21:12 _262q.cfs
-rw-rw-r--1 wallen   devs 754345785 Jul 29 11:43 _4c49.cfs
-rw-rw-r--1 wallen   devs 719608798 Jul 31 04:38 _6i6l.cfs
-rw-rw-r--1 wallen   devs 773242798 Aug  2 03:05 _8o79.cfs
-rw-rw-r--1 wallen   devs 791843591 Aug  3 12:13 _au8j.cfs
-rw-rw-r--1 wallen   devs 77665301 Aug  3 14:35 _b21n.cfs
-rw-rw-r--1 wallen   devs 79123000 Aug  3 17:49 _b9uk.cfs
-rw-rw-r--1 wallen   devs 71718714 Aug  3 22:05 _bhnf.cfs
-rw-rw-r--1 wallen   devs 81537292 Aug  4 02:50 _bpga.cfs
-rw-rw-r--1 wallen   devs 80611946 Aug  4 07:44 _bx95.cfs
-rw-rw-r--1 wallen   devs 77923836 Aug  4 13:23 _c523.cfs
-rw-rw-r--1 wallen   devs0 Aug  4 14:20 _caip.fnm
-rw-rw-r--1 wallen   devs 79987096 Aug  4 15:29 _ccxt.cfs
-rw-rw-r--1 wallen   devs 84966054 Aug  4 16:25 _ckqo.cfs
-rw-rw-r--1 wallen   devs 90829602 Aug  4 19:14 _csjj.cfs
-rw-rw-r--1 wallen   devs  7486317 Aug  4 19:23 _ctbm.cfs
-rw-rw-r--1 wallen   devs  1148765 Aug  4 19:24 _ctef.cfs
-rw-rw-r--1 wallen   devs   958149 Aug  4 19:27 _cth8.cfs
-rw-rw-r--1 wallen   devs   909911 Aug  4 19:28 _ctk1.cfs
-rw-rw-r--1 wallen   devs   918952 Aug  4 19:28 _ctmu.cfs
-rw-rw-r--1 wallen   devs   957856 Aug  4 19:31 _ctpn.cfs
-rw-rw-r--1 wallen   devs   651717 Aug  4 19:32 _ctsg.cfs
-rw-rw-r--1 wallen   devs   790354 Aug  4 19:32 _ctv9.cfs
-rw-rw-r--1 wallen   devs   890058 Aug  4 19:35 _cty2.cfs
-rw-rw-r--1 wallen   devs0 Aug  4 19:35 _cu0v.cfs
-rw-rw-r--1 wallen   devs   891397 Aug  5 13:36 _cu3o.cfs
-rw-rw-r--1 wallen   devs  1085511 Aug  5 13:40 _cu6h.cfs
-rw-rw-r--1 wallen   devs   754877 Aug  5 13:40 _cu9b.cfs
-rw-rw-r--1 wallen   devs  1610682 Aug  5 13:40 _cuc5.cfs
-rw-rw-r--1 wallen   devs  1039577 Aug  5 13:41 _cuez.cfs
-rw-rw-r--1 wallen   devs   831174 Aug  5 13:41 _cuht.cfs
-rw-rw-r--1 wallen   devs   930858 Aug  5 13:56 _cuko.cfs
-rw-rw-r--1 wallen   devs   911844 Aug  5 13:56 _cuni.cfs
-rw-rw-r--1 wallen   devs  340 Aug  5 13:56 segments
-rw-rw-r--1 wallen   devs4 Aug  5 13:56 deletable
drwxrwxrwx2 wallen   devs   929792 Aug  5 13:56 .
drwxrwxr-x5 wallen   devs   40 Aug 10 14:13 ..

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---

Restoring a corrupt index

2004-08-16 Thread wallen

Dear fellow Luceners,

I had a disk failure while indexing and am now unable to get ANY of the
documents stored in my index.  I am interested in restoring as many
documents as possible from what is a mostly complete index.

Is there something I can alter by hand to at least get most of the data
back?  I am getting an EOF error on the file/segment _cu0v which was
presumably the file that was being written when the index crashed.  Is there
a reference to that file in segments that I could edit out??

I have included what I hope is useful information below.

Thank you,
Will




This is the call-stack from an optimize call

IndexWriter writer = new IndexWriter(path, new
StandardAnalyzer(), false);
--> writer.optimize();
logger.debug(writer.docCount() + "");
writer.close();

Call Stack---
java.io.IOException: read past EOF
at org.apache.lucene.store.InputStream.refill(InputStream.java:154)
at org.apache.lucene.store.InputStream.readByte(InputStream.java:43)
at org.apache.lucene.store.InputStream.readVInt(InputStream.java:83)
at
org.apache.lucene.index.CompoundFileReader.(CompoundFileReader.java:66
)
at
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:104)
at
org.apache.lucene.index.SegmentReader.(SegmentReader.java:94)
at
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:480)
at
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at TryStuff.tryFixingLuceneIndex(TryStuff.java:60)
at TryStuff.main(TryStuff.java:49)

-Directory listing-

-rw-rw-r--1 wallen   devs   383461 Jul 27 16:48 _1wtg.cfs
-rw-rw-r--1 wallen   devs 754131765 Jul 27 21:12 _262q.cfs
-rw-rw-r--1 wallen   devs 754345785 Jul 29 11:43 _4c49.cfs
-rw-rw-r--1 wallen   devs 719608798 Jul 31 04:38 _6i6l.cfs
-rw-rw-r--1 wallen   devs 773242798 Aug  2 03:05 _8o79.cfs
-rw-rw-r--1 wallen   devs 791843591 Aug  3 12:13 _au8j.cfs
-rw-rw-r--1 wallen   devs 77665301 Aug  3 14:35 _b21n.cfs
-rw-rw-r--1 wallen   devs 79123000 Aug  3 17:49 _b9uk.cfs
-rw-rw-r--1 wallen   devs 71718714 Aug  3 22:05 _bhnf.cfs
-rw-rw-r--1 wallen   devs 81537292 Aug  4 02:50 _bpga.cfs
-rw-rw-r--1 wallen   devs 80611946 Aug  4 07:44 _bx95.cfs
-rw-rw-r--1 wallen   devs 77923836 Aug  4 13:23 _c523.cfs
-rw-rw-r--1 wallen   devs0 Aug  4 14:20 _caip.fnm
-rw-rw-r--1 wallen   devs 79987096 Aug  4 15:29 _ccxt.cfs
-rw-rw-r--1 wallen   devs 84966054 Aug  4 16:25 _ckqo.cfs
-rw-rw-r--1 wallen   devs 90829602 Aug  4 19:14 _csjj.cfs
-rw-rw-r--1 wallen   devs  7486317 Aug  4 19:23 _ctbm.cfs
-rw-rw-r--1 wallen   devs  1148765 Aug  4 19:24 _ctef.cfs
-rw-rw-r--1 wallen   devs   958149 Aug  4 19:27 _cth8.cfs
-rw-rw-r--1 wallen   devs   909911 Aug  4 19:28 _ctk1.cfs
-rw-rw-r--1 wallen   devs   918952 Aug  4 19:28 _ctmu.cfs
-rw-rw-r--1 wallen   devs   957856 Aug  4 19:31 _ctpn.cfs
-rw-rw-r--1 wallen   devs   651717 Aug  4 19:32 _ctsg.cfs
-rw-rw-r--1 wallen   devs   790354 Aug  4 19:32 _ctv9.cfs
-rw-rw-r--1 wallen   devs   890058 Aug  4 19:35 _cty2.cfs
-rw-rw-r--1 wallen   devs0 Aug  4 19:35 _cu0v.cfs
-rw-rw-r--1 wallen   devs   891397 Aug  5 13:36 _cu3o.cfs
-rw-rw-r--1 wallen   devs  1085511 Aug  5 13:40 _cu6h.cfs
-rw-rw-r--1 wallen   devs   754877 Aug  5 13:40 _cu9b.cfs
-rw-rw-r--1 wallen   devs  1610682 Aug  5 13:40 _cuc5.cfs
-rw-rw-r--1 wallen   devs  1039577 Aug  5 13:41 _cuez.cfs
-rw-rw-r--1 wallen   devs   831174 Aug  5 13:41 _cuht.cfs
-rw-rw-r--1 wallen   devs   930858 Aug  5 13:56 _cuko.cfs
-rw-rw-r--1 wallen   devs   911844 Aug  5 13:56 _cuni.cfs
-rw-rw-r--1 wallen   devs  340 Aug  5 13:56 segments
-rw-rw-r--1 wallen   devs4 Aug  5 13:56 deletable
drwxrwxrwx2 wallen   devs   929792 Aug  5 13:56 .
drwxrwxr-x5 wallen   devs   40 Aug 10 14:13 ..

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Finding All?

2004-08-13 Thread wallen

A ranged query that covers the full range does the same thing.  

Of course it is also inefficient with term generation:  myField[a TO z]

-Original Message-
From: Patrick Burleson [mailto:[EMAIL PROTECTED]
Sent: Friday, August 13, 2004 3:58 PM
To: Lucene Users List
Subject: Re: Finding All?


That is a very interesting idea. I might give that a shot. 

Thanks,
Patrick

On Fri, 13 Aug 2004 15:36:11 -0400, Tate Avery <[EMAIL PROTECTED]>
wrote:
> 
> I had to do this once and I put a field called "all" with a value of
"true" for every document.
> 
> _doc.addField(Field.Keyword("all", "true"));
> 
> Then, if there was an empty query, I would substitute it for the query
"all:true".  And, of course, every doc would match this.
> 
> There might be a MUCH more elegant solution, but this certainly worked for
me and was quite easy to incorporate.  And, it appears to order the
documents by the order in which they were indexed.
> 
> T
> 
> p.s. You can probably do something using IndexReader directly... but the
nice thing about this approach is that you are still just using a simple
query.
> 
> 
> 
> -Original Message-
> From: Patrick Burleson [mailto:[EMAIL PROTECTED]
> Sent: Friday, August 13, 2004 3:25 PM
> To: Lucene Users List
> Subject: Finding All?
> 
> Is there a way for lucene to find all documents? Say if I have a
> search input and someone puts  nothing in I want to go ahead and
> return everything. Passing "*" to QueryParser was not pretty.
> 
> Thanks,
> Patrick
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Question on the minimum value for DateField

2004-08-04 Thread wallen

The date is stored as a Long that is the number of seconds since jan 1970.
Anything before that would be negative.

-Original Message-
From: Terence Lai [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 04, 2004 6:25 PM
To: Lucene Users List
Subject: Question on the minimum value for DateField

Hi All,

I realize that the DateField cannot except the value which is before the
Year 1970, specifically in the
org.apache.lucene.document.DateField.timeToString() method. Is there are any
techincal reason for this limitation?

Thanks,
Terence

--
Get your free email account from http://www.trekspace.com
  Your Internet Virtual Desktop!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: TermFreqVector Beginner Question

2004-07-28 Thread wallen

Are you certain that you are storing the field "contents" in your documents,
not just tokenizing...

If you use the overloaded method that takes a Reader you lose the content.

-Original Message-
From: Grant Ingersoll [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 28, 2004 5:35 PM
To: [EMAIL PROTECTED]
Subject: Re: TermFreqVector Beginner Question

Can you post the whole section of related code?  Sounds like you are doing
things right.  

In the Lucene source code, there is a file called TestTermVectors.java, take
a look at that and see how your stuff compares.  I ran the test against the
HEAD and it worked.

>>> [EMAIL PROTECTED] 07/28/04 04:51PM >>>

Howdy,

I am new to Lucene and thus far I am very impressed.  Thanks to all who have
worked on this project!

I am working on a project where I want to do the following:

1.) Index a bunch of document.
2.) Pluck out one of the doucments by Lucene document number
3.) Get a term frequency for that document

After some digging and playing I came across this method...

   IndexReader.getTermFreqVector(int docNumber, String field)

This is exactly what I want.  So I ran the IndexFiles demo program with some
test documents and started poking at the index with an IndexReader. But when
I
called

   IndexReader.getTermFreqVector(someDocNumber,"contents")

I get NULL back.  After a little more digging I find that for a TermVector
to
exist the Field has to have the TermVector flag set.  So I changes some
lines
in the demo FileDocument.Document method to:

FileInputStream is = new FileInputStream(f);
Reader reader = new BufferedReader(new InputStreamReader(is));
doc.add(Field.Text("contents", reader.toString(),true));

with the "true" parameter causing the new Field to turn on the
storeTermVector
flag, right? So then I reindex and get the same results - getTermFreqVector
returns NULL.  So I inspect the field list of the Document from the index:

   Document d = ir.document(td.doc());
   System.out.println("  Path: "+d.get("path"));
   for (Enumeration e = d.fields() ; e.hasMoreElements() ;) 
   {
  System.out.println(((Field)e.nextElement()).toString());
   }

and I discover that there is now NO "contents" Field.  If I change the
paramter
in Field.Text to false, I get a "contents" Field but no TermVector.  To date
I
haven't been able to figure out how to get a TermFreqVector at all.

What am I missing?

I have looked at the documents - all the tutorials I have found just cover
the
basics.

I have read the news group postings related to "TermVectors" and
"TermFreqVectors" and everybody says stuff like "the new 1.4 Vector stuff is
great".  So how do they know?  Where can I learn about this?  Are there any
more
complete user tutorials/references that cover TermVector features?

Oh, I am using the 1.4 Lucene release in case it matters.

Thanks in advance,

Matt Galloway
Tulsa, Oklahoma

(BTW, I also tired Field.UnStored with the same results.)

-
This mail sent through IMP: http://horde.org/imp/ 

- End forwarded message -

-
This mail sent through IMP: http://horde.org/imp/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene vs. MySQL Full-Text

2004-07-22 Thread wallen

I also question whether it could handle extreme volume with such good query
speed.

Has anyone done numbers with  1+ million documents?

-Original Message-
From: Daniel Naber [mailto:[EMAIL PROTECTED]
Sent: Tuesday, July 20, 2004 5:44 PM
To: Lucene Users List
Subject: Re: Lucene vs. MySQL Full-Text

On Tuesday 20 July 2004 21:29, Tim Brennan wrote:

> ÂDoes anyone out there have
> anything more concrete they can add?

Stemming is still on the MySQL TODO list: 
http://dev.mysql.com/doc/mysql/en/Fulltext_TODO.html

Also, for most people it's easier to extend Lucene than MySQL (as MySQL is 
written in C(++?)) and there are more powerful queries in Lucene, e.g. 
fuzzy phrase search.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Very slow IndexReader.open() performance

2004-07-22 Thread wallen

It could also be that your disk space is filling up and the OS runs out of
swap room.

-Original Message-
From: Mark Florence [mailto:[EMAIL PROTECTED]
Sent: Tuesday, July 20, 2004 1:52 PM
To: Lucene Users List
Subject: Very slow IndexReader.open() performance

Hi -- We have a large index (~4m documents, ~14gb) that we haven't been
able to optimize for some time, because the JVM throws OutOfMemory, after
climbing to the maximum we can throw at it, 2gb. 

In fact, the OutOfMemory condition occurred most recently during a segment 
merge operation. maxMergeDocs was set to the default, and we seem to have
gotten around this problem by setting it to some lower value, currently
100,000. The index is highly interactive so I took the hint from earlier
posts to set it to this value.

Good news! No more OutOfMemory conditions.

Bad news: now, calling IndexReader.open() is taking 20+ seconds, and it 
is killing performance.

I followed the design pattern in another earlier post from Doug. I take a
batch of deletes, open an IndexReader, perform the deletes, then close it.
Then I take a batch of adds, open an IndexWriter, perform the adds, then
close it. Then I get a new IndexSearcher for searching.

But because the index is so interactive, this sequence repeats itself all
the time. 

My question is, is there a better way? Performance was fine when I could
optimize. Can I hold onto singleton a IndexReader/IndexWriter/IndexSearcher
to avoid the overhead of the open?

Any help would be most gratefully received.

Mark Florence, CTO, AIRS
[EMAIL PROTECTED]
800-897-7714x1703

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Searching against Database

2004-07-15 Thread wallen

If you know ahead of time which documents are viewable by a certain user
group you could add a field, such as group, and then when you index the
document you put the names of the user groups that are allowed to view that
document.  Then your query tool can append, for example "AND
group:developers" to the user's query.  Then you will not have to merge
results.

-Will

-Original Message-
From: Sergiu Gordea [mailto:[EMAIL PROTECTED]
Sent: Thursday, July 15, 2004 2:58 AM
To: Lucene Users List
Subject: Re: Searching against Database

Hi,

I have a simillar problem. I'm working on a web application in which the 
users have different permissions.
Not all information stored in the index is public for all users.

The documents in Index are identified by the same  ID that the  rows 
have in database tables.

I can get the  IDs of the documents that can be accesible by the user, 
but if this are 1000, what will happen in Lucene?

 Is this a valid solution? Can anyone provide a better idea?

 Thanks,

 Sergiu

lingaraju wrote:

>Hello
>
>Even i am searching the same code as all my web display information is
>stored  in database.
>Early response will be very much helpful
>
>Thanks and regards
>Raju
>
>- Original Message - 
>From: "Hetan Shah" <[EMAIL PROTECTED]>
>To: "Lucene Users List" <[EMAIL PROTECTED]>
>Sent: Thursday, July 15, 2004 5:56 AM
>Subject: Searching against Database
>
>
>  
>
>>Hello All,
>>
>>I have got all the answers from this fantastic mailing list. I have
>>another question ;)
>>
>>What is the best way (Best Practices) to integrate Lucene with live
>>database, Oracle to be more specific. Any pointers are really very much
>>appreciated.
>>
>>thanks guys.
>>-H
>>
>>
>>-
>>To unsubscribe, e-mail: [EMAIL PROTECTED]
>>For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>>
>>
>
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>
>  
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

corrupt indexes?

2004-07-13 Thread wallen

Has anyone had any experience with their index getting corrupted?

Are there any tools to repair it should it get corrupted?

I have not had any problems, but was curious at how resiliant this data
store seems to be.

-Will

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Field.java -> STORED, NOT_STORED, etc...

2004-07-12 Thread wallen

I have 2 suggestions:

1) use Eclipse, or an IDE that references the javadoc with mouseovers
2) if you are going to create constants, consider using a bitflag.  Then
your constants can have a 2's value, ie

STORED = 1
INDEXED = 2
TOKENIZED = 4

Then you can have the constructor look like:

new Field("name", "value", STORED + TOKENIZED)

The constructor would break that down bitwise!

-Original Message-
From: Kevin A. Burton [mailto:[EMAIL PROTECTED]
Sent: Sunday, July 11, 2004 5:05 AM
To: Lucene Users List
Subject: Field.java -> STORED, NOT_STORED, etc...


I've been working with the Field class doing index conversions between 
an old index format to my new external content store proposal (thus the 
email about the 14M convert).

Anyway... I find the whole Field.Keyword, Field.Text thing confusing.  
The main problem is that the constructor to Field just takes booleans 
and if you forget the ordering of the booleans its very confusing.

new Field( "name", "value", true, false, true );

So looking at that you have NO idea what its doing without fetching javadoc.

So I added a few constants to my class:

new Field( "name", "value", NOT_STORED, INDEXED, NOT_TOKENIZED );

which IMO is a lot easier to maintain.

Why not add these constants to Field.java:

public static final boolean STORED = true;
public static final boolean NOT_STORED = false;

public static final boolean INDEXED = true;
public static final boolean NOT_INDEXED = false;

public static final boolean TOKENIZED = true;
public static final boolean NOT_TOKENIZED = false;

Of course you still have to remember the order but this becomes a lot 
easier to maintain.

Kevin

-- 

Please reply using PGP.

http://peerfear.org/pubkey.asc

NewsMonster - http://www.newsmonster.org/

Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
   AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Problem with match on a non tokenized field.

2004-07-09 Thread wallen

I do not know how to work around that.

It is indeed an interesting situation that would require more understanding
as to how the analyzer (in this case NullAnalyzer) interacts with the
special characters such as the * and ~.

You could try using the whitespace analyzer instead of the nullanalyzer!

-Will

-Original Message-
From: Polina Litvak [mailto:[EMAIL PROTECTED]
Sent: Friday, July 09, 2004 4:45 PM
To: 'Lucene Users List'
Subject: RE: Problem with match on a non tokenized field.


Thanks a lot for your help. I've done what you suggested and it works
great except in this particular case:

I am trying to search for something like "abc-ef*" - i.e. I want to find
all fields that start with: "abc-ef".
I use PerFieldAnalyzerWrapper together with NullAnalyzer to make sure
this field doesn't get tokenized on the "-", but at the same time I need
the analyzer to realize that '*' is the wildcard search, not part of the
field value itself.

Would you know how to work around this ?

Thank you,
Polina

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: July 8, 2004 1:10 PM
To: [EMAIL PROTECTED]
Subject: RE: Problem with match on a non tokenized field.

The PerFieldAnalyzerWrapper is constructed with your default analyzer,
suppose this is the analyzer you use to tokenize.  You then call the
addAnalyzer method for each non-tokenized/keyword fields.

In the case below, url is a keyword, all other fields are tokenized:

PerFieldAnalyzerWrapper analyzer  = new
org.apache.lucene.analysis.PerFieldAnalyzerWrapper(new MyAnalyzer());
analyzer.addAnalyzer("url", new NullAnalyzer());
query = QueryParser.parse(searchQuery,"contents",analyzer);



-Original Message-
From: Polina Litvak [mailto:[EMAIL PROTECTED]
Sent: Thursday, July 08, 2004 10:19 AM
To: 'Lucene Users List'
Subject: RE: Problem with match on a non tokenized field.


Thanks a lot for your help.
I have one more question:

How would you handle a query consisting of two fields combined with a
Boolean operator, where one field is only indexed and stored (a Keyword)
and another is tokenized, indexed and store ?
Is it possible to have parts of the same query analyzed with different
analyzers ?


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: July 7, 2004 4:38 PM
To: [EMAIL PROTECTED]
Subject: RE: Problem with match on a non tokenized field.

Use org.apache.lucene.analysis.PerFieldAnalyzerWrapper

Here is how I use it:

PerFieldAnalyzerWrapper analyzer  = new
org.apache.lucene.analysis.PerFieldAnalyzerWrapper(new MyAnalyzer());
analyzer.addAnalyzer("url", new NullAnalyzer());
try 
{
query = QueryParser.parse(searchQuery,
"contents",
analyzer);

-Original Message-
From: Polina Litvak [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 07, 2004 4:20 PM
To: [EMAIL PROTECTED]
Subject: Problem with match on a non tokenized field.


I have a Lucene Document with a field named Code which is stored 
and indexed but not tokenized. The value of the field is ABC5-LB.
The only way I can match the field when searching is by entering 
Code:"ABC5-LB" because when I drop the quotes, every Analyzer I've tried
using breaks my
query into Code:ABC5 -Code:LB.
 
I need to be able to match this field by doing something like
Code:ABC5-L*, therefore always using quotes is not an option.
 
How would I go about writing my own analyzer that will not tokenize the
query ?
 
Thanks,
Polina
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Problem with match on a non tokenized field.

2004-07-08 Thread wallen

The PerFieldAnalyzerWrapper is constructed with your default analyzer,
suppose this is the analyzer you use to tokenize.  You then call the
addAnalyzer method for each non-tokenized/keyword fields.

In the case below, url is a keyword, all other fields are tokenized:

PerFieldAnalyzerWrapper analyzer  = new
org.apache.lucene.analysis.PerFieldAnalyzerWrapper(new MyAnalyzer());
analyzer.addAnalyzer("url", new NullAnalyzer());
query = QueryParser.parse(searchQuery,"contents",analyzer);



-Original Message-
From: Polina Litvak [mailto:[EMAIL PROTECTED]
Sent: Thursday, July 08, 2004 10:19 AM
To: 'Lucene Users List'
Subject: RE: Problem with match on a non tokenized field.


Thanks a lot for your help.
I have one more question:

How would you handle a query consisting of two fields combined with a
Boolean operator, where one field is only indexed and stored (a Keyword)
and another is tokenized, indexed and store ?
Is it possible to have parts of the same query analyzed with different
analyzers ?


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: July 7, 2004 4:38 PM
To: [EMAIL PROTECTED]
Subject: RE: Problem with match on a non tokenized field.

Use org.apache.lucene.analysis.PerFieldAnalyzerWrapper

Here is how I use it:

PerFieldAnalyzerWrapper analyzer  = new
org.apache.lucene.analysis.PerFieldAnalyzerWrapper(new MyAnalyzer());
analyzer.addAnalyzer("url", new NullAnalyzer());
try 
{
query = QueryParser.parse(searchQuery,
"contents",
analyzer);

-Original Message-
From: Polina Litvak [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 07, 2004 4:20 PM
To: [EMAIL PROTECTED]
Subject: Problem with match on a non tokenized field.


I have a Lucene Document with a field named Code which is stored 
and indexed but not tokenized. The value of the field is ABC5-LB.
The only way I can match the field when searching is by entering 
Code:"ABC5-LB" because when I drop the quotes, every Analyzer I've tried
using breaks my
query into Code:ABC5 -Code:LB.
 
I need to be able to match this field by doing something like
Code:ABC5-L*, therefore always using quotes is not an option.
 
How would I go about writing my own analyzer that will not tokenize the
query ?
 
Thanks,
Polina
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Problem with match on a non tokenized field.

2004-07-07 Thread wallen

Use org.apache.lucene.analysis.PerFieldAnalyzerWrapper

Here is how I use it:

PerFieldAnalyzerWrapper analyzer  = new
org.apache.lucene.analysis.PerFieldAnalyzerWrapper(new MyAnalyzer());
analyzer.addAnalyzer("url", new NullAnalyzer());
try 
{
query = QueryParser.parse(searchQuery, "contents",
analyzer);

-Original Message-
From: Polina Litvak [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 07, 2004 4:20 PM
To: [EMAIL PROTECTED]
Subject: Problem with match on a non tokenized field.


I have a Lucene Document with a field named Code which is stored 
and indexed but not tokenized. The value of the field is ABC5-LB.
The only way I can match the field when searching is by entering 
Code:"ABC5-LB" because when I drop the quotes, every Analyzer I've tried
using breaks my
query into Code:ABC5 -Code:LB.
 
I need to be able to match this field by doing something like
Code:ABC5-L*, therefore always using quotes is not an option.
 
How would I go about writing my own analyzer that will not tokenize the
query ?
 
Thanks,
Polina
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

QueryParser and Keyword Fields

2004-06-25 Thread wallen

Can anyone give me advice on the best way to not have your keyword fields
analyzed by QueryParser?

Even though it seems like it would be a common problem, I have read the FAQ,
and found this relevant thread with no real answers.

http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]
he.org&msgId=1235589

"QueryParser has some nasty habits of analyzing everything."

Can't it be smart and not analyze fields that are keywords (aka not
tokenized by the analyzer?)

Thank you,
Will

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Demo 3 on windows

2004-06-22 Thread wallen

use forward slashes / instead of \ for your path:

c:/apache/group/index

OR if c: is your main drive

/apache/group/index

-Original Message-
From: Hetan Shah [mailto:[EMAIL PROTECTED]
Sent: Monday, June 21, 2004 5:55 PM
To: [EMAIL PROTECTED]
Subject: Demo 3 on windows


Hello,

I have been trying to build the index on my windows machine with the 
following syntax and getting this message back from Lucene.

*java org.apache.lucene.demo.IndexHTML -create -index {index-dir} ..

*in my case it looks like
java org.apache.lucene.demo.IndexHTML -create -index c:\apache 
group\index ..

and the message that I am getting is
Usage: IndexHTML [-create] [-index ] 

Any idea why would I keep getting this message?

TIA.
-H


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: search "" and ""

2004-06-18 Thread wallen

This depends on the analyzer you use.

http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexi
ng&toc=faq#q13

-Original Message-
From: Lynn Li [mailto:[EMAIL PROTECTED]
Sent: Friday, June 18, 2004 5:03 PM
To: '[EMAIL PROTECTED]'
Subject: search "" and ""


When search "" or "", QueryParser parses them into "text". How
can I make it not to remove anchor brackets and slashes?

Thank you in advance,
Lynn

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: help needed in starting lucene

2004-06-02 Thread wallen

It sounds to me like you need a newer version of Java.

-Original Message-
From: milind honrao [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 02, 2004 5:36 PM
To: [EMAIL PROTECTED]
Subject: help needed in starting lucene

Hi,

I am just a beginner. I installed lucene according to the intsructions
provided. 
I did all the changed to the environment variables
when i try to run the test program for building indexes using the following
command:  java  org.apache.lucene.demo.IndexFiles test/Doc
I am getting the following exception 
Exception in thread "main" class java.lang.ExceptionInInitializerError:
java.lang.RuntimeException: java.security.NoSuchAlgorithmException: MD5:
Class not found.

Yahoo! India Matrimony: Find your partner online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Problem Indexing Large Document Field

2004-05-26 Thread wallen

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWrite
r.html#DEFAULT_MAX_FIELD_LENGTH

maxFieldLength
public int maxFieldLengthThe maximum number of terms that will be indexed
for a single field in a document. This limits the amount of memory required
for indexing, so that collections with very large files will not crash the
indexing process by running out of memory.
Note that this effectively truncates large documents, excluding from the
index terms that occur further in the document. If you know your source
documents are large, be sure to set this value high enough to accomodate the
expected size. If you set it to Integer.MAX_VALUE, then the only limit is
your memory, but you should anticipate an OutOfMemoryError.

By default, no more than 10,000 terms will be indexed for a field. 



-Original Message-
From: Gilberto Rodriguez [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 26, 2004 4:04 PM
To: [EMAIL PROTECTED]
Subject: Problem Indexing Large Document Field


I am trying to index a field in a Lucene document with about 90,000 
characters. The problem is that it only indexes part of the document. 
It seems to only index about 65,00 characters. So, if I search on terms 
that are at the beginning of the text, the search works, but it fails 
for terms that are at the end of the document.

Is there a limitation on how many characters can be stored in a 
document field? Any help would be appreciated, thanks


Gilberto Rodriguez
Software Engineer
   
370 CenterPointe Circle, Suite 1178
Altamonte Springs, FL 32701-3451
   
407.339.1177 (Ext.112)  phone
407.339.6704  fax
[EMAIL PROTECTED]  email
www.conviveon.com  web
 
This e-mail contains legally privileged and confidential information 
intended only for the individual or entity named within the message. If 
the reader of this message is not the intended recipient, or the agent 
responsible to deliver it to the intended recipient, the recipient is 
hereby notified that any review, dissemination, distribution or copying 
of this communication is prohibited. If this communication was received 
in error, please notify me by reply e-mail and delete the original 
message.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Memory usage

2004-05-26 Thread wallen

This sounds like a memory leakage situation.  If you are using tomcat I
would suggest you make sure you are on a recent version, as it is known to
have some memory leaks in version 4.  It doesn't make sense that repeated
queries would use more memory that the most demanding query unless objects
are not getting freed from memory.

-Will

-Original Message-
From: James Dunn [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 26, 2004 3:02 PM
To: [EMAIL PROTECTED]
Subject: Memory usage


Hello,

I was wondering if anyone has had problems with memory
usage and MultiSearcher.

My index is composed of two sub-indexes that I search
with a MultiSearcher.  The total size of the index is
about 3.7GB with the larger sub-index being 3.6GB and
the smaller being 117MB.

I am using Lucene 1.3 Final with the compound file
format.

Also I search across about 50 fields but I don't use
wildcard or range queries. 

Doing repeated searches in this way seems to
eventually chew up about 500MB of memory which seems
excessive to me.

Does anyone have any ideas where I could look to
reduce the memory my queries consume?

Thanks,

Jim




__
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Performance profile of optimization...

2004-05-24 Thread wallen

My understanding is that hard drive IO is the main bottleneck, as the
operation is mainly a file copy.  So to directly answer your question, I
believe the overall file size of your indexes will linearly effect the
performance profile of your optimizations.

-Original Message-
From: Michael Giles [mailto:[EMAIL PROTECTED]
Sent: Monday, May 24, 2004 3:13 PM
To: Lucene Users List
Subject: Performance profile of optimization...

What is the performance profile of optimizing an index?  By that I mean, 
what are the primary variables that negatively impact its speed (i.e. index 
size (bytes, docs), number of adds/deletes since last optimization, 
etc).  For example, if I add a single document to a small (i.e. < 10K docs) 
index and still have that index open (but would otherwise close it until 
the next update, a few minutes later), what type of a performance hit would 
optimizing the index be?  Does that cost change as the index gets bigger or 
is it tied to the number of changes that need to be rolled in?

-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Rebuild after corruption

2004-05-21 Thread wallen

Make sure you close your indexwriter.

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWrite
r.html#close()

-Original Message-
From: Steve Rajavuori [mailto:[EMAIL PROTECTED]
Sent: Friday, May 21, 2004 7:49 PM
To: '[EMAIL PROTECTED]'
Subject: Rebuild after corruption

I have a problem periodically where the process updating my Lucene files
terminates abnormally. When I try to open the Lucene files afterward I get
an exception indicating that files are missing. Does anyone know how I can
recover at this point, without having to rebuild the whole index from
scratch?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Searching Microsoft Word , Excel and PPT files for Japanese

2004-05-20 Thread wallen

I am not sure.  See what google give you.  I would guess you need to get a
table of entities and compare it to the unicode character.  So if you parse
the word file you might see something like "&u12312;" (without quotes) this
corresponds to a single unicode character and you can use the java api to
get that character.

-Will

-Original Message-
From: Ankur Goel [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 20, 2004 1:18 PM
To: 'Lucene Users List'
Subject: RE: Searching Microsoft Word , Excel and PPT files for Japanese

Hi 

Can you tell me how to convert the Windows-1252 characters/entities to
unicode (UTF-8 or UTF-16). Sorry I am new to this.

Looks like first I will have to parse the text out of these files. I tried
Jakarta PI also But for japans it was also not very good.

Regards,
Ankur 

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Thursday, May 20, 2004 10:43 PM
To: [EMAIL PROTECTED]
Subject: RE: Searching Microsoft Word , Excel and PPT files for Japanese

I believe MS apps store non-ascii characters as entities internally instead
of using unicode.  You can see evidence of this if you save your file as an
HTML file and look at the source.  You will have to adjust your parser to
convert the Windows-1252 characters/entities to unicode (UTF-8 or UTF-16).

-Will

-Original Message-
From: Ankur Goel [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 20, 2004 1:10 PM
To: 'Lucene Users List'
Subject: Searching Microsoft Word , Excel and PPT files for Japanese

Hi,

I am using CJK Tokenzier for searching the Japanese documents.  I am able to
search japanese documents which are text files. But I am not able to search
from Microsoft word, excel files with content in Japanese. 

Can you tell me how can search on Japanese content for Microsoft word, excel
and ppt files.

Thanks,
Ankur  

-Original Message-
From: Ankur Goel [mailto:[EMAIL PROTECTED]
Sent: Sunday, April 04, 2004 1:36 AM
To: 'Lucene Users List'
Subject: RE: Boolean Phrase Query question

Thanks Eric for the solution. I have to filename field as I have to give the
end user facility to search on File Name also. That's   why I am using TEXT
for file Name also.

"By using true on the finalQuery.add calls, you have said that both fields
must have the word "temp" in them.  Is that what you meant?  Or did you mean
an OR type of query?"

I need an OR type of query. I mean the word can be in the filename or in the
contents of the filename. But i am not able to do this. Can you tell me how
to do it?

Regards,
Ankur 

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Sunday, April 04, 2004 1:27 AM
To: Lucene Users List
Subject: Re: Boolean Phrase Query question

On Apr 3, 2004, at 12:13 PM, Ankur Goel wrote:
>
> Hi,
> I have to provide a functionality which provides search on both file 
> name and contents of the file.
>
> For indexing I use the following code:
>
>
> org.apache.lucene.document.Document doc = new org.apache.
> lucene.document.Document();
> doc.add(Field.Keyword("fileId","" + document.getFileId())); 
> doc.add(Field.Text("fileName",fileName);
> doc.add(Field.Text("contents", new FileReader(new File(fileName)));

I'm not sure what you plan on doing with the fileName field, but you
probably want to use a Keyword field for it.

And you may want to glue the file name and contents together into a single
field to facilitate searches to span both.  (be sure to put a space in
between if you do this)

> For searching a text say  "temp" I use the following code to look both 
> in file Name and contents of the file:
>
> BooleanQuery finalQuery = new BooleanQuery(); Query titleQuery = 
> QueryParser.parse("temp","fileName",analyzer);
> Query mainQuery = QueryParser.parse("temp","contents",analyzer);
>
> finalQuery.add(titleQuery, true, false); finalQuery.add(mainQuery, 
> true, false);
>
> Hits hits = is.search(finalQuery);

By using true on the finalQuery.add calls, you have said that both fields
must have the word "temp" in them.  Is that what you meant?  Or did you mean
an OR type of query?

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Searching Microsoft Word , Excel and PPT files for Japanese

2004-05-20 Thread wallen

I believe MS apps store non-ascii characters as entities internally instead
of using unicode.  You can see evidence of this if you save your file as an
HTML file and look at the source.  You will have to adjust your parser to
convert the Windows-1252 characters/entities to unicode (UTF-8 or UTF-16).

-Will

-Original Message-
From: Ankur Goel [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 20, 2004 1:10 PM
To: 'Lucene Users List'
Subject: Searching Microsoft Word , Excel and PPT files for Japanese

Hi,

I am using CJK Tokenzier for searching the Japanese documents.  I am able to
search japanese documents which are text files. But I am not able to search
from Microsoft word, excel files with content in Japanese. 

Can you tell me how can search on Japanese content for Microsoft word, excel
and ppt files.

Thanks,
Ankur  

-Original Message-
From: Ankur Goel [mailto:[EMAIL PROTECTED] 
Sent: Sunday, April 04, 2004 1:36 AM
To: 'Lucene Users List'
Subject: RE: Boolean Phrase Query question

Thanks Eric for the solution. I have to filename field as I have to give the
end user facility to search on File Name also. That's   why I am using TEXT
for file Name also.

"By using true on the finalQuery.add calls, you have said that both fields
must have the word "temp" in them.  Is that what you meant?  Or did you mean
an OR type of query?"

I need an OR type of query. I mean the word can be in the filename or in the
contents of the filename. But i am not able to do this. Can you tell me how
to do it?

Regards,
Ankur 

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Sunday, April 04, 2004 1:27 AM
To: Lucene Users List
Subject: Re: Boolean Phrase Query question

On Apr 3, 2004, at 12:13 PM, Ankur Goel wrote:
>
> Hi,
> I have to provide a functionality which provides search on both file 
> name and contents of the file.
>
> For indexing I use the following code:
>
>
> org.apache.lucene.document.Document doc = new org.apache.
> lucene.document.Document();
> doc.add(Field.Keyword("fileId","" + document.getFileId())); 
> doc.add(Field.Text("fileName",fileName);
> doc.add(Field.Text("contents", new FileReader(new File(fileName)));

I'm not sure what you plan on doing with the fileName field, but you
probably want to use a Keyword field for it.

And you may want to glue the file name and contents together into a single
field to facilitate searches to span both.  (be sure to put a space in
between if you do this)

> For searching a text say  "temp" I use the following code to look both 
> in file Name and contents of the file:
>
> BooleanQuery finalQuery = new BooleanQuery(); Query titleQuery = 
> QueryParser.parse("temp","fileName",analyzer);
> Query mainQuery = QueryParser.parse("temp","contents",analyzer);
>
> finalQuery.add(titleQuery, true, false); finalQuery.add(mainQuery, 
> true, false);
>
> Hits hits = is.search(finalQuery);

By using true on the finalQuery.add calls, you have said that both fields
must have the word "temp" in them.  Is that what you meant?  Or did you mean
an OR type of query?

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: AW: Problem indexing Spanish Characters

2004-05-19 Thread wallen

Here is an example method in org.apache.lucene.demo.html HTMLParser that
uses a different buffered reader for a different encoding. 

public Reader getReader() throws IOException
{
if (pipeIn == null)
{
pipeInStream = new MyPipedInputStream();
pipeOutStream = new PipedOutputStream(pipeInStream);
pipeIn = new InputStreamReader(pipeInStream);
pipeOut = new OutputStreamWriter(pipeOutStream);
//check the first 4 bytes for FFFE marker, if its
there we know its UTF-16 encoding
if (useUTF16)
{
try
{
pipeIn = new BufferedReader(new
InputStreamReader(pipeInStream, "UTF-16"));
}
catch (Exception e)
{
}
}
Thread thread = new ParserThread(this);
thread.start(); // start parsing
}
return pipeIn;
}

-Original Message-
From: Martin Remy [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 19, 2004 2:09 PM
To: 'Lucene Users List'
Subject: RE: AW: Problem indexing Spanish Characters


The tokenizers deal with unicode characters (CharStream, char), so the
problem is not there.  This problem must be solved at the point where the
bytes from your source files are turned into CharSequences/Strings, i.e. by
connecting an InputStreamReader to your FileReader (or whatever you're
using) and specifying "UTF-8" (or whatever encoding is appropriate) in the
InputStreamReader constructor.  

You must either detect the encoding from HTTP heaaders or XML declarations
or, if you know that it's the same for all of your source files, then just
hardcode UTF-8, for example.  

Martin

-Original Message-
From: Hannah c [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, May 19, 2004 10:35 AM
To: [EMAIL PROTECTED]
Subject: RE: AW: Problem indexing Spanish Characters

Hi,

I had a quick look at the sandbox but my problem is that I don't need a
spanish stemmer. However there must be a replacement tokenizer that supports
foreign characters to go along with the foreign language snowball stemmers. 
Does anyone know where I could find one?

In answer to Peters question -yes I'm also using "UTF-8" encoded XML
documents as the source.
I also put below an example of what is happening when I tokenize the text
using the StandardTokenizer below.

Thanks Hannah



--text I'm trying to index

century palace known as la "Fundación Hospital de Na. Señora del Pilar"

-tokens outputed from StandardTokenizer

century
palace
known
as
la
â
FundaciÃ*
n   *
Hospital
de
Na
SeÃ  *
ora   *
del
Pilar
â
---



>From: "Peter M Cipollone" <[EMAIL PROTECTED]>
>To: <[EMAIL PROTECTED]>
>Subject: Re: Problem indexing Spanish Characters
>Date: Wed, 19 May 2004 11:41:28 -0400
>
>could you send some sample text that causes this to happen?
>
>- Original Message -
>From: "Hannah c" <[EMAIL PROTECTED]>
>To: <[EMAIL PROTECTED]>
>Sent: Wednesday, May 19, 2004 11:30 AM
>Subject: Problem indexing Spanish Characters
>
>
> >
> > Hi,
> >
> > I  am indexing a number of English articles on Spanish resorts. As 
> > such there are a number of spanish characters throught the text, 
> > most of
>these
> > are in the place names which are the type of words I would like to 
> > use
>as
> > queries. My problem is with the StandardTokenizer class which cuts 
> > the
>word
> > into two when it comes across any of the spanish characters. I had a
>look
>at
> > the source but the code was generated by JavaCC and so is not very
>readable.
> > I was wondering if there was a way around this problem or which area 
> > of
>the
> > code I would need to change to avoid this.
> >
> > Thanks
> > Hannah Cumming
> >
> >
> >
> > 
> > - To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
>




>From: PEP AD Server Administrator
><[EMAIL PROTECTED]>
>Reply-To: "Lucene Users List" <[EMAIL PROTECTED]>
>To: "'Lucene Users List'" <[EMAIL PROTECTED]>
>Subject: AW: Problem indexing Spanish Characters
>Date: Wed, 19 May 2004 18:08:56 +0200
>
>Hi Hannah, Otis
>I cannot help but I have excatly the same problems with special german 
>charcters. I used snowball analyser but this does not help because the 
>problem (tokenizing) appears before the analyser comes into action.
>I just posted the question "Problem tokenizing UTF-8 with geman umlauts"
>some minutes ago which describes my problem and Hannahs seem to be similar.
>Do you have also UTF-8 encoded pages?
>
>Pet

Can documents be appended to?

2004-05-17 Thread wallen

Is it possible to append to an existing document?

Judging by my own tests and this thread, NO.
http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]
he.org&msgNo=3971

Wouldn't it be possible to look up an individual document (based upon a uid
of sorts), then load the Fields off of the old one, delete it, then add the
new document.  Is there any hope of doing this efficiently?  This would run
into problems when merging indexes, you would get duplicates if they existed
on more than 1 of your original indexes.

Thank you,
Will

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: demo IndexHTML parser breaks unicode?

TopTerms on query results

getting most common terms for a smaller set of documents

RE: Spam:too many open files

RE: Spam:too many open files

RE: Restoring a corrupt index

RE: Restoring a corrupt index

RE: Restoring a corrupt index

Restoring a corrupt index

RE: Finding All?

RE: Question on the minimum value for DateField

RE: TermFreqVector Beginner Question

RE: Lucene vs. MySQL Full-Text

RE: Very slow IndexReader.open() performance

RE: Searching against Database

corrupt indexes?

RE: Field.java -> STORED, NOT_STORED, etc...

RE: Problem with match on a non tokenized field.

RE: Problem with match on a non tokenized field.

RE: Problem with match on a non tokenized field.

QueryParser and Keyword Fields

RE: Demo 3 on windows

RE: search "" and ""

RE: help needed in starting lucene

RE: Problem Indexing Large Document Field

RE: Memory usage

RE: Performance profile of optimization...

RE: Rebuild after corruption

RE: Searching Microsoft Word , Excel and PPT files for Japanese

RE: Searching Microsoft Word , Excel and PPT files for Japanese

RE: AW: Problem indexing Spanish Characters

Can documents be appended to?

32 matches

Site Navigation

Mail list logo

Footer information