Re: Indexing other documents type than html and txt

2001-11-30 Thread Cecil, Paula New
Here is another version of something I had posted earlier. It attempts to read the text out of binary files. Not perfect and doesn't work at all on PDF. It permits you use the reader form of a Field to index. import java.util.*; import java.io.*; /** pThis class is designed to retrieve text

Attribute Search Bug

2001-11-28 Thread Cecil, Paula New
This program illustrates what may be a bug. It creates an index, a document with two fields. The second field is the problem. I use the Field constructor to make a field that is not stored, is indexed, not tokenized (there is no factory method for this combination). The program then queries

Re: PDF parser for Lucene

2001-11-23 Thread Cecil, Paula New
Inspired by the Unix strings command, I have written a subclass of FilterReader; which I have called BinaryReader. The idea is simply to index any proprietary file format by filtering out all non-printable characters. The assumption is that text is text. It will end up with more than the

Re: Attribute Search

2001-11-21 Thread Cecil, Paula New
into account the accent if Latin type of locale? -Original Message- From: Cecil, Paula New [mailto:[EMAIL PROTECTED]] Sent: Monday, November 19, 2001 9:47 PM To: LUCENE Text Search Subject: Attribute Search I am trying index a set of data, storing only a primary key. This primary

Re: Can't locate field

2001-11-19 Thread Cecil, Paula New
] Cecil, Paula New wrote: This is my first message to this list. I have successfully created several little tests of the Lucene api. In my last test, I am trying to index data records. Only the primary key needs to be stored (and I did not even index this field). For the others I want

Attribute Search

2001-11-19 Thread Cecil, Paula New
I am trying index a set of data, storing only a primary key. This primary key I left un-indexed. There is one text field, that I indexed and tokenized. The others I neither want to store or tokenized. My reasoning was that not tokenizing would produce the smallest index. The remaining