date:20070717

Re: search through all fields

2007-07-17 Thread Mathieu Lecarme

Sorry, I use Compass, an object mapper for Lucene, and it provides a
special field "all", I thought it was a Lucene feature.

M.

Renaud Waldura a écrit :
> Often documents can be divided in "metadata" and "contents" sections. Say
> you're indexing Web pages, you could index them with HEAD data all in one
> field, and the BODY content in another. While also creating separate fields
> for every HEAD field, e.g. TITLE etc.
>
> At search time, you rewrite every query to become "+head:(query)
> +body:(query)" using MultiFieldQueryParser. This way you don't have to
> create an "all" field that contains everything, head + body.
>
> I will increase your index size, no doubt. Might increase indexing time too.
>
> --Renaud
>  
>
> -Original Message-
> From: Mohammad Norouzi [mailto:[EMAIL PROTECTED] 
> Sent: Sunday, July 15, 2007 9:40 PM
> To: java-user@lucene.apache.org
> Subject: Re: search through all fields
>
> On 7/14/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>   
>> I think he means index all your different fields into a single field 
>> named "all".  Not sure what makes it special, it is just like any 
>> other field.
>>
>> 
>
>
> but that really impossible ! because I have near millions records to be
> indexed so this job will decrease the time of indexing and increase the
> index size
>
> --
> Regards,
> Mohammad
> --
> see my blog: http://brainable.blogspot.com/ another in Persian:
> http://fekre-motefavet.blogspot.com/
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: search through all fields

2007-07-17 Thread Mohammad Norouzi


Mathieu,
I need an object mapper for lucene would you please give me the Compass web
site? is it open source?

thanks

On 7/17/07, Mathieu Lecarme <[EMAIL PROTECTED]> wrote:


Sorry, I use Compass, an object mapper for Lucene, and it provides a
special field "all", I thought it was a Lucene feature.

M.

Renaud Waldura a écrit :
> Often documents can be divided in "metadata" and "contents" sections.
Say
> you're indexing Web pages, you could index them with HEAD data all in
one
> field, and the BODY content in another. While also creating separate
fields
> for every HEAD field, e.g. TITLE etc.
>
> At search time, you rewrite every query to become "+head:(query)
> +body:(query)" using MultiFieldQueryParser. This way you don't have to
> create an "all" field that contains everything, head + body.
>
> I will increase your index size, no doubt. Might increase indexing time
too.
>
> --Renaud
>
>
> -Original Message-
> From: Mohammad Norouzi [mailto:[EMAIL PROTECTED]
> Sent: Sunday, July 15, 2007 9:40 PM
> To: java-user@lucene.apache.org
> Subject: Re: search through all fields
>
> On 7/14/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>
>> I think he means index all your different fields into a single field
>> named "all".  Not sure what makes it special, it is just like any
>> other field.
>>
>>
>
>
> but that really impossible ! because I have near millions records to be
> indexed so this job will decrease the time of indexing and increase the
> index size
>
> --
> Regards,
> Mohammad
> --
> see my blog: http://brainable.blogspot.com/ another in Persian:
> http://fekre-motefavet.blogspot.com/
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--
Regards,
Mohammad
--
see my blog: http://brainable.blogspot.com/
another in Persian: http://fekre-motefavet.blogspot.com/

Re: search through all fields

2007-07-17 Thread Mathieu Lecarme

http://www.opensymphony.com/compass/
The project is free, following Lucene version quickly, the forum is
great, and the lead developer is quick reacting.

M.


Mohammad Norouzi a écrit :
> Mathieu,
> I need an object mapper for lucene would you please give me the
> Compass web
> site? is it open source?
>
> thanks


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

getting problem while indexing pdf files with pdfbox

2007-07-17 Thread neetika


http://www.nabble.com/file/p11647342/DRra0026.pdf DRra0026.pdf 

hi all,

 i am able to convert a pdf in to a text file using pdfbox. 
 and this is the code that I used, but I am not able to index it
 
// code for parsing and making index

public Document getDocument(InputStream is)
{
COSDocument cosDoc = null;
try {
PDFParser parser = new PDFParser(is);
parser.parse();
cosDoc = parser.getDocument();
}
catch (IOException e) {
e.printStackTrace();
}
String docText = null;
try {
PDFTextStripper stripper = new PDFTextStripper();
docText = stripper.getText(new PDDocument(cosDoc));
}
catch (IOException e) {
e.printStackTrace();
}
Document doc = new Document();
if (docText != null) {
doc.add(new Field("body", docText, Field.Store.YES, 
Field.Index.TOKENIZED));
}
return doc;
}

public static void main(String[] args) throws Exception {
TestPDFParser handler = new TestPDFParser();

Document doc = handler.getDocument(new FileInputStream(new
File("D:\\lucenePdf\\DRra0026.pdf")));

System.out.println(doc);

//Following code is for making index

IndexWriter f_writer = new IndexWriter("D:\\lucenePdf", new
StandardAnalyzer(), true);

f_writer.addDocument(doc);

}
}
 //code for searching a particular string..

 public static void main(String[] args) throws Exception {
String indexDir = "D:\\lucenePdf";
String q = "RA0083";


Directory fsDir = FSDirectory.getDirectory(indexDir);
IndexSearcher is = new IndexSearcher(fsDir);

Query query = new QueryParser("body", new
StandardAnalyzer()).parse(q); 

Hits hits = is.search(query);
System.out.println("Found " + hits.length() + " documents that
matched query '" + q + "':");
for (int i = 0; i < hits.length(); i++) {
Document doc = hits.doc(i);

}
}


When I run the above code...I get folowing output as a result of running
indexer class

Documenthttp://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11647342
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: getting problem while indexing pdf files with pdfbox

2007-07-17 Thread Erick Erickson


Offhand I'd assume that your problem is using PDFbox. Have you
tried printing out the docText string you get  back from

docText = stripper.getText(new PDDocument(cosDoc))?

I'd recommend you assure yourself that you get valid text back from
the PDF document before worrying about indexing it.

Best
Erick

On 7/17/07, neetika <[EMAIL PROTECTED]> wrote:



http://www.nabble.com/file/p11647342/DRra0026.pdf DRra0026.pdf

hi all,

i am able to convert a pdf in to a text file using pdfbox.
and this is the code that I used, but I am not able to index it

// code for parsing and making index

public Document getDocument(InputStream is)
{
COSDocument cosDoc = null;
try {
PDFParser parser = new PDFParser(is);
parser.parse();
cosDoc = parser.getDocument();
}
catch (IOException e) {
e.printStackTrace();
}
String docText = null;
try {
PDFTextStripper stripper = new PDFTextStripper();
docText = stripper.getText(new PDDocument(cosDoc));
}
catch (IOException e) {
e.printStackTrace();
}
Document doc = new Document();
if (docText != null) {
doc.add(new Field("body", docText, Field.Store.YES,
Field.Index.TOKENIZED));
}
return doc;
}

public static void main(String[] args) throws Exception {
TestPDFParser handler = new TestPDFParser();

Document doc = handler.getDocument(new FileInputStream(new
File("D:\\lucenePdf\\DRra0026.pdf")));

System.out.println(doc);

//Following code is for making index

IndexWriter f_writer = new IndexWriter("D:\\lucenePdf",
new
StandardAnalyzer(), true);

f_writer.addDocument(doc);

}
}
//code for searching a particular string..

public static void main(String[] args) throws Exception {
String indexDir = "D:\\lucenePdf";
String q = "RA0083";


Directory fsDir = FSDirectory.getDirectory(indexDir);
IndexSearcher is = new IndexSearcher(fsDir);

Query query = new QueryParser("body", new
StandardAnalyzer()).parse(q);

Hits hits = is.search(query);
System.out.println("Found " + hits.length() + " documents that
matched query '" + q + "':");
for (int i = 0; i < hits.length(); i++) {
Document doc = hits.doc(i);

}
}


When I run the above code...I get folowing output as a result of running
indexer class


Documenthttp://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11647342
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Does Index have a Tokenizer Built into it

2007-07-17 Thread John Paul Sondag

Hi,

I've been looking into the indexing documents with the vectors for terms and
positions on to solve my problem.  However, I've run into a bit of a snag.
After indexing I have been able to retrieve the TermPositionVector from the
index and it has all of the data, but I cannot find a way where given a
position I can retrieve the term at that position. Which is how I was hoping
to create my contextual snippets.

They have functions where given a term you can get it's position but I see
no method to achieve the reverse affect.  Is there another class I need to
use for this?

--JP

On 7/16/07, John Paul Sondag <[EMAIL PROTECTED]> wrote:

Some of the data sets that will be using have about 2 TB of data (90
million web pages).  The Snippet I will be generating I would like to
include the words that are being queried, so I don't want to simply store
the first 2 or 3 lines.  I have looked at the HighlighterTest and I do
believe that it requires the entire text of the document.  However, unlike
the highlighter I know where the termOffset in the document.

The input to my Snippet will be a vector of querywords and their offsets
in the document.  (not their position in the document).  I'm reading about
the option "term vectors" I can store while indexing my data.  It seems to
be much more efficient than storing the entire document, I'm just not sure
if the "term offset" is the same as a "token offset".  Here's what I'm
reading in case I'm totally off the ball here and this is useless to me:

http://lucene.apache.org/java/docs/fileformats.html#Term%20Vectors

It seems like this has all the information that I would have if I
tokenized the document anyways, or am I missing something?

Thanks again for all the help!

--JP

On 7/16/07, Ard Schrijvers < [EMAIL PROTECTED]> wrote:
>
> Hello,
>
> > Ard,
> >
> > I do have access to the URL's of the documents, but because I
> > will be making
> > short snippets for many pages (suppose it had about 20 hits
> > per page and I
> > need to make Snippets for each of them) I was worried it would be
> > inefficient to open each "hit" tokenize it and then make the
> > Snippet, of
>
> Yes, getting all the documents over http just to get the snippet, for
> example the first 2 lines, is really bad for your performance in search
> overviews.
>
> Logically, what you want to show, you need to store in your index. For
> example, if for search hits you need to show the title and subtitle, just
> store these two in the index. If you want to have a google like highlighter
> of text snippets where the term occured, you need to store the entire text
> IIRC (see HighlighterTest in lucene).
>
> How many docs are you talking about that you cannot store the entire
> content?
>
> You could also just index the content and not store it, and in another
> lucene field, store the first 2 or 3 lines of  the document, which serve as
> text snippet. Making correct extracts of text snippets is very hard (see
> lingpipe for example)
>
> Regards Ard
>
> > course the price of this may be worth the price of the increased Index
> > size.  I have been looking into storing "Field Vectors with
> > positions" in
> > the index.  It seems that by doing this I will have access to
> > everything
> > that the Tokenizer is giving me correct?   Will I need to
> > store "term text"
> > in order to be able to access the actual term instead of
> > stemmed words?
> >
> > Thanks for all your help,
> >
> > --JP
> >
> > On 7/13/07, Ard Schrijvers <[EMAIL PROTECTED]> wrote:
> > >
> > > Hello,
> > >
> > > > I'm wondering if after
> > > > opening the
> > > > index I can retrieve the Tokens (not the terms) of a
> > > > document, something
> > > > akin to IndexReader.Document (n).getTokenizer().
> > >
> > > It is obviously not possible to get the original tokens of
> > the document
> > > back when you haven't stored the document, because:
> > >
> > > 1) the analyzer might have removed stop words in the first place
> > > 2) the terms in lucene index are perhaps stemmed words /
> > synonyms / etc
> > > etc
> > > 3) how would you expect things like spaces, commas, dots etc to be
> > > restored?
> > >
> > > And, I think what you want does not comply with an inverted
> > index. When
> > > you do not store the document, you always loose information
> > about the
> > > document during indexing/analyzing
> > >
> > > How many documents are you talking about? They must be
> > either somewhere on
> > > FS or accessible over http...when you need the document,
> > why not just
> > > provide a link to the original location?
> > >
> > > Regards Ard
> > >
> > > >
> > > > In summary:
> > > >
> > > > My current ( too wasteful implementation is this)
> > > >
> > > > StandardTokenizer(BufferedReader (
> > > > IndexReader.Document(n).getField("text"
> > > > )  )
> > > >
> > > > I'm wondering if Lucene has a more efficient manner to
> > > > retrieve the tokens
> > > > of a document from an index.  Because it seems like it has
> > > > information about

Re: getting problem while indexing pdf files with pdfbox

2007-07-17 Thread neetika


Hi Erick,

Befoe indexing I have printed the doc, and I have given the output also.It
is printing well.
Kindly please check my post again following...

" System.out.println(doc);
 //Following code is for making index"

and the corresponding output is...

Document 
> Offhand I'd assume that your problem is using PDFbox. Have you
> tried printing out the docText string you get  back from
> 
> docText = stripper.getText(new PDDocument(cosDoc))?
> 
> I'd recommend you assure yourself that you get valid text back from
> the PDF document before worrying about indexing it.
> 
> Best
> Erick
> 
> On 7/17/07, neetika <[EMAIL PROTECTED]> wrote:
>>
>>
>> http://www.nabble.com/file/p11647342/DRra0026.pdf DRra0026.pdf
>>
>> hi all,
>>
>> i am able to convert a pdf in to a text file using pdfbox.
>> and this is the code that I used, but I am not able to index it
>>
>> // code for parsing and making index
>>
>> public Document getDocument(InputStream is)
>> {
>> COSDocument cosDoc = null;
>> try {
>> PDFParser parser = new PDFParser(is);
>> parser.parse();
>> cosDoc = parser.getDocument();
>> }
>> catch (IOException e) {
>> e.printStackTrace();
>> }
>> String docText = null;
>> try {
>> PDFTextStripper stripper = new PDFTextStripper();
>> docText = stripper.getText(new PDDocument(cosDoc));
>> }
>> catch (IOException e) {
>> e.printStackTrace();
>> }
>> Document doc = new Document();
>> if (docText != null) {
>> doc.add(new Field("body", docText, Field.Store.YES,
>> Field.Index.TOKENIZED));
>> }
>> return doc;
>> }
>>
>> public static void main(String[] args) throws Exception {
>> TestPDFParser handler = new TestPDFParser();
>>
>> Document doc = handler.getDocument(new
>> FileInputStream(new
>> File("D:\\lucenePdf\\DRra0026.pdf")));
>>
>> System.out.println(doc);
>>
>> //Following code is for making index
>>
>> IndexWriter f_writer = new IndexWriter("D:\\lucenePdf",
>> new
>> StandardAnalyzer(), true);
>>
>> f_writer.addDocument(doc);
>>
>> }
>> }
>> //code for searching a particular string..
>>
>> public static void main(String[] args) throws Exception {
>> String indexDir = "D:\\lucenePdf";
>> String q = "RA0083";
>>
>>
>> Directory fsDir = FSDirectory.getDirectory(indexDir);
>> IndexSearcher is = new IndexSearcher(fsDir);
>>
>> Query query = new QueryParser("body", new
>> StandardAnalyzer()).parse(q);
>>
>> Hits hits = is.search(query);
>> System.out.println("Found " + hits.length() + " documents that
>> matched query '" + q + "':");
>> for (int i = 0; i < hits.length(); i++) {
>> Document doc = hits.doc(i);
>>
>> }
>> }
>>
>>
>> When I run the above code...I get folowing output as a result of running
>> indexer class
>>
>>
>> Document> : RA0083
>> 99062620002100100220468148001102006PAYOUT : RA0083
>> 99062630002100100330468153601102006PAYOUT : RA0083
>> 99062647002100100440468155401102006PAYOUT : RA0083
>> 99062657002100100550468156201102006PAYOUT : RA0083
>>
>> and following  files are generated in the specified path..
>>
>> segments.gen
>> write.lock
>> segments_4
>>
>>
>> but when I run the search class it gives the result as:
>>
>> Found 0 documents that matched query 'RA0083':
>>
>> I am also attaching the corresponding pdf file for reference.
>> It seems as the index is not getting created..
>>
>> Please help me with some of your inputs,it will be very helpfull for me.
>> --
>> View this message in context:
>> http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11647342
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11653883
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: getting problem while indexing pdf files with pdfbox

2007-07-17 Thread Erick Erickson

You have NOT supplied an example of the text you extracted
from the document. But let's assume that the interesting
string is exactly what you expect.

Have you looked at your index with Luke to see if the data is there?
I *strongly* suggest you get a copy of Luke (google lucene luke) to
examine indexes with.

The existence of the write.lock file suggests that you haven't closed your
index prior to searching it. Although flushing it would probably work. Be
aware that you cannot see changes to an index if the reader you use
has been opened before the indexing operation. Also, there is some period
of time when the indexed data is buffered by the writer, and I'm unsure (but
doubt) it's available until it's been flushed.

I suspect that your problem is not related to PDF, but rather to whether
you've properly indexed data and closed your index prior to searching
it.

The other possibility is that your analyzer is parsing things
"interestingly".
StandardAnalyzer() does some interesting things when tokenizing, including
lowercasing the input stream. Although that shouldn't have been a problem
since you use the same analyzer for indexing and searching.

Also, try query.toString to see what is actually searched, that often gives
insights. The aforementioned Luke will allow you to submit queries to the
index, including explaining what the actual query produced is.
What are the file sizes of your index files?

Best
Erick

On 7/17/07, neetika <[EMAIL PROTECTED]> wrote:

Hi Erick,

Befoe indexing I have printed the doc, and I have given the output also.It
is printing well.
Kindly please check my post again following...

" System.out.println(doc);
 //Following code is for making index"

and the corresponding output is...

Document
> Offhand I'd assume that your problem is using PDFbox. Have you
> tried printing out the docText string you get  back from
>
> docText = stripper.getText(new PDDocument(cosDoc))?
>
> I'd recommend you assure yourself that you get valid text back from
> the PDF document before worrying about indexing it.
>
> Best
> Erick
>
> On 7/17/07, neetika <[EMAIL PROTECTED]> wrote:
>>
>>
>> http://www.nabble.com/file/p11647342/DRra0026.pdf DRra0026.pdf
>>
>> hi all,
>>
>> i am able to convert a pdf in to a text file using pdfbox.
>> and this is the code that I used, but I am not able to index it
>>
>> // code for parsing and making index
>>
>> public Document getDocument(InputStream is)
>> {
>> COSDocument cosDoc = null;
>> try {
>> PDFParser parser = new PDFParser(is);
>> parser.parse();
>> cosDoc = parser.getDocument();
>> }
>> catch (IOException e) {
>> e.printStackTrace();
>> }
>> String docText = null;
>> try {
>> PDFTextStripper stripper = new PDFTextStripper();
>> docText = stripper.getText(new PDDocument(cosDoc));
>> }
>> catch (IOException e) {
>> e.printStackTrace();
>> }
>> Document doc = new Document();
>> if (docText != null) {
>> doc.add(new Field("body", docText, Field.Store.YES,
>> Field.Index.TOKENIZED));
>> }
>> return doc;
>> }
>>
>> public static void main(String[] args) throws Exception
{
>> TestPDFParser handler = new TestPDFParser();
>>
>> Document doc = handler.getDocument(new
>> FileInputStream(new
>> File("D:\\lucenePdf\\DRra0026.pdf")));
>>
>> System.out.println(doc);
>>
>> //Following code is for making index
>>
>> IndexWriter f_writer = new IndexWriter("D:\\lucenePdf",
>> new
>> StandardAnalyzer(), true);
>>
>> f_writer.addDocument(doc);
>>
>> }
>> }
>> //code for searching a particular string..
>>
>> public static void main(String[] args) throws Exception {
>> String indexDir = "D:\\lucenePdf";
>> String q = "RA0083";
>>
>>
>> Directory fsDir = FSDirectory.getDirectory(indexDir);
>> IndexSearcher is = new IndexSearcher(fsDir);
>>
>> Query query = new QueryParser("body", new
>> StandardAnalyzer()).parse(q);
>>
>> Hits hits = is.search(query);
>> System.out.println("Found " + hits.length() + " documents that
>> matched query '" + q + "':");
>> for (int i = 0; i < hits.length(); i++) {
>> Document doc = hits.doc(i);
>>
>> }
>> }
>>
>>
>> When I run the above code...I get folowing output as a result of
running
>> indexer class
>>
>>
>>
Document> : RA0083
>> 99062620002100100220468148001102006PAYOUT : RA0083
>> 990626300021001003304

WildcardQuery and SpanQuery

2007-07-17 Thread Cedric Ho


Hi everybody,

We recently need to support wildcard search terms "*", "?" together
with SpanQuery. It seems that there's no SpanWildcardQuery available.
After looking into the lucene source code for a while, I guess we can
either:

1. Use SpanRegexQuery, or

2. Write our own SpanWildcardQuery, and implements the
rewrite(IndexReader) method to rewrite the query into a SpanOrQuery
with some SpanTermQuery.

Of the two approaches, Option 1 seems to be easier. But I am rather
concerned about the performance of using regular expression. On the
other hand, I am not sure if there are any other concerns I am not
aware of for option 2 (i.e. is there a reason why there's no
SpanWildcardQuery in the first place?)

Any advices ?

Cedric

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: getting problem while indexing pdf files with pdfbox

2007-07-17 Thread neetika


Hi Erick,

I am able to get the result fine.
The problem was, I forgot to close the writer and so the index file (.cfs)
was not getting generated.

Thanks a lot for the timely help.

Regards,
Neetika


Erick Erickson wrote:
> 
>  You have NOT supplied an example of the text you extracted
> from the document. But let's assume that the interesting
> string is exactly what you expect.
> 
> Have you looked at your index with Luke to see if the data is there?
> I *strongly* suggest you get a copy of Luke (google lucene luke) to
> examine indexes with.
> 
> The existence of the write.lock file suggests that you haven't closed your
> index prior to searching it. Although flushing it would probably work. Be
> aware that you cannot see changes to an index if the reader you use
> has been opened before the indexing operation. Also, there is some period
> of time when the indexed data is buffered by the writer, and I'm unsure
> (but
> doubt) it's available until it's been flushed.
> 
> I suspect that your problem is not related to PDF, but rather to whether
> you've properly indexed data and closed your index prior to searching
> it.
> 
> The other possibility is that your analyzer is parsing things
> "interestingly".
> StandardAnalyzer() does some interesting things when tokenizing, including
> lowercasing the input stream. Although that shouldn't have been a problem
> since you use the same analyzer for indexing and searching.
> 
> Also, try query.toString to see what is actually searched, that often
> gives
> insights. The aforementioned Luke will allow you to submit queries to the
> index, including explaining what the actual query produced is.
> What are the file sizes of your index files?
> 
> 
> Best
> Erick
> 
> On 7/17/07, neetika <[EMAIL PROTECTED]> wrote:
>>
>>
>> Hi Erick,
>>
>> Befoe indexing I have printed the doc, and I have given the output
>> also.It
>> is printing well.
>> Kindly please check my post again following...
>>
>> " System.out.println(doc);
>>  //Following code is for making index"
>>
>> and the corresponding output is...
>>
>>
>> Document> RA0083
>> 99062620002100100220468148001102006PAYOUT : RA0083
>> 99062630002100100330468153601102006PAYOUT : RA0083
>> 99062647002100100440468155401102006PAYOUT : RA0083
>> 099062657002100100550468156201102006PAYOUT : RA0083
>> which is as expected...but  my problem is...index file is not getting
>> generated.
>>
>> Please help
>>
>>
>>
>> Erick Erickson wrote:
>> >
>> > Offhand I'd assume that your problem is using PDFbox. Have you
>> > tried printing out the docText string you get  back from
>> >
>> > docText = stripper.getText(new PDDocument(cosDoc))?
>> >
>> > I'd recommend you assure yourself that you get valid text back from
>> > the PDF document before worrying about indexing it.
>> >
>> > Best
>> > Erick
>> >
>> > On 7/17/07, neetika <[EMAIL PROTECTED]> wrote:
>> >>
>> >>
>> >> http://www.nabble.com/file/p11647342/DRra0026.pdf DRra0026.pdf
>> >>
>> >> hi all,
>> >>
>> >> i am able to convert a pdf in to a text file using pdfbox.
>> >> and this is the code that I used, but I am not able to index it
>> >>
>> >> // code for parsing and making index
>> >>
>> >> public Document getDocument(InputStream is)
>> >> {
>> >> COSDocument cosDoc = null;
>> >> try {
>> >> PDFParser parser = new PDFParser(is);
>> >> parser.parse();
>> >> cosDoc = parser.getDocument();
>> >> }
>> >> catch (IOException e) {
>> >> e.printStackTrace();
>> >> }
>> >> String docText = null;
>> >> try {
>> >> PDFTextStripper stripper = new PDFTextStripper();
>> >> docText = stripper.getText(new PDDocument(cosDoc));
>> >> }
>> >> catch (IOException e) {
>> >> e.printStackTrace();
>> >> }
>> >> Document doc = new Document();
>> >> if (docText != null) {
>> >> doc.add(new Field("body", docText, Field.Store.YES,
>> >> Field.Index.TOKENIZED));
>> >> }
>> >> return doc;
>> >> }
>> >>
>> >> public static void main(String[] args) throws
>> Exception
>> {
>> >> TestPDFParser handler = new TestPDFParser();
>> >>
>> >> Document doc = handler.getDocument(new
>> >> FileInputStream(new
>> >> File("D:\\lucenePdf\\DRra0026.pdf")));
>> >>
>> >> System.out.println(doc);
>> >>
>> >> //Following code is for making index
>> >>
>> >> IndexWriter f_writer = new
>> IndexWriter("D:\\lucenePdf",
>> >> new
>> >> StandardAnalyzer(), true);
>> >>
>> >> f_w

Re: Does Index have a Tokenizer Built into it

2007-07-17 Thread Chris Hostetter


: After indexing I have been able to retrieve the TermPositionVector from the
: index and it has all of the data, but I cannot find a way where given a
: position I can retrieve the term at that position. Which is how I was hoping
: to create my contextual snippets.

there is no easy way to go from a position to a term -- coincidently there
is a very recent thread on this on java-dev...
http://www.nabble.com/Best-Practices-for-getting-Strings-from-a-position-range-tf4084187.html

...a new API may come out of it, but in the mean time you may be
interested in taking the approach the current highlighter uses (as
mentioned in that thread), of using the TermPositionVector to rebuild the
orriginal tokenstream, then skipping ahead to the positions you are
interested in.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: WildcardQuery and SpanQuery

2007-07-17 Thread Paul Elschot

On Wednesday 18 July 2007 05:58, Cedric Ho wrote:
> Hi everybody,
> 
> We recently need to support wildcard search terms "*", "?" together
> with SpanQuery. It seems that there's no SpanWildcardQuery available.
> After looking into the lucene source code for a while, I guess we can
> either:
> 
> 1. Use SpanRegexQuery, or
> 
> 2. Write our own SpanWildcardQuery, and implements the
> rewrite(IndexReader) method to rewrite the query into a SpanOrQuery
> with some SpanTermQuery.
> 
> Of the two approaches, Option 1 seems to be easier. But I am rather
> concerned about the performance of using regular expression. On the
> other hand, I am not sure if there are any other concerns I am not
> aware of for option 2 (i.e. is there a reason why there's no
> SpanWildcardQuery in the first place?)
> 
> Any advices ?

The basic problem you are facing is that in Lucene
the expansion of the terms is tightly coupled to the generation
of a combination query using the expanded terms.

In contrib/surround the term expansion and query generation
are decoupled using a visitor pattern for the terms. The code is here:
http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surround/src/java/org/apache/lucene/queryParser/surround/query

In surround a wild card term can provide either an OR of 
normal term queries, or a SpanOrQuery of span term queries.
This query generation is in class SimpleTerm, which has one method
for a normal boolean OR query over the terms, and one for
a span query for the terms.

In both cases surround uses a regular expression to expand 
the matching terms, but that could be changed to use
another wildcard expansion mechanisms than the ones in
SrndPrefixQuery and SrndTruncQuery, which
are subclasses of SimpleTerm.

With the term expansion and the query combination split,
it is also necessary to limit the maximum number of expanded
terms in another way than Lucene does. In surround the
classes BasicQueryFactory and TooManyBasicQueries are
used for that.

Regards,
Paul Elschot

> 
> Cedric
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: search through all fields

Re: search through all fields

Re: search through all fields

getting problem while indexing pdf files with pdfbox

Re: getting problem while indexing pdf files with pdfbox

Re: Does Index have a Tokenizer Built into it

Re: getting problem while indexing pdf files with pdfbox

Re: getting problem while indexing pdf files with pdfbox

WildcardQuery and SpanQuery

Re: getting problem while indexing pdf files with pdfbox

Re: Does Index have a Tokenizer Built into it

Re: WildcardQuery and SpanQuery

12 matches

Site Navigation

Mail list logo

Footer information