Re: search through all fields
Sorry, I use Compass, an object mapper for Lucene, and it provides a special field "all", I thought it was a Lucene feature. M. Renaud Waldura a écrit : > Often documents can be divided in "metadata" and "contents" sections. Say > you're indexing Web pages, you could index them with HEAD data all in one > field, and the BODY content in another. While also creating separate fields > for every HEAD field, e.g. TITLE etc. > > At search time, you rewrite every query to become "+head:(query) > +body:(query)" using MultiFieldQueryParser. This way you don't have to > create an "all" field that contains everything, head + body. > > I will increase your index size, no doubt. Might increase indexing time too. > > --Renaud > > > -Original Message- > From: Mohammad Norouzi [mailto:[EMAIL PROTECTED] > Sent: Sunday, July 15, 2007 9:40 PM > To: java-user@lucene.apache.org > Subject: Re: search through all fields > > On 7/14/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > >> I think he means index all your different fields into a single field >> named "all". Not sure what makes it special, it is just like any >> other field. >> >> > > > but that really impossible ! because I have near millions records to be > indexed so this job will decrease the time of indexing and increase the > index size > > -- > Regards, > Mohammad > -- > see my blog: http://brainable.blogspot.com/ another in Persian: > http://fekre-motefavet.blogspot.com/ > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: search through all fields
Mathieu, I need an object mapper for lucene would you please give me the Compass web site? is it open source? thanks On 7/17/07, Mathieu Lecarme <[EMAIL PROTECTED]> wrote: Sorry, I use Compass, an object mapper for Lucene, and it provides a special field "all", I thought it was a Lucene feature. M. Renaud Waldura a écrit : > Often documents can be divided in "metadata" and "contents" sections. Say > you're indexing Web pages, you could index them with HEAD data all in one > field, and the BODY content in another. While also creating separate fields > for every HEAD field, e.g. TITLE etc. > > At search time, you rewrite every query to become "+head:(query) > +body:(query)" using MultiFieldQueryParser. This way you don't have to > create an "all" field that contains everything, head + body. > > I will increase your index size, no doubt. Might increase indexing time too. > > --Renaud > > > -Original Message- > From: Mohammad Norouzi [mailto:[EMAIL PROTECTED] > Sent: Sunday, July 15, 2007 9:40 PM > To: java-user@lucene.apache.org > Subject: Re: search through all fields > > On 7/14/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > >> I think he means index all your different fields into a single field >> named "all". Not sure what makes it special, it is just like any >> other field. >> >> > > > but that really impossible ! because I have near millions records to be > indexed so this job will decrease the time of indexing and increase the > index size > > -- > Regards, > Mohammad > -- > see my blog: http://brainable.blogspot.com/ another in Persian: > http://fekre-motefavet.blogspot.com/ > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Regards, Mohammad -- see my blog: http://brainable.blogspot.com/ another in Persian: http://fekre-motefavet.blogspot.com/
Re: search through all fields
http://www.opensymphony.com/compass/ The project is free, following Lucene version quickly, the forum is great, and the lead developer is quick reacting. M. Mohammad Norouzi a écrit : > Mathieu, > I need an object mapper for lucene would you please give me the > Compass web > site? is it open source? > > thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
getting problem while indexing pdf files with pdfbox
http://www.nabble.com/file/p11647342/DRra0026.pdf DRra0026.pdf hi all, i am able to convert a pdf in to a text file using pdfbox. and this is the code that I used, but I am not able to index it // code for parsing and making index public Document getDocument(InputStream is) { COSDocument cosDoc = null; try { PDFParser parser = new PDFParser(is); parser.parse(); cosDoc = parser.getDocument(); } catch (IOException e) { e.printStackTrace(); } String docText = null; try { PDFTextStripper stripper = new PDFTextStripper(); docText = stripper.getText(new PDDocument(cosDoc)); } catch (IOException e) { e.printStackTrace(); } Document doc = new Document(); if (docText != null) { doc.add(new Field("body", docText, Field.Store.YES, Field.Index.TOKENIZED)); } return doc; } public static void main(String[] args) throws Exception { TestPDFParser handler = new TestPDFParser(); Document doc = handler.getDocument(new FileInputStream(new File("D:\\lucenePdf\\DRra0026.pdf"))); System.out.println(doc); //Following code is for making index IndexWriter f_writer = new IndexWriter("D:\\lucenePdf", new StandardAnalyzer(), true); f_writer.addDocument(doc); } } //code for searching a particular string.. public static void main(String[] args) throws Exception { String indexDir = "D:\\lucenePdf"; String q = "RA0083"; Directory fsDir = FSDirectory.getDirectory(indexDir); IndexSearcher is = new IndexSearcher(fsDir); Query query = new QueryParser("body", new StandardAnalyzer()).parse(q); Hits hits = is.search(query); System.out.println("Found " + hits.length() + " documents that matched query '" + q + "':"); for (int i = 0; i < hits.length(); i++) { Document doc = hits.doc(i); } } When I run the above code...I get folowing output as a result of running indexer class Documenthttp://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11647342 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: getting problem while indexing pdf files with pdfbox
Offhand I'd assume that your problem is using PDFbox. Have you tried printing out the docText string you get back from docText = stripper.getText(new PDDocument(cosDoc))? I'd recommend you assure yourself that you get valid text back from the PDF document before worrying about indexing it. Best Erick On 7/17/07, neetika <[EMAIL PROTECTED]> wrote: http://www.nabble.com/file/p11647342/DRra0026.pdf DRra0026.pdf hi all, i am able to convert a pdf in to a text file using pdfbox. and this is the code that I used, but I am not able to index it // code for parsing and making index public Document getDocument(InputStream is) { COSDocument cosDoc = null; try { PDFParser parser = new PDFParser(is); parser.parse(); cosDoc = parser.getDocument(); } catch (IOException e) { e.printStackTrace(); } String docText = null; try { PDFTextStripper stripper = new PDFTextStripper(); docText = stripper.getText(new PDDocument(cosDoc)); } catch (IOException e) { e.printStackTrace(); } Document doc = new Document(); if (docText != null) { doc.add(new Field("body", docText, Field.Store.YES, Field.Index.TOKENIZED)); } return doc; } public static void main(String[] args) throws Exception { TestPDFParser handler = new TestPDFParser(); Document doc = handler.getDocument(new FileInputStream(new File("D:\\lucenePdf\\DRra0026.pdf"))); System.out.println(doc); //Following code is for making index IndexWriter f_writer = new IndexWriter("D:\\lucenePdf", new StandardAnalyzer(), true); f_writer.addDocument(doc); } } //code for searching a particular string.. public static void main(String[] args) throws Exception { String indexDir = "D:\\lucenePdf"; String q = "RA0083"; Directory fsDir = FSDirectory.getDirectory(indexDir); IndexSearcher is = new IndexSearcher(fsDir); Query query = new QueryParser("body", new StandardAnalyzer()).parse(q); Hits hits = is.search(query); System.out.println("Found " + hits.length() + " documents that matched query '" + q + "':"); for (int i = 0; i < hits.length(); i++) { Document doc = hits.doc(i); } } When I run the above code...I get folowing output as a result of running indexer class Documenthttp://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11647342 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Does Index have a Tokenizer Built into it
Hi, I've been looking into the indexing documents with the vectors for terms and positions on to solve my problem. However, I've run into a bit of a snag. After indexing I have been able to retrieve the TermPositionVector from the index and it has all of the data, but I cannot find a way where given a position I can retrieve the term at that position. Which is how I was hoping to create my contextual snippets. They have functions where given a term you can get it's position but I see no method to achieve the reverse affect. Is there another class I need to use for this? --JP On 7/16/07, John Paul Sondag <[EMAIL PROTECTED]> wrote: Some of the data sets that will be using have about 2 TB of data (90 million web pages). The Snippet I will be generating I would like to include the words that are being queried, so I don't want to simply store the first 2 or 3 lines. I have looked at the HighlighterTest and I do believe that it requires the entire text of the document. However, unlike the highlighter I know where the termOffset in the document. The input to my Snippet will be a vector of querywords and their offsets in the document. (not their position in the document). I'm reading about the option "term vectors" I can store while indexing my data. It seems to be much more efficient than storing the entire document, I'm just not sure if the "term offset" is the same as a "token offset". Here's what I'm reading in case I'm totally off the ball here and this is useless to me: http://lucene.apache.org/java/docs/fileformats.html#Term%20Vectors It seems like this has all the information that I would have if I tokenized the document anyways, or am I missing something? Thanks again for all the help! --JP On 7/16/07, Ard Schrijvers < [EMAIL PROTECTED]> wrote: > > Hello, > > > Ard, > > > > I do have access to the URL's of the documents, but because I > > will be making > > short snippets for many pages (suppose it had about 20 hits > > per page and I > > need to make Snippets for each of them) I was worried it would be > > inefficient to open each "hit" tokenize it and then make the > > Snippet, of > > Yes, getting all the documents over http just to get the snippet, for > example the first 2 lines, is really bad for your performance in search > overviews. > > Logically, what you want to show, you need to store in your index. For > example, if for search hits you need to show the title and subtitle, just > store these two in the index. If you want to have a google like highlighter > of text snippets where the term occured, you need to store the entire text > IIRC (see HighlighterTest in lucene). > > How many docs are you talking about that you cannot store the entire > content? > > You could also just index the content and not store it, and in another > lucene field, store the first 2 or 3 lines of the document, which serve as > text snippet. Making correct extracts of text snippets is very hard (see > lingpipe for example) > > Regards Ard > > > course the price of this may be worth the price of the increased Index > > size. I have been looking into storing "Field Vectors with > > positions" in > > the index. It seems that by doing this I will have access to > > everything > > that the Tokenizer is giving me correct? Will I need to > > store "term text" > > in order to be able to access the actual term instead of > > stemmed words? > > > > Thanks for all your help, > > > > --JP > > > > On 7/13/07, Ard Schrijvers <[EMAIL PROTECTED]> wrote: > > > > > > Hello, > > > > > > > I'm wondering if after > > > > opening the > > > > index I can retrieve the Tokens (not the terms) of a > > > > document, something > > > > akin to IndexReader.Document (n).getTokenizer(). > > > > > > It is obviously not possible to get the original tokens of > > the document > > > back when you haven't stored the document, because: > > > > > > 1) the analyzer might have removed stop words in the first place > > > 2) the terms in lucene index are perhaps stemmed words / > > synonyms / etc > > > etc > > > 3) how would you expect things like spaces, commas, dots etc to be > > > restored? > > > > > > And, I think what you want does not comply with an inverted > > index. When > > > you do not store the document, you always loose information > > about the > > > document during indexing/analyzing > > > > > > How many documents are you talking about? They must be > > either somewhere on > > > FS or accessible over http...when you need the document, > > why not just > > > provide a link to the original location? > > > > > > Regards Ard > > > > > > > > > > > In summary: > > > > > > > > My current ( too wasteful implementation is this) > > > > > > > > StandardTokenizer(BufferedReader ( > > > > IndexReader.Document(n).getField("text" > > > > ) ) > > > > > > > > I'm wondering if Lucene has a more efficient manner to > > > > retrieve the tokens > > > > of a document from an index. Because it seems like it has > > > > information about
Re: getting problem while indexing pdf files with pdfbox
Hi Erick, Befoe indexing I have printed the doc, and I have given the output also.It is printing well. Kindly please check my post again following... " System.out.println(doc); //Following code is for making index" and the corresponding output is... Document > Offhand I'd assume that your problem is using PDFbox. Have you > tried printing out the docText string you get back from > > docText = stripper.getText(new PDDocument(cosDoc))? > > I'd recommend you assure yourself that you get valid text back from > the PDF document before worrying about indexing it. > > Best > Erick > > On 7/17/07, neetika <[EMAIL PROTECTED]> wrote: >> >> >> http://www.nabble.com/file/p11647342/DRra0026.pdf DRra0026.pdf >> >> hi all, >> >> i am able to convert a pdf in to a text file using pdfbox. >> and this is the code that I used, but I am not able to index it >> >> // code for parsing and making index >> >> public Document getDocument(InputStream is) >> { >> COSDocument cosDoc = null; >> try { >> PDFParser parser = new PDFParser(is); >> parser.parse(); >> cosDoc = parser.getDocument(); >> } >> catch (IOException e) { >> e.printStackTrace(); >> } >> String docText = null; >> try { >> PDFTextStripper stripper = new PDFTextStripper(); >> docText = stripper.getText(new PDDocument(cosDoc)); >> } >> catch (IOException e) { >> e.printStackTrace(); >> } >> Document doc = new Document(); >> if (docText != null) { >> doc.add(new Field("body", docText, Field.Store.YES, >> Field.Index.TOKENIZED)); >> } >> return doc; >> } >> >> public static void main(String[] args) throws Exception { >> TestPDFParser handler = new TestPDFParser(); >> >> Document doc = handler.getDocument(new >> FileInputStream(new >> File("D:\\lucenePdf\\DRra0026.pdf"))); >> >> System.out.println(doc); >> >> //Following code is for making index >> >> IndexWriter f_writer = new IndexWriter("D:\\lucenePdf", >> new >> StandardAnalyzer(), true); >> >> f_writer.addDocument(doc); >> >> } >> } >> //code for searching a particular string.. >> >> public static void main(String[] args) throws Exception { >> String indexDir = "D:\\lucenePdf"; >> String q = "RA0083"; >> >> >> Directory fsDir = FSDirectory.getDirectory(indexDir); >> IndexSearcher is = new IndexSearcher(fsDir); >> >> Query query = new QueryParser("body", new >> StandardAnalyzer()).parse(q); >> >> Hits hits = is.search(query); >> System.out.println("Found " + hits.length() + " documents that >> matched query '" + q + "':"); >> for (int i = 0; i < hits.length(); i++) { >> Document doc = hits.doc(i); >> >> } >> } >> >> >> When I run the above code...I get folowing output as a result of running >> indexer class >> >> >> Document> : RA0083 >> 99062620002100100220468148001102006PAYOUT : RA0083 >> 99062630002100100330468153601102006PAYOUT : RA0083 >> 99062647002100100440468155401102006PAYOUT : RA0083 >> 99062657002100100550468156201102006PAYOUT : RA0083 >> >> and following files are generated in the specified path.. >> >> segments.gen >> write.lock >> segments_4 >> >> >> but when I run the search class it gives the result as: >> >> Found 0 documents that matched query 'RA0083': >> >> I am also attaching the corresponding pdf file for reference. >> It seems as the index is not getting created.. >> >> Please help me with some of your inputs,it will be very helpfull for me. >> -- >> View this message in context: >> http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11647342 >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > -- View this message in context: http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11653883 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: getting problem while indexing pdf files with pdfbox
You have NOT supplied an example of the text you extracted from the document. But let's assume that the interesting string is exactly what you expect. Have you looked at your index with Luke to see if the data is there? I *strongly* suggest you get a copy of Luke (google lucene luke) to examine indexes with. The existence of the write.lock file suggests that you haven't closed your index prior to searching it. Although flushing it would probably work. Be aware that you cannot see changes to an index if the reader you use has been opened before the indexing operation. Also, there is some period of time when the indexed data is buffered by the writer, and I'm unsure (but doubt) it's available until it's been flushed. I suspect that your problem is not related to PDF, but rather to whether you've properly indexed data and closed your index prior to searching it. The other possibility is that your analyzer is parsing things "interestingly". StandardAnalyzer() does some interesting things when tokenizing, including lowercasing the input stream. Although that shouldn't have been a problem since you use the same analyzer for indexing and searching. Also, try query.toString to see what is actually searched, that often gives insights. The aforementioned Luke will allow you to submit queries to the index, including explaining what the actual query produced is. What are the file sizes of your index files? Best Erick On 7/17/07, neetika <[EMAIL PROTECTED]> wrote: Hi Erick, Befoe indexing I have printed the doc, and I have given the output also.It is printing well. Kindly please check my post again following... " System.out.println(doc); //Following code is for making index" and the corresponding output is... Document > Offhand I'd assume that your problem is using PDFbox. Have you > tried printing out the docText string you get back from > > docText = stripper.getText(new PDDocument(cosDoc))? > > I'd recommend you assure yourself that you get valid text back from > the PDF document before worrying about indexing it. > > Best > Erick > > On 7/17/07, neetika <[EMAIL PROTECTED]> wrote: >> >> >> http://www.nabble.com/file/p11647342/DRra0026.pdf DRra0026.pdf >> >> hi all, >> >> i am able to convert a pdf in to a text file using pdfbox. >> and this is the code that I used, but I am not able to index it >> >> // code for parsing and making index >> >> public Document getDocument(InputStream is) >> { >> COSDocument cosDoc = null; >> try { >> PDFParser parser = new PDFParser(is); >> parser.parse(); >> cosDoc = parser.getDocument(); >> } >> catch (IOException e) { >> e.printStackTrace(); >> } >> String docText = null; >> try { >> PDFTextStripper stripper = new PDFTextStripper(); >> docText = stripper.getText(new PDDocument(cosDoc)); >> } >> catch (IOException e) { >> e.printStackTrace(); >> } >> Document doc = new Document(); >> if (docText != null) { >> doc.add(new Field("body", docText, Field.Store.YES, >> Field.Index.TOKENIZED)); >> } >> return doc; >> } >> >> public static void main(String[] args) throws Exception { >> TestPDFParser handler = new TestPDFParser(); >> >> Document doc = handler.getDocument(new >> FileInputStream(new >> File("D:\\lucenePdf\\DRra0026.pdf"))); >> >> System.out.println(doc); >> >> //Following code is for making index >> >> IndexWriter f_writer = new IndexWriter("D:\\lucenePdf", >> new >> StandardAnalyzer(), true); >> >> f_writer.addDocument(doc); >> >> } >> } >> //code for searching a particular string.. >> >> public static void main(String[] args) throws Exception { >> String indexDir = "D:\\lucenePdf"; >> String q = "RA0083"; >> >> >> Directory fsDir = FSDirectory.getDirectory(indexDir); >> IndexSearcher is = new IndexSearcher(fsDir); >> >> Query query = new QueryParser("body", new >> StandardAnalyzer()).parse(q); >> >> Hits hits = is.search(query); >> System.out.println("Found " + hits.length() + " documents that >> matched query '" + q + "':"); >> for (int i = 0; i < hits.length(); i++) { >> Document doc = hits.doc(i); >> >> } >> } >> >> >> When I run the above code...I get folowing output as a result of running >> indexer class >> >> >> Document> : RA0083 >> 99062620002100100220468148001102006PAYOUT : RA0083 >> 990626300021001003304
WildcardQuery and SpanQuery
Hi everybody, We recently need to support wildcard search terms "*", "?" together with SpanQuery. It seems that there's no SpanWildcardQuery available. After looking into the lucene source code for a while, I guess we can either: 1. Use SpanRegexQuery, or 2. Write our own SpanWildcardQuery, and implements the rewrite(IndexReader) method to rewrite the query into a SpanOrQuery with some SpanTermQuery. Of the two approaches, Option 1 seems to be easier. But I am rather concerned about the performance of using regular expression. On the other hand, I am not sure if there are any other concerns I am not aware of for option 2 (i.e. is there a reason why there's no SpanWildcardQuery in the first place?) Any advices ? Cedric - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: getting problem while indexing pdf files with pdfbox
Hi Erick, I am able to get the result fine. The problem was, I forgot to close the writer and so the index file (.cfs) was not getting generated. Thanks a lot for the timely help. Regards, Neetika Erick Erickson wrote: > > You have NOT supplied an example of the text you extracted > from the document. But let's assume that the interesting > string is exactly what you expect. > > Have you looked at your index with Luke to see if the data is there? > I *strongly* suggest you get a copy of Luke (google lucene luke) to > examine indexes with. > > The existence of the write.lock file suggests that you haven't closed your > index prior to searching it. Although flushing it would probably work. Be > aware that you cannot see changes to an index if the reader you use > has been opened before the indexing operation. Also, there is some period > of time when the indexed data is buffered by the writer, and I'm unsure > (but > doubt) it's available until it's been flushed. > > I suspect that your problem is not related to PDF, but rather to whether > you've properly indexed data and closed your index prior to searching > it. > > The other possibility is that your analyzer is parsing things > "interestingly". > StandardAnalyzer() does some interesting things when tokenizing, including > lowercasing the input stream. Although that shouldn't have been a problem > since you use the same analyzer for indexing and searching. > > Also, try query.toString to see what is actually searched, that often > gives > insights. The aforementioned Luke will allow you to submit queries to the > index, including explaining what the actual query produced is. > What are the file sizes of your index files? > > > Best > Erick > > On 7/17/07, neetika <[EMAIL PROTECTED]> wrote: >> >> >> Hi Erick, >> >> Befoe indexing I have printed the doc, and I have given the output >> also.It >> is printing well. >> Kindly please check my post again following... >> >> " System.out.println(doc); >> //Following code is for making index" >> >> and the corresponding output is... >> >> >> Document> RA0083 >> 99062620002100100220468148001102006PAYOUT : RA0083 >> 99062630002100100330468153601102006PAYOUT : RA0083 >> 99062647002100100440468155401102006PAYOUT : RA0083 >> 099062657002100100550468156201102006PAYOUT : RA0083 >> which is as expected...but my problem is...index file is not getting >> generated. >> >> Please help >> >> >> >> Erick Erickson wrote: >> > >> > Offhand I'd assume that your problem is using PDFbox. Have you >> > tried printing out the docText string you get back from >> > >> > docText = stripper.getText(new PDDocument(cosDoc))? >> > >> > I'd recommend you assure yourself that you get valid text back from >> > the PDF document before worrying about indexing it. >> > >> > Best >> > Erick >> > >> > On 7/17/07, neetika <[EMAIL PROTECTED]> wrote: >> >> >> >> >> >> http://www.nabble.com/file/p11647342/DRra0026.pdf DRra0026.pdf >> >> >> >> hi all, >> >> >> >> i am able to convert a pdf in to a text file using pdfbox. >> >> and this is the code that I used, but I am not able to index it >> >> >> >> // code for parsing and making index >> >> >> >> public Document getDocument(InputStream is) >> >> { >> >> COSDocument cosDoc = null; >> >> try { >> >> PDFParser parser = new PDFParser(is); >> >> parser.parse(); >> >> cosDoc = parser.getDocument(); >> >> } >> >> catch (IOException e) { >> >> e.printStackTrace(); >> >> } >> >> String docText = null; >> >> try { >> >> PDFTextStripper stripper = new PDFTextStripper(); >> >> docText = stripper.getText(new PDDocument(cosDoc)); >> >> } >> >> catch (IOException e) { >> >> e.printStackTrace(); >> >> } >> >> Document doc = new Document(); >> >> if (docText != null) { >> >> doc.add(new Field("body", docText, Field.Store.YES, >> >> Field.Index.TOKENIZED)); >> >> } >> >> return doc; >> >> } >> >> >> >> public static void main(String[] args) throws >> Exception >> { >> >> TestPDFParser handler = new TestPDFParser(); >> >> >> >> Document doc = handler.getDocument(new >> >> FileInputStream(new >> >> File("D:\\lucenePdf\\DRra0026.pdf"))); >> >> >> >> System.out.println(doc); >> >> >> >> //Following code is for making index >> >> >> >> IndexWriter f_writer = new >> IndexWriter("D:\\lucenePdf", >> >> new >> >> StandardAnalyzer(), true); >> >> >> >> f_w
Re: Does Index have a Tokenizer Built into it
: After indexing I have been able to retrieve the TermPositionVector from the : index and it has all of the data, but I cannot find a way where given a : position I can retrieve the term at that position. Which is how I was hoping : to create my contextual snippets. there is no easy way to go from a position to a term -- coincidently there is a very recent thread on this on java-dev... http://www.nabble.com/Best-Practices-for-getting-Strings-from-a-position-range-tf4084187.html ...a new API may come out of it, but in the mean time you may be interested in taking the approach the current highlighter uses (as mentioned in that thread), of using the TermPositionVector to rebuild the orriginal tokenstream, then skipping ahead to the positions you are interested in. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WildcardQuery and SpanQuery
On Wednesday 18 July 2007 05:58, Cedric Ho wrote: > Hi everybody, > > We recently need to support wildcard search terms "*", "?" together > with SpanQuery. It seems that there's no SpanWildcardQuery available. > After looking into the lucene source code for a while, I guess we can > either: > > 1. Use SpanRegexQuery, or > > 2. Write our own SpanWildcardQuery, and implements the > rewrite(IndexReader) method to rewrite the query into a SpanOrQuery > with some SpanTermQuery. > > Of the two approaches, Option 1 seems to be easier. But I am rather > concerned about the performance of using regular expression. On the > other hand, I am not sure if there are any other concerns I am not > aware of for option 2 (i.e. is there a reason why there's no > SpanWildcardQuery in the first place?) > > Any advices ? The basic problem you are facing is that in Lucene the expansion of the terms is tightly coupled to the generation of a combination query using the expanded terms. In contrib/surround the term expansion and query generation are decoupled using a visitor pattern for the terms. The code is here: http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surround/src/java/org/apache/lucene/queryParser/surround/query In surround a wild card term can provide either an OR of normal term queries, or a SpanOrQuery of span term queries. This query generation is in class SimpleTerm, which has one method for a normal boolean OR query over the terms, and one for a span query for the terms. In both cases surround uses a regular expression to expand the matching terms, but that could be changed to use another wildcard expansion mechanisms than the ones in SrndPrefixQuery and SrndTruncQuery, which are subclasses of SimpleTerm. With the term expansion and the query combination split, it is also necessary to limit the maximum number of expanded terms in another way than Lucene does. In surround the classes BasicQueryFactory and TooManyBasicQueries are used for that. Regards, Paul Elschot > > Cedric > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]