Parsing and Indexing XML Docs
I am having problems with the lucene-sandbox/contributions/XML-Indexing-Demo. I get the following error when I index my XML documents with the SAX parser in Java 1.4.1 java.lang.StringIndexOutOfBoundsException: String index out of range: 200 at org.apache.crimson.parser.Parser2.parseInternal(Parser2.java:524) at org.apache.crimson.parser.Parser2.parse(Parser2.java:305) at org.apache.crimson.parser.XMLReaderImpl.parse(XMLReaderImpl.java:442) at org.xml.sax.helpers.XMLReaderAdapter.parse(XMLReaderAdapter.java:223) at javax.xml.parsers.SAXParser.parse(SAXParser.java:314) at javax.xml.parsers.SAXParser.parse(SAXParser.java:253) at org.apache.lucenesandbox.xmlindexingdemo.XMLDocumentHandlerSAX.init(XMLDocumentHandlerSAX.java:34) I thought it may be related to the depricated messages I get when I build the XML demo so I replaced the depricated calls. This was mostly by extending from DefaultHandler instead of BaseHandler. Now my XML doc is parsed but there are no events generated that call startElement() and stopElement(). I need stopElement() to be called to add the field to my Lucene document. Any one else had any problems like this? Thanks, Dave Kendig
Re: Parsing and Indexing XML Docs
Bummer, I get the same thing with Xerces. I do not suspect the XML file itself since it is from a separate app that has been operational for over a year. Does anyone maintain the sandbox contributions? Dave Traceback (innermost last): File ./indexTest.py, line 22, in ? java.lang.StringIndexOutOfBoundsException: String index out of range: 200 at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:) at org.xml.sax.helpers.XMLReaderAdapter.parse(XMLReaderAdapter.java:223) at javax.xml.parsers.SAXParser.parse(SAXParser.java:314) at javax.xml.parsers.SAXParser.parse(SAXParser.java:253) at org.apache.lucenesandbox.xmlindexingdemo.XMLDocumentHandlerSAX.init(XMLDocumentHandlerSAX.java:34) at org.apache.lucenesandbox.xmlindexingdemo.IndexFiles.indexDocs(IndexFiles.java:104) Doesn't that look like an error in Crimson? If I were you I'd use Xerces instead, I always had a better feeling about Xerces, and I think that demo code doesn't have anything Crimson-specific hard-coded in it. Otis --- David Kendig [EMAIL PROTECTED] wrote: I am having problems with the lucene-sandbox/contributions/XML-Indexing-Demo. I get the following error when I index my XML documents with the SAX parser in Java 1.4.1 java.lang.StringIndexOutOfBoundsException: String index out of range: 200 at org.apache.crimson.parser.Parser2.parseInternal(Parser2.java:524) at org.apache.crimson.parser.Parser2.parse(Parser2.java:305) at org.apache.crimson.parser.XMLReaderImpl.parse(XMLReaderImpl.java:442) at org.xml.sax.helpers.XMLReaderAdapter.parse(XMLReaderAdapter.java:223) at javax.xml.parsers.SAXParser.parse(SAXParser.java:314) at javax.xml.parsers.SAXParser.parse(SAXParser.java:253) at org.apache.lucenesandbox.xmlindexingdemo.XMLDocumentHandlerSAX.init(XMLDocumentHandlerSAX.java:34) I thought it may be related to the depricated messages I get when I build the XML demo so I replaced the depricated calls. This was mostly by extending from DefaultHandler instead of BaseHandler. Now my XML doc is parsed but there are no events generated that call startElement() and stopElement(). I need stopElement() to be called to add the field to my Lucene document. Any one else had any problems like this? Thanks, Dave Kendig __ Do you Yahoo!? Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop! http://platinum.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Book
Craig I do not subscribe to Java Developer's Journal. Are the articles online? Or could it be posted here after the article is published? Thanks, Dave Kendig There is a book by Wrox called Professional JSP Site Design (I think) that has a chapter on searching and it mentions Lucene, but its coverage on Lucene is *VERY* thin. I wouldn't recommend this book for learning Lucene. I have an article on Lucene to appear in December's Java Developer's Journal. It's not as complete a coverage of Lucene as I would have liked it to be, but with limited space in a magazine I couldn't go into much more than an introduction. I'd have probably written it differently if I had it to do over again. Oh well. Let me know what you think of the article when it comes out. William W wrote: I would like to buy a book about Lucene. Who could write it ? : ) _ STOP MORE SPAM with the new MSN 8 and get 2 months FREE* http://join.msn.com/?page=features/junkmail -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Lucene and XML
Rob I found it under the Lucene 'contributions' page on the main web site. Apparently ISOGEN is a commercial company that open sourced their XML extention to Lucene. It seems to be very nice and thought out but I do wonder who maintains the contributed code. Dave Hello all, I did not know there were packages like ISOGEN that used Lucene to build a searchable index based on XML files. From visiting ISOGEN's website it looks like it is a commercial software, are there any open source extensions to Lucene that allow XML indexing and searching? Please let me know. Thanks again, Rob -- To unsubscribe, e-mail: mailto:lucene-user-unsubscribe;jakarta.apache.org For additional commands, e-mail: mailto:lucene-user-help;jakarta.apache.org
Date Search Problem
I have XML documents that I indexed using Lucene with ISOGEN's XML package. I am unable to get the date search working properly. First let me describe how I set things up. The document has these fields. Temporal_Coverage Start_Date1968-01-01/Start_Date Stop_Date1997-12-31/Stop_Date /Temporal_Coverage They are indexed and added to org.apache.lucene.document.Document contentDoc.add(new Field(Start_Date, startDate, false, true, false)); I build a query (in a Jython Servlet the imports the lucene packages) #if a date range is supplied, use a date filter dateFormat = SimpleDateFormat(-MM-dd); dateFilter = DateFilter.After(Start_Date, dateFormat.parse(2001-02-03) ) hits = self.searcher.search(lucQuery, dateFilter) Now when DateFilter.After() is called above, I print the value of the start attribute that is declared as a string and this is what I get: DateFilter.After().start=0ciqv3fk0 But in DateFilter.bits() it is comparing against this: Enum(0)=TermStart_Date:1000-01-01 So could someone please point me in the right direction? I must be missing something here because it looks like it is comparing 0ciqv3fk0 to TermStart_Date:1000-01-01 and that is obviously wrong. I scoured the FAQ and mail listings and the information on how to search using dateField is minimal. The API docs help, but it is not clear to me how to put the API's together. Unfortunately, the demo isn't much better at showing how to search using arbitrary date formats. Thanks, Dave Kendig -- To unsubscribe, e-mail: mailto:lucene-user-unsubscribe;jakarta.apache.org For additional commands, e-mail: mailto:lucene-user-help;jakarta.apache.org
Lucene and Geographic Searching
Hi, I'm very interested in migrating our current search engine to use Lucene. After evaluating Lucene, I have become very impressed and have been telling lots of people about it. One requirement that we have is to be able to search our documents by specifying a geographical boundary. I searched everything I could find on Lucene but I barely found any mention of anyone using it for such a purpose. My XML documents contain both temporal and spatial information that I would like my users to be able to search on. Does such a thing exist for Lucene? Is there an easy way to do this with Lucene? Is there interest in adding this type of functionality to Lucene if it doesn't exist? Could something like GeoTools or some other Java toolkit be integrated into Lucene. I would even offer my help to make it so, if there is a need. David Kendig Global Change Master Directory GSFC/NASA http://globalchange.nasa.gov -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]