On 6/6/05, Franz Ruebe <[EMAIL PROTECTED]> wrote: > I set my CLASSPATH in Windows and la voil�, there's a lucene.log. Setting my > classpath was not necessary before, because I had an ant 1.5.1 in my > PATH-Variable, which I deleted after the new installation. Does this sounds > reasonable? > >If it is in the correct directory, the errors go away and "lucene.log" > >is created. (Like the filename? I invented it myself!) > The login errors went away, and I like the name very muchacho (mirrors your > genius ;-)) . > But is the location sensful? Shouldn't it be a log dir (There are still not > enough dirs in lenya).
Good. The logging is working. Now if it was only useful... Since I was starting the indexing manually from the ant directory, it made sense to have the log appear in the same directory, and that is where it defaults. You should be able to specify a full filepath in the log4j.properties. > But the log is empty (creation date is crawl-try-date) and I got still a > bunch of Java errors. So just the old problem... Bad. It sounds like your Lucene has a bug in the crawler, but since nobody else noticed it, it is more likely your configuration is incorrect. > So once again some questions: > What to do when just want to use the 'normal way of searching' (I agree > after understanding your way indexing the xml is much smarter, but one after > the other...)? Then my instructions are not what you want. I thought about combining results from indexing crawled HTML, and indexing the XML. Combining them would require many xsl:choose in the XSL for results because the URLs do not need to be rewritten for HTML entries. Also, crawled HTML does not have a language field, and the description is usually rather messy. I decided the easiest method would be separate indexes and search pages. I only need the XML indexed for my project, so I stopped working on it when that was working. > I made a crawler-live.xconf (In > http://lenya.apache.org/1_2_x/components/search/lucene.html is a mistake, I > think) > "Note that there is a search.properties file in build/lenya/webapp/lenya/bin > that you may have to change. crawler.xconf needs to have the following > elements:" > Shouldn't that be crawler-live.xconf as written in the ant command? > Here is my crawler-live.xconf > <?xml version="1.0"?> > <crawler> > <user-agent>lenya</user-agent> > <base-url href="http://localhost:8888/mypub/live/index.html"/> > <scope-url href="http://localhost:8888/mypub/live/"/> > <uri-list src="work/search/lucene/uris.txt"/> > <htdocs-dump-dir > src="work/search/lucene/htdocs_dump/live/lenya.apache.org"/> > </crawler> > "Note that there is a search.properties..." > Ok, in there's just the webapp.dir=../../ ; seems to be ok... Can't open lenya.apache.org now. (Is it down, or is it Comcast?) I'll look at it later, but I have no authority on that site. > >org.apache.avalon.excalibur.io.FileUtil.catPath(FileUtil.java:509) > >forgot to check the bounds before using substring. Might work if you > >wrap the code: > >if(lowerbound >= 0){ newString = oldString.substring(lowerbound [, > >upperbound]); > >} > >but it would be better to read the code and figure out why the > >lowerbound is -1. Usually the -1 comes from searching with indexOf() > >or lastIndexOf() for a substring that is not there . > Wouldn't that mean, that everybody would have the same probs in using the > crawler? > Does anybody else uses the crawler? > I'm not (yet??) the right guy to change java-classes. I am the right guy for changing Java classes, but I tried to avoid it for this project. The closest was rewriting search.xsp. It is probably missing or incorrect configuration. I do not write software without reasonable defaults for missing/bad configuration, or errors that don't tell how to fix, but that's just me; most programmers think more obscure errors are better. (I do not have the time to research it; it does not affect my current project. Sorry.) > > > Is there a way to get an index without crawling? > >Yes. That was the point of the How-To. I wanted to filter the > >results based on language and filepath, and that is easier working > >from Lenya's XML than from HTML. It seems silly to crawl the website > >and put a copy on the hard drive, when all the contents are already on > >the hard drive in a much better format. > How embarassing. I read the stuff several times, but obviously didn't > understand it. Have the red lines on your site > http://solprovider.com/lenya/search have already been yesterday there? Or is > this a tribute to my reading-over-without-understanding? This line in the > apache how-to would spare dudes like me a lot of time. Most of the changes are a tribute to you! The only other change was the link to the Linux version. Documentation should be an interactive process (which is why Wikis are so popular.) I will ask Gregor to update the official How-To when we are done. > Sorry for BSE Please define "BSE" (if it is allowed in polite company.) --- Why do you need the crawled HTML version of Search? Should I add reasons about when my version is not appropriate? solprovider --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
