On 6/6/05, Franz Ruebe <[EMAIL PROTECTED]> wrote:
> I set my CLASSPATH in Windows and la voil�, there's a lucene.log. Setting my
> classpath was not necessary before, because I had an ant 1.5.1 in my
> PATH-Variable, which I deleted after the new installation. Does this sounds
> reasonable?
> >If it is in the correct directory, the errors go away and "lucene.log"
> >is created.  (Like the filename?  I invented it myself!)
> The login errors went away, and I like the name very muchacho (mirrors your
> genius ;-)) .
> But is the location sensful? Shouldn't it be a log dir (There are still not
> enough dirs in lenya).

Good.  The logging is working.  Now if it was only useful...

Since I was starting the indexing manually from the ant directory, it
made sense to have the log appear in the same directory, and that is
where it defaults.  You should be able to specify a full filepath in
the log4j.properties.

> But the log is empty (creation date is crawl-try-date) and I got still a
> bunch of Java errors. So just the old problem...

Bad.  It sounds like your Lucene has a bug in the crawler, but since
nobody else noticed it, it is more likely your configuration is
incorrect.

> So once again some questions:
> What to do when just want to use the 'normal way of searching' (I agree
> after understanding your way indexing the xml is much smarter, but one after
> the other...)?

Then my instructions are not what you want.  I thought about combining
results from indexing crawled HTML, and indexing the XML.  Combining
them would require many xsl:choose in the XSL for results because the
URLs do not need to be rewritten for HTML entries.  Also, crawled HTML
does not have a language field, and the description is usually rather
messy.  I decided the easiest method would be separate indexes and
search pages.  I only need the XML indexed for my project, so I
stopped working on it when that was working.

> I made a crawler-live.xconf (In
> http://lenya.apache.org/1_2_x/components/search/lucene.html is a mistake, I
> think)
> "Note that there is a search.properties file in build/lenya/webapp/lenya/bin
> that you may have to change. crawler.xconf needs to have the following
> elements:"
> Shouldn't that be crawler-live.xconf as written in the ant command?
> Here is my crawler-live.xconf
> <?xml version="1.0"?>
> <crawler>
>   <user-agent>lenya</user-agent>
>   <base-url href="http://localhost:8888/mypub/live/index.html"/>
>   <scope-url href="http://localhost:8888/mypub/live/"/>
>   <uri-list src="work/search/lucene/uris.txt"/>
>   <htdocs-dump-dir
> src="work/search/lucene/htdocs_dump/live/lenya.apache.org"/>
> </crawler>
> "Note that there is a search.properties..."
> Ok, in there's just the webapp.dir=../../ ; seems to be ok...

Can't open lenya.apache.org now.  (Is it down, or is it Comcast?) 
I'll look at it later, but I have no authority on that site.

> >org.apache.avalon.excalibur.io.FileUtil.catPath(FileUtil.java:509)
> >forgot to check the bounds before using substring.  Might work if you
> >wrap the code:
> >if(lowerbound >= 0){   newString = oldString.substring(lowerbound [,
> >upperbound]);
> >}
> >but it would be better to read the code and figure out why the
> >lowerbound is -1.  Usually the -1 comes from searching with indexOf()
> >or lastIndexOf() for a substring that is not there .
> Wouldn't that mean, that everybody would have the same probs in using the
> crawler?
> Does anybody else uses the crawler?
> I'm not (yet??) the right guy to change java-classes.

I am the right guy for changing Java classes, but I tried to avoid it
for this project.  The closest was rewriting search.xsp.

It is probably missing or incorrect configuration.  I do not write
software without reasonable defaults for missing/bad configuration, or
errors that don't tell how to fix, but that's just me; most
programmers think more obscure errors are better.  (I do not have the
time to research it; it does not affect my current project.  Sorry.)

> > > Is there a way to get an index without crawling?
> >Yes.  That was the point of the How-To.  I wanted to filter the
> >results based on language and  filepath, and that is easier working
> >from Lenya's XML than from HTML.  It seems silly to crawl the website
> >and put a copy on the hard drive, when all the contents are already on
> >the hard drive in a much better format.
> How embarassing. I read the stuff several times, but obviously didn't
> understand it. Have the red lines on your site
> http://solprovider.com/lenya/search have already been yesterday there? Or is
> this a tribute to my reading-over-without-understanding? This line in the
> apache how-to would spare dudes like me a lot of time.

Most of the changes are a tribute to you!  The only other change was
the link to the Linux version.  Documentation should be an interactive
process (which is why Wikis are so popular.)  I will ask Gregor to
update the official How-To when we are done.

> Sorry for BSE
Please define "BSE" (if it is allowed in polite company.)

---
Why do you need the crawled HTML version of Search?  Should I add
reasons about when my version is not appropriate?

solprovider

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to