ntlm - options overview

2006-11-25 Thread Tomi NA

I came across an interesting overview of ntlm authentication
possibilities at http://www.oaklandsoftware.com/papers/ntlm.html

I thought I'd just mention it here in case anyone who knows how nutch
authentication works "under the hood" has anything to say about the
listed options.
The solution that's usually mentioned when talking about nutch and
ntlm authentication is using a ntlm proxy, but it's basically just a
make-do solution. I'm mostly interested in how the projects listed in
the Oakland Software paper could be employed to do the job.
Oh, and one more thing: is NTLM just a matter of porting it to java?
As I understand it, samba implements the protocol in C, ntlmaps in
python...

...anyway, enough of my ramblings...

Cheers,
t.n.a.


Re: depth limitation

2006-11-16 Thread Tomi NA

2006/11/16, [EMAIL PROTECTED] <[EMAIL PROTECTED]>:

I have added depth limitation for version 0.7.2. If to someone it is
interestingly i can contribute it.


I am using depth limitation in 0.8.1, but am looking to 0.7.2 as the
next version I work with so  I'm very interested.

t.n.a.


Re: Strategic Direction of Nutch

2006-11-13 Thread Tomi NA

2006/11/13, carmmello <[EMAIL PROTECTED]>:

Hi,
Nutch, from version 0.8 is, really, very, very slow, using a single machine,
to process data, after the crawling.  Compared with Nutch 0.7.2 I would say,
...
this series.  I don`t believe that there are many Nutch users, in the real
world of searching, with a farm of computers.  I, for myself, have already


Ditto, on both points.
Furthermore, I'd say I'm much more likely to deliver 10 single machine
nutch setups than a single system with 10 nodes. I believe the same
goes for a number of other users.

I had a look at the hadoop code and, well, it'd take a week (probably
an optimistic estimate) just to get acquainted with selected points of
interest, leaving a lot unknown. And this is just to get started. At
the moment, I can't justify a possible hi-risk, multi-week effort to
investigate where the bottleneck is and find a workable solution - I
can only imagine how this problem would look to someone without any
prior knowledge about distributed systems and/or indexing
technology...
...in the meantime, I suspect we might see something that seems much
more reasonable in the mid-term: a lot of useful code back-ported to
0.7.2., doing an excellent nice job on installations on one or a
hand-full of servers.

t.n.a.


Re: Nutch for dotNet

2006-11-12 Thread Tomi NA

2006/11/11, Ha ward <[EMAIL PROTECTED]>:

I'm a newbie. I wonder if there is Nutch implementation for dotnet version.
Can someone assist?


As far as I know, the only existing nutch implementation is in java.
Still, you can do a lot with nutch without going under the hood, i.e.
using the available configuration options.

Cheers,
t.n.a.


Re: .7x -> .8x

2006-11-03 Thread Tomi NA

2006/11/3, Josef Novak <[EMAIL PROTECTED]>:

Hi,

   Very short question (hopefully).  Is it possible to get bin/nutch
fetch to print a log of the pages being downloaded to the command
terminal?  I have been using 0.7.2 up until now; in that version the
fetch command outputs errors and the names of urls that the fetcher is
attempting to download.  Where is this info in .8.1?  (and why did
this change?)

   If I missed something, and in fact nothing has changed, apologies
for the inconvenience.


tail -f /logs/hadoop.log might be what you'd like to see.

Cheers,
t.n.a.


Re: returning a description of a returned document

2006-10-29 Thread Tomi NA

2006/10/29, Cristina Belderrain <[EMAIL PROTECTED]>:

Hi Tomi,

please take a look at the following tutorial:

   http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html

Apparently, Nutch's search application already shows hit summaries...
Anyway, you can always retrieve each summary programatically using a
NutchBean instance: please see the sample code towards the end of the
tutorial.


Silly, I should have looked at the nutch UI .jsps right away: the
thing is, I've been working exclusively on intranet shared folder
searches for some time now and can't explain it (yet), but it seems
that none of the indexed documents have a summary. I only asked the
question in the first place because I've never really noticed a single
summary in the search hits.

I'll look into it and see what kind of explanation i come up with.

Thanks, Cristina.

t.n.a.


returning a description of a returned document

2006-10-28 Thread Tomi NA

Is there a way to have nutch return some hit context (a la google) to
better identify the hit?
For example, if I search for "nutch", a link pointing to
"http://lucene.apache.org/nutch/"; would be followed by the following
context:
"This is the first *Nutch* release as an Apache Lucene sub-project.
... *Nutch* is a two-year-old open source project, previously hosted
at Sourceforge and ..."

t.n.a.


Re: Fetching outside the domain ?

2006-10-25 Thread Tomi NA

2006/10/23, Andrzej Bialecki <[EMAIL PROTECTED]>:

Tomi NA wrote:
> 2006/10/18, [EMAIL PROTECTED] <[EMAIL PROTECTED]>:
>
>> Btw we have some virtual local hosts, hoz does the
>> db.ignore.external.links
>> deal with that ?
>
> Update:
> setting db.ignore.external.links to true in nutch-site (and later also
> in nutch-default as a sanity check) *doesn't work*: I feed the crawl
> process a handfull of URLs and can only helplessly watch as the crawl
> spreads to dozens of other sites.

Could you give an example of a root URL, which leads to this symptom
(i.e. leaks outside the original site)?


I'll try to find out exactly where the crawler starts to run loose as
I have several web sites in my initial URL list.


> In answer to your question, it seems pointless to talk about virtual
> host handling if the elementary filtering logic doesn't seem to
> work... :-\

Well, if this logic doesn't work it needs to be fixed, that's all.


Won't argue with you there.

t.n.a.


Re: crawling sites which require authentication

2006-10-23 Thread Tomi NA

2006/10/14, Tomi NA <[EMAIL PROTECTED]>:

2006/10/14, Toufeeq Hussain <[EMAIL PROTECTED]>:

> From internal tests with ntlmaps + Nutch the conclusion we came to was
> that though it "kinda-works" it puts a huge load on the Nutch server
> as ntlmaps is a major memory-hog and the mixture of the two leads to
> performance issues. For a PoC this will do but for
> production-deployments I would not suggest one goes the ntlmaps way.
>
> An alternate would be to have a separate ntlmaps-server ,a dedicated
> machine acting as the NTLM proxy for the Nutch-box which sits behind
> it.

I haven't noticed the added resource drain, but then again, I haven't
really tested all that much: the constraints on the partical project I
implemented the approach weren't very strict.
I'll keep my eye on the cpu usage.


* Update *

ntlmaps is really every bit as sluggish as Toufeeq led me to believe,
routinely taking up to 85% of the CPU. I doesn't appear deterministic,
though: right now it's barely noticable using less then 10% of the CPU
power.

Toufeeq, could you say anything more on the topic of nutch in-built
NTLM authentication support?

t.n.a.


Re: Fetching outside the domain ?

2006-10-23 Thread Tomi NA

2006/10/18, [EMAIL PROTECTED] <[EMAIL PROTECTED]>:


Btw we have some virtual local hosts, hoz does the db.ignore.external.links
deal with that ?


Update:
setting db.ignore.external.links to true in nutch-site (and later also
in nutch-default as a sanity check) *doesn't work*: I feed the crawl
process a handfull of URLs and can only helplessly watch as the crawl
spreads to dozens of other sites.

In answer to your question, it seems pointless to talk about virtual
host handling if the elementary filtering logic doesn't seem to
work... :-\

t.n.a.


Re: Fetching outside the domain ?

2006-10-18 Thread Tomi NA

2006/10/18, Frederic Goudal <[EMAIL PROTECTED]>:


Hello,

I'm begining to play with nutch to index our own web site.
I have done a first crawl and I have trid the recrawl script.
While fetching I have lines like that :

fetching http://www.yourdictionary.com/grammars.html
fetching http://www.cours.polymtl.ca/if540/hiv_00.htm
fetching http://www.maxim-ic.com/quick_view2.cfm/qv_pk/

but by crawl-urlfilter.txt is :

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|
exe|png)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*enseirb.fr/
+^http://www.enseirb.fr/

# skip everything else
-.

So... I think I miss some point.


Frederic, what exactly is the problem? You'd like the recrawl not to
leave your web site? You can do that very easily: set the
"db.ignore.external.links" property in nutch-site.xml to "true" (you
can copy the xml property from nutch-default and then change the value
to "true");


Btw as a beginner, totally ignorant of java, and timeless system ingeneer in
charge of too many things, is there any doc that really explain the behaviour
of nutch ?


A good place to read about nutch is the nutch wiki:
http://wiki.apache.org/nutch/

Cheers,
t.n.a.


Re: crawling sites which require authentication

2006-10-14 Thread Tomi NA

2006/10/14, Toufeeq Hussain <[EMAIL PROTECTED]>:


From internal tests with ntlmaps + Nutch the conclusion we came to was
that though it "kinda-works" it puts a huge load on the Nutch server
as ntlmaps is a major memory-hog and the mixture of the two leads to
performance issues. For a PoC this will do but for
production-deployments I would not suggest one goes the ntlmaps way.

An alternate would be to have a separate ntlmaps-server ,a dedicated
machine acting as the NTLM proxy for the Nutch-box which sits behind
it.


I haven't noticed the added resource drain, but then again, I haven't
really tested all that much: the constraints on the partical project I
implemented the approach weren't very strict.
I'll keep my eye on the cpu usage.


The right way would be to use the in-built authentication features of
Nutch for Auth based crawling.


Nutch supports ntlm authentication? I see I've got some reading to
catch up on...

t.n.a.


Re: crawling sites which require authentication

2006-10-13 Thread Tomi NA

2006/10/13, Guruprasad Iyer <[EMAIL PROTECTED]>:

Hi Tomi,

"using a ntlmaps proxy"
How do I get this proxy?

"You tell nutch to use the proxy and you provide the proxy with adequate
access priviledges."
How do I do this? Can you elaborate?

I am a new Nutch user and am very much in the learning phase. Thanks.

Cheers,
Guruprasad


Guruprasad,
please use "reply-all" so your messages end up on the list as well. As
far as ntlmaps is concerned, you can read about it here
http://ntlmaps.sourceforge.net/ od download it here
http://sourceforge.net/project/showfiles.php?group_id=69259&package_id=68110&release_id=303755.
If you're using linux chances are all you need to do is issue a
command like "emerge ntlmaps" or "apt-get install ntlmaps".
Read the ntlmaps documentation on how you set it up or just follow the
comments in its config file: /etc/ntlmaps/server.cfg.
The only thing left for you to do is to edit the nutch-site.xml file
and set the http.proxy.host to (probably) "localhost" and
http.proxy.port to whatever port you set the proxy to listen on.

Looking at what I've written, I should have just said google is your
friend...ah well, what's done is done. :)

Hope this helps,
t.n.a.


Re: crawling sites which require authentication

2006-10-12 Thread Tomi NA

2006/10/12, Guruprasad Iyer <[EMAIL PROTECTED]>:

Hi,

I need to know how to crawl (intranet) sites which require authentication.
One suggestion was that I replace protocol-http with protocol-httpclient in
the value field of plugin.includes tag in the nutch-default.xml file.
However, this did not solve the problem.
Can you help me out on this? Thanks.


I don't know what kind of authentication scheme you're up against, but
recently I had to work with NTLM authentication in an intranet and
worked arround it using a ntlmaps proxy. You tell nutch to use the
proxy and you provide the proxy with adequate access priviledges. As
simple as that and works like a charm. I imagine the nutch proxy
support could be extended so that e.g. it selects a proxy based on
regexp matching of urls. That way it would be possible to provide all
the login/password pairs needed to crawl all of the sites you're
interested in.

t.n.a.


Re: Lucene query support in Nutch

2006-10-10 Thread Tomi NA

2006/10/10, Cristina Belderrain <[EMAIL PROTECTED]>:

On 10/9/06, Tomi NA <[EMAIL PROTECTED]> wrote:

> This is *exactly* what I was thinking. Like Stefan, I believe the
> nutch analyzer is a good foundation and should therefore be extended
> to support the "or" operator, and possibly additional capabilities
> when the need arises.
>
> t.n.a.

Tomi, why would you extend Nutch's analyzer when Lucene's analyzer,
which does exactly what you want, is already there?


Stefan basically answered that question, but basically, my opinion is
that Nutch's analyzer does it's job well, but only lacks one obvious
query capability: the "or" search. The fact that several users here
need this kind of functionality suggests it's not the beginning of a
landslide of new required capabilities. Lucene's analyzer, on the
other hand, is completely inadequate in this respect if search is
necessarily bound to a single (content) field.
In conclusion, my position is pragmatic: I welcome the simplest
solution to implement the "or" search. I just believe that it'd be
easiest to do that extending the nutch Analyzer.

t.n.a.


Re: Lucene query support in Nutch

2006-10-09 Thread Tomi NA

2006/10/8, Stefan Neufeind <[EMAIL PROTECTED]>:


if it's not the full feature-set, maybe most people could live with it.
But basic boolean queries I think were the root for this topic. Is there
an "easier" way to allow this in Nutch as well instead of throwing quite
a bit away and using the Lucene-syntax? As has just been pointed out: It


This is *exactly* what I was thinking. Like Stefan, I believe the
nutch analyzer is a good foundation and should therefore be extended
to support the "or" operator, and possibly additional capabilities
when the need arises.

t.n.a.


Re: Which Operating-System do you use for Nutch

2006-09-27 Thread Tomi NA

On 9/26/06, Jim Wilson <[EMAIL PROTECTED]> wrote:

I'd do it, but I'm too busy being consumed with worries about the lack of
support for HTTP/NTLM credentials and SMB fileshare indexing.

Arrrgg - tis another sad day in the life of this pirate.


We seem to share the same problems...they haven't gone and knocked me
down...yet, but I expect they might fairly soon.
For now, I'm placing the shares under an IIS umbrella: I direct the
crawl to the root of the web and serve http links to the files. IIS
(somehow) takes care of A/D authorization: once the user clicks on a
link, IIS checks the users credentials and matches it to the files ACL
(I suppose). The downsides? Even though I could theoretically allow
the users with sufficient privileges to write files, I can only
provide WebDAV access. Whats more, I'm stuck with
IIS/Windows/whatever. I'd much rather let the customer decide what he
wants to run on his servers. Finally, distributed network shares (i.e.
shares not shared from the server) make the problem/solution
significantly more complicated.
Alternatively, you could try with the file protocol, generating
"browser unfriendly" file:// links, opens up it's own Pandora's box of
security issues...so, how do you go about it?

t.n.a.


Re: [ANNOUNCE] Nutch 0.8.1 available

2006-09-27 Thread Tomi NA

On 9/27/06, Sami Siren <[EMAIL PROTECTED]> wrote:

Nutch Project is pleased to announce the availability of 0.8.1 release
of Nutch - the open source web-search software based on lucene and hadoop.

The release is immediately available for download from:

 http://lucene.apache.org/nutch/release/

Nutch 0.8.1 is a maintenance release for 0.8 branch and fixes many
serious bugs discovered in previous release. For a list of changes see

http://www.apache.org/dist/lucene/nutch/CHANGES-0.8.1.txt


Haven't seen it in action yet, but it seems some serious errors got
fixed in this version.


A big thanks to everybody who participated and made this release possible.


Ditto!

t.n.a.


Re: Which Operating-System do you use for Nutch

2006-09-26 Thread Tomi NA

On 9/25/06, Jim Wilson <[EMAIL PROTECTED]> wrote:


You can get it working on Windows if you're willing to work for it.  To use
Nutch OOTB, you have to install Cygwin since the provided Nutch launcher is
written in Bash.

Members of the community have provided alternatives, such as this Python
lanucher: http://wiki.apache.org/nutch/CrossPlatformNutchScripts



The way I see it, the existing shell scripts are not a permanent
solution. That said, python is better than (ba)sh, but java would be
even better (even though fs operations are not one liners). My (very
superfluous, but still) experiences with bean shell suggest that it
might be a good long term, platform independent solution.
It will probably happen when someone scratches that particular itch,
though, meaning it's author is going to be someone developing on
windows. :)

t.n.a.


Re: Forcing refetch and index of specified files

2006-09-22 Thread Tomi NA

On 9/21/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

Benjamin Higgins wrote:
> How can I instruct Nutch to refetch specific files and then update the
> index
> entries for those files?
>
> I am indexing files on a fileserver and I am able to produce a report of
> changed files about every 30 minutes.
>
> I'd like to feed that into Nutch at approximately the same interval so
> I can
> keep the index up-to-date.
>
> Thanks.

Conceptually this should be easy - you just need to generate a fetchlist
directly from your list of changed files, and not through
injecting/generating from a crawldb.

I wrote a tool for 0.7 which does this - look at the NUTCH-68 issue in
JIRA. This would have to be ported to 0.8 - check how Injector does this
in the first stage, when it converts a simple text file to a MapFile.


Would an algorithm like this make any sense:
for each URL in txt file
 if URL in crawldb
   update the date to "now()+1" in it's crawl datum
 else
   use existing inject logic to inject the new url

After that, it's only a matter of running the recrawl script with -adddays 0.

t.n.a.


Re: Nutch 0.8 - MS Word document parse failure : "Can't be handled as micrsosoft document. java.util.NoSuchElementException"

2006-09-22 Thread Tomi NA

On 9/22/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:


You are not the first one to consider using OO.org for Word conversion.
However, this solution brings with it a large dependency (ca 250MB
installed), which requires proper installation; and also the UNO
interface is reported to be relatively slow - I'm not sure if it's the
inherent slowness of the conversion, or the problem with the (lack of)
concurrency, i.e. a single OO instance may convert only a single
document at a time ...


250 MB for the complete office suite or just UNO?
As far as the concurrency problem is concerned, has anyone asked
OO.org developers what that's about?

t.n.a.


Re: Nutch 0.8 - MS Word document parse failure : "Can't be handled as micrsosoft document. java.util.NoSuchElementException"

2006-09-22 Thread Tomi NA

On 9/22/06, Trym B. Asserson <[EMAIL PROTECTED]> wrote:


Any other suggestions? Tomi, you said you'd had difficulties too with
certain MS documents, did you manage to find a work-around or did you
just have to ignore these documents? So far we've only concentrated on
using the plugins in Nutch 0.8 as they're provided, so we have no
experience with OO/UNO. Given that POI seems to deliver reasonably good
parsing features for MS formats, we're a bit reluctant to throw it out
just yet.


No, I haven't found a work-around yet: it seemed too much work at the
moment. Right now I'm thinking it may not be necessairy to dump POI in
favour of UNO (although I believe it would be better in the long
term): maybe it would be possible to work arround the exceptions and
still get (at least) most of the text content.

I'll probably have a look at it one of these days, although I'm a bit
sceptical: wouldn't the original plugin authors have already fixed it
if they could help it?

t.n.a.


Re: Automatic crawling

2006-09-21 Thread Tomi NA

On 9/21/06, Jacob Brunson <[EMAIL PROTECTED]> wrote:

On 9/21/06, Gianni Parini <[EMAIL PROTECTED]> wrote:
> -Is it possible to have an automatic recrawling? have i got to write
> my own application by myself? I need an application running in
> background that re-crawl my intranet site 2-3 times a week..

On the nutch wiki you will find an intranet recrawl script.  That
probably will work for you.  However, I think the script has a problem
with duplicating segment data during the mergesegs step, but I've
asked about it here and haven't had any confirmations.


Well, I can confirm my index grew to ~5 GB from ~1.5 GB after (if I
remember correctly) 2 recrawls.
It doesn't solve the problem I was after anyway, as it only indexes
pages according to the time of the last crawl, rather than crawling
everything, checking if it the new content has a newer
modification/creation date and indexing only that (typical intranet
scenario). But I'm running like a madman in the opposite direction of
the topic: please ignore me. :)

t.n.a.


Re: Nutch 0.8 - MS Word document parse failure : "Can't be handled as micrsosoft document. java.util.NoSuchElementException"

2006-09-21 Thread Tomi NA

On 9/21/06, Jim Wilson <[EMAIL PROTECTED]> wrote:

I haven't had this particular problem, but here's something to consider:
After you remove the TextBox objects you have to re-save the document.  Is
the new document the same version as the previous one?  By this I mean, the
same Word version (97, 2000, etc).


I've had some difficulties with misc MS Office documents and it makes
me wonder: would using OpenOffice.org to parse the files make more
sense than using POI? OO.org uses the UNO framework which has a Java
API so conceivably, anything OO.org understands nutch would
understand.
The fact that OO.org is able to parse MS formats fairly well (better
than most other libraries/applications) suggest that it'd give the
best results if at some point nutch/lucene supported weighted
relations between a term and a field/document. It would make e.g.
words appearing in the headers more important than e.g. words in
footnotes.
Returning to the subject of parsing MS document formats at all, has
anyone considered/attempted using OO.org UNO to parse them? Are there
any major shortcomings to the approach?

t.n.a.


Re: Changing page injection behavior in Nutch 0.8

2006-09-20 Thread Tomi NA

On 9/20/06, Tomi NA <[EMAIL PROTECTED]> wrote:

On 9/20/06, Benjamin Higgins <[EMAIL PROTECTED]> wrote:
> In Nutch 0.7, I wanted to change Nutch's behavior such that when I inject a
> file it will add the page, even if it is already present.
>
> I did this because I can prepare a list of changed files that I have on my
> intranet and want Nutch to reindex them right away.
>
> I made a change (suggested by Howie Wang) to
> org.apache.nutch.db.WebDBInjector by changing the addPage method.  I
> replaced the line:
>
>   dbWriter.addPageIfNotPresent(page);
>
> with:
>
>   dbWriter.addPageWithScore(page);
>
> Question: I'm moving to Nutch 0.8 and I'd like similar behavior, but I don't
> know where to put them as a lot of code has changed (and there's no longer a
> WebDBInjector.java file).
>
> How can I accomplish this?  If there is a more appropriate way to do this
> please let me know that also.

I'm interested in this problem as well. Haven't had a chance yet to
look into it, thought.


I think the crawl.Injector.InjectorReducer class is the one we're looking for.
Would this do the trick?

 //output.collect(key, (Writable)values.next()); // just collect
first value
while (values.hasNext()) {
output.collect(key, (Writable) values.next());
}

I can't verify as an IOException's giving me trouble (possibly because
I checkedout 0.9-dev), someone else might have more luck with the
0.8(.1?) sources.

t.n.a.


Re: Changing page injection behavior in Nutch 0.8

2006-09-19 Thread Tomi NA

On 9/20/06, Benjamin Higgins <[EMAIL PROTECTED]> wrote:

In Nutch 0.7, I wanted to change Nutch's behavior such that when I inject a
file it will add the page, even if it is already present.

I did this because I can prepare a list of changed files that I have on my
intranet and want Nutch to reindex them right away.

I made a change (suggested by Howie Wang) to
org.apache.nutch.db.WebDBInjector by changing the addPage method.  I
replaced the line:

  dbWriter.addPageIfNotPresent(page);

with:

  dbWriter.addPageWithScore(page);

Question: I'm moving to Nutch 0.8 and I'd like similar behavior, but I don't
know where to put them as a lot of code has changed (and there's no longer a
WebDBInjector.java file).

How can I accomplish this?  If there is a more appropriate way to do this
please let me know that also.


I'm interested in this problem as well. Haven't had a chance yet to
look into it, thought.

t.n.a.


Re: Stemming and Synonyms

2006-09-19 Thread Tomi NA

On 9/19/06, Gonçalo Gaiolas <[EMAIL PROTECTED]> wrote:

Hi everyone!



I'm using version 7.2 of Nutch and I'm very happy with it. Want to send a
big thumbs up for you guys behind it!


Welcome, our honoured guest from the future! :) 7.2 probably includes
natural language processing and spawns a great deal of controversy as
to weather it can be considered "intelligent" or just very good at
smalltalk. :)


Having said that, I'd like to make my users search experience as good as
possible. To do that, I need to solve two little "problems" :

-  Stemming – in my index I have lots of plurals and verbal forms
that prevent my users from sometimes finding the right results. I've been
looking around and it seems that the only stemming implementation available
for nutch is described in the wiki and requires extensive changes in Nutch
code, something I'd like to avoid. Can somebody help me ?

-  Synonyms – Ok, I don't really need synonyms. What I need is a way
to specify that Image Converter should be equal to ImageConverter, or
WebBlock should be the same as web block. How can I do this? This one is
really impacting the search quality :-)


I guess you need a different Analyzer. There's a list at
http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/Analyzer.html
You could also write your own to best represent the data you have.

Cheers,
t.n.a.


Re: how to combine two run's result for search

2006-09-18 Thread Tomi NA

On 9/18/06, Zaheed Haque <[EMAIL PROTECTED]> wrote:

Hi:

I have just checked your flash movie.. quick observation you are
running tomcat 4.1.31 and there is nothing you are doing that seems
wrong. Anyway after starting the servers can you search using the
following command

bin/nutch org.apache.nutch.search.NutchBean bobdocs

what do you get .. and what's in the logfile?

If you get something then probably its tomcat 4.1.31 is  the problem.


[EMAIL PROTECTED] ~/posao/nutch/novo/nutch-0.8 $ ./bin/nutch
org.apache.nutch.search.NutchBean bobdocs
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/nutch/search/NutchBean
[EMAIL PROTECTED] ~/posao/nutch/novo/nutch-0.8 $

It doesn't really tell me if tomcat is the problem, does it? I've
added debug statements to the nutch script so I can check if my
CLASSPATH is correct. I have no idea why nutch can't find the
NutchBean class.
I have, however, checked out the nutch 0.8 and hadoop 0.5 sources from
the svn repository, imported them into an eclipse project and used the
DistributedSearch Client and Server "public static void main" methods.
My experiments showed that my problem is not with tomcat or the nutch
web UI, because the DistributedSearch.Client also returned 0 results
regardless of the query or combination of indexes. I've managed to
confirm that the Client sees all the search servers, but it simply
fails to return any results.
I also ran across something in the logs that I didn't see before. The
following is periodically output (regardless of what I'm doing in
eclipse, as long as the Client thread is active):

2006-09-18 13:55:30,352 INFO  searcher.DistributedSearch - STATS: 2
servers, 2 segments.
2006-09-18 13:55:40,539 INFO  searcher.DistributedSearch - Querying
segments from search servers...
2006-09-18 13:55:40,559 INFO  searcher.DistributedSearch - STATS: 2
servers, 2 segments.
2006-09-18 13:55:50,564 INFO  searcher.DistributedSearch - Querying
segments from search servers...

Going back to square one...am I building the crawls correctly?
./bin/nutch crawl urls -threads 15 -topN 10 -depth 3

Is it the fact that I'm doing an intranet crawl every time, instead of
the multi-step whole web crawl? What else, what am I missing?

t.n.a.


Re: how to combine two run's result for search

2006-09-18 Thread Tomi NA

On 9/16/06, Tomi NA <[EMAIL PROTECTED]> wrote:

On 9/15/06, Tomi NA <[EMAIL PROTECTED]> wrote:
> On 9/14/06, Zaheed Haque <[EMAIL PROTECTED]> wrote:
> > > Thats the way I set it up at first.
> > > This time, I started with a blank slate, unpacked nutch and tomcat,
> > > unpacked nutch-0.8.war into the webapps/ROOT and left the deployed app
> > > untouched.
> >
> > The above means that you have an empty nutch-site.xml under
> > webapps/ROOT and you have a nutch-default.xml with a searcher.dir
> > property = crawl. Am I correct? cos you left the deployed web app
> > untouched? no?
>
> You are correct, the searcher.dir property is set to "crawl".
>
> > > I then pointed the "crawl" symlink in the current dir to point to the
> > > "crawls" directory, where my search-servers.txt (with two "localhost
> > > port" entries). In the "crawls" dir I also have two nutch-built
> > > indexes.
> >
> > If I remember it correctly I had some trouble with symlink once but I
> > don't exactly remember why.. maybe you can try without symlink..
>
> I tried renaming the directory "crawls" to "crawl", then running the
> servers like so:
> ./nutch-0.8/bin/nutch server 8192 crawl/crawl1 &
> ./nutch-0.8/bin/nutch server 8193 crawl/crawl2 &
>
> > > Now, I start nutch distributed search servers on each index and start
> > > tomcat from the dir containing the "crawl" link. I get no results at
> > > all.
> > > If I change the link to point to "crawls/crawl1", the search works
> >
> > I am guessing the above is also a symlink.. hmm.. maybe it has
> > something to do with distributed search and symlink.. no?
>
> It doesn't appear to be the problem. I tried without symlinks without success.
>
> I'm going to document the problem better today, so maybe that will help.
> I'm having trouble believing what I'm trying to achieve is so
> problematic...nevertheless, I appreciate your  effort so far.

I don't think I can document the problem better than I have here:
http://tna.sharanet.org/problem.html

It's a 2-minute flash movie showing exactly what I'm doing. I'd very
much appreciate anyone taking a look at it, but especially Zaheed.
The only thing I forgot to display in the movie is my search-servers.txt:
localhost 8192
localhost 8193

Now, what am I doing wrong?

t.n.a.


Anyone? Renaud, Zaheed, Feng?

t.n.a.


Re: java.lang.NullPointerException

2006-09-18 Thread Tomi NA

On 9/18/06, NG-Marketing, M.Schneider <[EMAIL PROTECTED]> wrote:

I figured it out. I used in my nutch-site.xml the following config


  searcher.max.hits
  2048


If I change the value to nothing "" it works all fine.  It took me a couple
of hours to figure it out. This might be a bug.


Is the specific value (2048) a problem or does nutch throw a NPE
regardless of the value you use?

t.n.a.


Re: java.lang.NullPointerException

2006-09-17 Thread Tomi NA

On 9/17/06, NG-Marketing, Matthias Schneider <[EMAIL PROTECTED]> wrote:

Hello List,



i installed nutch 0.8 and i can fetch and index documents, but I can not
search them. I get the following error:



StandardWrapperValve[jsp]: Servlet.service() for servlet jsp threw exception

java.lang.NullPointerException

at
org.apache.nutch.searcher.LuceneQueryOptimizer$LimitedCollector.(Lucen
eQueryOptimizer.java:108)

at
org.apache.nutch.searcher.LuceneQueryOptimizer.optimize(LuceneQueryOptimizer
.java:244)

at
org.apache.nutch.searcher.IndexSearcher.search(IndexSearcher.java:95)

at org.apache.nutch.searcher.NutchBean.search(NutchBean.java:249)
[...snip...]


Can't say I ran into a problem like that, but have you checked if your
index is valid, i.e. can you open the index with luke
(http://www.getopt.org/luke/) and run queries?

t.n.a.


Re: how to combine two run's result for search

2006-09-16 Thread Tomi NA

On 9/15/06, Tomi NA <[EMAIL PROTECTED]> wrote:

On 9/14/06, Zaheed Haque <[EMAIL PROTECTED]> wrote:
> > Thats the way I set it up at first.
> > This time, I started with a blank slate, unpacked nutch and tomcat,
> > unpacked nutch-0.8.war into the webapps/ROOT and left the deployed app
> > untouched.
>
> The above means that you have an empty nutch-site.xml under
> webapps/ROOT and you have a nutch-default.xml with a searcher.dir
> property = crawl. Am I correct? cos you left the deployed web app
> untouched? no?

You are correct, the searcher.dir property is set to "crawl".

> > I then pointed the "crawl" symlink in the current dir to point to the
> > "crawls" directory, where my search-servers.txt (with two "localhost
> > port" entries). In the "crawls" dir I also have two nutch-built
> > indexes.
>
> If I remember it correctly I had some trouble with symlink once but I
> don't exactly remember why.. maybe you can try without symlink..

I tried renaming the directory "crawls" to "crawl", then running the
servers like so:
./nutch-0.8/bin/nutch server 8192 crawl/crawl1 &
./nutch-0.8/bin/nutch server 8193 crawl/crawl2 &

> > Now, I start nutch distributed search servers on each index and start
> > tomcat from the dir containing the "crawl" link. I get no results at
> > all.
> > If I change the link to point to "crawls/crawl1", the search works
>
> I am guessing the above is also a symlink.. hmm.. maybe it has
> something to do with distributed search and symlink.. no?

It doesn't appear to be the problem. I tried without symlinks without success.

I'm going to document the problem better today, so maybe that will help.
I'm having trouble believing what I'm trying to achieve is so
problematic...nevertheless, I appreciate your  effort so far.


I don't think I can document the problem better than I have here:
http://tna.sharanet.org/problem.html

It's a 2-minute flash movie showing exactly what I'm doing. I'd very
much appreciate anyone taking a look at it, but especially Zaheed.
The only thing I forgot to display in the movie is my search-servers.txt:
localhost 8192
localhost 8193

Now, what am I doing wrong?

t.n.a.


Re: how to combine two run's result for search

2006-09-15 Thread Tomi NA

On 9/14/06, Zaheed Haque <[EMAIL PROTECTED]> wrote:

> Thats the way I set it up at first.
> This time, I started with a blank slate, unpacked nutch and tomcat,
> unpacked nutch-0.8.war into the webapps/ROOT and left the deployed app
> untouched.

The above means that you have an empty nutch-site.xml under
webapps/ROOT and you have a nutch-default.xml with a searcher.dir
property = crawl. Am I correct? cos you left the deployed web app
untouched? no?


You are correct, the searcher.dir property is set to "crawl".


> I then pointed the "crawl" symlink in the current dir to point to the
> "crawls" directory, where my search-servers.txt (with two "localhost
> port" entries). In the "crawls" dir I also have two nutch-built
> indexes.

If I remember it correctly I had some trouble with symlink once but I
don't exactly remember why.. maybe you can try without symlink..


I tried renaming the directory "crawls" to "crawl", then running the
servers like so:
./nutch-0.8/bin/nutch server 8192 crawl/crawl1 &
./nutch-0.8/bin/nutch server 8193 crawl/crawl2 &


> Now, I start nutch distributed search servers on each index and start
> tomcat from the dir containing the "crawl" link. I get no results at
> all.
> If I change the link to point to "crawls/crawl1", the search works

I am guessing the above is also a symlink.. hmm.. maybe it has
something to do with distributed search and symlink.. no?


It doesn't appear to be the problem. I tried without symlinks without success.

I'm going to document the problem better today, so maybe that will help.
I'm having trouble believing what I'm trying to achieve is so
problematic...nevertheless, I appreciate your  effort so far.

t.n.a.


Re: 0.8 Intranet Crawl Output/Logging?

2006-09-14 Thread Tomi NA

On 9/14/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

Everyone, thanks for the help with this.  I hope to return the
assistance, once I am more familiar with 0.8.  I am using tail -f now to
monitor my test crawls.  It also look like you can use
conf/hadoop-env.sh to redirect log file output to a different location
for each of your configurations.

One follow up question:
Now that I can actually see the log, I am finding some of the output
rather annoying/noisy.  Specially, I am referring to the Registered
Plugins and Registered Extension-Points output.  It's nice to see that
once at crawl start, but not with every step of the crawl.

So does any one know if I can disable that output?  Here's the output to
which I refer:

2006-09-14 14:03:42,852 INFO  plugin.PluginRepository - Plugins: looking
in: /var/nutch/nutch-0.8/plugins
2006-09-14 14:03:43,030 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2006-09-14 14:03:43,030 INFO  plugin.PluginRepository - Registered
Plugins:


watch -n 1 "grep -v PluginRepository
/home/wmelo/nutch-0.8/logs/hadoop.log | tail -n 20"

t.n.a.


Re: how to combine two run's result for search

2006-09-14 Thread Tomi NA

On 9/14/06, Zaheed Haque <[EMAIL PROTECTED]> wrote:

On 9/14/06, Tomi NA <[EMAIL PROTECTED]> wrote:
> On 9/5/06, Zaheed Haque <[EMAIL PROTECTED]> wrote:
> > Hi:
>
> I have a problem or two with the described procedure...
>
> > Assuming you have
> >
> > index 1 at /data/crawl1
> > index 2 at /data/crawl2
>
> Used ./bin/nutch crawl urls -dir /home/myhome/crawls/mycrawldir to
> generate an index: luke says the index is valid and I can query it
> using luke's interface.
>
> Does the "searcher.dir" value in nutch-(default|site).xml have any
> impact on the way indexes are created?

No it doesn't have any impact on index creation. searcher.dir value is
for searching only. nutch-site.xml is where you should change..
example...


  searcher.dir
   /home/myhome/crawls
  
  Path to root of index directories.  This directory is searched (in
  order) for either the file search-servers.txt, containing a list of
  distributed search servers, or the directory "index" containing
  merged indexes, or the directory "segments" containing segment
  indexes.
  


and the text file should be in this case ...

 /home/myhome/crawls/search-servers.txt


Thats the way I set it up at first.
This time, I started with a blank slate, unpacked nutch and tomcat,
unpacked nutch-0.8.war into the webapps/ROOT and left the deployed app
untouched.
I then pointed the "crawl" symlink in the current dir to point to the
"crawls" directory, where my search-servers.txt (with two "localhost
port" entries). In the "crawls" dir I also have two nutch-built
indexes.
Now, I start nutch distributed search servers on each index and start
tomcat from the dir containing the "crawl" link. I get no results at
all.
If I change the link to point to "crawls/crawl1", the search works
i.e. I get a couple of results. What seems to be the problem is
inserting the distributed search server between the index and tomcat.
Nothing I do makes the least bit of difference. :\

t.n.a.


Re: how to combine two run's result for search

2006-09-14 Thread Tomi NA

On 9/5/06, Zaheed Haque <[EMAIL PROTECTED]> wrote:

Hi:


I have a problem or two with the described procedure...


Assuming you have

index 1 at /data/crawl1
index 2 at /data/crawl2


Used ./bin/nutch crawl urls -dir /home/myhome/crawls/mycrawldir to
generate an index: luke says the index is valid and I can query it
using luke's interface.

Does the "searcher.dir" value in nutch-(default|site).xml have any
impact on the way indexes are created?


In nutch-site.xml
searcher.dir = /data


This is the nutch-site.xml of the web UI?


Under /data you have a text file called search-server.txt (I think do
check nutch-site search.dir description please)


/home/myhome/crawls/search-servers.txt


In the text file you will have the following

hostname1 portnumber
hostname2 portnumber

example
localhost 1234
localhost 5678


I placed
localhost 12567
(just one instance, to test)


Then you need to start

bin/nutch server 1234 /data/craw1 &

and

bin/nutch server 5678 /data/crawl2 &


did that, using port 12567
./bin/nutch server 12567 /home/mydir/crawls/mycrawldir &


bin/nutch org.apache.nutch.search.NutchBean www

you should see results :-)


I get:

Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/nutch/search/NutchBean


Whats more, I get no results to any query I care to pass by the Web
UI, which suggests the UI isn't connected to the underlying
DistributedSearch server. :\

Any hints, anyone?

TIA,
t.n.a.


Re: 0.8 Intranet Crawl Output/Logging?

2006-09-13 Thread Tomi NA

On 9/13/06, wmelo <[EMAIL PROTECTED]> wrote:

I have the same original doubt.  I know that the log shows  informations,
but, how to see the things happening, real time, like in nutch 0.7.2, when
you use the crawl command in the terminal?


try something like this (assuming you know what's good for you so you
use a *n*x):
watch -n 1 "tail -n 20 /home/wmelo/nutch-0.8/logs/hadoop.log"

Please replace the path to your "logs" directory to match your
environment and report back if there's a problem.
Hope it helps.

t.n.a.


Re: Windows Native Launching?

2006-09-11 Thread Tomi NA

On 9/11/06, Jim Wilson <[EMAIL PROTECTED]> wrote:

Dear Nutch User Community,

Does anyone have a nutch.bat file to use in the bin directory?  I find it
bemusing that Java (cross-platform) was chosen as the development language,
but the launcher is written in Bash.


As much as I hate to say it, but you're right, it doesn't make any
sense to hobble such a great body of platform independent code with a
couple of short bit vital *n*x-only scripts. ...even if we are talking
about windows.
Would it make sense to go java all the way and use groovy or
beanshell? My knowledge of these projects is rather superficial, but
someone else might know more...

t.n.a.


Re: Nutch-site.xml vs Nutch-default.xml

2006-09-09 Thread Tomi NA

On 9/9/06, victor_emailbox <[EMAIL PROTECTED]> wrote:


Hi all,
  I spent a lot of time to figure out why Nutch didn't respond to my
configuration in nutch-site.xml.  I set db.ignore.external.links to true.
It didn't work.  Then I realized that Nutch-default.xml also has same
db.ignore.external.links but it was set to false.  So I set it to true too,
and it works.  Isn't nutch-site.xml supposed to override the setting in
Nutch-default.xml?
Many thanks.


I have no idea what's wrong with your nutch configuration, but yes,
nutch-site overrides nutch-default. Maybe someone else has an
explanation to offer.

t.n.a.


Re: Fetching past Authentication

2006-09-09 Thread Tomi NA

On 9/8/06, Jim Wilson <[EMAIL PROTECTED]> wrote:

Dear Nutch User List,

I am desperately trying to index an Intranet with the following
characteristics

1) Some sites require no authentication - these already work great!
2) Some sites require basic HTTP Authentication.
3) Some sites require NTLM Authentication.
4) No sites require both HTTP and NTLM (only one or the other).
5) The same Username/Password should work on all sites which require either
type of Authentication.
6) For sites requiring NTLM Authentication, the same Domain is always used.
7) If a site requires authentication, but the Username/Password mentioned
above fails, the site doesn't matter and does not need fetched/indexed.

My question is this: How can I provide a default Username/Password/Domain
for Nutch to use when answering HTTP or NTLM challenges?

(I really hope all I need is a couple of  tags in my
nutch-site.xml, but I'm beginning to doubt it).

I love Nutch, and really want to use it.  Please help if you know the
answer.  Thanks!


I'm also very interested in hearing more on the topic.
The only mention of a solution to (a part of) this problem I found is
http://www.dehora.net/journal/2005/11/nutch_with_basic_authentication.html

t.n.a.


Re: Recrawling (Tomi NA)

2006-09-08 Thread Tomi NA

On 9/8/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

Tomi NA wrote:
> On 9/7/06, David Wallace <[EMAIL PROTECTED]> wrote:
>> Just guessing, but could this be caused by session ids in the URL?  Or
>> some other unimportant piece of data?  If this is the case, then every
>> page would be added to the index when it's crawled, regardless of
>> whether it's already in there, with a different session id.  If this is
>> what's causing your problem, then you need to use the regexp URL
>> normaliser to strip out the session ids.
>
> Nice try but no luck, I'm afraid.
> The complete web is absolutely static. The reason is that we've set up
> IIS (I'm not too happy choosing IIS over apache) to serve files from a
> shared directory on the same server, the rationale beeing that we'd
> rather have http://-type links than file://.
>> From what I've seen in the logs, I don't see URLs varying so I'm still
> at square one. Still, thanks for the effort. If you have any other
> ideas, I'm eager to hear them.

The best way to discover what's going on is to start from a small subset
of injected urls, and do the following:

* inject

* dump the db to a text file

* generate / fetch / updatedb

* dump the db again to a second text file

* compare the files.


I'll see if I'm able to reproduce those steps here, thanks.

t.n.a.


Re: Indexing MS Powerpoint files with Lucene

2006-09-08 Thread Tomi NA

On 9/8/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

(moved to nutch-user)

Tomi NA wrote:
> On 9/7/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
>> Tomi NA wrote:
>> > On 9/7/06, Nick Burch <[EMAIL PROTECTED]> wrote:
>> >> On Thu, 7 Sep 2006, Tomi NA wrote:
>> >> > On 9/7/06, Venkateshprasanna <[EMAIL PROTECTED]> wrote:
>> >> >> Is there any filter available for extracting text from MS
>> >> Powerpoint files
>> >> >> and indexing them?
>> >> >> The lucene website suggests the POI project, which, it seems
>> does not
>> >> >> support PPT files as of now.
>> >> >
>> >> > http://jakarta.apache.org/poi/hslf/index.html
>> >> >
>> >> > It doesn't say poi doesn't support ppt. It just says support is
>> >> limited.
>> >> > Don't know exactly how limited, but certainly not useless for
>> indexing
>> >> > purposes.
>> >>
>> >> Support for editing and adding things to PowerPoint files is
>> limited, as
>> >> is getting out the finer points of fonts and positioning.
>> >
>> > Which brings me to another (off)topic: can lucene/nutch assign
>> > different weights to tokens in the same document field? An obvious
>> > example would be: "this text seems to be in large, bold, blinking
>> > letters: I'll assume it's more important than the surrounding 8px
>> > text."
>>
>> No, it can't (at least not yet). As a workaround you can extract these
>> portions of text to another field (or multiple fields), and then add
>> them with a higher boost. Then, expand your queries so that they include
>> also this field. This way, if query matches these special tokens,
>> results will get higher rank because of matching on this boosted field.
>
> I thought a workaround like that would be needed. Still, it could give
> useful results...though as a nutch user, the possibility is mostly
> theoretical for me, as probably none of the existing parsers take into
> account the formatting information. I could be completely wrong here,
> so please, feel free to correct me.

You can write a HtmlParseFilter, which will extract these portions of
text and put them into ParseData.metadata. Then, during indexing you can
check if such metadata exists and if yes - add it as separate fields.
You will need also to modify the QueryFilters, to expand user queries to
also include clauses for these additional fields.


Thanks Andrzej, I understand the concepts involved now. If the need
arises, I'll see what I can do about making it work as intended.

t.n.a.


Re: Recrawling (Tomi NA)

2006-09-08 Thread Tomi NA

On 9/7/06, David Wallace <[EMAIL PROTECTED]> wrote:

Just guessing, but could this be caused by session ids in the URL?  Or
some other unimportant piece of data?  If this is the case, then every
page would be added to the index when it's crawled, regardless of
whether it's already in there, with a different session id.  If this is
what's causing your problem, then you need to use the regexp URL
normaliser to strip out the session ids.


Nice try but no luck, I'm afraid.
The complete web is absolutely static. The reason is that we've set up
IIS (I'm not too happy choosing IIS over apache) to serve files from a
shared directory on the same server, the rationale beeing that we'd
rather have http://-type links than file://.

From what I've seen in the logs, I don't see URLs varying so I'm still

at square one. Still, thanks for the effort. If you have any other
ideas, I'm eager to hear them.

t.n.a.


Re: parse url and file attributes only - no content

2006-09-07 Thread Tomi NA

On 9/7/06, heack <[EMAIL PROTECTED]> wrote:

I meet the same problem with you. I think if there exist a way to store a
description to .mp3 .wmv or .avi .. files, and could be searched.


I believe the problem can't be solved by adding a new parse plugin to
parse "all other (binary) filetypes": this additional parser would
still get the complete (possibly very big) file from the remote host.
At which level are the http.content.limit and file.content.limit taken
into accont?
I'm thinking a new configuration setting (say,
(http|file).unsupported.extensions) set to "mp3|iso|psd" etc. could
guide the fetch algorithm so that it doesn't fetch the file contents
for these files, but simply fetches information *about* the files in
question. How does that sound?

t.n.a.


Re: Recrawling

2006-09-07 Thread Tomi NA

On 9/6/06, Andrei Hajdukewycz <[EMAIL PROTECTED]> wrote:

Another problem I've noticed is that it seems the db grows *rapidly* with each 
successive recrawl. Mine started at 379MB, and it seems to increase by roughly 
350MB every time I run a recrawl, despite there not being anywhere near that 
many additional pages.

This seems like a pretty severe problem, honestly, obviously there's a lot of 
duplicated data in the segments.


I have the same problem: my index grew from 1.5GB after the original
crawl to over 5GB(!) after the recrawl...from the looks of it, I might
as well crawl anew every time. :\

t.n.a.


parse url and file attributes only - no content

2006-09-07 Thread Tomi NA

I'd like the user to be able to find "my three dogs.jpg" if he
searches for "three dogs", even though nutch doesn't have a .jpg
parser. Whatsmore, I'd like the user to be able to search against any
other extrinsic file attribute: date, file size, even mime type, all
without reading a single bit of the actual file contents.
Can nutch be configured so that it indexes these external file
properties and completely skip file contents?
I thought maybe I could adapt an existing parser (parse-text?) to do
the job, but I guess I'd still be stuck with reading megabytes of
unparsable data, just to fill in the url, type, date and similar
attributes. I'd appreciate a comment or two.

TIA,
t.n.a.


Re: how to combine two run's result for search

2006-09-06 Thread Tomi NA

On 9/6/06, Zaheed Haque <[EMAIL PROTECTED]> wrote:

On 9/6/06, Tomi NA <[EMAIL PROTECTED]> wrote:
> On 9/5/06, Zaheed Haque <[EMAIL PROTECTED]> wrote:
> > Hi:
>
> > In the text file you will have the following
> >
> > hostname1 portnumber
> > hostname2 portnumber
> >
> > example
> > localhost 1234
> > localhost 5678
> >
>
> Does this work with nutch 0.7.2 or is it specific to the 0.8 release?

I don't really know I have never tried 0.7. From the CVS it seems like it does

http://svn.apache.org/viewvc/lucene/nutch/tags/release-0.7.2/conf/nutch-default.xml?revision=390479&view=markup

but I don't know if the command structures are the same..


Just thought you might know of the top of your head, I'll go try it out.

t.n.a.


Re: how to combine two run's result for search

2006-09-05 Thread Tomi NA

On 9/5/06, Zaheed Haque <[EMAIL PROTECTED]> wrote:

Hi:



In the text file you will have the following

hostname1 portnumber
hostname2 portnumber

example
localhost 1234
localhost 5678



Does this work with nutch 0.7.2 or is it specific to the 0.8 release?

t.n.a.


crawling frequently changing data on an intranet - how?

2006-09-05 Thread Tomi NA

The task
---

I have less than 100GB of diverse documents (.doc, .pdf, .ppt, .txt,
.xls, etc.) to index. Dozens, or even hundreds and thousands of
documents can change their content, be created or deleted every day.
The crawler will run on a HP DL380 G4 server - don't know the exact
specs yet.
I'd like to keep the index no more than 20 minutes out of date (5-10
would be ideal).
I'm currently sticking to nutch 0.7.2 because of crawl (especially
fetch) speed considerations.

Current idea
---

From what I've read so far, nutch relies on the date a certain

document was last crawled, rather than checking the live document's
last modification date (a reasonable way to behave on the Internet,
but could be better in an intranet). That's why I can't simply run the
wiki recrawl script and let it find the documents that changed since
the last index.
I'd therefore run a crawl overnight and use the produced index as a
"main index". During the day, however, I can traverse the whole
intranet web, see what's changed and crawl/index only the documents
that have changed, building a second, "helper index".
I'd set up the search application to use both of those indexes.

Problems
-
I don't know to tell the search interface to use 2 separate indices.
I'm really not sure how I'll make the search interface reload the
"helper index" every 10 or 20 minutes.

I'd welcome an opinion from anyone with more experience with
nutch...which basically means anyone. :)

TIA,
t.n.a.


Re: Does Nutch index images?

2006-09-03 Thread Tomi NA

On 9/3/06, Sidney <[EMAIL PROTECTED]> wrote:


Does nutch index images? If not or/and if so how can I go about creating a
separate search category for searching for images like the major search
engines have? If anyone can give any information on this I would be very
grateful.


You could go format by format, writing nutch plugins to access image
metadata for .jpg, .gif, .png, .tiff etc.
Don't know about writing a nutch plugin, but I don't think reading
image metadata is too much of a problem in java.
This might be a good place to start:
http://schmidt.devlib.org/java/image-io-libraries.html

t.n.a.


Re: Could anyone teache me how to index the title or content of PDF?

2006-09-02 Thread Tomi NA

On 9/1/06, Frank Huang <[EMAIL PROTECTED]> wrote:


But when I execute ./nutch crawl there show some messages like "fetch okay
,but can`t parse http://(omit...).pdf " reason:failed content
truncated at 70709 bytes.Parse can`t handle incomplete pdf file.


Haven't had time to go through the complete code (not sure I'd
understand it, anyway), but this looks like you need to set
file.content.limit to, say, 16777216. If you're crawling over http
rather than intranet shares, the property you need to set is
http.content.limit.

Hope it helps.


t.n.a.


Re: intranet crawl problems: mime types; .doc-related exceptions; really, really slow crawl + possible infinite loop

2006-08-31 Thread Tomi NA

On 8/30/06, Chris Mattmann <[EMAIL PROTECTED]> wrote:

Hi there Tomi,


On 8/30/06 12:25 PM, "Tomi NA" <[EMAIL PROTECTED]> wrote:

> I'm attempting to crawl a single samba mounted share. During testing,
> I'm crawling like this:
>
> ./bin/nutch crawl urls -dir crawldir4 -depth 2 -topN 20
>
> I'm using luke 0.6 to query and analyze the index.
>
> PROBLEMS
>
> 1.) search by file type doesn't work
> I expected that a search "file type:pdf" would have returned a list of
> files on the local filesystem, but it does not.

I believe that the keyword is "type", so your query should be "type:pdf"
(without the quotes). I'm not positive about this either, but I believe you
have to give the fully qualified mimeType, as in "application/pdf". Not
definitely sure about that though so you should experiment.


I should have emphasized that the string I queried with is without the
quotes. The "file" keyword was used because all the entries are
accessible via "file://"-type links and so searching only for "file"
would return all files. Filtering by type would then return all files
of the given type.
I tried the following query:
url:file type:application/pdf
but it seems I get the same set of hits regardless of what I use as
type, so if I search for "url:file type:application/pdf" I get the
same results as searching for "url:file type:whatever".


Additionally, in order for the mimeTypes to be indexed properly, you need to
have the index-more plugin enabled. Check your
$NUTCH_HOME/conf/nutch-site.xml, and look for the property "plugin.includes"
and make sure that the index-more plugin is enabled there.


I listed my nutch-site settings at the end of my mail: the index-more
plugin is enabled.


> 2.) invalid nutch file type detection
> I see the following in the hadoop.log:
> ---
> 2006-08-30 15:12:07,766 WARN  parse.ParseUtil - Unable to successfully
> parse content file:/mnt/bobdocs/acta.zip of type application/zip
> 2006-08-30 15:12:07,766 WARN  fetcher.Fetcher - Error parsing:
> file:/mnt/bobdocs/acta.zip: failed(2,202): Content truncated at
> 1024000 bytes. Parser can't handle incomplete pdf file.
> ---
> acta.zip is a .zip file, not a .pdf. Don't have any idea why this happens.

This may result from the contentType returned by the web server for
"acta.zip". Check the web server that the file is hosted on, and see what
the server responds for the contentType for that file.

Additionally, you may want to check if magic is enabled for mimeTypes. This
allows the mimeType to be sensed through the use of hex codes compared with
the beginning of each file.


I have mime.type.magic set to true. The files I index are served via
samba over the LAN rather then via a web server, so no, it's not a
problem of contentType.


> 3.) Why is the TextParser mapped to application/pdf and what has that
> have to do with indexing a .txt file?
> -
> 2006-08-30 15:12:02,593 INFO  fetcher.Fetcher - fetching
> file:/mnt/bobdocs/popis-vg-procisceni.txt
> 2006-08-30 15:12:02,916 WARN  parse.ParserFactory -
> ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to
> contentType application/pdf via parse-plugins.xml, but its plugin.xml
> file does not claim to support contentType: application/pdf
> -

The TextParser * was * enabled as a last resort sort of means of extracting
...


I understand, thanks. Still don't know what threw the pdf-parser off, though.


> 4.) Some .doc files can't be indexed, although I can open them via
> openoffice 2 with no problems
> -
> 2006-08-30 15:12:02,991 WARN  parse.ParseUtil - Unable to successfully
> parse content file:/mnt/bobdocs/cards2005.doc of type
> application/msword
> 2006-08-30 15:12:02,991 WARN  fetcher.Fetcher - Error parsing:
> file:/mnt/bobdocs/cards2005.doc: failed(2,0): Can't be handled as
> micrsosoft document. java.lang.StringIndexOutOfBoundsException: String
> in
> dex out of range: -1024
> -

What version of MS Word were you trying to index? I believe that the POI
library used by the word parser can only handle certain versions of MS Word
documents, although I'm not positive about this.


Oh, so POI doesn't use the same technology OO.org uses to access MS
Office created docs? That's a shame... :(
So, does anyone know which Word versions does it support?


As for 5 and 6 I'm not entirely sure about those problems. I wish you luck
in solving both of them though, and hope what I said above helps you out.


Thanks for the effort, Chris. I know a little more, but still have a
long way to go.
Does anyone else know anything about the unsolved problems I'm facing?

t.n.a.


intranet crawl problems: mime types; .doc-related exceptions; really, really slow crawl + possible infinite loop

2006-08-30 Thread Tomi NA

I'm attempting to crawl a single samba mounted share. During testing,
I'm crawling like this:

./bin/nutch crawl urls -dir crawldir4 -depth 2 -topN 20

I'm using luke 0.6 to query and analyze the index.

PROBLEMS

1.) search by file type doesn't work
I expected that a search "file type:pdf" would have returned a list of
files on the local filesystem, but it does not.

2.) invalid nutch file type detection
I see the following in the hadoop.log:
---
2006-08-30 15:12:07,766 WARN  parse.ParseUtil - Unable to successfully
parse content file:/mnt/bobdocs/acta.zip of type application/zip
2006-08-30 15:12:07,766 WARN  fetcher.Fetcher - Error parsing:
file:/mnt/bobdocs/acta.zip: failed(2,202): Content truncated at
1024000 bytes. Parser can't handle incomplete pdf file.
---
acta.zip is a .zip file, not a .pdf. Don't have any idea why this happens.

3.) Why is the TextParser mapped to application/pdf and what has that
have to do with indexing a .txt file?
-
2006-08-30 15:12:02,593 INFO  fetcher.Fetcher - fetching
file:/mnt/bobdocs/popis-vg-procisceni.txt
2006-08-30 15:12:02,916 WARN  parse.ParserFactory -
ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to
contentType application/pdf via parse-plugins.xml, but its plugin.xml
file does not claim to support contentType: application/pdf
-

4.) Some .doc files can't be indexed, although I can open them via
openoffice 2 with no problems
-
2006-08-30 15:12:02,991 WARN  parse.ParseUtil - Unable to successfully
parse content file:/mnt/bobdocs/cards2005.doc of type
application/msword
2006-08-30 15:12:02,991 WARN  fetcher.Fetcher - Error parsing:
file:/mnt/bobdocs/cards2005.doc: failed(2,0): Can't be handled as
micrsosoft document. java.lang.StringIndexOutOfBoundsException: String
in
dex out of range: -1024
-

5.) MoreIndexingFilter doesn't seem to work
The relevant part of the hadoop.log file:
-
2006-08-30 15:13:40,235 WARN  more.MoreIndexingFilter -
file:/mnt/bobdocs/EU2007-2013.pdforg.apache.nutch.util.mime.MimeTypeException:
The type can not be null or empty
-
This happens with other file types, as well:
-
2006-08-30 15:13:54,697 WARN  more.MoreIndexingFilter -
file:/mnt/bobdocs/popis-vg-procisceni.txtorg.apache.nutch.util.mime.MimeTypeException:
The type can not be null or empty
-

6.) At the moment, I'm crawling the same directory (/mnt/bobdocs), the
crawl process seems to be stuck in an infinite loop and I have no way
of knowing what's going on as the .log isn't flushed until the process
finishes.


ENVIRONMENT

logs/hadoop.log inspection reveals things like this:

My (relevant) crawl settings are:

-
 db.max.anchor.length
 511

 db.max.outlinks.per.page
 -1

 fetcher.server.delay
 0

 fetcher.threads.fetch
 5

 fetcher.verbose
 true

 file.content.limit
 10240

 parser.character.encoding.default
 iso8859-2

 indexer.max.title.length
 511

 indexer.mergeFactor
 5

 indexer.minMergeDocs
 5

 plugin.includes
nutch-extensionpoints|protocol-(file|http)|urlfilter-regex|parse-(text|html|msword|pdf|mspowerpoint|msexcel|rtf|js)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic

 searcher.max.hits
 100
-


MISC. SUGGESTIONS

Add the following configuration options to the nutch-*.xml files:
* allow search by date or extension (with no other criteria)
* always flush log to disk (at every log addition).

TIA,
t.n.a.


file access rights/permissions considerations - the least painful way

2006-08-10 Thread Tomi NA

I'm interested in crawling multiple shared folders (among other
things) on a corporate LAN.
It is a LAN of MS clients with Active Directory managed accounts.

The users routinely access the files based on ntfs-level (and
sharing?) permissions.

Idealy, I'd like to set up a central server (probably linux, but any
*n*x would do) where I'd mount all the shared folders.
I'd then set up apache so that the files are accessible via http and,
more importantly, webdav. I imagine apache could use mod_dav, mod_auth
and possibly one or two other modules to regulate access priviledges -
I could very well be completely wrong here.
Finally, I'd like to set up nutch to crawl the shared documents
through the web server, so that the stored links are valid in the
whole LAN. Nutch would therefore require absolute access to all
documents, but the documents would be served via a web server who
checks user identities and access rights.

Nutch users who've tackled the access rights problem themselves would
save me a world of time, effort and trouble with a couple of pointers
on how to go about the whole security issue.
If the setup I described is the worst possible way to go about it, I'd
appreciate a notice saying so and elaborating why. :)

TIA,
t.n.a.


Re: How do I write a nutch query.

2006-08-08 Thread Tomi NA

On 8/8/06, Björn Wilmsmann <[EMAIL PROTECTED]> wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hey,

I have run into the same problem, too. Sometimes nutch won't return
results for queries although there clearly are pages containing the
search term. I agree that this must have something to do with Nutch
scoring however I have not yet found out how to change this behaviour


I ran into the same problem but I believe it has something to do with
the analyzer (probably StandardAnalyzer, I don't really know what
Nutch uses by default, yet), plugins for those files or something
along those lines.
As far as grading is concerned, wouldn't a grading problem change the
result order, rather than skip certain results altogether?

t.n.a.


max file size vs. available RAM size: crawl uses up all available memory

2006-07-31 Thread Tomi NA

I am trying to crawl/index a shared folder in the office LAN: that
means a lot of .zip files, a lot of big .pdfs (>5 MB) etc.
I sacrificed performance for memory effectiveness where I found the
tradeoff ("indexer.mergeFactor" = 5, "indexer.minMergeDocs" = 5), but
the crawl process breaks if I set "file.content.limit" to, say, 10 MB
even thought I'm testing on a 1GB RAM machine. To be fair, some
300-400 MB are already taken by misc programs, but stil...

I invoke nutch like so:
./bin/nutch crawl -local urldir -dir crawldir -depth 20 -topN 1000

What I'd like to know is:
1) where does all the memory go?
2) how can I reduce the peak memory requirements?

To reiterate, I'm just testing at the moment, but I need to index
documents at any tree depth and any document smaller than, say,
10-20MB and I hope I don't need 5+GB of RAM to do it.

TIA,
t.n.a.


Re: nutch 0.8 and luke

2006-07-31 Thread Tomi NA

On 7/29/06, Tomi NA <[EMAIL PROTECTED]> wrote:

On 7/29/06, Sami Siren <[EMAIL PROTECTED]> wrote:
> Not expert on this area but perhaps you need to upgrade lucene .jar
> files that are used by luke?

I believe I was a little bit hasty with the message I sent. I took a
second look and it just might be that luke was right and the index is
invalid - I'm going to check it out come Monday.

Thanks for the reply.
t.n.a.


Got it all sorted out: I didn't set a valid value for http.agent.name.
What made me sound the alarm was the fact that I got some kind of
crawl result (about 2.5 MB in size).
Anyway, I did a successful crawl on my laptop, crawling throught a
directory exposed through a http server.

The problems I ran into there are new thread material.

Thanks again for the effort.
t.n.a.


Re: nutch 0.8 and luke

2006-07-29 Thread Tomi NA

On 7/29/06, Sami Siren <[EMAIL PROTECTED]> wrote:

Not expert on this area but perhaps you need to upgrade lucene .jar
files that are used by luke?


I believe I was a little bit hasty with the message I sent. I took a
second look and it just might be that luke was right and the index is
invalid - I'm going to check it out come Monday.

Thanks for the reply.
t.n.a.


nutch 0.8 and luke

2006-07-29 Thread Tomi NA

I successfully used luke with indexes created with nutch 0.7.2.
I tried the same with nutch 0.8, but luke sees it as a corrupt index.
Should this be happening?
I know this isn't the luke mailing list, but the information will
still be usefull to people using nutch.

Thanks,
t.n.a.


Re: missing, but declared functionality

2006-07-28 Thread Tomi NA

Sorry for the long silence and thanks for the help.
I've found the plugins you mentioned and set up nutch to use them. The
result is somewhat confusing, though. For one thing, my date: and
type: queries still returned no results. Weirder still, using luke to
inspect the index contents, I saw the new fields, luke would display
the top ranking terms by both "date" and "type" fields, a search like
"date:20051030" would yield dozens of results, but the "string value"
of the "date" and "type" fields was not availableeven thought I
found the documents in question using that exact field as a key.

I'll see what I come up with using 0.8 as I need the .xls and .zip
support, anyway.

t.n.a.

On 7/20/06, Teruhiko Kurosaka <[EMAIL PROTECTED]> wrote:

You'd have to enable index-more and query-more plugins, I believe.

> -Original Message-
> From: Tomi NA [mailto:[EMAIL PROTECTED]
> Sent: 2006-7-19 10:01
> To: nutch-user@lucene.apache.org
> Subject: missing, but declared functionality
>
> These kinds of queries return no results:
>
> date:19980101-20061231
> type:pdf
> type:application/pdf
>
> From the release changes documents (0.7-0.7.2), I assumed
> these would work.
> Upon index inspection (using the luke tool), I see there are no fields
> marked "date" or "type" (althought I gather this is interpreted as
> url:*.pdf). The fields I have are:
> anchor
> boost
> content
> digest
> docNo
> host
> segment
> site
> title
> url
>
> I ran the index process with very little special configurationsome
> filetype filtering and the like.
> Am I missing something?
> The files are served over a samba share: I plan to serve them through
> a web server because of security implications of using the file://
> protocol. Can the creation and last modification date be retrieved
> over http:// at all?
>
> TIA,
> t.n.a.
>



missing, but declared functionality

2006-07-19 Thread Tomi NA

These kinds of queries return no results:

date:19980101-20061231
type:pdf
type:application/pdf


From the release changes documents (0.7-0.7.2), I assumed these would work.

Upon index inspection (using the luke tool), I see there are no fields
marked "date" or "type" (althought I gather this is interpreted as
url:*.pdf). The fields I have are:
anchor
boost
content
digest
docNo
host
segment
site
title
url

I ran the index process with very little special configurationsome
filetype filtering and the like.
Am I missing something?
The files are served over a samba share: I plan to serve them through
a web server because of security implications of using the file://
protocol. Can the creation and last modification date be retrieved
over http:// at all?

TIA,
t.n.a.