ntlm - options overview
I came across an interesting overview of ntlm authentication possibilities at http://www.oaklandsoftware.com/papers/ntlm.html I thought I'd just mention it here in case anyone who knows how nutch authentication works "under the hood" has anything to say about the listed options. The solution that's usually mentioned when talking about nutch and ntlm authentication is using a ntlm proxy, but it's basically just a make-do solution. I'm mostly interested in how the projects listed in the Oakland Software paper could be employed to do the job. Oh, and one more thing: is NTLM just a matter of porting it to java? As I understand it, samba implements the protocol in C, ntlmaps in python... ...anyway, enough of my ramblings... Cheers, t.n.a.
Re: depth limitation
2006/11/16, [EMAIL PROTECTED] <[EMAIL PROTECTED]>: I have added depth limitation for version 0.7.2. If to someone it is interestingly i can contribute it. I am using depth limitation in 0.8.1, but am looking to 0.7.2 as the next version I work with so I'm very interested. t.n.a.
Re: Strategic Direction of Nutch
2006/11/13, carmmello <[EMAIL PROTECTED]>: Hi, Nutch, from version 0.8 is, really, very, very slow, using a single machine, to process data, after the crawling. Compared with Nutch 0.7.2 I would say, ... this series. I don`t believe that there are many Nutch users, in the real world of searching, with a farm of computers. I, for myself, have already Ditto, on both points. Furthermore, I'd say I'm much more likely to deliver 10 single machine nutch setups than a single system with 10 nodes. I believe the same goes for a number of other users. I had a look at the hadoop code and, well, it'd take a week (probably an optimistic estimate) just to get acquainted with selected points of interest, leaving a lot unknown. And this is just to get started. At the moment, I can't justify a possible hi-risk, multi-week effort to investigate where the bottleneck is and find a workable solution - I can only imagine how this problem would look to someone without any prior knowledge about distributed systems and/or indexing technology... ...in the meantime, I suspect we might see something that seems much more reasonable in the mid-term: a lot of useful code back-ported to 0.7.2., doing an excellent nice job on installations on one or a hand-full of servers. t.n.a.
Re: Nutch for dotNet
2006/11/11, Ha ward <[EMAIL PROTECTED]>: I'm a newbie. I wonder if there is Nutch implementation for dotnet version. Can someone assist? As far as I know, the only existing nutch implementation is in java. Still, you can do a lot with nutch without going under the hood, i.e. using the available configuration options. Cheers, t.n.a.
Re: .7x -> .8x
2006/11/3, Josef Novak <[EMAIL PROTECTED]>: Hi, Very short question (hopefully). Is it possible to get bin/nutch fetch to print a log of the pages being downloaded to the command terminal? I have been using 0.7.2 up until now; in that version the fetch command outputs errors and the names of urls that the fetcher is attempting to download. Where is this info in .8.1? (and why did this change?) If I missed something, and in fact nothing has changed, apologies for the inconvenience. tail -f /logs/hadoop.log might be what you'd like to see. Cheers, t.n.a.
Re: returning a description of a returned document
2006/10/29, Cristina Belderrain <[EMAIL PROTECTED]>: Hi Tomi, please take a look at the following tutorial: http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html Apparently, Nutch's search application already shows hit summaries... Anyway, you can always retrieve each summary programatically using a NutchBean instance: please see the sample code towards the end of the tutorial. Silly, I should have looked at the nutch UI .jsps right away: the thing is, I've been working exclusively on intranet shared folder searches for some time now and can't explain it (yet), but it seems that none of the indexed documents have a summary. I only asked the question in the first place because I've never really noticed a single summary in the search hits. I'll look into it and see what kind of explanation i come up with. Thanks, Cristina. t.n.a.
returning a description of a returned document
Is there a way to have nutch return some hit context (a la google) to better identify the hit? For example, if I search for "nutch", a link pointing to "http://lucene.apache.org/nutch/"; would be followed by the following context: "This is the first *Nutch* release as an Apache Lucene sub-project. ... *Nutch* is a two-year-old open source project, previously hosted at Sourceforge and ..." t.n.a.
Re: Fetching outside the domain ?
2006/10/23, Andrzej Bialecki <[EMAIL PROTECTED]>: Tomi NA wrote: > 2006/10/18, [EMAIL PROTECTED] <[EMAIL PROTECTED]>: > >> Btw we have some virtual local hosts, hoz does the >> db.ignore.external.links >> deal with that ? > > Update: > setting db.ignore.external.links to true in nutch-site (and later also > in nutch-default as a sanity check) *doesn't work*: I feed the crawl > process a handfull of URLs and can only helplessly watch as the crawl > spreads to dozens of other sites. Could you give an example of a root URL, which leads to this symptom (i.e. leaks outside the original site)? I'll try to find out exactly where the crawler starts to run loose as I have several web sites in my initial URL list. > In answer to your question, it seems pointless to talk about virtual > host handling if the elementary filtering logic doesn't seem to > work... :-\ Well, if this logic doesn't work it needs to be fixed, that's all. Won't argue with you there. t.n.a.
Re: crawling sites which require authentication
2006/10/14, Tomi NA <[EMAIL PROTECTED]>: 2006/10/14, Toufeeq Hussain <[EMAIL PROTECTED]>: > From internal tests with ntlmaps + Nutch the conclusion we came to was > that though it "kinda-works" it puts a huge load on the Nutch server > as ntlmaps is a major memory-hog and the mixture of the two leads to > performance issues. For a PoC this will do but for > production-deployments I would not suggest one goes the ntlmaps way. > > An alternate would be to have a separate ntlmaps-server ,a dedicated > machine acting as the NTLM proxy for the Nutch-box which sits behind > it. I haven't noticed the added resource drain, but then again, I haven't really tested all that much: the constraints on the partical project I implemented the approach weren't very strict. I'll keep my eye on the cpu usage. * Update * ntlmaps is really every bit as sluggish as Toufeeq led me to believe, routinely taking up to 85% of the CPU. I doesn't appear deterministic, though: right now it's barely noticable using less then 10% of the CPU power. Toufeeq, could you say anything more on the topic of nutch in-built NTLM authentication support? t.n.a.
Re: Fetching outside the domain ?
2006/10/18, [EMAIL PROTECTED] <[EMAIL PROTECTED]>: Btw we have some virtual local hosts, hoz does the db.ignore.external.links deal with that ? Update: setting db.ignore.external.links to true in nutch-site (and later also in nutch-default as a sanity check) *doesn't work*: I feed the crawl process a handfull of URLs and can only helplessly watch as the crawl spreads to dozens of other sites. In answer to your question, it seems pointless to talk about virtual host handling if the elementary filtering logic doesn't seem to work... :-\ t.n.a.
Re: Fetching outside the domain ?
2006/10/18, Frederic Goudal <[EMAIL PROTECTED]>: Hello, I'm begining to play with nutch to index our own web site. I have done a first crawl and I have trid the recrawl script. While fetching I have lines like that : fetching http://www.yourdictionary.com/grammars.html fetching http://www.cours.polymtl.ca/if540/hiv_00.htm fetching http://www.maxim-ic.com/quick_view2.cfm/qv_pk/ but by crawl-urlfilter.txt is : # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV| exe|png)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/.+?)/.*?\1/.*?\1/ # accept hosts in MY.DOMAIN.NAME #+^http://([a-z0-9]*\.)*enseirb.fr/ +^http://www.enseirb.fr/ # skip everything else -. So... I think I miss some point. Frederic, what exactly is the problem? You'd like the recrawl not to leave your web site? You can do that very easily: set the "db.ignore.external.links" property in nutch-site.xml to "true" (you can copy the xml property from nutch-default and then change the value to "true"); Btw as a beginner, totally ignorant of java, and timeless system ingeneer in charge of too many things, is there any doc that really explain the behaviour of nutch ? A good place to read about nutch is the nutch wiki: http://wiki.apache.org/nutch/ Cheers, t.n.a.
Re: crawling sites which require authentication
2006/10/14, Toufeeq Hussain <[EMAIL PROTECTED]>: From internal tests with ntlmaps + Nutch the conclusion we came to was that though it "kinda-works" it puts a huge load on the Nutch server as ntlmaps is a major memory-hog and the mixture of the two leads to performance issues. For a PoC this will do but for production-deployments I would not suggest one goes the ntlmaps way. An alternate would be to have a separate ntlmaps-server ,a dedicated machine acting as the NTLM proxy for the Nutch-box which sits behind it. I haven't noticed the added resource drain, but then again, I haven't really tested all that much: the constraints on the partical project I implemented the approach weren't very strict. I'll keep my eye on the cpu usage. The right way would be to use the in-built authentication features of Nutch for Auth based crawling. Nutch supports ntlm authentication? I see I've got some reading to catch up on... t.n.a.
Re: crawling sites which require authentication
2006/10/13, Guruprasad Iyer <[EMAIL PROTECTED]>: Hi Tomi, "using a ntlmaps proxy" How do I get this proxy? "You tell nutch to use the proxy and you provide the proxy with adequate access priviledges." How do I do this? Can you elaborate? I am a new Nutch user and am very much in the learning phase. Thanks. Cheers, Guruprasad Guruprasad, please use "reply-all" so your messages end up on the list as well. As far as ntlmaps is concerned, you can read about it here http://ntlmaps.sourceforge.net/ od download it here http://sourceforge.net/project/showfiles.php?group_id=69259&package_id=68110&release_id=303755. If you're using linux chances are all you need to do is issue a command like "emerge ntlmaps" or "apt-get install ntlmaps". Read the ntlmaps documentation on how you set it up or just follow the comments in its config file: /etc/ntlmaps/server.cfg. The only thing left for you to do is to edit the nutch-site.xml file and set the http.proxy.host to (probably) "localhost" and http.proxy.port to whatever port you set the proxy to listen on. Looking at what I've written, I should have just said google is your friend...ah well, what's done is done. :) Hope this helps, t.n.a.
Re: crawling sites which require authentication
2006/10/12, Guruprasad Iyer <[EMAIL PROTECTED]>: Hi, I need to know how to crawl (intranet) sites which require authentication. One suggestion was that I replace protocol-http with protocol-httpclient in the value field of plugin.includes tag in the nutch-default.xml file. However, this did not solve the problem. Can you help me out on this? Thanks. I don't know what kind of authentication scheme you're up against, but recently I had to work with NTLM authentication in an intranet and worked arround it using a ntlmaps proxy. You tell nutch to use the proxy and you provide the proxy with adequate access priviledges. As simple as that and works like a charm. I imagine the nutch proxy support could be extended so that e.g. it selects a proxy based on regexp matching of urls. That way it would be possible to provide all the login/password pairs needed to crawl all of the sites you're interested in. t.n.a.
Re: Lucene query support in Nutch
2006/10/10, Cristina Belderrain <[EMAIL PROTECTED]>: On 10/9/06, Tomi NA <[EMAIL PROTECTED]> wrote: > This is *exactly* what I was thinking. Like Stefan, I believe the > nutch analyzer is a good foundation and should therefore be extended > to support the "or" operator, and possibly additional capabilities > when the need arises. > > t.n.a. Tomi, why would you extend Nutch's analyzer when Lucene's analyzer, which does exactly what you want, is already there? Stefan basically answered that question, but basically, my opinion is that Nutch's analyzer does it's job well, but only lacks one obvious query capability: the "or" search. The fact that several users here need this kind of functionality suggests it's not the beginning of a landslide of new required capabilities. Lucene's analyzer, on the other hand, is completely inadequate in this respect if search is necessarily bound to a single (content) field. In conclusion, my position is pragmatic: I welcome the simplest solution to implement the "or" search. I just believe that it'd be easiest to do that extending the nutch Analyzer. t.n.a.
Re: Lucene query support in Nutch
2006/10/8, Stefan Neufeind <[EMAIL PROTECTED]>: if it's not the full feature-set, maybe most people could live with it. But basic boolean queries I think were the root for this topic. Is there an "easier" way to allow this in Nutch as well instead of throwing quite a bit away and using the Lucene-syntax? As has just been pointed out: It This is *exactly* what I was thinking. Like Stefan, I believe the nutch analyzer is a good foundation and should therefore be extended to support the "or" operator, and possibly additional capabilities when the need arises. t.n.a.
Re: Which Operating-System do you use for Nutch
On 9/26/06, Jim Wilson <[EMAIL PROTECTED]> wrote: I'd do it, but I'm too busy being consumed with worries about the lack of support for HTTP/NTLM credentials and SMB fileshare indexing. Arrrgg - tis another sad day in the life of this pirate. We seem to share the same problems...they haven't gone and knocked me down...yet, but I expect they might fairly soon. For now, I'm placing the shares under an IIS umbrella: I direct the crawl to the root of the web and serve http links to the files. IIS (somehow) takes care of A/D authorization: once the user clicks on a link, IIS checks the users credentials and matches it to the files ACL (I suppose). The downsides? Even though I could theoretically allow the users with sufficient privileges to write files, I can only provide WebDAV access. Whats more, I'm stuck with IIS/Windows/whatever. I'd much rather let the customer decide what he wants to run on his servers. Finally, distributed network shares (i.e. shares not shared from the server) make the problem/solution significantly more complicated. Alternatively, you could try with the file protocol, generating "browser unfriendly" file:// links, opens up it's own Pandora's box of security issues...so, how do you go about it? t.n.a.
Re: [ANNOUNCE] Nutch 0.8.1 available
On 9/27/06, Sami Siren <[EMAIL PROTECTED]> wrote: Nutch Project is pleased to announce the availability of 0.8.1 release of Nutch - the open source web-search software based on lucene and hadoop. The release is immediately available for download from: http://lucene.apache.org/nutch/release/ Nutch 0.8.1 is a maintenance release for 0.8 branch and fixes many serious bugs discovered in previous release. For a list of changes see http://www.apache.org/dist/lucene/nutch/CHANGES-0.8.1.txt Haven't seen it in action yet, but it seems some serious errors got fixed in this version. A big thanks to everybody who participated and made this release possible. Ditto! t.n.a.
Re: Which Operating-System do you use for Nutch
On 9/25/06, Jim Wilson <[EMAIL PROTECTED]> wrote: You can get it working on Windows if you're willing to work for it. To use Nutch OOTB, you have to install Cygwin since the provided Nutch launcher is written in Bash. Members of the community have provided alternatives, such as this Python lanucher: http://wiki.apache.org/nutch/CrossPlatformNutchScripts The way I see it, the existing shell scripts are not a permanent solution. That said, python is better than (ba)sh, but java would be even better (even though fs operations are not one liners). My (very superfluous, but still) experiences with bean shell suggest that it might be a good long term, platform independent solution. It will probably happen when someone scratches that particular itch, though, meaning it's author is going to be someone developing on windows. :) t.n.a.
Re: Forcing refetch and index of specified files
On 9/21/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: Benjamin Higgins wrote: > How can I instruct Nutch to refetch specific files and then update the > index > entries for those files? > > I am indexing files on a fileserver and I am able to produce a report of > changed files about every 30 minutes. > > I'd like to feed that into Nutch at approximately the same interval so > I can > keep the index up-to-date. > > Thanks. Conceptually this should be easy - you just need to generate a fetchlist directly from your list of changed files, and not through injecting/generating from a crawldb. I wrote a tool for 0.7 which does this - look at the NUTCH-68 issue in JIRA. This would have to be ported to 0.8 - check how Injector does this in the first stage, when it converts a simple text file to a MapFile. Would an algorithm like this make any sense: for each URL in txt file if URL in crawldb update the date to "now()+1" in it's crawl datum else use existing inject logic to inject the new url After that, it's only a matter of running the recrawl script with -adddays 0. t.n.a.
Re: Nutch 0.8 - MS Word document parse failure : "Can't be handled as micrsosoft document. java.util.NoSuchElementException"
On 9/22/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: You are not the first one to consider using OO.org for Word conversion. However, this solution brings with it a large dependency (ca 250MB installed), which requires proper installation; and also the UNO interface is reported to be relatively slow - I'm not sure if it's the inherent slowness of the conversion, or the problem with the (lack of) concurrency, i.e. a single OO instance may convert only a single document at a time ... 250 MB for the complete office suite or just UNO? As far as the concurrency problem is concerned, has anyone asked OO.org developers what that's about? t.n.a.
Re: Nutch 0.8 - MS Word document parse failure : "Can't be handled as micrsosoft document. java.util.NoSuchElementException"
On 9/22/06, Trym B. Asserson <[EMAIL PROTECTED]> wrote: Any other suggestions? Tomi, you said you'd had difficulties too with certain MS documents, did you manage to find a work-around or did you just have to ignore these documents? So far we've only concentrated on using the plugins in Nutch 0.8 as they're provided, so we have no experience with OO/UNO. Given that POI seems to deliver reasonably good parsing features for MS formats, we're a bit reluctant to throw it out just yet. No, I haven't found a work-around yet: it seemed too much work at the moment. Right now I'm thinking it may not be necessairy to dump POI in favour of UNO (although I believe it would be better in the long term): maybe it would be possible to work arround the exceptions and still get (at least) most of the text content. I'll probably have a look at it one of these days, although I'm a bit sceptical: wouldn't the original plugin authors have already fixed it if they could help it? t.n.a.
Re: Automatic crawling
On 9/21/06, Jacob Brunson <[EMAIL PROTECTED]> wrote: On 9/21/06, Gianni Parini <[EMAIL PROTECTED]> wrote: > -Is it possible to have an automatic recrawling? have i got to write > my own application by myself? I need an application running in > background that re-crawl my intranet site 2-3 times a week.. On the nutch wiki you will find an intranet recrawl script. That probably will work for you. However, I think the script has a problem with duplicating segment data during the mergesegs step, but I've asked about it here and haven't had any confirmations. Well, I can confirm my index grew to ~5 GB from ~1.5 GB after (if I remember correctly) 2 recrawls. It doesn't solve the problem I was after anyway, as it only indexes pages according to the time of the last crawl, rather than crawling everything, checking if it the new content has a newer modification/creation date and indexing only that (typical intranet scenario). But I'm running like a madman in the opposite direction of the topic: please ignore me. :) t.n.a.
Re: Nutch 0.8 - MS Word document parse failure : "Can't be handled as micrsosoft document. java.util.NoSuchElementException"
On 9/21/06, Jim Wilson <[EMAIL PROTECTED]> wrote: I haven't had this particular problem, but here's something to consider: After you remove the TextBox objects you have to re-save the document. Is the new document the same version as the previous one? By this I mean, the same Word version (97, 2000, etc). I've had some difficulties with misc MS Office documents and it makes me wonder: would using OpenOffice.org to parse the files make more sense than using POI? OO.org uses the UNO framework which has a Java API so conceivably, anything OO.org understands nutch would understand. The fact that OO.org is able to parse MS formats fairly well (better than most other libraries/applications) suggest that it'd give the best results if at some point nutch/lucene supported weighted relations between a term and a field/document. It would make e.g. words appearing in the headers more important than e.g. words in footnotes. Returning to the subject of parsing MS document formats at all, has anyone considered/attempted using OO.org UNO to parse them? Are there any major shortcomings to the approach? t.n.a.
Re: Changing page injection behavior in Nutch 0.8
On 9/20/06, Tomi NA <[EMAIL PROTECTED]> wrote: On 9/20/06, Benjamin Higgins <[EMAIL PROTECTED]> wrote: > In Nutch 0.7, I wanted to change Nutch's behavior such that when I inject a > file it will add the page, even if it is already present. > > I did this because I can prepare a list of changed files that I have on my > intranet and want Nutch to reindex them right away. > > I made a change (suggested by Howie Wang) to > org.apache.nutch.db.WebDBInjector by changing the addPage method. I > replaced the line: > > dbWriter.addPageIfNotPresent(page); > > with: > > dbWriter.addPageWithScore(page); > > Question: I'm moving to Nutch 0.8 and I'd like similar behavior, but I don't > know where to put them as a lot of code has changed (and there's no longer a > WebDBInjector.java file). > > How can I accomplish this? If there is a more appropriate way to do this > please let me know that also. I'm interested in this problem as well. Haven't had a chance yet to look into it, thought. I think the crawl.Injector.InjectorReducer class is the one we're looking for. Would this do the trick? //output.collect(key, (Writable)values.next()); // just collect first value while (values.hasNext()) { output.collect(key, (Writable) values.next()); } I can't verify as an IOException's giving me trouble (possibly because I checkedout 0.9-dev), someone else might have more luck with the 0.8(.1?) sources. t.n.a.
Re: Changing page injection behavior in Nutch 0.8
On 9/20/06, Benjamin Higgins <[EMAIL PROTECTED]> wrote: In Nutch 0.7, I wanted to change Nutch's behavior such that when I inject a file it will add the page, even if it is already present. I did this because I can prepare a list of changed files that I have on my intranet and want Nutch to reindex them right away. I made a change (suggested by Howie Wang) to org.apache.nutch.db.WebDBInjector by changing the addPage method. I replaced the line: dbWriter.addPageIfNotPresent(page); with: dbWriter.addPageWithScore(page); Question: I'm moving to Nutch 0.8 and I'd like similar behavior, but I don't know where to put them as a lot of code has changed (and there's no longer a WebDBInjector.java file). How can I accomplish this? If there is a more appropriate way to do this please let me know that also. I'm interested in this problem as well. Haven't had a chance yet to look into it, thought. t.n.a.
Re: Stemming and Synonyms
On 9/19/06, Gonçalo Gaiolas <[EMAIL PROTECTED]> wrote: Hi everyone! I'm using version 7.2 of Nutch and I'm very happy with it. Want to send a big thumbs up for you guys behind it! Welcome, our honoured guest from the future! :) 7.2 probably includes natural language processing and spawns a great deal of controversy as to weather it can be considered "intelligent" or just very good at smalltalk. :) Having said that, I'd like to make my users search experience as good as possible. To do that, I need to solve two little "problems" : - Stemming – in my index I have lots of plurals and verbal forms that prevent my users from sometimes finding the right results. I've been looking around and it seems that the only stemming implementation available for nutch is described in the wiki and requires extensive changes in Nutch code, something I'd like to avoid. Can somebody help me ? - Synonyms – Ok, I don't really need synonyms. What I need is a way to specify that Image Converter should be equal to ImageConverter, or WebBlock should be the same as web block. How can I do this? This one is really impacting the search quality :-) I guess you need a different Analyzer. There's a list at http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/Analyzer.html You could also write your own to best represent the data you have. Cheers, t.n.a.
Re: how to combine two run's result for search
On 9/18/06, Zaheed Haque <[EMAIL PROTECTED]> wrote: Hi: I have just checked your flash movie.. quick observation you are running tomcat 4.1.31 and there is nothing you are doing that seems wrong. Anyway after starting the servers can you search using the following command bin/nutch org.apache.nutch.search.NutchBean bobdocs what do you get .. and what's in the logfile? If you get something then probably its tomcat 4.1.31 is the problem. [EMAIL PROTECTED] ~/posao/nutch/novo/nutch-0.8 $ ./bin/nutch org.apache.nutch.search.NutchBean bobdocs Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/nutch/search/NutchBean [EMAIL PROTECTED] ~/posao/nutch/novo/nutch-0.8 $ It doesn't really tell me if tomcat is the problem, does it? I've added debug statements to the nutch script so I can check if my CLASSPATH is correct. I have no idea why nutch can't find the NutchBean class. I have, however, checked out the nutch 0.8 and hadoop 0.5 sources from the svn repository, imported them into an eclipse project and used the DistributedSearch Client and Server "public static void main" methods. My experiments showed that my problem is not with tomcat or the nutch web UI, because the DistributedSearch.Client also returned 0 results regardless of the query or combination of indexes. I've managed to confirm that the Client sees all the search servers, but it simply fails to return any results. I also ran across something in the logs that I didn't see before. The following is periodically output (regardless of what I'm doing in eclipse, as long as the Client thread is active): 2006-09-18 13:55:30,352 INFO searcher.DistributedSearch - STATS: 2 servers, 2 segments. 2006-09-18 13:55:40,539 INFO searcher.DistributedSearch - Querying segments from search servers... 2006-09-18 13:55:40,559 INFO searcher.DistributedSearch - STATS: 2 servers, 2 segments. 2006-09-18 13:55:50,564 INFO searcher.DistributedSearch - Querying segments from search servers... Going back to square one...am I building the crawls correctly? ./bin/nutch crawl urls -threads 15 -topN 10 -depth 3 Is it the fact that I'm doing an intranet crawl every time, instead of the multi-step whole web crawl? What else, what am I missing? t.n.a.
Re: how to combine two run's result for search
On 9/16/06, Tomi NA <[EMAIL PROTECTED]> wrote: On 9/15/06, Tomi NA <[EMAIL PROTECTED]> wrote: > On 9/14/06, Zaheed Haque <[EMAIL PROTECTED]> wrote: > > > Thats the way I set it up at first. > > > This time, I started with a blank slate, unpacked nutch and tomcat, > > > unpacked nutch-0.8.war into the webapps/ROOT and left the deployed app > > > untouched. > > > > The above means that you have an empty nutch-site.xml under > > webapps/ROOT and you have a nutch-default.xml with a searcher.dir > > property = crawl. Am I correct? cos you left the deployed web app > > untouched? no? > > You are correct, the searcher.dir property is set to "crawl". > > > > I then pointed the "crawl" symlink in the current dir to point to the > > > "crawls" directory, where my search-servers.txt (with two "localhost > > > port" entries). In the "crawls" dir I also have two nutch-built > > > indexes. > > > > If I remember it correctly I had some trouble with symlink once but I > > don't exactly remember why.. maybe you can try without symlink.. > > I tried renaming the directory "crawls" to "crawl", then running the > servers like so: > ./nutch-0.8/bin/nutch server 8192 crawl/crawl1 & > ./nutch-0.8/bin/nutch server 8193 crawl/crawl2 & > > > > Now, I start nutch distributed search servers on each index and start > > > tomcat from the dir containing the "crawl" link. I get no results at > > > all. > > > If I change the link to point to "crawls/crawl1", the search works > > > > I am guessing the above is also a symlink.. hmm.. maybe it has > > something to do with distributed search and symlink.. no? > > It doesn't appear to be the problem. I tried without symlinks without success. > > I'm going to document the problem better today, so maybe that will help. > I'm having trouble believing what I'm trying to achieve is so > problematic...nevertheless, I appreciate your effort so far. I don't think I can document the problem better than I have here: http://tna.sharanet.org/problem.html It's a 2-minute flash movie showing exactly what I'm doing. I'd very much appreciate anyone taking a look at it, but especially Zaheed. The only thing I forgot to display in the movie is my search-servers.txt: localhost 8192 localhost 8193 Now, what am I doing wrong? t.n.a. Anyone? Renaud, Zaheed, Feng? t.n.a.
Re: java.lang.NullPointerException
On 9/18/06, NG-Marketing, M.Schneider <[EMAIL PROTECTED]> wrote: I figured it out. I used in my nutch-site.xml the following config searcher.max.hits 2048 If I change the value to nothing "" it works all fine. It took me a couple of hours to figure it out. This might be a bug. Is the specific value (2048) a problem or does nutch throw a NPE regardless of the value you use? t.n.a.
Re: java.lang.NullPointerException
On 9/17/06, NG-Marketing, Matthias Schneider <[EMAIL PROTECTED]> wrote: Hello List, i installed nutch 0.8 and i can fetch and index documents, but I can not search them. I get the following error: StandardWrapperValve[jsp]: Servlet.service() for servlet jsp threw exception java.lang.NullPointerException at org.apache.nutch.searcher.LuceneQueryOptimizer$LimitedCollector.(Lucen eQueryOptimizer.java:108) at org.apache.nutch.searcher.LuceneQueryOptimizer.optimize(LuceneQueryOptimizer .java:244) at org.apache.nutch.searcher.IndexSearcher.search(IndexSearcher.java:95) at org.apache.nutch.searcher.NutchBean.search(NutchBean.java:249) [...snip...] Can't say I ran into a problem like that, but have you checked if your index is valid, i.e. can you open the index with luke (http://www.getopt.org/luke/) and run queries? t.n.a.
Re: how to combine two run's result for search
On 9/15/06, Tomi NA <[EMAIL PROTECTED]> wrote: On 9/14/06, Zaheed Haque <[EMAIL PROTECTED]> wrote: > > Thats the way I set it up at first. > > This time, I started with a blank slate, unpacked nutch and tomcat, > > unpacked nutch-0.8.war into the webapps/ROOT and left the deployed app > > untouched. > > The above means that you have an empty nutch-site.xml under > webapps/ROOT and you have a nutch-default.xml with a searcher.dir > property = crawl. Am I correct? cos you left the deployed web app > untouched? no? You are correct, the searcher.dir property is set to "crawl". > > I then pointed the "crawl" symlink in the current dir to point to the > > "crawls" directory, where my search-servers.txt (with two "localhost > > port" entries). In the "crawls" dir I also have two nutch-built > > indexes. > > If I remember it correctly I had some trouble with symlink once but I > don't exactly remember why.. maybe you can try without symlink.. I tried renaming the directory "crawls" to "crawl", then running the servers like so: ./nutch-0.8/bin/nutch server 8192 crawl/crawl1 & ./nutch-0.8/bin/nutch server 8193 crawl/crawl2 & > > Now, I start nutch distributed search servers on each index and start > > tomcat from the dir containing the "crawl" link. I get no results at > > all. > > If I change the link to point to "crawls/crawl1", the search works > > I am guessing the above is also a symlink.. hmm.. maybe it has > something to do with distributed search and symlink.. no? It doesn't appear to be the problem. I tried without symlinks without success. I'm going to document the problem better today, so maybe that will help. I'm having trouble believing what I'm trying to achieve is so problematic...nevertheless, I appreciate your effort so far. I don't think I can document the problem better than I have here: http://tna.sharanet.org/problem.html It's a 2-minute flash movie showing exactly what I'm doing. I'd very much appreciate anyone taking a look at it, but especially Zaheed. The only thing I forgot to display in the movie is my search-servers.txt: localhost 8192 localhost 8193 Now, what am I doing wrong? t.n.a.
Re: how to combine two run's result for search
On 9/14/06, Zaheed Haque <[EMAIL PROTECTED]> wrote: > Thats the way I set it up at first. > This time, I started with a blank slate, unpacked nutch and tomcat, > unpacked nutch-0.8.war into the webapps/ROOT and left the deployed app > untouched. The above means that you have an empty nutch-site.xml under webapps/ROOT and you have a nutch-default.xml with a searcher.dir property = crawl. Am I correct? cos you left the deployed web app untouched? no? You are correct, the searcher.dir property is set to "crawl". > I then pointed the "crawl" symlink in the current dir to point to the > "crawls" directory, where my search-servers.txt (with two "localhost > port" entries). In the "crawls" dir I also have two nutch-built > indexes. If I remember it correctly I had some trouble with symlink once but I don't exactly remember why.. maybe you can try without symlink.. I tried renaming the directory "crawls" to "crawl", then running the servers like so: ./nutch-0.8/bin/nutch server 8192 crawl/crawl1 & ./nutch-0.8/bin/nutch server 8193 crawl/crawl2 & > Now, I start nutch distributed search servers on each index and start > tomcat from the dir containing the "crawl" link. I get no results at > all. > If I change the link to point to "crawls/crawl1", the search works I am guessing the above is also a symlink.. hmm.. maybe it has something to do with distributed search and symlink.. no? It doesn't appear to be the problem. I tried without symlinks without success. I'm going to document the problem better today, so maybe that will help. I'm having trouble believing what I'm trying to achieve is so problematic...nevertheless, I appreciate your effort so far. t.n.a.
Re: 0.8 Intranet Crawl Output/Logging?
On 9/14/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: Everyone, thanks for the help with this. I hope to return the assistance, once I am more familiar with 0.8. I am using tail -f now to monitor my test crawls. It also look like you can use conf/hadoop-env.sh to redirect log file output to a different location for each of your configurations. One follow up question: Now that I can actually see the log, I am finding some of the output rather annoying/noisy. Specially, I am referring to the Registered Plugins and Registered Extension-Points output. It's nice to see that once at crawl start, but not with every step of the crawl. So does any one know if I can disable that output? Here's the output to which I refer: 2006-09-14 14:03:42,852 INFO plugin.PluginRepository - Plugins: looking in: /var/nutch/nutch-0.8/plugins 2006-09-14 14:03:43,030 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2006-09-14 14:03:43,030 INFO plugin.PluginRepository - Registered Plugins: watch -n 1 "grep -v PluginRepository /home/wmelo/nutch-0.8/logs/hadoop.log | tail -n 20" t.n.a.
Re: how to combine two run's result for search
On 9/14/06, Zaheed Haque <[EMAIL PROTECTED]> wrote: On 9/14/06, Tomi NA <[EMAIL PROTECTED]> wrote: > On 9/5/06, Zaheed Haque <[EMAIL PROTECTED]> wrote: > > Hi: > > I have a problem or two with the described procedure... > > > Assuming you have > > > > index 1 at /data/crawl1 > > index 2 at /data/crawl2 > > Used ./bin/nutch crawl urls -dir /home/myhome/crawls/mycrawldir to > generate an index: luke says the index is valid and I can query it > using luke's interface. > > Does the "searcher.dir" value in nutch-(default|site).xml have any > impact on the way indexes are created? No it doesn't have any impact on index creation. searcher.dir value is for searching only. nutch-site.xml is where you should change.. example... searcher.dir /home/myhome/crawls Path to root of index directories. This directory is searched (in order) for either the file search-servers.txt, containing a list of distributed search servers, or the directory "index" containing merged indexes, or the directory "segments" containing segment indexes. and the text file should be in this case ... /home/myhome/crawls/search-servers.txt Thats the way I set it up at first. This time, I started with a blank slate, unpacked nutch and tomcat, unpacked nutch-0.8.war into the webapps/ROOT and left the deployed app untouched. I then pointed the "crawl" symlink in the current dir to point to the "crawls" directory, where my search-servers.txt (with two "localhost port" entries). In the "crawls" dir I also have two nutch-built indexes. Now, I start nutch distributed search servers on each index and start tomcat from the dir containing the "crawl" link. I get no results at all. If I change the link to point to "crawls/crawl1", the search works i.e. I get a couple of results. What seems to be the problem is inserting the distributed search server between the index and tomcat. Nothing I do makes the least bit of difference. :\ t.n.a.
Re: how to combine two run's result for search
On 9/5/06, Zaheed Haque <[EMAIL PROTECTED]> wrote: Hi: I have a problem or two with the described procedure... Assuming you have index 1 at /data/crawl1 index 2 at /data/crawl2 Used ./bin/nutch crawl urls -dir /home/myhome/crawls/mycrawldir to generate an index: luke says the index is valid and I can query it using luke's interface. Does the "searcher.dir" value in nutch-(default|site).xml have any impact on the way indexes are created? In nutch-site.xml searcher.dir = /data This is the nutch-site.xml of the web UI? Under /data you have a text file called search-server.txt (I think do check nutch-site search.dir description please) /home/myhome/crawls/search-servers.txt In the text file you will have the following hostname1 portnumber hostname2 portnumber example localhost 1234 localhost 5678 I placed localhost 12567 (just one instance, to test) Then you need to start bin/nutch server 1234 /data/craw1 & and bin/nutch server 5678 /data/crawl2 & did that, using port 12567 ./bin/nutch server 12567 /home/mydir/crawls/mycrawldir & bin/nutch org.apache.nutch.search.NutchBean www you should see results :-) I get: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/nutch/search/NutchBean Whats more, I get no results to any query I care to pass by the Web UI, which suggests the UI isn't connected to the underlying DistributedSearch server. :\ Any hints, anyone? TIA, t.n.a.
Re: 0.8 Intranet Crawl Output/Logging?
On 9/13/06, wmelo <[EMAIL PROTECTED]> wrote: I have the same original doubt. I know that the log shows informations, but, how to see the things happening, real time, like in nutch 0.7.2, when you use the crawl command in the terminal? try something like this (assuming you know what's good for you so you use a *n*x): watch -n 1 "tail -n 20 /home/wmelo/nutch-0.8/logs/hadoop.log" Please replace the path to your "logs" directory to match your environment and report back if there's a problem. Hope it helps. t.n.a.
Re: Windows Native Launching?
On 9/11/06, Jim Wilson <[EMAIL PROTECTED]> wrote: Dear Nutch User Community, Does anyone have a nutch.bat file to use in the bin directory? I find it bemusing that Java (cross-platform) was chosen as the development language, but the launcher is written in Bash. As much as I hate to say it, but you're right, it doesn't make any sense to hobble such a great body of platform independent code with a couple of short bit vital *n*x-only scripts. ...even if we are talking about windows. Would it make sense to go java all the way and use groovy or beanshell? My knowledge of these projects is rather superficial, but someone else might know more... t.n.a.
Re: Nutch-site.xml vs Nutch-default.xml
On 9/9/06, victor_emailbox <[EMAIL PROTECTED]> wrote: Hi all, I spent a lot of time to figure out why Nutch didn't respond to my configuration in nutch-site.xml. I set db.ignore.external.links to true. It didn't work. Then I realized that Nutch-default.xml also has same db.ignore.external.links but it was set to false. So I set it to true too, and it works. Isn't nutch-site.xml supposed to override the setting in Nutch-default.xml? Many thanks. I have no idea what's wrong with your nutch configuration, but yes, nutch-site overrides nutch-default. Maybe someone else has an explanation to offer. t.n.a.
Re: Fetching past Authentication
On 9/8/06, Jim Wilson <[EMAIL PROTECTED]> wrote: Dear Nutch User List, I am desperately trying to index an Intranet with the following characteristics 1) Some sites require no authentication - these already work great! 2) Some sites require basic HTTP Authentication. 3) Some sites require NTLM Authentication. 4) No sites require both HTTP and NTLM (only one or the other). 5) The same Username/Password should work on all sites which require either type of Authentication. 6) For sites requiring NTLM Authentication, the same Domain is always used. 7) If a site requires authentication, but the Username/Password mentioned above fails, the site doesn't matter and does not need fetched/indexed. My question is this: How can I provide a default Username/Password/Domain for Nutch to use when answering HTTP or NTLM challenges? (I really hope all I need is a couple of tags in my nutch-site.xml, but I'm beginning to doubt it). I love Nutch, and really want to use it. Please help if you know the answer. Thanks! I'm also very interested in hearing more on the topic. The only mention of a solution to (a part of) this problem I found is http://www.dehora.net/journal/2005/11/nutch_with_basic_authentication.html t.n.a.
Re: Recrawling (Tomi NA)
On 9/8/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: Tomi NA wrote: > On 9/7/06, David Wallace <[EMAIL PROTECTED]> wrote: >> Just guessing, but could this be caused by session ids in the URL? Or >> some other unimportant piece of data? If this is the case, then every >> page would be added to the index when it's crawled, regardless of >> whether it's already in there, with a different session id. If this is >> what's causing your problem, then you need to use the regexp URL >> normaliser to strip out the session ids. > > Nice try but no luck, I'm afraid. > The complete web is absolutely static. The reason is that we've set up > IIS (I'm not too happy choosing IIS over apache) to serve files from a > shared directory on the same server, the rationale beeing that we'd > rather have http://-type links than file://. >> From what I've seen in the logs, I don't see URLs varying so I'm still > at square one. Still, thanks for the effort. If you have any other > ideas, I'm eager to hear them. The best way to discover what's going on is to start from a small subset of injected urls, and do the following: * inject * dump the db to a text file * generate / fetch / updatedb * dump the db again to a second text file * compare the files. I'll see if I'm able to reproduce those steps here, thanks. t.n.a.
Re: Indexing MS Powerpoint files with Lucene
On 9/8/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: (moved to nutch-user) Tomi NA wrote: > On 9/7/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: >> Tomi NA wrote: >> > On 9/7/06, Nick Burch <[EMAIL PROTECTED]> wrote: >> >> On Thu, 7 Sep 2006, Tomi NA wrote: >> >> > On 9/7/06, Venkateshprasanna <[EMAIL PROTECTED]> wrote: >> >> >> Is there any filter available for extracting text from MS >> >> Powerpoint files >> >> >> and indexing them? >> >> >> The lucene website suggests the POI project, which, it seems >> does not >> >> >> support PPT files as of now. >> >> > >> >> > http://jakarta.apache.org/poi/hslf/index.html >> >> > >> >> > It doesn't say poi doesn't support ppt. It just says support is >> >> limited. >> >> > Don't know exactly how limited, but certainly not useless for >> indexing >> >> > purposes. >> >> >> >> Support for editing and adding things to PowerPoint files is >> limited, as >> >> is getting out the finer points of fonts and positioning. >> > >> > Which brings me to another (off)topic: can lucene/nutch assign >> > different weights to tokens in the same document field? An obvious >> > example would be: "this text seems to be in large, bold, blinking >> > letters: I'll assume it's more important than the surrounding 8px >> > text." >> >> No, it can't (at least not yet). As a workaround you can extract these >> portions of text to another field (or multiple fields), and then add >> them with a higher boost. Then, expand your queries so that they include >> also this field. This way, if query matches these special tokens, >> results will get higher rank because of matching on this boosted field. > > I thought a workaround like that would be needed. Still, it could give > useful results...though as a nutch user, the possibility is mostly > theoretical for me, as probably none of the existing parsers take into > account the formatting information. I could be completely wrong here, > so please, feel free to correct me. You can write a HtmlParseFilter, which will extract these portions of text and put them into ParseData.metadata. Then, during indexing you can check if such metadata exists and if yes - add it as separate fields. You will need also to modify the QueryFilters, to expand user queries to also include clauses for these additional fields. Thanks Andrzej, I understand the concepts involved now. If the need arises, I'll see what I can do about making it work as intended. t.n.a.
Re: Recrawling (Tomi NA)
On 9/7/06, David Wallace <[EMAIL PROTECTED]> wrote: Just guessing, but could this be caused by session ids in the URL? Or some other unimportant piece of data? If this is the case, then every page would be added to the index when it's crawled, regardless of whether it's already in there, with a different session id. If this is what's causing your problem, then you need to use the regexp URL normaliser to strip out the session ids. Nice try but no luck, I'm afraid. The complete web is absolutely static. The reason is that we've set up IIS (I'm not too happy choosing IIS over apache) to serve files from a shared directory on the same server, the rationale beeing that we'd rather have http://-type links than file://. From what I've seen in the logs, I don't see URLs varying so I'm still at square one. Still, thanks for the effort. If you have any other ideas, I'm eager to hear them. t.n.a.
Re: parse url and file attributes only - no content
On 9/7/06, heack <[EMAIL PROTECTED]> wrote: I meet the same problem with you. I think if there exist a way to store a description to .mp3 .wmv or .avi .. files, and could be searched. I believe the problem can't be solved by adding a new parse plugin to parse "all other (binary) filetypes": this additional parser would still get the complete (possibly very big) file from the remote host. At which level are the http.content.limit and file.content.limit taken into accont? I'm thinking a new configuration setting (say, (http|file).unsupported.extensions) set to "mp3|iso|psd" etc. could guide the fetch algorithm so that it doesn't fetch the file contents for these files, but simply fetches information *about* the files in question. How does that sound? t.n.a.
Re: Recrawling
On 9/6/06, Andrei Hajdukewycz <[EMAIL PROTECTED]> wrote: Another problem I've noticed is that it seems the db grows *rapidly* with each successive recrawl. Mine started at 379MB, and it seems to increase by roughly 350MB every time I run a recrawl, despite there not being anywhere near that many additional pages. This seems like a pretty severe problem, honestly, obviously there's a lot of duplicated data in the segments. I have the same problem: my index grew from 1.5GB after the original crawl to over 5GB(!) after the recrawl...from the looks of it, I might as well crawl anew every time. :\ t.n.a.
parse url and file attributes only - no content
I'd like the user to be able to find "my three dogs.jpg" if he searches for "three dogs", even though nutch doesn't have a .jpg parser. Whatsmore, I'd like the user to be able to search against any other extrinsic file attribute: date, file size, even mime type, all without reading a single bit of the actual file contents. Can nutch be configured so that it indexes these external file properties and completely skip file contents? I thought maybe I could adapt an existing parser (parse-text?) to do the job, but I guess I'd still be stuck with reading megabytes of unparsable data, just to fill in the url, type, date and similar attributes. I'd appreciate a comment or two. TIA, t.n.a.
Re: how to combine two run's result for search
On 9/6/06, Zaheed Haque <[EMAIL PROTECTED]> wrote: On 9/6/06, Tomi NA <[EMAIL PROTECTED]> wrote: > On 9/5/06, Zaheed Haque <[EMAIL PROTECTED]> wrote: > > Hi: > > > In the text file you will have the following > > > > hostname1 portnumber > > hostname2 portnumber > > > > example > > localhost 1234 > > localhost 5678 > > > > Does this work with nutch 0.7.2 or is it specific to the 0.8 release? I don't really know I have never tried 0.7. From the CVS it seems like it does http://svn.apache.org/viewvc/lucene/nutch/tags/release-0.7.2/conf/nutch-default.xml?revision=390479&view=markup but I don't know if the command structures are the same.. Just thought you might know of the top of your head, I'll go try it out. t.n.a.
Re: how to combine two run's result for search
On 9/5/06, Zaheed Haque <[EMAIL PROTECTED]> wrote: Hi: In the text file you will have the following hostname1 portnumber hostname2 portnumber example localhost 1234 localhost 5678 Does this work with nutch 0.7.2 or is it specific to the 0.8 release? t.n.a.
crawling frequently changing data on an intranet - how?
The task --- I have less than 100GB of diverse documents (.doc, .pdf, .ppt, .txt, .xls, etc.) to index. Dozens, or even hundreds and thousands of documents can change their content, be created or deleted every day. The crawler will run on a HP DL380 G4 server - don't know the exact specs yet. I'd like to keep the index no more than 20 minutes out of date (5-10 would be ideal). I'm currently sticking to nutch 0.7.2 because of crawl (especially fetch) speed considerations. Current idea --- From what I've read so far, nutch relies on the date a certain document was last crawled, rather than checking the live document's last modification date (a reasonable way to behave on the Internet, but could be better in an intranet). That's why I can't simply run the wiki recrawl script and let it find the documents that changed since the last index. I'd therefore run a crawl overnight and use the produced index as a "main index". During the day, however, I can traverse the whole intranet web, see what's changed and crawl/index only the documents that have changed, building a second, "helper index". I'd set up the search application to use both of those indexes. Problems - I don't know to tell the search interface to use 2 separate indices. I'm really not sure how I'll make the search interface reload the "helper index" every 10 or 20 minutes. I'd welcome an opinion from anyone with more experience with nutch...which basically means anyone. :) TIA, t.n.a.
Re: Does Nutch index images?
On 9/3/06, Sidney <[EMAIL PROTECTED]> wrote: Does nutch index images? If not or/and if so how can I go about creating a separate search category for searching for images like the major search engines have? If anyone can give any information on this I would be very grateful. You could go format by format, writing nutch plugins to access image metadata for .jpg, .gif, .png, .tiff etc. Don't know about writing a nutch plugin, but I don't think reading image metadata is too much of a problem in java. This might be a good place to start: http://schmidt.devlib.org/java/image-io-libraries.html t.n.a.
Re: Could anyone teache me how to index the title or content of PDF?
On 9/1/06, Frank Huang <[EMAIL PROTECTED]> wrote: But when I execute ./nutch crawl there show some messages like "fetch okay ,but can`t parse http://(omit...).pdf " reason:failed content truncated at 70709 bytes.Parse can`t handle incomplete pdf file. Haven't had time to go through the complete code (not sure I'd understand it, anyway), but this looks like you need to set file.content.limit to, say, 16777216. If you're crawling over http rather than intranet shares, the property you need to set is http.content.limit. Hope it helps. t.n.a.
Re: intranet crawl problems: mime types; .doc-related exceptions; really, really slow crawl + possible infinite loop
On 8/30/06, Chris Mattmann <[EMAIL PROTECTED]> wrote: Hi there Tomi, On 8/30/06 12:25 PM, "Tomi NA" <[EMAIL PROTECTED]> wrote: > I'm attempting to crawl a single samba mounted share. During testing, > I'm crawling like this: > > ./bin/nutch crawl urls -dir crawldir4 -depth 2 -topN 20 > > I'm using luke 0.6 to query and analyze the index. > > PROBLEMS > > 1.) search by file type doesn't work > I expected that a search "file type:pdf" would have returned a list of > files on the local filesystem, but it does not. I believe that the keyword is "type", so your query should be "type:pdf" (without the quotes). I'm not positive about this either, but I believe you have to give the fully qualified mimeType, as in "application/pdf". Not definitely sure about that though so you should experiment. I should have emphasized that the string I queried with is without the quotes. The "file" keyword was used because all the entries are accessible via "file://"-type links and so searching only for "file" would return all files. Filtering by type would then return all files of the given type. I tried the following query: url:file type:application/pdf but it seems I get the same set of hits regardless of what I use as type, so if I search for "url:file type:application/pdf" I get the same results as searching for "url:file type:whatever". Additionally, in order for the mimeTypes to be indexed properly, you need to have the index-more plugin enabled. Check your $NUTCH_HOME/conf/nutch-site.xml, and look for the property "plugin.includes" and make sure that the index-more plugin is enabled there. I listed my nutch-site settings at the end of my mail: the index-more plugin is enabled. > 2.) invalid nutch file type detection > I see the following in the hadoop.log: > --- > 2006-08-30 15:12:07,766 WARN parse.ParseUtil - Unable to successfully > parse content file:/mnt/bobdocs/acta.zip of type application/zip > 2006-08-30 15:12:07,766 WARN fetcher.Fetcher - Error parsing: > file:/mnt/bobdocs/acta.zip: failed(2,202): Content truncated at > 1024000 bytes. Parser can't handle incomplete pdf file. > --- > acta.zip is a .zip file, not a .pdf. Don't have any idea why this happens. This may result from the contentType returned by the web server for "acta.zip". Check the web server that the file is hosted on, and see what the server responds for the contentType for that file. Additionally, you may want to check if magic is enabled for mimeTypes. This allows the mimeType to be sensed through the use of hex codes compared with the beginning of each file. I have mime.type.magic set to true. The files I index are served via samba over the LAN rather then via a web server, so no, it's not a problem of contentType. > 3.) Why is the TextParser mapped to application/pdf and what has that > have to do with indexing a .txt file? > - > 2006-08-30 15:12:02,593 INFO fetcher.Fetcher - fetching > file:/mnt/bobdocs/popis-vg-procisceni.txt > 2006-08-30 15:12:02,916 WARN parse.ParserFactory - > ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to > contentType application/pdf via parse-plugins.xml, but its plugin.xml > file does not claim to support contentType: application/pdf > - The TextParser * was * enabled as a last resort sort of means of extracting ... I understand, thanks. Still don't know what threw the pdf-parser off, though. > 4.) Some .doc files can't be indexed, although I can open them via > openoffice 2 with no problems > - > 2006-08-30 15:12:02,991 WARN parse.ParseUtil - Unable to successfully > parse content file:/mnt/bobdocs/cards2005.doc of type > application/msword > 2006-08-30 15:12:02,991 WARN fetcher.Fetcher - Error parsing: > file:/mnt/bobdocs/cards2005.doc: failed(2,0): Can't be handled as > micrsosoft document. java.lang.StringIndexOutOfBoundsException: String > in > dex out of range: -1024 > - What version of MS Word were you trying to index? I believe that the POI library used by the word parser can only handle certain versions of MS Word documents, although I'm not positive about this. Oh, so POI doesn't use the same technology OO.org uses to access MS Office created docs? That's a shame... :( So, does anyone know which Word versions does it support? As for 5 and 6 I'm not entirely sure about those problems. I wish you luck in solving both of them though, and hope what I said above helps you out. Thanks for the effort, Chris. I know a little more, but still have a long way to go. Does anyone else know anything about the unsolved problems I'm facing? t.n.a.
intranet crawl problems: mime types; .doc-related exceptions; really, really slow crawl + possible infinite loop
I'm attempting to crawl a single samba mounted share. During testing, I'm crawling like this: ./bin/nutch crawl urls -dir crawldir4 -depth 2 -topN 20 I'm using luke 0.6 to query and analyze the index. PROBLEMS 1.) search by file type doesn't work I expected that a search "file type:pdf" would have returned a list of files on the local filesystem, but it does not. 2.) invalid nutch file type detection I see the following in the hadoop.log: --- 2006-08-30 15:12:07,766 WARN parse.ParseUtil - Unable to successfully parse content file:/mnt/bobdocs/acta.zip of type application/zip 2006-08-30 15:12:07,766 WARN fetcher.Fetcher - Error parsing: file:/mnt/bobdocs/acta.zip: failed(2,202): Content truncated at 1024000 bytes. Parser can't handle incomplete pdf file. --- acta.zip is a .zip file, not a .pdf. Don't have any idea why this happens. 3.) Why is the TextParser mapped to application/pdf and what has that have to do with indexing a .txt file? - 2006-08-30 15:12:02,593 INFO fetcher.Fetcher - fetching file:/mnt/bobdocs/popis-vg-procisceni.txt 2006-08-30 15:12:02,916 WARN parse.ParserFactory - ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType application/pdf via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: application/pdf - 4.) Some .doc files can't be indexed, although I can open them via openoffice 2 with no problems - 2006-08-30 15:12:02,991 WARN parse.ParseUtil - Unable to successfully parse content file:/mnt/bobdocs/cards2005.doc of type application/msword 2006-08-30 15:12:02,991 WARN fetcher.Fetcher - Error parsing: file:/mnt/bobdocs/cards2005.doc: failed(2,0): Can't be handled as micrsosoft document. java.lang.StringIndexOutOfBoundsException: String in dex out of range: -1024 - 5.) MoreIndexingFilter doesn't seem to work The relevant part of the hadoop.log file: - 2006-08-30 15:13:40,235 WARN more.MoreIndexingFilter - file:/mnt/bobdocs/EU2007-2013.pdforg.apache.nutch.util.mime.MimeTypeException: The type can not be null or empty - This happens with other file types, as well: - 2006-08-30 15:13:54,697 WARN more.MoreIndexingFilter - file:/mnt/bobdocs/popis-vg-procisceni.txtorg.apache.nutch.util.mime.MimeTypeException: The type can not be null or empty - 6.) At the moment, I'm crawling the same directory (/mnt/bobdocs), the crawl process seems to be stuck in an infinite loop and I have no way of knowing what's going on as the .log isn't flushed until the process finishes. ENVIRONMENT logs/hadoop.log inspection reveals things like this: My (relevant) crawl settings are: - db.max.anchor.length 511 db.max.outlinks.per.page -1 fetcher.server.delay 0 fetcher.threads.fetch 5 fetcher.verbose true file.content.limit 10240 parser.character.encoding.default iso8859-2 indexer.max.title.length 511 indexer.mergeFactor 5 indexer.minMergeDocs 5 plugin.includes nutch-extensionpoints|protocol-(file|http)|urlfilter-regex|parse-(text|html|msword|pdf|mspowerpoint|msexcel|rtf|js)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic searcher.max.hits 100 - MISC. SUGGESTIONS Add the following configuration options to the nutch-*.xml files: * allow search by date or extension (with no other criteria) * always flush log to disk (at every log addition). TIA, t.n.a.
file access rights/permissions considerations - the least painful way
I'm interested in crawling multiple shared folders (among other things) on a corporate LAN. It is a LAN of MS clients with Active Directory managed accounts. The users routinely access the files based on ntfs-level (and sharing?) permissions. Idealy, I'd like to set up a central server (probably linux, but any *n*x would do) where I'd mount all the shared folders. I'd then set up apache so that the files are accessible via http and, more importantly, webdav. I imagine apache could use mod_dav, mod_auth and possibly one or two other modules to regulate access priviledges - I could very well be completely wrong here. Finally, I'd like to set up nutch to crawl the shared documents through the web server, so that the stored links are valid in the whole LAN. Nutch would therefore require absolute access to all documents, but the documents would be served via a web server who checks user identities and access rights. Nutch users who've tackled the access rights problem themselves would save me a world of time, effort and trouble with a couple of pointers on how to go about the whole security issue. If the setup I described is the worst possible way to go about it, I'd appreciate a notice saying so and elaborating why. :) TIA, t.n.a.
Re: How do I write a nutch query.
On 8/8/06, Björn Wilmsmann <[EMAIL PROTECTED]> wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hey, I have run into the same problem, too. Sometimes nutch won't return results for queries although there clearly are pages containing the search term. I agree that this must have something to do with Nutch scoring however I have not yet found out how to change this behaviour I ran into the same problem but I believe it has something to do with the analyzer (probably StandardAnalyzer, I don't really know what Nutch uses by default, yet), plugins for those files or something along those lines. As far as grading is concerned, wouldn't a grading problem change the result order, rather than skip certain results altogether? t.n.a.
max file size vs. available RAM size: crawl uses up all available memory
I am trying to crawl/index a shared folder in the office LAN: that means a lot of .zip files, a lot of big .pdfs (>5 MB) etc. I sacrificed performance for memory effectiveness where I found the tradeoff ("indexer.mergeFactor" = 5, "indexer.minMergeDocs" = 5), but the crawl process breaks if I set "file.content.limit" to, say, 10 MB even thought I'm testing on a 1GB RAM machine. To be fair, some 300-400 MB are already taken by misc programs, but stil... I invoke nutch like so: ./bin/nutch crawl -local urldir -dir crawldir -depth 20 -topN 1000 What I'd like to know is: 1) where does all the memory go? 2) how can I reduce the peak memory requirements? To reiterate, I'm just testing at the moment, but I need to index documents at any tree depth and any document smaller than, say, 10-20MB and I hope I don't need 5+GB of RAM to do it. TIA, t.n.a.
Re: nutch 0.8 and luke
On 7/29/06, Tomi NA <[EMAIL PROTECTED]> wrote: On 7/29/06, Sami Siren <[EMAIL PROTECTED]> wrote: > Not expert on this area but perhaps you need to upgrade lucene .jar > files that are used by luke? I believe I was a little bit hasty with the message I sent. I took a second look and it just might be that luke was right and the index is invalid - I'm going to check it out come Monday. Thanks for the reply. t.n.a. Got it all sorted out: I didn't set a valid value for http.agent.name. What made me sound the alarm was the fact that I got some kind of crawl result (about 2.5 MB in size). Anyway, I did a successful crawl on my laptop, crawling throught a directory exposed through a http server. The problems I ran into there are new thread material. Thanks again for the effort. t.n.a.
Re: nutch 0.8 and luke
On 7/29/06, Sami Siren <[EMAIL PROTECTED]> wrote: Not expert on this area but perhaps you need to upgrade lucene .jar files that are used by luke? I believe I was a little bit hasty with the message I sent. I took a second look and it just might be that luke was right and the index is invalid - I'm going to check it out come Monday. Thanks for the reply. t.n.a.
nutch 0.8 and luke
I successfully used luke with indexes created with nutch 0.7.2. I tried the same with nutch 0.8, but luke sees it as a corrupt index. Should this be happening? I know this isn't the luke mailing list, but the information will still be usefull to people using nutch. Thanks, t.n.a.
Re: missing, but declared functionality
Sorry for the long silence and thanks for the help. I've found the plugins you mentioned and set up nutch to use them. The result is somewhat confusing, though. For one thing, my date: and type: queries still returned no results. Weirder still, using luke to inspect the index contents, I saw the new fields, luke would display the top ranking terms by both "date" and "type" fields, a search like "date:20051030" would yield dozens of results, but the "string value" of the "date" and "type" fields was not availableeven thought I found the documents in question using that exact field as a key. I'll see what I come up with using 0.8 as I need the .xls and .zip support, anyway. t.n.a. On 7/20/06, Teruhiko Kurosaka <[EMAIL PROTECTED]> wrote: You'd have to enable index-more and query-more plugins, I believe. > -Original Message- > From: Tomi NA [mailto:[EMAIL PROTECTED] > Sent: 2006-7-19 10:01 > To: nutch-user@lucene.apache.org > Subject: missing, but declared functionality > > These kinds of queries return no results: > > date:19980101-20061231 > type:pdf > type:application/pdf > > From the release changes documents (0.7-0.7.2), I assumed > these would work. > Upon index inspection (using the luke tool), I see there are no fields > marked "date" or "type" (althought I gather this is interpreted as > url:*.pdf). The fields I have are: > anchor > boost > content > digest > docNo > host > segment > site > title > url > > I ran the index process with very little special configurationsome > filetype filtering and the like. > Am I missing something? > The files are served over a samba share: I plan to serve them through > a web server because of security implications of using the file:// > protocol. Can the creation and last modification date be retrieved > over http:// at all? > > TIA, > t.n.a. >
missing, but declared functionality
These kinds of queries return no results: date:19980101-20061231 type:pdf type:application/pdf From the release changes documents (0.7-0.7.2), I assumed these would work. Upon index inspection (using the luke tool), I see there are no fields marked "date" or "type" (althought I gather this is interpreted as url:*.pdf). The fields I have are: anchor boost content digest docNo host segment site title url I ran the index process with very little special configurationsome filetype filtering and the like. Am I missing something? The files are served over a samba share: I plan to serve them through a web server because of security implications of using the file:// protocol. Can the creation and last modification date be retrieved over http:// at all? TIA, t.n.a.