Re: Improvement of Nutch 0.7.2
I have created 0.7.3 label i JIRA and I am willing to commit useful patches in this branch. I do not have time to develop new code myself (and if I would I will better spend it in trunk in my opinion). So if you have anything to submit I woul dbe willing to commit it. Regards Piotr On 2/12/07, carmmello [EMAIL PROTECTED] wrote: About 3 or 4 months ago, there was some discussion and seemingly, a consensus that Nutch 0.7.2 was a better choice for the ones that were using a single computer and aiming for indexed sites of just about a few thousand or, at most, a few millions of pages. There were plans to develop a new version, Nutch 0.7.3. Since than, I did not hear anything more about the subject. Does anyone know something about this? Thanks Carmmello
Re: 0.7.3 version
As no objections were raised I created a 0.7.3 version in JIRA so we can start assigning current JIRA issues to it. Regards Piotr Piotr Kosiorowski wrote: Hello committers, Based on a recent discussion on nutch user list - (Strategic Direction of Nutch) I would like to prepare 0.7.3 release. The idea is to allow people who still use 0.7.2 to get rid of most important bugs and allow them to add some small features they would need as the claim is 0.8.1 is not good for small crawls at the moment. It will allow us to work on 0.8 branch so it would be more small installation friendly. I would like to approach it this way that if noone objects I would create a 0.7.3 release in JIRA and ask people to assign issues with patches to it. I do not have a lot of time personally so I do not plan to do any development myself - just taking care of high quality patches and committing them - after some time when we gather some aomount of bugfixes/isues I would prepare 0.7.3 release. Any objections comments? Regards Piotr
Re: Strategic Direction of Nutch
We should use JIRA for it as Arun said. I will send separate email to committers with a bit better title to get acceptance for preparation of 0.7.3 version and create 0.7.3 version in JIRA so you can assign issues to it. Regards Piotr On 11/16/06, Arun Kaundal [EMAIL PROTECTED] wrote: Hi Nitin, As per I understand, following is answer for the things you are looking for: On 11/16/06, Nitin Borwankar [EMAIL PROTECTED] wrote: Andrzej Bialecki wrote: Nitin Borwankar wrote: Hi all, First an intro. I am another Nutch newbie and am finding 0.7.2 to be quite an effective single machine crawler. [..] The ability to keep db formats compatible would be nice to allow reuse of existing results but is not necessary. That's probably not going to happen - each branch has specific requirements from the db and segment formats, which are incompatible. However, given enough interest we could implement converters, even bi-directional. As a potential developer I would like to volunteer for the ongoing maintenance and evolution of 0.7.2 as an effective single machine crawler. That's excellent! I imagine the procedure to get you involved would be something like this: * start collecting issues related to maintenance, bugfixes or improvements of that branch, what is the mechanism for this collection process - do we create a separate email list, a separate alias ... or everyone just sends me email ( this may get messy fast ). Well, Once you create your jira user, You can choose among number of project you want to contribute. You have to filter the request to all the issues (that includes bug fix, improvement and patches etc ) Or else you can Browse Nutch project to see its issues. Here you can chhose version 0.7.2 * create JIRA issues, plus start collecting patches, tested and ready for committing. One of the existing developers will commit them on your behalf. sounds good. * after a while we would consider giving you committer rights so that you could work directly with the code. fair enough. Do we take this offline for further thrashing out ? Or continue here ? Nitin Borwankar Consider this a proposal to maintain two separate versions by continuing bug fix versions of 0.7 until one of two things happen a) 0.8 evolves to something satisfactory for use as also as a single machine search engine and everyone is happy moving to it b) a critical mass of developers steps forward to support the ongoing development of 0.7.2 into say Nutch-lite always and only meant for single machine use. I do hope that option a) becomes a reality sooner rather than later. But if there is sufficient interest (and enough developers) in developing 0.7 branch, then go for it - keeping in mind, though, that eventually these code bases will diverge so much that maintaining them will require two mostly separate teams ... Well, I am also working on nutch 0.7.2 occasionally for in-house product from last one year. As nutch 0.8.x aheading somewhere in different direction, We continue want to enhance and upgrade the functionality of 0.7.2 . I will like to extend you help in this regards and want to work with you to improve it. My zira id is :sharma_arun_se . I have created issue for you. Keep it up!!!
Fwd: 0.7.3 version
Hello, I am forwarding an email sent to committers so nutch users are also aware of this initiative. Regards, Piotr -- Forwarded message -- From: Piotr Kosiorowski [EMAIL PROTECTED] Date: Nov 16, 2006 10:09 PM Subject: 0.7.3 version To: nutch-dev@lucene.apache.org Hello committers, Based on a recent discussion on nutch user list - (Strategic Direction of Nutch) I would like to prepare 0.7.3 release. The idea is to allow people who still use 0.7.2 to get rid of most important bugs and allow them to add some small features they would need as the claim is 0.8.1 is not good for small crawls at the moment. It will allow us to work on 0.8 branch so it would be more small installation friendly. I would like to approach it this way that if noone objects I would create a 0.7.3 release in JIRA and ask people to assign issues with patches to it. I do not have a lot of time personally so I do not plan to do any development myself - just taking care of high quality patches and committing them - after some time when we gather some aomount of bugfixes/isues I would prepare 0.7.3 release. Any objections comments? Regards Piotr
Re: Strategic Direction of Nutch
I agree with Andrzej. On my part if some takes the effort of preparing patches and testing I as a committer (not very active one recently) may focus on 7.2 issues and commit the patches. And in future prepare 7.3 release. Regards, Piotr On 11/15/06, Andrzej Bialecki [EMAIL PROTECTED] wrote: Nitin Borwankar wrote: Hi all, First an intro. I am another Nutch newbie and am finding 0.7.2 to be quite an effective single machine crawler. [..] The ability to keep db formats compatible would be nice to allow reuse of existing results but is not necessary. That's probably not going to happen - each branch has specific requirements from the db and segment formats, which are incompatible. However, given enough interest we could implement converters, even bi-directional. As a potential developer I would like to volunteer for the ongoing maintenance and evolution of 0.7.2 as an effective single machine crawler. That's excellent! I imagine the procedure to get you involved would be something like this: * start collecting issues related to maintenance, bugfixes or improvements of that branch, * create JIRA issues, plus start collecting patches, tested and ready for committing. One of the existing developers will commit them on your behalf. * after a while we would consider giving you committer rights so that you could work directly with the code. Consider this a proposal to maintain two separate versions by continuing bug fix versions of 0.7 until one of two things happen a) 0.8 evolves to something satisfactory for use as also as a single machine search engine and everyone is happy moving to it b) a critical mass of developers steps forward to support the ongoing development of 0.7.2 into say Nutch-lite always and only meant for single machine use. I do hope that option a) becomes a reality sooner rather than later. But if there is sufficient interest (and enough developers) in developing 0.7 branch, then go for it - keeping in mind, though, that eventually these code bases will diverge so much that maintaining them will require two mostly separate teams ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Strategic Direction of Nutch
Anthony, I do not think nutch can forget about small implementations. It was one of its strong points and I do think we will want to support them. For any issues please report them in JIRA and I am sure they would be taken care of. Regards Piotr On 11/12/06, Anthony May [EMAIL PROTECTED] wrote: Greetings all, I have just been handed the administration of our nutch implementation, we are currently using nutch 0.7 and it very badly needs updating. However we are evaluating several options, and I wanted to know about where nutch is going as a project. I have not been able to find anything in the wiki or in the mailing list archives with this information (forgive me if I have missed it). The central issue is that our needs are for our crawling our own website with about 200,000 pages and documents with a single machine containing nutch, not for crawling the web with a massively scalar architecture. I have heard nutch is moving towards the latter and that the former usage is becoming very slow in 0.8 compared to 0.7, is this correct? Thank you for helping me out. Regards, Anthony May Web Developer NZQA This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA. All emails have been scanned for viruses and content by MailMarshal. NZQA reserves the right to monitor all email communications through its network.
Re: details: stackoverflow error
Hello Rajesh, I have run bin/nutch crawl urls -dir crawl.test -depth 3 on standard nutch-0.7.2 setup. The urls file contain http://www.math.psu.edu/MathLists/Contents.html only. In crawl-rlfilter I have changed the url pattern to: # accept hosts in MY.DOMAIN.NAME +^http:// JVM: java version 1.4.2_06 Linux It runs without problems. Please reinstall from distribution make only required changes and retest. If it fails we will to track it down again. Regards Piotr Rajesh Munavalli wrote: Forgot to mention one more parameter. Modify the crawl-urlfilter to accept any URL. On 4/6/06, Rajesh Munavalli [EMAIL PROTECTED] wrote: Java version: JSDK 1.4.2_08 URL Seed: http://www.math.psu.edu/MathLists/Contents.html I even tried allocating more stack memory using -Xss, process memory -Xms option. However, if I run the individual tools (fetchlisttool, fetcher, updatedb..etc) separately from the shell, it works fine. Thanks, --Rajesh On 4/6/06, Piotr Kosiorowski [EMAIL PROTECTED] wrote: Which Java version do you use? Is it the same for all urls or only for specific one? If URL you are trying to crawl is public you can send it to me (off list if you wish) and I can check it on my machine. Regards Piotr Rajesh Munavalli wrote: I had earlier posted this message to the list but havent got any response. Here are more details. Nutch versionI: nutch.0.7.2 URL File: contains a single URL. File name: urls Crawl-url-filter: is set to grab all URLs Command: bin/nutch crawl urls -dir crawl.test -depth 3 Error: java.lang.StackOverflowError The error occurrs while it executes the UpdateDatabaseTool. One solution I can think of is to provide more stack memory. But is there a better solution to this? Thanks, Rajesh
Re: details: stackoverflow error
Which Java version do you use? Is it the same for all urls or only for specific one? If URL you are trying to crawl is public you can send it to me (off list if you wish) and I can check it on my machine. Regards Piotr Rajesh Munavalli wrote: I had earlier posted this message to the list but havent got any response. Here are more details. Nutch versionI: nutch.0.7.2 URL File: contains a single URL. File name: urls Crawl-url-filter: is set to grab all URLs Command: bin/nutch crawl urls -dir crawl.test -depth 3 Error: java.lang.StackOverflowError The error occurrs while it executes the UpdateDatabaseTool. One solution I can think of is to provide more stack memory. But is there a better solution to this? Thanks, Rajesh
Re: Nutch 0.7.2 release
Yes. Correct link is http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=390158 It was used on the Web site but I made a mistake while pasting it into email (I used the one for 0.7.1 release). Thanks for spotting it. Regrads Piotr On 4/1/06, TDLN [EMAIL PROTECTED] wrote: Is this the correct revision of the release notes? http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=390158 Rgrds, Thomas On 4/1/06, TDLN [EMAIL PROTECTED] wrote: Yes! This is great news, thank you so much. By the way: in the revision of the release notes that you posted (292986), the changes for 0.7.2 are missing. Rgrds, Thomas On 4/1/06, Piotr Kosiorowski [EMAIL PROTECTED] wrote: Hello all, The 0.7.2 release of Nutch is now available. This is a bug fix release for 0.7 branch. See CHANGES.txt ( http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=292986 ) for details. The release is available on http://lucene.apache.org/nutch/release/ . Regards, Piotr
Re: Nutch 0.7.2 release | upgrading from 0.7.1?
The 0.7.2 release should work without problems with 0.7.1 data. Regards Piotr On 4/2/06, Håvard W. Kongsgård [EMAIL PROTECTED] wrote: What about upgrading from 0.7.1? Can I use my existing db and segments? Piotr Kosiorowski wrote: Hello all, The 0.7.2 release of Nutch is now available. This is a bug fix release for 0.7 branch. See CHANGES.txt ( http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=292986 ) for details. The release is available on http://lucene.apache.org/nutch/release/. Regards, Piotr
Nutch 0.7.2 release
Hello all, The 0.7.2 release of Nutch is now available. This is a bug fix release for 0.7 branch. See CHANGES.txt (http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=292986) for details. The release is available on http://lucene.apache.org/nutch/release/. Regards, Piotr
Re: A possible error in the tutorial
Thanks. Fixed in SVN. Will be deployed on the Web site with 0.7.2 release. fabrizio silvestri wrote: Hi guys.. Just a quick question about the tutorial: the line bin/nutch org.apache.nutch.crawl.DmozParser content.rdf.u8 -subset 5000 dmoz/urls shouldn't be bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 dmoz/urls -- Dr. Fabrizio SILVESTRI High Performance Computing Laboratory Information Science and Technologies Institute (ISTI), Italian National Research Council (CNR) Via G. Moruzzi, 1, 56126 Pisa, Italy Phone: +39 050 315-3011 (Direct) Mobile: +39 328-9552152 FAX: +39 050 3138091 (G3) WWW: http://miles.isti.cnr.it/~silvestr
Re: Getting java.io.IOException: Couldn't rename \tmp\nutch\mapred\local\map_n68li2\part-0.out with Nutch 0.8
As I stated in recent email on similar subject - disable antivirus software if you have one. I have seen many cases when AV was keeping file locked on Windows. Regards Piotr On 1/6/06, Arun Kaundal [EMAIL PROTECTED] wrote: Anybody PLz rely I am waiting for it On 1/6/06, Arun Kaundal [EMAIL PROTECTED] wrote: No, I don't have any expolrer or file opened. I am facing this error from last two days with no good response any one there. On 1/5/06, Stefan Groschupf [EMAIL PROTECTED] wrote: Windows: Do you still have explorer or any file open located in this directory? Linux: Do you may had start nutch once as root and now as a othe user? Stefan Am 05.01.2006 um 08:08 schrieb Arun Kumar Sharma: Hi, I am facing this error while running Nutch on local file system. java.io.IOException: Couldn't rename \tmp\nutch\mapred\local \map_n68li2\part-0.out at org.apache.nutch.mapred.LocalJobRunner$Job.run (LocalJobRunner.java:81) 060105 122537 map 100% java.io.IOException: Job failed! at org.apache.nutch.mapred.JobClient.runJob (JobClient.java:308) at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:52) at org.apache.nutch.crawl.Crawl.main(Crawl.java:115) Exception in thread main Regards, Arun Kumar Sharma (Tech Lead -Java/J2EE) Mob: +91.981.529.5761 Send instant messages to your online friends http:// in.messenger.yahoo.com
Re: java.io.IOException: already exists
It looks like majority of people who get it run it on Windows - is it the same in your case? Maybe some kind of antivirus software is preventing the folder from being deleted? Regards Piotr On 1/4/06, Nguyen Ngoc Giang [EMAIL PROTECTED] wrote: Hi all, I'd like to bring back this topic, which has been ignored several times in Nutch mailing list as well as JIRA ( http://issues.apache.org/jira/browse/NUTCH-94, http://issues.apache.org/jira/browse/NUTCH-96, http://issues.apache.org/jira/browse/NUTCH-117 ). Here is my error stack: 060104 110314 Finishing update 060104 110314 Processing pagesByURL: Sorted 11 instructions in 0.016seconds. 060104 110314 Processing pagesByURL: Sorted 687.5 instructions/second java.io.IOException: already exists: C:\tomcat\webapps\ROOT\data\db\webdb.new\pagesByURL at org.apache.nutch.io.MapFile$Writer.init(MapFile.java:86) at org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown( WebDBWriter.java:549) at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544) at org.apache.nutch.tools.UpdateDatabaseTool.close( UpdateDatabaseTool.java:375) This error happens not only at update time, but also at fetchlist time. And the weird thing is that it happens so undeterministically. I debugged around and it seems the problem is because some CloseProcessors didn't terminate correctly, causing the webdb.new not deletable. Then I try to reduce to only 1 thread, with lightweight load (as suggested in the JIRA discussion), but it doesn't help. But when I try to run step by step using debugging mode of the IDE, there was no problem. Can anyone help me to figure out this issue? Thanks very much. Regards, Giang
Re: build instructions?
It is a known bug in 0.7.1 distribution. You can get the sources directly from svn and it build fine. It is also fixed in preparation for 0.7.2 release and in trunk. Or you can fix it locally by creating empty src/java folder I am not sure if it is the only one empty folder missing in nutch-extensionpoints folder but there should be not so many of them. Regards Piotr Teruhiko Kurosaka wrote: Where can I find the build instructions for Nutch? Just typing ant ended with an error complaining that there is no such directory as ...\src\plugin\nutch-extensionpoints\src\java This is Nutch 0.7.1 download and I'm trying to build on Windows XP Professional with Cygwin and JDK 1.5. (I tried JDK 1.4.1 but I saw the same failure.) -Kuro
Re: try to restart aborted crawl
Hi, I had the same problems with JVM crashes and it was in fact hardware problem (memory). It can also be a problem with your software config (but as far as I remember you are using quite standard configuration). I doubt it has anything to do with nutch (except nutch stresses JVM/whole box) so it can be seen easier than during normal system usage. Regards, Piotr wmelo wrote: The biggest problem, is not to restart the crawl, but the problem that lead do failure itself, more precisely: Exception in thread main java.io.IOException: key out of order: http://web.mit .edu/is/about/index.html after http://web.mit.edu/is/?ut/index.html; This kind of problem occurs, with me, almost all the time (together with another that says that there is some problem with Java Hot Spot), preventing me to, really, use Nutch. I have reported those two problems before, without any answer. I don't know, but this may be (or not) a bug of Nutch (or of Lucene , I don't have any idea), The only thing I know, is that both issues are very big non-conformities, that should be corrected as soon as possible. Wmelo No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.1.362 / Virus Database: 267.13.12/192 - Release Date: 5/12/2005
Re: Nutch returns irrelevant site
You can use explain page to find out why this page is scored the way it is. I would expect anchor text would be th emain component of it. Regards Piotr Aled Jones wrote: Hi I'm currently setting up a nutch search engine that searches travel websites. It works quite well but sometimes returns odd results. One good example: One of the 100 or so sites I've asked it to crawl is http://www.hfholidays.co.uk/ . This site is mainly about walking holidays and has many pages with the word walking in it, so when I type in walking into nutch then I'd expect it to turn up, however the first result I get back from using the keyword walking is http://www.hfholidays.co.uk/email.asp . This page doesn't have the word walking in it anywhere. Could someone please explain if this is a bug or the way nutch works. I've got an idea how google works, if nutch works in a similar fashion does this page appear because it is linked from many pages with the word walking in them? Thanks Aled This e-mail and any attachments are strictly confidential and intended solely for the addressee. They may contain information which is covered by legal, professional or other privilege. If you are not the intended addressee, you must not copy the e-mail or the attachments, or use them for any purpose or disclose their contents to any other person. To do so may be unlawful. If you have received this transmission in error, please notify us as soon as possible and delete the message and attachments from all places in your computer where they are stored. Although we have scanned this e-mail and any attachments for viruses, it is your responsibility to ensure that they are actually virus free.
Re: Why Nutch 0.7.1 Does Not Compile???
I compiled it for the release - linux, jdk 1.4.2, ant 1.6.2. No problems - otherwise I would be unable to relase it :). Please report your environment for such problems. Regards Piotr, Victor Lee wrote: ok, it's weird now. If I use the command ant jar, it builds successfully. If I use ant tar, it has tons of errors. Maybe something is broken in package, war, or javadoc? Victor Lee [EMAIL PROTECTED] wrote: Hi, I extracted a fresh copy of original nutch 0.7.1 and tried to compile it, but it doesn't compile at all. Why? Why would a official release can't even compile? I did ant tar at the outmost nutch directory where the build.xml locates. Please let me know what I did wrong. There are whole bunch of errors besides warnings. Anyone tried to compile it before. Please let me know what you did to make it compile. Many thanks. - Yahoo! FareChase - Search multiple travel sites in one click. - Yahoo! FareChase - Search multiple travel sites in one click.
Re: Using FetchListEntry -dumpurls
Hi, I think this is the reason: Exception in thread main java.lang.NoClassDefFoundError: net/nutch/pagedb/FetchListEntry In 0.7 branch all classes where moved to org.apache.nutch package structure and scripts where updated so you are probably using old script with new release. Regards Piotr Bryan Woliner wrote: Hi, I am trying to dump all of the URLS from a segment to a text file. I was able to do this successfully under Nutch 0.6 but am not able to do so under 0.7.1 Please take a look a the line below and let me know if you can figure out why I'm getting an error. Perhaps it a due to change from version 0.6 to 0.7.1, or maybe I just have the wrong syntax. Note: the segments/20051107233629 directory is a valid segments directory, as is evidence by the ls statement below. ___- -bash-2.05b$ bin/nutch net.nutch.pagedb.FetchListEntry -dumpurls segments/20051107233629 foo.txt Exception in thread main java.lang.NoClassDefFoundError: net/nutch/pagedb/FetchListEntry -bash-2.05b$ ls -la segments/20051107233629 total 8 drwxr-xr-x 8 bryan bryan 1024 Nov 7 23:36 . drwxr-xr-x 3 bryan bryan 1024 Nov 7 23:36 .. drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 content drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 fetcher drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 fetchlist drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 index -rw-r--r-- 1 bryan bryan 0 Nov 7 23:36 index.done drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 parse_data drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 parse_text
Re: Wrapping Nutch
Yes. You are right - first exception idndicates that some required plugins are missing. And this is probably because you do not have plugins directory with all plugins in classpath P. On 10/10/05, Matt Clark [EMAIL PROTECTED] wrote: Sorry, One more clue. Here is the colsole output - 051010 143748 10 Plugins: directory not found: plugins My guess is that Nutch wants the plugins directory on my classpath. Am I on the right path? matt
Re: Dedup won't actually dedup
Hello Jon, As far as I remember dedup marks the records as deleted only without physically removing them. And first action of dedup is to clear old deletions (as it is written in log). So if you repeat it you will get the same number of deleted records each time. Regards Piotr Jon Shoberg wrote: Any idea why dedup won't actually remove the items? Thoughts? *** First Pass 051008 144843 Clearing old deletions in 051008 144843 Reading url hashes... 051008 144902 Sorting url hashes... 051008 144905 Deleting url duplicates... 051008 144907 Deleted 4082 url duplicates. 051008 144907 Reading content hashes... 051008 144918 Sorting content hashes... 051008 144923 Deleting content duplicates... 051008 144925 Deleted 228430 content duplicates. 051008 144925 Duplicate deletion complete locally. Now returning to NFS... 051008 144925 DeleteDuplicates complete *** Second Pass *** 051008 144932 Reading url hashes... 051008 144949 Sorting url hashes... 051008 144953 Deleting url duplicates... 051008 144955 Deleted 4082 url duplicates. 051008 144955 Reading content hashes... 051008 145005 Sorting content hashes... 051008 145011 Deleting content duplicates... 051008 145012 Deleted 228430 content duplicates. 051008 145012 Duplicate deletion complete locally. Now returning to NFS... 051008 145012 DeleteDuplicates complete
Re: Fwd: problem about the fetch of dinamic page
You can use nutch readdb command to check if urls you are interested in where added to WebDB - if yes check the segments if they contain these urls. Please review the logs from fetch to check if there was an attempt to fetch from these urls (you might have some problem with authentication). Right now the description is too generic for me to help with more details. Regards Piotr [EMAIL PROTECTED] wrote: Hi, I have a question about nutch crawler: I want to make a document search on a site one that has approached with authentication (user/password). As soon as fact the login, the first page visualized from the composed application e' from two frame: HTML HEAD TITLESistema Provvedimenti -SUPER/TITLE /HEAD FRAMESET ROWS=14%,* FRAME NORESIZE NAME=MENU SRC=Servlet1?menu=1 SCROLLING=AUTO FRAME NAME=PAGE SRC=../a.html SCROLLING=AUTO /FRAMESET /HTML The servlet Servlet1 publish on web a table with a 1 line and N columns, where every column contains a href with the URL of an other servlet (a Servlet2-ServletN). DESCRIPTION OF THE PROBLEM: My problem is that I ago see that crawler make the fetch of the page of login, of the static page a.html, of servlet the Servlet1, but not ago fetch of no the other servlet (Servlet2-ServletN). Instead if I put of the href in the page a.html, Nutch succeeds to make the fetch of the URL and works all. DESCRIPTION OF OUR CONFIGURATION OF NUTCH: I installed Nutch 0.6. I launch the nutch in this mode: /usr/nutch-0.6/bin/nutch crawl url -dir index -depth 10 -threads 8 crawl.log where in the file url there is only the url of the sie with just the login and passw I modified the file of configuration of Nutch crawl-urlfilter.txt like : -^(ftp|mailto): -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|m ov|MOV|exe)$ +[?=] +. Plese somebody help me!!! It is very important for me Adriano Palombo - Visita http://domini.interfree.it, il sito di Interfree dove trovare soluzioni semplici e complete che soddisfano le tue esigenze in Internet, ecco due esempi di offerte: - Registrazione Dominio: un dominio con 1 MB di spazio disco + 2 caselle email a soli 18,59 euro - MioDominio: un dominio con 20 MB di spazio disco + 5 caselle email a soli 51,13 euro Vieni a trovarci! Lo Staff di Interfree - - Visita http://domini.interfree.it, il sito di Interfree dove trovare soluzioni semplici e complete che soddisfano le tue esigenze in Internet, ecco due esempi di offerte: - Registrazione Dominio: un dominio con 1 MB di spazio disco + 2 caselle email a soli 18,59 euro - MioDominio: un dominio con 20 MB di spazio disco + 5 caselle email a soli 51,13 euro Vieni a trovarci! Lo Staff di Interfree -
Re: a few questions
Earl Cahill wrote: Tempted to do each question as a separate email, but here you go. 1. Does nutch use pure lucene for its indexing? Does the nutch index = lucene + potentially ndfs? If I am going to run a search web service, I am just wondering what advantages nutch would serve over lucene. Yes it uses lucene for indexing. It is not only lucene + ndfs as it contains a fetcher, WebDB maintenance, map/reduce and plenty of plugins. 2. Turns out I am going to write a web service for search. I have played with the nutch search example, but if I want to do rather arbitrary key/value pairs and have a web service return xml, I am guessing I am going to have to write my own. Is that right? Is there an easy way to get results in xml format? Guessing I need to build it all myself. There is a servlet OpenSearchServlet that return XML. 3. In another project, I want to use ndfs to store two+ distinct copies of a file, but I really don't want anything else to do with nutch on the project. Is that possible? Is there a clean break? I want to make a list of servers, then have an api call that takes a file and stores 2+ copies across my servers, and an api call that reads a file, with appropriate failover. I think that it is possible. Take a look at TestClient (not the best name) to look at API usage. 4. Guessing I write a plugin, but I want to interject some code during the nutch crawl process that does some analysis and actually does the index insertion itself. There any good docs on how to do such a thing? Plugin writting is easy - you can take a look at one of existing ones as an example. But all depends what and where you want to change as plugins are executed at the well specified points of processing so it might happen that there is no possibility of using a plugin for the thing you want to achieve. Regards, Piotr
Nutch 0.7.1 release
The 0.7.1 release of Nutch is now available. This is a bug fix release. See CHANGES.txt (http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=292986) for details. The release is available here (http://lucene.apache.org/nutch/release/). Please report all problems in JIRA setting appropriate version number in bug description. Regards, Piotr
Re: link analysis and update segments
UpdateDB copies link information and score from the WebDB to segments so it is important to have score calculated before updatedb is run. One can use current standard nutch score (based on number of inlinks) or try to use analyze - I have committed a patch for it some time ago that might help a bit with it disk space requirements so the best approach would be to test it (it worked ok for me) and if it is ok for you - report it so others can also try it out. Regards Piotr AJ Chen wrote: In a whole-web or vertical crawling setting, is it right that link analysis and update segment from DB should be performed in right order before indexing the segments? There's not much talk about update segment from DB. I think it should be an important step. Could someone point out when it should be run and what the benefits are? I remember it was mentioned sometime ago that the link analysis tool does not work yet and the number of in-links should be used instead. Any update? If it's still not working, how to set it to use link numbers? Thanks, AJ
Re: Re-indexing segments to add more field information
Hello, I think it is enough to delete index.done file and index folder. I did it this way some time ago. Regrads Piotr Mike Berrow wrote: I would like to re-build the indexes I have in existing segments using a custom index filter plug-in (adds more field information to assist with a custom sort). Should that be just a matter of deleting the index.done files to let it go through?, or is thre more to be done than that? Any known pitfalls? Thanks much, -- Mike Berrow
Re: Link Analysis Score..
There are many ways nutch can boost document in the index. But I suspect you are refereing to analyze process - it uses PagrReank computation for page score. For details read DistributedAnalysisTool - especially computeRound method. Regards Piotr Rozina Sorathia wrote: I wanted to know where exactly the Link Analysis Score is calculated …Is there any code snippet available.? How is the Link Analysis Score affecting the overall final score of the document? //Rozina Sorathia,// //Systems Executive,// //KPIT Cummins Infosystems Ltd.,// //[EMAIL PROTECTED] mailto:[EMAIL PROTECTED]//
Re: Analyser error
I was never doing it this way - creating webdb content based on segments only. So I do not know if it works - I do not have time at the moment to test it myslef - sorry. Regards Piotr EM wrote: The problem is still there, maybe I'm doing something wrong? 1. 'rm -r db' 2. 'mkdir db' 3. ' bin/nutch admin db -create' 4. I'll then updatedb db from a fetched segment, this should fill it up with links? 5. 'bin/nutch analylze db 7' And it fails here with three 'tmpsomething' directories and webdb.new -Original Message- From: Piotr Kosiorowski [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 30, 2005 3:07 PM To: nutch-user@lucene.apache.org Subject: Re: Analyser error It looks like you have temporary results from previous run (probably killed or terminated not successfully). It shoudl be safe to remove db\webdb.new directory and start again. regars Piotr EM wrote: What does it mean if the bin/nutch analyze db 7 fails with: 050830 024914 Target pages from init(): 27419 050830 024914 Processing pagesByURL: Sorted 27419 instructions in 0.172 seconds. 050830 024914 Processing pagesByURL: Sorted 159412.79069767444 instructions/second Finished at Tue Aug 30 02:49:14 EDT 2005 Exception in thread main java.io.IOException: already exists: db\webdb.new\pagesByURL at org.apache.nutch.io.MapFile$Writer.init(MapFile.java:86) at org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:54 9) at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544) at org.apache.nutch.tools.DistributedAnalysisTool.completeRound(DistributedAnal ysisTool.java:562) at org.apache.nutch.tools.LinkAnalysisTool.iterate(LinkAnalysisTool.java:60) at org.apache.nutch.tools.LinkAnalysisTool.main(LinkAnalysisTool.java:81)
Re: PDF support? Does crawl parse p
Hello Diane, There is a plugin to parse pdf files. You have to enable it in nutch-site.xml (just copy entry from nutch-default.xml). You have to change plugin.includes property to include parse-pdf plugin: [...] parse-(text|html) [...] to [...] parse-(text|html|pdf) [...] Regards Piotr Diane Palla wrote: Does Nutch have a way to parse pdf files, that is, application/pdf content type files? I noticed a plugin variable setting in default.properties: plugin.pdf=org.apache.nutch.parse.pdf* I never changed this file. Is that the right value? I am using Nutch 0.7. What do I have to do make parse pdf files? When I do the crawl, I get this error with application/pdf files: 050831 145126 fetch okay, but can't parse mainurl/research/126900/126969/126969.pdf, reason: failed(2,203): Content-Type not text/html: application/pdf If it's not possible, what future version of Nutch do developers expect to support application/pdf types and have such parsing of pdf files available? Diane Palla Web Services Developer Seton Hall University 973 313-6199 [EMAIL PROTECTED] Bryan Woliner [EMAIL PROTECTED] 08/23/2005 05:22 PM Please respond to nutch-user@lucene.apache.org To nutch-user@lucene.apache.org cc Subject Adding small batches of fetched URLs to a larger aggregate segment/index Hi, I have a number of sites that I want to crawl, then merge their segments and create a single index. One of the main reasons I want to do this is that I want some of the sites in my index to be crawls on a daily basis, others on a weekly basis, etc. Each time I re-crawl a site, I want to add the fetched URLs to a single aggregate segment/index. I have a couple questions about doing this: 1. Is it possible to use a different regex.urlfilter.txt file for each site that I am crawling? If so, how would I do this? 2. If I have a very large segment that is indexed (my aggregate index) and I want to add another (much smaller) set of fetched URLs to this index, what is the best way to do this. It seems like merging the small and large segments and then re-indexing the whole thing would be very time consuming -- especially if I wanted to add news small sets of fetched URLs frequently. Thanks for any suggestions you have to offer, Bryan
Re: permissions error with nutch 0.7
It looks like it has problem creating lucene lock file - I think it is usually created in /tmp if you are running it on Unix. Can you check if you can correctly access it? Regards Piotr On 8/25/05, Jason Martens [EMAIL PROTECTED] wrote: On Wed, 2005-08-24 at 18:20 -0700, Michael Ji wrote: did you try switch your user mode to su (Super User) mode? In nutch itself? I configured this all as root, and I would rather not run tomcat as root. Is there some way I can find out what files it is trying to write to when it gets the permission denied message? Jason Martens
Re: FetchedSegments.getSummary() for a PDF
You can try it out but I think parsing separately expects some directories in segment have different names than you have after standard fetch with parsing. Regards Piotr Lucas Rockwell wrote: Hi Piotr, Thanks for the response. So, I can't use: bin/nutch parse segment directory and then reindex? -lucas On Aug 25, 2005, at 11:28 AM, Piotr Kosiorowski wrote: As I understand if you had parse-pdf disabled you have to reparse (snd then reindex) segments. There is no standard way to do it (I think it might be done with some tricks). The easiest way would be to refetch it with pdf parsing enabled. Piotr Lucas Rockwell wrote: Hi all, I have enabled the parse-pdf and index-more plugins and reindexed my segments and then enabled those plus the query-more plugin in my front-end application and when I do a query I still can not get at the contents of the PDFs in the index. And even when I search for pdf -- which gets me all PDF files because of the url -- and use FetchedSegments.getSummary() there is nothing there. Any idea what I am doing wrong? Thanks. -lucas
Re: Nutch 0.7 released
Upps, you are right - we have not updated README.txt after changing our documentation format to Apache Forrest. Right now these files are deployed on nutch site: http://lucene.apache.org/nutch/tutorial.html etc in Documentation section. Thanks for catching it Piotr Lukas Vlcek wrote: Hi, May be I don't understanding correctly but according to README.txt file included in release pack I can't find docs/en/tutorial.html and docs/en/developers.html files. Lukas On 8/17/05, Piotr Kosiorowski [EMAIL PROTECTED] wrote: Hi, New Nutch release was prepared today. This is the first Nutch release as an Apache Lucene sub-project. You can download it from http://lucene.apache.org/nutch/release/nutch-0.7.tar.gz. There was a package name change from net.nutch.* to org.apache.nutch.* for this release, so local modifications and configuration files containing class names may require an update. Release numbers were created in JIRA too, so please use them while reporting a bug. Regards, Piotr
Re: webdb - orphaned pages?
Hello, Pages from WebDB are not deleted automatically. Nutch does not check if page has inlinks during fetchlist generation - so orphaned page would be refetched. It will stop to refetch the page if page becomes unavailable for some number of fetch attempts. Regards Piotr On 8/10/05, Raymond Creel [EMAIL PROTECTED] wrote: I have a question about the webdb and fetching. When a page that used to have incoming links is found to be orphaned (i.e. there are no longer any pages that have links to it), is it deleted from the webdb? Or is it left in the webdb but set not to be refetched? Or will it continue to be refetched anyway (this doesn't seem right to me)? Conversely, what will happen when a link to it reappears later? One more thing - are pages injected with the webdb injector treated any differently (as I see them as being sort of the root nodes of the webdb - they should never be deleted)? Thanks much for any clarity on this! raymond __ Yahoo! Mail for Mobile Take Yahoo! Mail with you! Check email on your mobile phone. http://mobile.yahoo.com/learn/mail
Re: Problem in Incremental crawling with 4GB segment directories
Hello Ayyanar , Please be more specific with your setup and problem description. Recently I fetched a segment that contains 73GB of data now so I do not think size of your segment is a problem. Regards Piotr On 8/8/05, Ayyanar Inbamohan [EMAIL PROTECTED] wrote: Hi, I have 4 GB data, which is existing one, I want to do incremental crawling to this 4 GB data, with a new URL, but the crwal is not success... why? thanks in advance, Ayyanar... __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Is it possible to have multiple search.dir in nutch-site.xml, Please reply immediately
No it is not possible. But you can search in both if you merge the segments or simply put them in one location. Regards Piotr On 8/8/05, Ayyanar Inbamohan [EMAIL PROTECTED] wrote: Hi all, Is it possible to have multiple search.dir in nutch-site.xml, Please reply immediately I want to search the text in both the directories, /home/oss/docs/Kmportal_repository_search_data /home/oss/spikesearch/src/nutch/var/nutch/dir-url.extranet Can we give something like following, nutch-conf property namesearcher.dir/name value/home/oss/docs/Kmportal_repository_search_data/value value/home/oss/spikesearch/src/nutch/var/nutch/dir-url.extranet/value descriptionLocation of the Nutch Index./description /property /nutch-conf but above is not working... thanks, Ayyanar.. __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: regex-url filter
Hello, I am not sure which way is better but I would look for dot: orginal http://([a-z0-9]*\.)*.(com|org|net|biz|edu|biz|mil|us|info|cc)/ modifiedhttp://([a-z0-9]*\.)*(com|org|net|biz|edu|biz|mil|us|info|cc)/ In my opinion dot before com,org etc is already included in ([a-z0-9]*\.)* and additional one (not escaped) means any character so it would match eg: http://www.abc.xcom/ but not http://www.abc.com/. Regards, P. Chirag Chaman wrote: Here's a better way http://([a-z0-9]*\.)*.(com|org|net|biz|edu|biz|mil|us|info|cc)/ FYI, this will not remove non-English sites -- but international sites that follow the two-letter convention. CC- -Original Message- From: Jay Pound [mailto:[EMAIL PROTECTED] Sent: Monday, August 08, 2005 2:37 PM To: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org Subject: regex-url filter I would like a confirmation from someone that this will work, I've edited the regex filter in hopes to weed out non-english sites from my search results, I'll be testing pruning on my current 40mil index to see if it works there, or maybe there is a way to set the search to return only english results, but I'm trying it this way now, is this the right way to add just extensions without sites? I'll try it soon but just wanted to not waste my time if its not correct!!! Thanks, -Jay Pound # The default url filter. # Better for whole-internet crawling. # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|m ov|MOV|exe)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # accept US only sites +^http://([a-z0-9]*\.)*.com/ +^http://([a-z0-9]*\.)*.org/ +^http://([a-z0-9]*\.)*.edu/ +^http://([a-z0-9]*\.)*.net/ +^http://([a-z0-9]*\.)*.mil/ +^http://([a-z0-9]*\.)*.us/ +^http://([a-z0-9]*\.)*.info/ +^http://([a-z0-9]*\.)*.cc/ +^http://([a-z0-9]*\.)*.biz/
Re: distributed search
If you have two search servers search1.mydomain.com search2.mydomain.com Then on each of them run ./bin/nutch server 1234 /index Now go to your tomcat box. In the directory where you used to have segments dir (either tomcat startup directory or directory specified in nutch config xml). Create search-servers.txt file containing: search1.mydomain.com 1234 search2.mydomain.com 1234 And move your old segment/index directories somewhere else so they are not by accident used. You should see search activity in your search servers logs now. Regards Piotr On 8/2/05, webmaster [EMAIL PROTECTED] wrote: I read the wiki on the server option, how does it talk with tomcat for the search? it says ./bin/nutch server port index dir ./bin/nutch server 1234 /index how does it talk with eachother to find the other servers in the cluster? -Jay
Re: Preventing the fetch command from going to certain URLs
Hello Joe, If you are using whole web crawling you should change regex-urlfilter.txt insead of crawl-urlfilter.txt. Piotr On 7/28/05, Vacuum Joe [EMAIL PROTECTED] wrote: I have a simple question: I'm using Nutch to do some whole-web crawling (just a small dataset). Somehow Nutch has gotten a lot of URLs from af.wikipedia.org into its segments, and when I generate another segments (using -topN 2) it wants to crawl a bunch more urls from af.wikipedia.org. I don't want to crawl any of the Afrikaans Wikipedia. Is there a way to block that? Also, I want to block it from ever crawling domains like 33.44.55.66, because those are usually very badly configured servers with worthless content. I tried to put those things into crawl-urlfilter.txt file and the banned-hosts.txt file, but it seems that the fetch command doesn't pay attention to those two files. Should I be using crawl instead of fetch? __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: [Nutch-general] number of indexed pages
Hello, First one will give you number of pages in WebDB and not all of them are indexed. Regards, Piotr On 7/29/05, Erik Hatcher [EMAIL PROTECTED] wrote: Two options: bin/nutch readdb crawl/db -stats or use Luke (Google for luke lucene) to open the Lucene index. Erik On Jul 28, 2005, at 9:44 PM, blackwater dev wrote: After I finish a crawl...what is the best way to go into my crawl directory and get the number of indexed pages? Thanks! --- SF.Net email is Sponsored by the Better Software Conference EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile Plan-Driven Development * Managing Projects Teams * Testing QA Security * Process Improvement Measurement * http://www.sqe.com/ bsce5sf ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: Problem Starting Nutch (Tutorial like)
Hello, I would rather suspect some misconfiguration of networking. According to JavaDcoc: InetAddress.getLocalHost() throws UnknownHostException - if no IP address for the host could be found. Regards Piotr On 7/28/05, blackwater dev [EMAIL PROTECTED] wrote: Are you sure your urls file doesn't have an extension? I had a similiar problem and found my urls file was .rtf which I didn't see until I viewed the file via the command line. On 8/28/05, Nils Hoeller [EMAIL PROTECTED] wrote: Hi my Problem is: I ve done everything as descriped in the Getting Started Tutorial at nutch.org. When I now run the command: bin/nutch crawl urls -dir crawl.test -depth 3 crawl.log I get this Exception in the log file: run java in /usr/java/jdk1.5.0_04 050828 104004 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-default.xml 050828 104004 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/crawl-tool.xml 050828 104004 parsing file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-site.xml 050828 104004 No FS indicated, using default:local 050828 104004 crawl started in: crawl.test 050828 104004 rootUrlFile = urls 050828 104004 threads = 10 050828 104004 depth = 3 Exception in thread main java.lang.RuntimeException: java.net.UnknownHostException: linux: linux at org.apache.nutch.io.SequenceFile $Writer.init(SequenceFile.java:67) at org.apache.nutch.io.MapFile$Writer.init(MapFile.java:94) at org.apache.nutch.db.WebDBWriter.init(WebDBWriter.java:1507) at org.apache.nutch.db.WebDBWriter.createWebDB(WebDBWriter.java:1438) at org.apache.nutch.tools.WebDBAdminTool.main(WebDBAdminTool.java:172) at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:133) Caused by: java.net.UnknownHostException: linux: linux at java.net.InetAddress.getLocalHost(InetAddress.java:1308) at org.apache.nutch.io.SequenceFile $Writer.init(SequenceFile.java:64) ... 5 more My urls file looks like this: http://www.nutch.org/ I ve also tried: http://www.ifis.uni-luebeck.de/ which I d like to get nutched Also in the urlfilter conf is written +^http://([a-z0-9]*\.)*ifis.uni-luebeck.de/ +^http://([a-z0-9]*\.)*nutch.org/ Can anyone give me a Hint? Where is the error? Thanks Nils
Re: total pages
Hello, I assume second counts are printed by some tool accessing WebD. Right? If so - 2 250 000 is the number of pages generated to be fetched (so all fetched pages, fetch attempts with error) - simply total number of pages in segments. The second number is amount of Pages/Links in WebDB - pages /links known to nutch gathered by extracting links from already fetched pages. Some of these pages have been already fetched but some of them are to be fetched in future. Regards Piotr Ilia S. Yatsenko wrote: Hello Sorry my little English How nutch count document in search index? I have 90 segments with 25000 in each segment Total is 2 250 000 pages in index (this number I see when execute mergesegs). But in the same time nutch report me: Number of pages: 4318557 Number of links: 5541456 Why I see in 2 times more pages than I have in real index?
Re: Merge Crawl results
Hello, You can merge segments for these two crawls using nutch mergesegs, in fact you can simply copy all segment directories to one place. But it would not be a full merge of crawls as right now there is no way to merge WebDB for these two crawls. You can deduplicate it using nutch dedup (nutch mergesegs does deduplication for you too). But in fact you should probably try a different approach - Intranet crawling was meant as an easy way to crawl small sites strating every time from scratch. If you do not want to start from scratch you should use Whole web crawling tutorial limiting it to your site/sites only in config file. Regards Piotr Benny Lin wrote: Hi, I am looking to see if there are ways to merge different Crawl results, Let's say, I have two url sets in two different files, I use following commands, bin/nutch crawl URLs1.txt -dir test1 -depth 0 test1.log bin/nutch crawl URLs2.txt -dir test1 -depth 0 test2.log Then I have two folders test1 and test2. My question are, 1. Is there a way to merge two sets of above result? If it is, what's command string? 2. If above two sets have duplicate urls, how to make merge results unique? Or may be I can do it in different way if I want to do accumlation indexing but not need do everything again and again from beginning. Can someone help out? Thanks a lot. Benny Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs
Re: Searching by content type
Hello, Please have a look at index-more and query-more plugins for content-type handling. Regards Piotr Vacuum Joe wrote: I have been looking through the API docs and I can't figure this out. Here is my question: Is there a way to search based on meta-information, such as content type, or even the value of header fields? For example, let's say I would like to find only PDFs, or perhaps put higher weight on PDFs vs. other kinds of documents. Can this be done? I looked at the query interface. It looks like NutchBean allows me to specify a Query, and a Query is basically made up of Strings which are in the content. I can't find any way to specify meta-information I'm looking for. Any ideas on this? Thanks Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs
Re: prioritizing newly injected urls for fetching
Hello Kamil, Do you want to generate a fetchlist with urls that are present in WebDB but where not fetched till now? I am not sure what you are trying to achive but, you can generate any fetchlist you want using latest tool by Andrzej Bialecki (http://issues.apache.org/jira/browse/NUTCH-68) (have not tried it myself). There was also (some time ago) discussion on the nutch mailing list about refetchonly param for fetchlist generator - some ideas are still not implemented but you can read how it works currently. Regards Piotr Kamil Wnuk wrote: Hi All, I have recently started using nutch and I am looking for a method of prioritizing urls injected during an ongoing crawl process (similar to the whole-web crawl scenario described in the tutorial) so that they are guaranteed to be included at the top of the next fetchlist generated. The purpose of this is so that I can give nutch the urls of newly created web pages that I want indexed as quickly as possible. I have looked through the nutch documentation and the mailing list archives and have not been able to find a solution. Does a good method for doing this exist? Thanks in advance, Kamil
Re: Skipping the final indexing step?
Hello Otis, If you are only reading ParseData and FetcherOutput from nutch segment you do not need lucene index at all. So you can safely skip -i switch. Regards Piotr On 7/21/05, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hello, I'm using SegmentMergeTool to merge some large segments, and I see that the final index optimization (below) takes a looong time. I think this index creation and optimization is triggered by the -i param to SegmentMergeTools. From what I saw in the SegmentMergeTools.java, this is an optional parameter, but I'm wondering if I can just skip this final indexing and index optimization step all together. Right now I'm not making use of this final Nutch index, as I'm just reading from ParseData.DIR_NAME and FetcherOutput.DIR_NAME using ArrayFile.Reader.
Re: How to view the URLs stored in a segment
Hellom Bryan Woliner wrote: A couple more (basic) questions: When FetchListEntry is called with the -dumpurls option, where does the fetchlist get dumped, in what format, and how do I access it? The list of urls is dumped to stdout (System.out). Format of single line: Recno record_number: url On a related note: How do I know what method of FetchListEntry.java is called (and with what parameters) when I type : bin/nutch net.nutch.pagedb.FetchListEntry -dumplist $s1 You have to read the source code of FetchListEntry. In general how do I find out what methods (and parameters) are called by command line calls to nutch java files. In bin/nutch shell script there is a mapping from command names to java classes that are executed for a command. You have to read the source code of java class(mainly parsing of command line options) to find out what methods are called and their params. Regards Piotr
Re: ndfs stuff
Hello Ferenc, Some documentation on running ndfs can be found on wiki: http://wiki.apache.org/nutch/NutchDistributedFileSystem Regards, Piotr [EMAIL PROTECTED] wrote: Have any location the ndfs usage documentation? Regards, Ferenc