OT: Can't get unsubscribed from the wiki notifications
Somehow I got subscribed to the emails whenever the wiki gets updated, and I can't figure out how to unsubscribe from them. The password recovery form never seems to email me whatever it needs to email me to allow me to recover or reset my password, which makes me suspect maybe I forgot what wiki name I used or what email address I used. Do I have any hope of getting unsubbed, or should I just filter out those messages? -- http://www.linkedin.com/in/paultomblin http://careers.stackoverflow.com/ptomblin
RE: recrawl.sh stopped at depth 7/10 without error
Try starting it with nohup. 'man nohup' for details. -- Sent from my Palm Prē BELLINI ADAM wrote: hi, mabe i found my probleme, it's not nutch mistake, i beleived when running the crawl command as background process when closing my console it will not stop the process, but it seems that it realy kill the process i launched the porcess like this : ./bin/nutch crawl urls -dir crawl depth -10 crawl.log amp; but even with the 'amp;' caractere when closing my console it kills the process. thx Date: Mon, 7 Dec 2009 19:00:37 +0800 Subject: Re: recrawl.sh stopped at depth 7/10 without error From: yea...@gmail.com To: nutch-user@lucene.apache.org I sill want to know the reason. 2009/12/2 BELLINI ADAM lt;mbel...@msn.com hi, anay idea guys ?? thanx From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: recrawl.sh stopped at depth 7/10 without error Date: Fri, 27 Nov 2009 20:11:12 + hi, this is the main loop of my recrawl.sh do echo --- Beginning crawl at depth `expr $i + 1` of $depth --- $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN \ -adddays $adddays if [ $? -ne 0 ] then echo runbot: Stopping at depth $depth. No more URLs to fetch. break fi segment=`ls -d $crawl/segments/* | tail -1` $NUTCH_HOME/bin/nutch fetch $segment -threads $threads if [ $? -ne 0 ] then echo runbot: fetch $segment at depth `expr $i + 1` failed. echo runbot: Deleting segment $segment. rm $RMARGS $segment continue fi $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment done echo - Merge Segments (Step 3 of $steps) - in my log file i never find the message - Merge Segments (Step 3 of $steps) - ! so it breaks the loop and stops the process. i dont understand why it stops at depth 7 without any errors ! From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: recrawl.sh stopped at depth 7/10 without error Date: Wed, 25 Nov 2009 15:43:33 + hi, i'm running recrawl.sh and it stops every time at depth 7/10 without any error ! but when run the bin/crawl with the same crawl-urlfilter and the same seeds file it finishs softly in 1h50 i checked the hadoop.log, and dont find any error there...i just find the last url it was parsing do fetching or crawling has a timeout ? my recrawl takes 2 hours before it stops. i set the time fetch interval 24 hours and i'm running the generate with adddays = 1 best regards _ Eligible CDN College amp; University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now! http://go.microsoft.com/?linkid=9691819 _ Eligible CDN College amp; University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now! http://go.microsoft.com/?linkid=9691819 _ Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7 now http://go.microsoft.com/?linkid=9691818 _ Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7 now http://go.microsoft.com/?linkid=9691818
Nutch frozen but not exiting
My nutch crawl just stopped. The process is still there, and doesn't respond to a kill -TERM or a kill -HUP, but it hasn't written anything to the log file in the last 40 minutes. The last thing it logged was some calls to my custom url filter. Nothing has been written in the hadoop directory or the crawldir/crawldb or the segments dir in that time. How can I tell what's going on and why it's stopped? -- http://www.linkedin.com/in/paultomblin http://careers.stackoverflow.com/ptomblin
Re: Nutch frozen but not exiting
On Sat, Nov 28, 2009 at 4:45 PM, Andrzej Bialecki a...@getopt.org wrote: Paul Tomblin wrote: How can I tell what's going on and why it's stopped? Try to generate a thread dump to see what code is being executed. I didn't do any sort of distributed mode because I've only got one core. I had to do a jstack -F to force a stack dump, and here's what it says: -bash-3.2$ jstack -F 32507 Attaching to process ID 32507, please wait... Debugger attached successfully. Server compiler detected. JVM version is 14.3-b01 Deadlock Detection: No deadlocks found. Thread 21558: (state = IN_NATIVE_TRANS) - java.lang.UNIXProcess.forkAndExec(byte[], byte[], int, byte[], int, byte[], boolean, java.io.FileDescriptor, java.io.FileDescriptor, java.io.FileDescriptor) @bci=0 (Interpreted frame) - java.lang.UNIXProcess.access$500(java.lang.UNIXProcess, byte[], byte[], int, byte[], int, byte[], boolean, java.io.FileDescriptor, java.io.FileDescriptor, java.io.FileDescriptor) @bci=18, line=20 (Interpreted frame) - java.lang.UNIXProcess$1$1.run() @bci=93, line=109 (Interpreted frame) Thread 21548: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=158 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=1925 (Interpreted frame) - org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run() @bci=55, line=882 (Interpreted frame) Thread 21545: (state = BLOCKED_TRANS) - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame) - org.apache.hadoop.mapred.Task$1.run() @bci=31, line=403 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=619 (Interpreted frame) Thread 21540: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Interpreted frame) - java.lang.Object.wait() @bci=2, line=485 (Interpreted frame) - java.lang.UNIXProcess$Gate.waitForExit() @bci=10, line=64 (Interpreted frame) - java.lang.UNIXProcess.init(byte[], byte[], int, byte[], int, byte[], boolean) @bci=74, line=145 (Interpreted frame) - java.lang.ProcessImpl.start(java.lang.String[], java.util.Map, java.lang.String, boolean) @bci=182, line=65 (Interpreted frame) - java.lang.ProcessBuilder.start() @bci=112, line=452 (Interpreted frame) - org.apache.hadoop.util.Shell.runCommand() @bci=52, line=149 (Interpreted frame) - org.apache.hadoop.util.Shell.run() @bci=23, line=134 (Interpreted frame) - org.apache.hadoop.fs.DF.getAvailable() @bci=1, line=73 (Interpreted frame) - org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(java.lang.String, long, org.apache.hadoop.conf.Configuration) @bci=187, line=321 (Interpreted frame) - org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(java.lang.String, long, org.apache.hadoop.conf.Configuration) @bci=16, line=124 (Interpreted frame) - org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(org.apache.hadoop.mapred.TaskAttemptID, int, long) @bci=50, line=107 (Interpreted frame) - org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill() @bci=78, line=930 (Compiled frame) - org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush() @bci=104, line=842 (Interpreted frame) - org.apache.hadoop.mapred.MapTask.run(org.apache.hadoop.mapred.JobConf, org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=391, line=343 (Interpreted frame) - org.apache.hadoop.mapred.LocalJobRunner$Job.run() @bci=282, line=138 (Interpreted frame) Thread 32521: (state = BLOCKED_TRANS) - java.lang.Object.wait(long) @bci=0 (Interpreted frame) - java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=118 (Interpreted frame) - org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ReferenceQueueThread.run() @bci=9, line=1082 (Interpreted frame) Thread 32516: (state = BLOCKED_TRANS) Thread 32515: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Interpreted frame) - java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=118 (Interpreted frame) - java.lang.ref.ReferenceQueue.remove() @bci=2, line=134 (Compiled frame) - java.lang.ref.Finalizer$FinalizerThread.run() @bci=3, line=159 (Compiled frame) Thread 32514: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Interpreted frame) - java.lang.Object.wait() @bci=2, line=485 (Compiled frame) - java.lang.ref.Reference$ReferenceHandler.run() @bci=46, line=116 (Compiled frame) Thread 32508: (state = IN_VM_TRANS) - org.apache.hadoop.mapred.JobStatus.getRunState() @bci=0, line=199 (Interpreted frame) - org.apache.hadoop.mapred.JobClient$NetworkedJob.isComplete() @bci=8, line=278 (Interpreted frame) - org.apache.hadoop.mapred.JobClient.runJob(org.apache.hadoop.mapred.JobConf) @bci=149, line=1155 (Interpreted frame) - org.apache.nutch.crawl.CrawlDb.update(org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path[], boolean, boolean, boolean, boolean) @bci=363, line=94 (Interpreted frame
Re: Nutch frozen but not exiting
On Sat, Nov 28, 2009 at 8:25 PM, Andrzej Bialecki a...@getopt.org wrote: Hm, the curious thing here is that the java process is sleeping, and 99% of cpu is in system time ... usually this would indicate swapping, but since there is no swap in your setup I'm stumped. Still, this may be related to the weird memory/swap setup on that machine - try decreasing the heap size and see what happens. When I decrease the heap size, it dies pretty early on. -- http://www.linkedin.com/in/paultomblin http://careers.stackoverflow.com/ptomblin
Re: Problem with Indexing Local Filesystem.
On Sun, Nov 15, 2009 at 2:45 AM, prashant ullegaddi prashullega...@gmail.com wrote: -activeThreads=0 Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:969) at org.apache.nutch.crawl.Crawl.main(Crawl.java:122) When that happened to me, it meant that the temporary hadoop files had filled up the /tmp file system. I had to configure hadoop to put its files somewhere else by putting the following in conf/hadoop-site.xml configuration property namehadoop.tmp.dir/name value/var/tmp/value /property /configuration -- http://www.linkedin.com/in/paultomblin http://careers.stackoverflow.com/ptomblin
Re: Hadoop wants to do whoami?
On Fri, Nov 6, 2009 at 11:44 PM, Ken Krugler kkrugler_li...@transpac.com wrote: Normally it works fine, but it will fail if you don't have swap space allocated because that's factored into the free space calc when the fork happens. What's the swap space setup for your VPS setup? There's no swap space. -- http://www.linkedin.com/in/paultomblin http://careers.stackoverflow.com/ptomblin
Why is nutch writing files in /tmp?
Why is nutch writing /tmp/hadoop-[userid] files, and how can I stop it doing that? -- http://www.linkedin.com/in/paultomblin http://careers.stackoverflow.com/ptomblin
Re: Redirect handling
There are two different types of redirect. When a web site returns a 301 status (redirect permanent), it means the url you requested is no longer valid, don't ask for it again. When it returns a 307 status (temporary redirect), it means keep asking for the url you asked for, and I'll tell you where to go from there. In the first case, Nutch should remove the first URL from its database and put the redirection target in in its place. In the second case, Nutch should leave the original URL in its database, but also go to the redirection target. I don't know if that's actually what Nutch does, but I assume so. On Tue, Oct 27, 2009 at 11:30 AM, caezar caeza...@gmail.com wrote: Hi All, I've done some googling, but found different answers, so I would appreciate if you tell me which is the correct one: - when page redirected, content of target page is fetched and associated with the source (initial) page URL - when page redirected, new entry with the redirect target url and contents added to the db If the second option is the correct one, then one more question. When I have a NutchDocument instance which represents target URL, is that possible to retrieve it's redirect source URL somehow? Thanks -- View this message in context: http://www.nabble.com/Redirect-handling-tp26079767p26079767.html Sent from the Nutch - User mailing list archive at Nabble.com. -- http://www.linkedin.com/in/paultomblin
Re: Recrawling Nutch
nutch doesn't do a good job on storing or testing the Last-Modified time of pages it's crawled. I made the following changes which seem to help a lot: snowbird:~/src/nutch/trunk svn diff Index: src/java/org/apache/nutch/fetcher/Fetcher.java === --- src/java/org/apache/nutch/fetcher/Fetcher.java (revision 817382) +++ src/java/org/apache/nutch/fetcher/Fetcher.java (working copy) @@ -21,6 +21,7 @@ import java.net.MalformedURLException; import java.net.URL; import java.net.UnknownHostException; +import java.text.ParseException; import java.util.*; import java.util.Map.Entry; import java.util.concurrent.atomic.AtomicInteger; @@ -42,6 +43,7 @@ import org.apache.nutch.metadata.Metadata; import org.apache.nutch.metadata.Nutch; import org.apache.nutch.net.*; +import org.apache.nutch.net.protocols.HttpDateFormat; import org.apache.nutch.protocol.*; import org.apache.nutch.parse.*; import org.apache.nutch.scoring.ScoringFilters; @@ -742,6 +744,23 @@ datum.setStatus(status); datum.setFetchTime(System.currentTimeMillis()); + LOG.debug(metadata = + (content != null ? content.getMetadata() : content-null)); + LOG.debug(modified? = + ((content != null content.getMetadata() != null) ? content.getMetadata().get(Last-Modified) : content-null)); + if (content != null content.getMetadata() != null content.getMetadata().get(Last-Modified) != null) + { + String lastModifiedStr = content.getMetadata().get(Last-Modified); + + try + { + long lastModifiedDate = HttpDateFormat.toLong(lastModifiedStr); + LOG.debug(last modified = + lastModifiedStr + = + lastModifiedDate); + datum.setModifiedTime(lastModifiedDate); + } + catch (ParseException e) + { + LOG.error(unable to parse + lastModifiedStr, e); + } + } if (pstatus != null) datum.getMetaData().put(Nutch.WRITABLE_PROTO_STATUS_KEY, pstatus); ParseResult parseResult = null; Index: src/java/org/apache/nutch/indexer/IndexerMapReduce.java === --- src/java/org/apache/nutch/indexer/IndexerMapReduce.java (revision 817382) +++ src/java/org/apache/nutch/indexer/IndexerMapReduce.java (working copy) @@ -84,8 +84,10 @@ if (CrawlDatum.hasDbStatus(datum)) dbDatum = datum; else if (CrawlDatum.hasFetchStatus(datum)) { - // don't index unmodified (empty) pages - if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) + /* + * Where did this person get the idea that unmodified pages are empty? + // don't index unmodified (empty) pages + if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) */ fetchDatum = datum; } else if (CrawlDatum.STATUS_LINKED == datum.getStatus() || CrawlDatum.STATUS_SIGNATURE == datum.getStatus()) { @@ -108,7 +110,7 @@ } if (!parseData.getStatus().isSuccess() || -fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) { +(fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)) { return; } Index: src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java === --- src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java (revision 817382) +++ src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java (working copy) @@ -124,11 +124,14 @@ reqStr.append(\r\n); } - reqStr.append(\r\n); if (datum.getModifiedTime() 0) { -reqStr.append(If-Modified-Since: + HttpDateFormat.toString(datum.getModifiedTime())); + String httpDate = + HttpDateFormat.toString(datum.getModifiedTime()); + Http.LOG.debug(modified time: + httpDate); +reqStr.append(If-Modified-Since: + httpDate); reqStr.append(\r\n); } + reqStr.append(\r\n); byte[] reqBytes= reqStr.toString().getBytes(); On Wed, Oct 14, 2009 at 9:40 AM, sprabhu_PN shreekanth.pra...@pinakilabs.com wrote: We are looking at picking up updates in a recrawl - How do I get the the fetcher to read the recently built segment, get to the url and decide whether to get the content based on whether the url has been updated since? Shreekanth Prabhu -- View this message in context: http://www.nabble.com/Recrawling--Nutch-tp25891294p25891294.html Sent from the Nutch - User mailing list archive at Nabble.com. -- http://www.linkedin.com/in/paultomblin
Re: Incremental Whole Web Crawling
Don't change options in nutch-default.xml - copy the option into nutch-site.xml and change it there. That way the change will (hopefully) survive an upgrade. On Tue, Oct 6, 2009 at 1:01 AM, Gaurang Patel gaurangtpa...@gmail.com wrote: Hey, Never mind. I got *generate.update.db* in *nutch-default.xml* and set it true. Regards, Gaurang 2009/10/5 Gaurang Patel gaurangtpa...@gmail.com Hey Andrzej, Can you tell me where to set this property (generate.update.db)? I am trying to run similar kind of crawl scenario that Eric is running. -Gaurang 2009/10/5 Andrzej Bialecki a...@getopt.org Eric wrote: Andrzej, Just to make sure I have this straight, set the generate.update.db property to true then bin/nutch generate crawl/crawldb crawl/segments -topN 10: 16 times? Yes. When this property is set to true, then each fetchlist will be different, because the records for those pages that are already on another fetchlist will be temporarily locked. Please note that this lock holds only for 1 week, so you need to fetch all segments within one week from generating them. You can fetch and updatedb in arbitrary order, so once you fetched some segments you can run the parsing and updatedb just from these segments, without waiting for all 16 segments to be processed. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- http://www.linkedin.com/in/paultomblin
Re: how to upgrade a java application with nutch?
2009/10/1 Jaime Martín james...@gmail.com Hi! I´ve a java application that I would like to upgrade with nutch. What jars should I add to my lib applicaction to make it possible to use nutch features from some of my app pages and business logic classes? I´ve tried with nutch-1.0.jar generated by war target without success. I wonder what is the proper nutch build.xml target I should execute for this and what of the generated jars are to be included in my app. Maybe apart from nutch-1.0.jar are all nutch-1.0\lib jars compulsory or just a few of them? Maybe I'm doing it wrong, but I used the nutch-1.0.job file instead of the jar. -- http://www.linkedin.com/in/paultomblin
Re: Something wrong with nutch.wiki
2009/10/1 Kirby Bohling kirby.bohl...@gmail.com: 2009/9/29 Ольга Пескова opesk...@mail.ru: Hello! Please check the url: http://wiki.apache.org/nutch/ I can't find any content there. Just as a point of reference, I got the FrontPage to pull up just prior to sending this e-mail. I'm not sure what is wrong with your connection to it, but I don't believe it is the server. It was down for a number of hours today, but evidently it's back up now. -- http://www.linkedin.com/in/paultomblin
Re: Why Nutch is not crawling all links from web page
On Tue, Sep 22, 2009 at 4:17 AM, Pravin Karne pravin_ka...@persistent.co.in wrote: Hi, I am using nutch to crawl particular site. But I found that Nutch is not crawling all links from every pages. Is there any tuning parameter for nutch to crawl all links? There are a number of reasons why it might not follow a link, and in my opinion Nutch really needs a way to provide the information on why it's doing or not doing what it does, without turning on DEBUG level logging. One reason might be if the links go to a new site (or redirect internally - if you follow links within www.prnewswire.com, they often do a silent redirect to news.prnewswire.com) and you've got the property db.ignore.external.links set to true. Another reason might be the robots.txt file for the site you're crawling forbids you from crawling those parts of the site. Another reason might be that you're using one or more url filters, and the url filters forbid the urls in question. There are probably other reasons, but those are the ones that have bitten me so far. -- http://www.linkedin.com/in/paultomblin
Where should I do this?
I want to output to a file or database every url/filename that's crawled, along with the status. I figure I can do this with a plugin, but I'm not sure where to slot it into the plugin hierarchy. Any suggestions? -- http://www.linkedin.com/in/paultomblin
Re: Difference between Deiselpoint and Nutch?
On Fri, Sep 18, 2009 at 12:06 PM, David M. Cole d...@colegroup.com wrote: At 11:30 AM -0400 9/18/09, Paul Tomblin wrote: Is anybody here familiar with how Desielpoint (DP) works? Dieselpoint is designed specifically for intranets and therefore doesn't take robots.txt into account because the Dieselpoint administrator and the web administrator (theoretically) work toward the same goals (see the thread from last Friday, Ignoring Robots.txt for an instance where that wasn't the case). Nutch is designed specifically for all-web crawling (like Google or Bing) and respects robots.txt because Nutch needs to be polite when indexing sites over which it has no control. Your client has a robots.txt file to control Google and/or Bing, so Nutch is respecting it the same way Google or Bing would. I'm afraid I wasn't clear. The site that the client is indexing with DP is an external site, not hers. Nutch is, I think, doing the right thing by not crawling it, but I can't convince her of this because she's convinced that DP is commercial and Nutch is only Open Source, so obviously DP is right. The site in question does have several sitemaps. Can Nutch do anything with sitemaps? (By the way, what does it mean when the robots.txt file lists more than one sitemap?) -- http://www.linkedin.com/in/paultomblin
What to do about sites with Disallow: * and a sitemap?
Is there a way to make Nutch look at the pages of a site based on its Sitemap? -- http://www.linkedin.com/in/paultomblin
Changing the filter rules?
If I change the filter rules, during a recrawl will URLs that are no longer valid according to the new rules be removed from the segment database? -- http://www.linkedin.com/in/paultomblin
Re: taking a look into a nutch segment
On Fri, Sep 4, 2009 at 4:29 PM, Lowell Kirshlow...@carbonfive.com wrote: I'd like to poke around in my nutch segment and see what data is there. I don't want to write any (or muhc) code. Are there any utilities out there that could help me with what I'm trying? bin/nutch readsegs -- http://www.linkedin.com/in/paultomblin
Re: Help me, No urls to fetch.
On Wed, Sep 2, 2009 at 6:36 AM, zo tigerzo.ti...@hotmail.com wrote: At last i ran bin/nutch crawl command but it gives No urls to fetch check your filter and seed list error I am sure there is no problem in crawl-url filter and other configuration xml files İs anyone know any possible problem What's in your url directory? -- http://www.linkedin.com/in/paultomblin
Isn't this a bug?
If I crawl a page with a url like: http://localhost/Documents/pharma/DocSamples/?C=N;O=A (which is what you get when you have a directory without an index.*, and you've configured Options Indexes, and you click one of the sorting options) and it presents all the files in the directory as relative links like foo.html, Nutch ends up trying to fetch the files with the second part of that same parameter on the end, like foo.htmlO=A, which ends up getting a 404. Look at the parse data for http://localhost/Documents/pharma/DocSamples/?C=D;O=A ... [java] outlink: toUrl: http://localhost/Documents/pharma/DocSamples/15%20minutes.htm;O=A anchor: 15 minutes.htm [java] outlink: toUrl: http://localhost/Documents/pharma/DocSamples/18whistle.html;O=A anchor: 18whistle.html [java] outlink: toUrl: http://localhost/Documents/pharma/DocSamples/2010%20brings%20changes.doc;O=A anchor: 2010 brings changes.doc ... -- http://www.linkedin.com/in/paultomblin
Getting Can't be handled as Microsoft document - java.util.NoSuchElementException
Is there something special I have to do to parse MS Word documents? I've got parse-msword included as one of my plugin, but I'm getting this error. (This is Nutch-1.0) -- http://www.linkedin.com/in/paultomblin
Nutch bug: can't handle urls with spaces in them
In my browser, I can see a URL with spaces in it, but when I hover over it, the browser has replaced the spaces with %20s, and when I click on it I get the document. However, when Nutch attempts to follow the link, it doesn't do that, and so it gets a 404. It should do the same thing that web browsers do, or else I'm going to be facing questions from my users about why certain documents aren't indexed even though they can see them just fine. If I do a view source, I can see the URLs with spaces in them: a href=http://localhost/Documents/pharma/DocSamples/Leg blood clots.htmLeg blood clots.htm/abr / But when I click on them, the URL got converted to: http://localhost/Documents/pharma/DocSamples/Leg%20blood%20clots.htm -- http://www.linkedin.com/in/paultomblin
Memory cost of extra threads?
Has anybody quantified what the memory cost per extra fetch thread? My fetches are taking way too long, and since I'm spending hours at a time staring at [java] [java] [DEBUG] 22:40 (Fetcher.java:run:482) [java] FetcherThread spin-waiting ... [java] [java] [DEBUG] 22:40 (Fetcher.java:run:482) [java] FetcherThread spin-waiting ... over and over again, I'm thinking maybe I should give it more to chew on. -- http://www.linkedin.com/in/paultomblin
Re: Keywords?
On Fri, Aug 21, 2009 at 4:20 AM, Julien Niochelists.digitalpeb...@gmail.com wrote: ou'll need to write a custom parser implementing HtmlParseFilter and get it to store the keywords found in the Metadata, then write a custom Indexer. By default the HTML parser does not do anything about meta tags. That's unfortunate, because org.apache.nutch.parse.html.HtmlParser actually extracts all the meta tags, and then takes a few and throws the rest away. It's mildly annoying that I'm going to have to re-implement all of HtmlParser just to add two lines to take that data out of metaTags and put it in content.getMetaData(). -- http://www.linkedin.com/in/paultomblin
Keywords?
Is there a way to extract the keywords from an html page? I can't find it in ParseData or CrawlDatum anywhere. -- http://www.linkedin.com/in/paultomblin
Nutch.SIGNATURE_KEY
Is SIGNATURE_KEY (aka nutch.content.digest) a valid way to check if my page has changed since the last time I crawled it? I patched Nutch to properly handle modification dates, and then discovered that my web site doesn't send Modification-Date because it uses shmtl (Server-parsed HTML). I assume it's some sort of cryptographic hash of the entire page? Another question: is Nutch smart enough to use that signature to determine that, say, http://xcski.com/ and http://xcski.com/index.html are the same page? -- http://www.linkedin.com/in/paultomblin
Re: Nutch.SIGNATURE_KEY
On Wed, Aug 19, 2009 at 1:00 PM, Ken Kruglerkkrugler_li...@transpac.com wrote: Another question: is Nutch smart enough to use that signature to determine that, say, http://xcski.com/ and http://xcski.com/index.html are the same page? I believe the hashes would be the same for either raw MD5 or text signature, yes. So on the search side these would get collapsed. Don't know about what else you mean as far as same page - e.g. one entry in the CrawlDB? If so, then somebody else with more up-to-date knowledge of Nutch would need to chime in here. Older versions of Nutch would still have these as separate entries, FWIR. Actually, I just checked some of my own pages, and http://xcski.com/ and http://xcski.com/index.html have different signatures, in spite of them being the same page. So I guess the answer to that is no, even if there were logic to make them the same page in CrawlDB, it wouldn't work. -- http://www.linkedin.com/in/paultomblin
Which versions?
Which versions of Lucene, Nutch and Solr work together? I've discovered that the Nutch trunk and the Solr trunk use wildly different versions of the Lucene jars, and it's causing me problems. -- http://www.linkedin.com/in/paultomblin
Re: How do I get all the documents in the index without searching?
On Tue, Aug 11, 2009 at 2:10 PM, Paul Tomblinptomb...@xcski.com wrote: I want to iterate through all the documents that are in the crawl, programattically. The only code I can find does searches. I don't want to search for a term, I want everything. Is there a way to do this? To answer my own question, what I ended up doing was IndexReader reader = IndexReader.open(indexDir.getAbsolutePath()); for (int i = 0; i reader.numDocs(); i++) { Document doc = reader.document(i); } Now that I have the Document, I have to figure out how to process it further to get the actual contents, but I assume that I need to go back to the segment for that. -- http://www.linkedin.com/in/paultomblin
How do I get all the documents in the index without searching?
I want to iterate through all the documents that are in the crawl, programattically. The only code I can find does searches. I don't want to search for a term, I want everything. Is there a way to do this? -- http://www.linkedin.com/in/paultomblin
Why isn't fetcher sending the last fetch time when it does a GET?
I'm watching my server logs as I do a second crawl of the site I crawled yesterday, and it's getting HTTP response code 200 on every page. Since none of those pages have changed, ideally the fetcher should send the last retrieval time in the HTTP header, and the server would then respond with a 301 code, so it wouldn't have to reparse the same page. Wouldn't this be a major win in terms of bandwidth consumed? Certainly GoogleBot does it that way. I'm doing the crawl using a slightly modified version of the script on the Wiki http://wiki.apache.org/nutch/Crawl -- http://www.linkedin.com/in/paultomblin
Re: Print out a list of every URL fetched?
Not quite what I want - that will show me every url that's ever been crawled, not just the ones fetched this time, nor is it real-time. On Fri, Aug 7, 2009 at 3:23 AM, Sebastian Nagelsebastian.na...@exorbyte.com wrote: Hi Paul, you can use $NUTCH_HOME/bin/nutch readdb my_crawl/crawldb/ -dump dump_crawldb/ -format csv then in dump_crawldb you'll find a CSV file with all URLs in your crawlDb. One column indicates the status. Select only those records with db_fetched and you'll have your list. Sebastian -- http://www.linkedin.com/in/paultomblin
Why did it think /style was part of the URL?
I am crawling my own site, which includes an ancient MovableType installation. When it gets to http://xcski.com/movabletype/mt.cgi, it produces an invalid outlink (seen by an exception in the crawl, and in the following readseg dump): Outlinks: 8 outlink: toUrl: http://xcski.com/movabletype/text/css anchor: outlink: toUrl: http://xcski.com/movabletype//style anchor: outlink: toUrl: http://xcski.com/movabletype/mt.cgi?__mode=start_recover ancho r: outlink: toUrl: http://xcski.com/movabletype/styles.css anchor: outlink: toUrl: http://xcski.com/movabletype/images/mt-logo.gif anchor: outlink: toUrl: http://xcski.com/movabletype/images/spacer.gif anchor: outlink: toUrl: http://xcski.com/movabletype/images/spacer.gif anchor: outlink: toUrl: http://xcski.com/movabletype/mt.cgi# anchor: Forgot your passw ord? Looking through the text returned by just doing a wget on that URL, I don't see any href that's anywhere near a /style, so I can't figure out why it's doing that. -- http://www.linkedin.com/in/paultomblin
Print out a list of every URL fetched?
If I want to print out a list of every URL as it's fetched, or better yet write that list to a file, is there a good plugin to implement? I'm guessing URLFilter isn't the best because it might see urls that don't actually get fetched as well as ones that return 304, 4xx or 5xx response codes. Ideally, it should only print ones that are being re-indexed. -- http://www.linkedin.com/in/paultomblin
Re: Added plugins not visible
On Wed, Aug 5, 2009 at 2:51 AM, Saurabh Sumansaurabhsuman...@rediff.com wrote: Hi i have created a plugins for indexing filter . i put it in [nutch_folder]\src\plugin\ then i built it with ant command. Build was successful and it created a jar. I also added that jar in classpath. In nutch-default.xml in value of properties nameplugin.includes/name i added that plugin also . like value..parse-(text|html|js)|index-(basic|anchor|germinait)|.../value my pluguns is index-germinait. But when i run the crawl , it is not detecting index-germinait. Where am i wrong?WHich step i am missing? Is the plugin.xml also in the classpath? -- http://www.linkedin.com/in/paultomblin
Re: Added plugins not visible
On Wed, Aug 5, 2009 at 8:08 AM, Saurabh Sumansaurabhsuman...@rediff.com wrote: No plugin.xml is not in classpath. I think it needs to be. Or at least, it needs to be in either build/plugins/plugin-name or in plugins/plugin-name. -- http://www.linkedin.com/in/paultomblin
Re: Nutch in C++
On Tue, Aug 4, 2009 at 1:35 PM, reinhard schwabreinhard.sch...@aon.at wrote: And why? I guess you may see some performance improvement, but it would be a LOT cheaper to throw hardware at the problem (and you may not see much if any). performance improvement? can you proove that c++ will be faster? Considering that Nutch is mostly network IO bound, rewriting it in a different language isn't going to make the Internet serve up your pages faster. -- http://www.linkedin.com/in/paultomblin
Re: Plugin development
That assumes that you're going to be putting the plugin in the Nutch source tree. I'm looking for guidance on what to do differently if you don't put it in the nutch source tree. On Fri, Jul 31, 2009 at 12:48 AM, Alexander Aristovalexander.aris...@gmail.com wrote: This is a simple HowTo http://wiki.apache.org/nutch/WritingPluginExample-0.9 Best Regards Alexander Aristov 2009/7/31 Paul Tomblin ptomb...@xcski.com How do I develop a plugin that isn't in the nutch source tree? I want to keep all my project's source code together, and not put the project specific plugin in with the nutch code. Do I just have my plugin's build.xml include $NUTCH_HOME/src/home/build-plugin.xml? (I'm a little shakey on ant syntax, I'm used to make.) Other than that, and making sure my plugin's jar file ends up in nutch's CLASSPATH, is there anything special I need to know? Should I be asking this on the developer list? -- http://www.linkedin.com/in/paultomblin -- http://www.linkedin.com/in/paultomblin
Re: Plugin development
On Fri, Jul 31, 2009 at 4:33 AM, Alexander Aristovalexander.aris...@gmail.com wrote: What do you mean under putting it in the nutch source tree. I mean those instructions you linked to (which I had already seen) only show you how to compile your plugin if you're willing to put it in $NUTCH_HOME/src/nutch/plugin, which I am not. I want to be able to compile it in my own source tree. I don't need to put my servlet code in the Tomcat source code tree to compile it, and I don't need to put my Swing code in com/javax, so I shouldn't need the source code tree of Nutch just to compile a plugin. -- http://www.linkedin.com/in/paultomblin
Nutch and Solr
I'm trying to follow the example in the Wiki, but it's corrupt. It has a bunch of garbage in the part you're supposed to past into solrconfig.xml - I don't know if something got interpreted as wiki markup when it shouldn't, or what, but I doubt superscripts are a normal part of the configuration. Can somebody please tell me what I'm supposed to do there? -- http://www.linkedin.com/in/paultomblin
Re: how to exclude some external links
On Thu, Jul 30, 2009 at 9:15 PM, alx...@aim.com wrote: I would like to know how can I modify nutch code to exclude external links with certain extensions. For example, if have in urls mydomain.com and my domain.com has a lot of links like mydomain.com/mylink.shtml, then I want nutch not to fetch(crawl) these kind of urls at all. Can't you do this with the existing RegexURLFilter plugin? Make sure urlfilter-regex is listed in plugin.includes, and that you've got the property urlfilter.regex.file is set to a file (probably regex-urlfilter.txt). Then you can list the extensions you want to skip in that file. -- http://www.linkedin.com/in/paultomblin
Plugin development
How do I develop a plugin that isn't in the nutch source tree? I want to keep all my project's source code together, and not put the project specific plugin in with the nutch code. Do I just have my plugin's build.xml include $NUTCH_HOME/src/home/build-plugin.xml? (I'm a little shakey on ant syntax, I'm used to make.) Other than that, and making sure my plugin's jar file ends up in nutch's CLASSPATH, is there anything special I need to know? Should I be asking this on the developer list? -- http://www.linkedin.com/in/paultomblin
Include/exclude lists
Is there any way other than the config files to specify the url filter parameters? I have a few dozen sites to crawl, and for each site I want to specify its own includes and excludes. I don't want to have to go into the config file and change the propertynameurlfilter.regex.file/name each time. Can I specify that on the command line to bin/nutch generate or something? -- http://www.linkedin.com/in/paultomblin
Dumping what I have?
The nutch data files are pretty opaque, and even strings can't extract anything except the occasional URL. Is there any code to dump the contents of the various files in a human readable form? -- http://www.linkedin.com/in/paultomblin
Re: Dumping what I have?
Awesome! Thanks. On Tue, Jul 28, 2009 at 12:26 PM, reinhard schwab reinhard.sch...@aon.atwrote: yes, there are tools which you can use to dump the content of crawl db, link db and segments. dump=./crawl/dump bin/nutch readdb $crawl/crawldb -dump $dump/crawldb bin/nutch readlinkdb $crawl/linkdb -dump $dump/linkdb bin/nutch readseg -dump $1 $dump/segments/$1 you will get more info if you call bin/nutch readdb bin/nutch readlinkdb bin/nutch readseg Paul Tomblin schrieb: The nutch data files are pretty opaque, and even strings can't extract anything except the occasional URL. Is there any code to dump the contents of the various files in a human readable form? -- http://www.linkedin.com/in/paultomblin
Re: How to index other fields in solr
Wouldn't that be using facets, as per http://wiki.apache.org/solr/SimpleFacetParameters On Mon, Jul 27, 2009 at 2:34 AM, Saurabh Suman saurabhsuman...@rediff.comwrote: I am using solr for searching.I used the class SolrIndexer.But i can search on content only?I want to search on author also?How to index on author? -- View this message in context: http://www.nabble.com/How-to-index-other-fields-in-solr-tp24674208p24674208.html Sent from the Nutch - User mailing list archive at Nabble.com. -- http://www.linkedin.com/in/paultomblin
Re: Why did my crawl fail?
Actually, I got that error the first time I used it, and then again when I blew away the downloaded nutch and grabbed the latest trunk from Subversion. On Mon, Jul 27, 2009 at 1:11 AM, xiao yang yangxiao9...@gmail.com wrote: You must have crawled for several times, and some of them failed before the parse phase. So the parse data was not generated. You'd better delete the whole directory file:/Users/ptomblin/nutch-1.0/crawl.blog, and recrawl it, then you will know the exact reason why it failed in the parse phase from the output information. Xiao On Fri, Jul 24, 2009 at 10:53 PM, Paul Tomblinptomb...@xcski.com wrote: I installed nutch 1.0 on my laptop last night and set it running to crawl my blog with the command: bin/nutch crawl urls -dir crawl.blog -depth 10 it was still running strong when I went to bed several hours later, and this morning I woke up to this: activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl.blog/crawldb CrawlDb update: segments: [crawl.blog/segments/20090724010303] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done LinkDb: starting LinkDb: linkdb: crawl.blog/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303 Exception in thread main org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530/parse_data at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147) at org.apache.nutch.crawl.Crawl.main(Crawl.java:129) -- http://www.linkedin.com/in/paultomblin -- http://www.linkedin.com/in/paultomblin
Re: Why did my crawl fail?
Unfortunately I blew away those particular logs when I fetched the svn trunk. I just tried it again (well, I started it again at noon and it just finished) and this time it worked fine, so it seems kind of heisenbug-like. Maybe it has something to do with which pages are types it can't handle? On Mon, Jul 27, 2009 at 11:27 AM, xiao yang yangxiao9...@gmail.com wrote: Hi, Paul Can you post the error messages in the log file (file:/Users/ptomblin/nutch-1.0/logs)? On Mon, Jul 27, 2009 at 6:55 PM, Paul Tomblinptomb...@xcski.com wrote: Actually, I got that error the first time I used it, and then again when I blew away the downloaded nutch and grabbed the latest trunk from Subversion. On Mon, Jul 27, 2009 at 1:11 AM, xiao yang yangxiao9...@gmail.com wrote: You must have crawled for several times, and some of them failed before the parse phase. So the parse data was not generated. You'd better delete the whole directory file:/Users/ptomblin/nutch-1.0/crawl.blog, and recrawl it, then you will know the exact reason why it failed in the parse phase from the output information. Xiao On Fri, Jul 24, 2009 at 10:53 PM, Paul Tomblinptomb...@xcski.com wrote: I installed nutch 1.0 on my laptop last night and set it running to crawl my blog with the command: bin/nutch crawl urls -dir crawl.blog -depth 10 it was still running strong when I went to bed several hours later, and this morning I woke up to this: activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl.blog/crawldb CrawlDb update: segments: [crawl.blog/segments/20090724010303] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done LinkDb: starting LinkDb: linkdb: crawl.blog/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303 Exception in thread main org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530/parse_data at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147) at org.apache.nutch.crawl.Crawl.main(Crawl.java:129) -- http://www.linkedin.com/in/paultomblin -- http://www.linkedin.com/in/paultomblin -- http://www.linkedin.com/in/paultomblin
Re: Why did my crawl fail?
No, it fetched thousands of pages - my blog and picture gallery. It just never finished indexing them because as well as looking at the 11 segments that exist, it's also trying to look at a segment that doesn't. On Sun, Jul 26, 2009 at 9:06 PM, arkadi.kosmy...@csiro.au wrote: This is a very interesting issue. I guess that absence of parse_data means that no content has been fetched. Am I wrong? This happened in my crawls a few times. Theoretically (I am guessing again) this may happen if all urls selected for fetching on this iteration are either blocked by the filters, or failed to be fetched, for whatever reason. I got around this problem by checking for presence of parse_data, and if it is absent, deleting the segment. This seems to be working, but I am not 100% sure that this is a good thing to do. Can I do this? Is it safe to do? Would appreciate if someone with expert knowledge commented on this issue. Regards, Arkadi -Original Message- From: ptomb...@gmail.com [mailto:ptomb...@gmail.com] On Behalf Of Paul Tomblin Sent: Saturday, July 25, 2009 12:54 AM To: nutch-user Subject: Why did my crawl fail? I installed nutch 1.0 on my laptop last night and set it running to crawl my blog with the command: bin/nutch crawl urls -dir crawl.blog -depth 10 it was still running strong when I went to bed several hours later, and this morning I woke up to this: activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl.blog/crawldb CrawlDb update: segments: [crawl.blog/segments/20090724010303] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done LinkDb: starting LinkDb: linkdb: crawl.blog/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303 Exception in thread main org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/ptomblin/nutch- 1.0/crawl.blog/segments/20090723154530/parse_data at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1 79) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn putFormat.java:39) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:19 0) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147) at org.apache.nutch.crawl.Crawl.main(Crawl.java:129) -- http://www.linkedin.com/in/paultomblin -- http://www.linkedin.com/in/paultomblin
Can I chunk during the crawl?
Forgive me if this is a bit of a n00b question. I've been tasked with taking some other person's code and replacing all the DieselPoint code with Lucene/Nutch. What they do in DieselPoint is crawl specific parts of the web, then perform some proprietary splitting up of the returned pages into chunks, and then the chunks themselves are indexed. Actually, I think they do it in a kind of a naive way, because it appears that DieselPoint crawls and indexes, and then this code goes through the index and creates chunk files, possibly several from any given initial page, and then DieselPoint is set loose to crawl and index those chunk files. Then the app uses *that* index in proprietary searches. I'm trying to learn my way around Nutch, and I'm wondering if there might be a way to get rid of the chunking stage by doing it directly in the initial crawl, possibly by writing a plugin. Unfortunately I'm under NDA so I can't give away too much of what the chunking process does, but I hope I've given enough information on what I'm trying to do. Is what I'm doing possible? -- http://www.linkedin.com/in/paultomblin
Why did my crawl fail?
I installed nutch 1.0 on my laptop last night and set it running to crawl my blog with the command: bin/nutch crawl urls -dir crawl.blog -depth 10 it was still running strong when I went to bed several hours later, and this morning I woke up to this: activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl.blog/crawldb CrawlDb update: segments: [crawl.blog/segments/20090724010303] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done LinkDb: starting LinkDb: linkdb: crawl.blog/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303 Exception in thread main org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530/parse_data at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147) at org.apache.nutch.crawl.Crawl.main(Crawl.java:129) -- http://www.linkedin.com/in/paultomblin