OT: Can't get unsubscribed from the wiki notifications

2010-01-17 Thread Paul Tomblin
Somehow I got subscribed to the emails whenever the wiki gets updated,
and I can't figure out how to unsubscribe from them.  The password
recovery form never seems to email me whatever it needs to email me to
allow me to recover or reset my password, which makes me suspect maybe
I forgot what wiki name I used or what email address I used.  Do I
have any hope of getting unsubbed, or should I just filter out those
messages?

-- 
http://www.linkedin.com/in/paultomblin
http://careers.stackoverflow.com/ptomblin



RE: recrawl.sh stopped at depth 7/10 without error

2009-12-07 Thread Paul Tomblin
Try starting it with nohup.  'man nohup' for details.

-- Sent from my Palm Prē
BELLINI ADAM wrote:







hi,



mabe i found my probleme, it's not nutch mistake, i beleived when running the 
crawl command as background process when closing my console it will not stop 
the process, but it seems that it realy kill the process  





i launched the porcess like this : ./bin/nutch crawl urls -dir crawl depth -10 
 crawl.log amp;



but even with the 'amp;' caractere when closing my console it kills the 
process.



thx



 Date: Mon, 7 Dec 2009 19:00:37 +0800

 Subject: Re: recrawl.sh stopped at depth 7/10 without error

 From: yea...@gmail.com

 To: nutch-user@lucene.apache.org

 

 I sill want to  know the reason.

 

 2009/12/2 BELLINI ADAM lt;mbel...@msn.com

 

 

  hi,

 

  anay idea guys ??

 

 

 

  thanx

 

   From: mbel...@msn.com

   To: nutch-user@lucene.apache.org

   Subject: RE: recrawl.sh stopped at depth 7/10 without error

   Date: Fri, 27 Nov 2009 20:11:12 +

  

  

  

   hi,

  

   this is the main loop of my recrawl.sh

  

  

   do

  

 echo --- Beginning crawl at depth `expr $i + 1` of $depth ---

 $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN \

 -adddays $adddays

 if [ $? -ne 0 ]

 then

   echo runbot: Stopping at depth $depth. No more URLs to fetch.

   break

 fi

 segment=`ls -d $crawl/segments/* | tail -1`

  

 $NUTCH_HOME/bin/nutch fetch $segment -threads $threads

 if [ $? -ne 0 ]

 then

   echo runbot: fetch $segment at depth `expr $i + 1` failed.

   echo runbot: Deleting segment $segment.

   rm $RMARGS $segment

   continue

 fi

  

 $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment

  

   done

  

   echo - Merge Segments (Step 3 of $steps) -

  

  

  

   in my log file i never find the message - Merge Segments (Step 3 of

  $steps) - ! so it breaks the loop and stops the process.

  

   i dont understand why it stops at depth 7 without any errors !

  

  

From: mbel...@msn.com

To: nutch-user@lucene.apache.org

Subject: recrawl.sh stopped at depth 7/10 without error

Date: Wed, 25 Nov 2009 15:43:33 +

   

   

   

hi,

   

i'm running recrawl.sh and it stops every time at depth 7/10 without

  any error ! but when run the bin/crawl with the same crawl-urlfilter and the

  same seeds file it finishs softly in 1h50

   

i checked the hadoop.log, and dont find any error there...i just find

  the last url it was parsing

do fetching or crawling has a timeout ?

my recrawl takes 2 hours before it stops. i set the time fetch interval

  24 hours and i'm running the generate with adddays = 1

   

best regards

   

_

Eligible CDN College amp; University students can upgrade to Windows 7

  before Jan 3 for only $39.99. Upgrade now!

http://go.microsoft.com/?linkid=9691819

  

   _

   Eligible CDN College amp; University students can upgrade to Windows 7

  before Jan 3 for only $39.99. Upgrade now!

   http://go.microsoft.com/?linkid=9691819

 

  _

  Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7

  now

  http://go.microsoft.com/?linkid=9691818

  

_

Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7 now

http://go.microsoft.com/?linkid=9691818


Nutch frozen but not exiting

2009-11-28 Thread Paul Tomblin
My nutch crawl just stopped.  The process is still there, and doesn't
respond to a kill -TERM or a kill -HUP, but it hasn't written
anything to the log file in the last 40 minutes.  The last thing it
logged was some calls to my custom url filter.  Nothing has been
written in the hadoop directory or the crawldir/crawldb or the
segments dir in that time.

How can I tell what's going on and why it's stopped?

-- 
http://www.linkedin.com/in/paultomblin
http://careers.stackoverflow.com/ptomblin


Re: Nutch frozen but not exiting

2009-11-28 Thread Paul Tomblin
On Sat, Nov 28, 2009 at 4:45 PM, Andrzej Bialecki a...@getopt.org wrote:
 Paul Tomblin wrote:

 How can I tell what's going on and why it's stopped?

 Try to generate a thread dump to see what code is being executed.

I didn't do any sort of distributed mode because I've only got one
core.  I had to do a jstack -F to force a stack dump, and here's
what it says:

-bash-3.2$ jstack -F 32507
Attaching to process ID 32507, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 14.3-b01
Deadlock Detection:

No deadlocks found.

Thread 21558: (state = IN_NATIVE_TRANS)
 - java.lang.UNIXProcess.forkAndExec(byte[], byte[], int, byte[], int,
byte[], boolean, java.io.FileDescriptor, java.io.FileDescriptor,
java.io.FileDescriptor) @bci=0 (Interpreted frame)
 - java.lang.UNIXProcess.access$500(java.lang.UNIXProcess, byte[],
byte[], int, byte[], int, byte[], boolean, java.io.FileDescriptor,
java.io.FileDescriptor, java.io.FileDescriptor) @bci=18, line=20
(Interpreted frame)
 - java.lang.UNIXProcess$1$1.run() @bci=93, line=109 (Interpreted frame)


Thread 21548: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object)
@bci=14, line=158 (Interpreted frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await()
@bci=42, line=1925 (Interpreted frame)
 - org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run()
@bci=55, line=882 (Interpreted frame)


Thread 21545: (state = BLOCKED_TRANS)
 - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame)
 - org.apache.hadoop.mapred.Task$1.run() @bci=31, line=403 (Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=619 (Interpreted frame)


Thread 21540: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Interpreted frame)
 - java.lang.Object.wait() @bci=2, line=485 (Interpreted frame)
 - java.lang.UNIXProcess$Gate.waitForExit() @bci=10, line=64 (Interpreted frame)
 - java.lang.UNIXProcess.init(byte[], byte[], int, byte[], int,
byte[], boolean) @bci=74, line=145 (Interpreted frame)
 - java.lang.ProcessImpl.start(java.lang.String[], java.util.Map,
java.lang.String, boolean) @bci=182, line=65 (Interpreted frame)
 - java.lang.ProcessBuilder.start() @bci=112, line=452 (Interpreted frame)
 - org.apache.hadoop.util.Shell.runCommand() @bci=52, line=149
(Interpreted frame)
 - org.apache.hadoop.util.Shell.run() @bci=23, line=134 (Interpreted frame)
 - org.apache.hadoop.fs.DF.getAvailable() @bci=1, line=73 (Interpreted frame)
 - 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(java.lang.String,
long, org.apache.hadoop.conf.Configuration) @bci=187, line=321
(Interpreted frame)
 - org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(java.lang.String,
long, org.apache.hadoop.conf.Configuration) @bci=16, line=124
(Interpreted frame)
 - 
org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(org.apache.hadoop.mapred.TaskAttemptID,
int, long) @bci=50, line=107 (Interpreted frame)
 - org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill()
@bci=78, line=930 (Compiled frame)
 - org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush() @bci=104,
line=842 (Interpreted frame)
 - org.apache.hadoop.mapred.MapTask.run(org.apache.hadoop.mapred.JobConf,
org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=391, line=343
(Interpreted frame)
 - org.apache.hadoop.mapred.LocalJobRunner$Job.run() @bci=282,
line=138 (Interpreted frame)


Thread 32521: (state = BLOCKED_TRANS)
 - java.lang.Object.wait(long) @bci=0 (Interpreted frame)
 - java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=118
(Interpreted frame)
 - 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ReferenceQueueThread.run()
@bci=9, line=1082 (Interpreted frame)


Thread 32516: (state = BLOCKED_TRANS)


Thread 32515: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Interpreted frame)
 - java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=118
(Interpreted frame)
 - java.lang.ref.ReferenceQueue.remove() @bci=2, line=134 (Compiled frame)
 - java.lang.ref.Finalizer$FinalizerThread.run() @bci=3, line=159
(Compiled frame)


Thread 32514: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Interpreted frame)
 - java.lang.Object.wait() @bci=2, line=485 (Compiled frame)
 - java.lang.ref.Reference$ReferenceHandler.run() @bci=46, line=116
(Compiled frame)


Thread 32508: (state = IN_VM_TRANS)
 - org.apache.hadoop.mapred.JobStatus.getRunState() @bci=0, line=199
(Interpreted frame)
 - org.apache.hadoop.mapred.JobClient$NetworkedJob.isComplete()
@bci=8, line=278 (Interpreted frame)
 - org.apache.hadoop.mapred.JobClient.runJob(org.apache.hadoop.mapred.JobConf)
@bci=149, line=1155 (Interpreted frame)
 - org.apache.nutch.crawl.CrawlDb.update(org.apache.hadoop.fs.Path,
org.apache.hadoop.fs.Path[], boolean, boolean, boolean, boolean)
@bci=363, line=94 (Interpreted frame

Re: Nutch frozen but not exiting

2009-11-28 Thread Paul Tomblin
On Sat, Nov 28, 2009 at 8:25 PM, Andrzej Bialecki a...@getopt.org wrote:

 Hm, the curious thing here is that the java process is sleeping, and 99% of
 cpu is in system time ... usually this would indicate swapping, but since
 there is no swap in your setup I'm stumped. Still, this may be related to
 the weird memory/swap setup on that machine - try decreasing the heap size
 and see what happens.

When I decrease the heap size, it dies pretty early on.

-- 
http://www.linkedin.com/in/paultomblin
http://careers.stackoverflow.com/ptomblin


Re: Problem with Indexing Local Filesystem.

2009-11-15 Thread Paul Tomblin
On Sun, Nov 15, 2009 at 2:45 AM, prashant ullegaddi
prashullega...@gmail.com wrote:

 -activeThreads=0
 Exception in thread main java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:969)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:122)

When that happened to me, it meant that the temporary hadoop files had
filled up the /tmp file system.  I had to configure hadoop to put its
files somewhere else by putting the following in conf/hadoop-site.xml

configuration

property
 namehadoop.tmp.dir/name
 value/var/tmp/value
/property

/configuration

-- 
http://www.linkedin.com/in/paultomblin
http://careers.stackoverflow.com/ptomblin


Re: Hadoop wants to do whoami?

2009-11-07 Thread Paul Tomblin
On Fri, Nov 6, 2009 at 11:44 PM, Ken Krugler
kkrugler_li...@transpac.com wrote:

 Normally it works fine, but it will fail if you don't have swap space
 allocated because that's factored into the free space calc when the fork
 happens.

 What's the swap space setup for your VPS setup?

There's no swap space.

-- 
http://www.linkedin.com/in/paultomblin
http://careers.stackoverflow.com/ptomblin


Why is nutch writing files in /tmp?

2009-11-02 Thread Paul Tomblin
Why is nutch writing /tmp/hadoop-[userid] files, and how can I stop it
doing that?


-- 
http://www.linkedin.com/in/paultomblin
http://careers.stackoverflow.com/ptomblin


Re: Redirect handling

2009-10-27 Thread Paul Tomblin
There are two different types of redirect.  When a web site returns a
301 status (redirect permanent), it means the url you requested is no
longer valid, don't ask for it again.  When it returns a 307 status
(temporary redirect), it means keep asking for the url you asked for,
and I'll tell you where to go from there.  In the first case, Nutch
should remove the first URL from its database and put the redirection
target in in its place.  In the second case, Nutch should leave the
original URL in its database, but also go to the redirection target.
I don't know if that's actually what Nutch does, but I assume so.

On Tue, Oct 27, 2009 at 11:30 AM, caezar caeza...@gmail.com wrote:

 Hi All,

 I've done some googling, but found different answers, so I would appreciate
 if you tell me which is the correct one:
 - when page redirected, content of target page is fetched and associated
 with the source (initial) page URL
 - when page redirected, new entry with the redirect target url and contents
 added to the db

 If the second option is the correct one, then one more question. When I have
 a NutchDocument instance which represents target URL, is that possible to
 retrieve it's redirect source URL somehow?

 Thanks
 --
 View this message in context: 
 http://www.nabble.com/Redirect-handling-tp26079767p26079767.html
 Sent from the Nutch - User mailing list archive at Nabble.com.





-- 
http://www.linkedin.com/in/paultomblin


Re: Recrawling Nutch

2009-10-14 Thread Paul Tomblin
nutch doesn't do a good job on storing or testing the Last-Modified
time of pages it's crawled.  I made the following changes which seem
to help a lot:

snowbird:~/src/nutch/trunk svn diff
Index: src/java/org/apache/nutch/fetcher/Fetcher.java
===
--- src/java/org/apache/nutch/fetcher/Fetcher.java  (revision 817382)
+++ src/java/org/apache/nutch/fetcher/Fetcher.java  (working copy)
@@ -21,6 +21,7 @@
 import java.net.MalformedURLException;
 import java.net.URL;
 import java.net.UnknownHostException;
+import java.text.ParseException;
 import java.util.*;
 import java.util.Map.Entry;
 import java.util.concurrent.atomic.AtomicInteger;
@@ -42,6 +43,7 @@
 import org.apache.nutch.metadata.Metadata;
 import org.apache.nutch.metadata.Nutch;
 import org.apache.nutch.net.*;
+import org.apache.nutch.net.protocols.HttpDateFormat;
 import org.apache.nutch.protocol.*;
 import org.apache.nutch.parse.*;
 import org.apache.nutch.scoring.ScoringFilters;
@@ -742,6 +744,23 @@

   datum.setStatus(status);
   datum.setFetchTime(System.currentTimeMillis());
+  LOG.debug(metadata =  + (content != null ?
content.getMetadata() : content-null));
+  LOG.debug(modified? =  + ((content != null 
content.getMetadata() != null) ?
content.getMetadata().get(Last-Modified) : content-null));
+  if (content != null  content.getMetadata() != null 
content.getMetadata().get(Last-Modified) != null)
+  {
+  String lastModifiedStr = content.getMetadata().get(Last-Modified);
+
+  try
+  {
+  long lastModifiedDate = HttpDateFormat.toLong(lastModifiedStr);
+  LOG.debug(last modified =  + lastModifiedStr +  = 
+ lastModifiedDate);
+  datum.setModifiedTime(lastModifiedDate);
+  }
+  catch (ParseException e)
+  {
+  LOG.error(unable to parse  + lastModifiedStr, e);
+  }
+  }
   if (pstatus != null)
datum.getMetaData().put(Nutch.WRITABLE_PROTO_STATUS_KEY, pstatus);

   ParseResult parseResult = null;
Index: src/java/org/apache/nutch/indexer/IndexerMapReduce.java
===
--- src/java/org/apache/nutch/indexer/IndexerMapReduce.java (revision 
817382)
+++ src/java/org/apache/nutch/indexer/IndexerMapReduce.java (working copy)
@@ -84,8 +84,10 @@
 if (CrawlDatum.hasDbStatus(datum))
   dbDatum = datum;
 else if (CrawlDatum.hasFetchStatus(datum)) {
-  // don't index unmodified (empty) pages
-  if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)
+  /*
+   * Where did this person get the idea that unmodified pages
are empty?
+   // don't index unmodified (empty) pages
+  if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) */
 fetchDatum = datum;
 } else if (CrawlDatum.STATUS_LINKED == datum.getStatus() ||
CrawlDatum.STATUS_SIGNATURE == datum.getStatus()) {
@@ -108,7 +110,7 @@
 }

 if (!parseData.getStatus().isSuccess() ||
-fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) {
+(fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS 
fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)) {
   return;
 }

Index: 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
===
--- 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
  (revision
817382)
+++ 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
  (working
copy)
@@ -124,11 +124,14 @@
 reqStr.append(\r\n);
   }

-  reqStr.append(\r\n);
   if (datum.getModifiedTime()  0) {
-reqStr.append(If-Modified-Since:  +
HttpDateFormat.toString(datum.getModifiedTime()));
+   String httpDate =
+ HttpDateFormat.toString(datum.getModifiedTime());
+   Http.LOG.debug(modified time:  + httpDate);
+reqStr.append(If-Modified-Since:  + httpDate);
 reqStr.append(\r\n);
   }
+  reqStr.append(\r\n);

   byte[] reqBytes= reqStr.toString().getBytes();



On Wed, Oct 14, 2009 at 9:40 AM, sprabhu_PN
shreekanth.pra...@pinakilabs.com wrote:

 We are looking at picking up updates in a recrawl - How do I get the the
 fetcher to read the recently built segment, get to the url and decide
 whether to get the content based on whether the url has been updated since?
 

 Shreekanth Prabhu
 --
 View this message in context: 
 http://www.nabble.com/Recrawling--Nutch-tp25891294p25891294.html
 Sent from the Nutch - User mailing list archive at Nabble.com.





-- 
http://www.linkedin.com/in/paultomblin


Re: Incremental Whole Web Crawling

2009-10-06 Thread Paul Tomblin
Don't change options in nutch-default.xml - copy the option into
nutch-site.xml and change it there.  That way the change will
(hopefully) survive an upgrade.

On Tue, Oct 6, 2009 at 1:01 AM, Gaurang Patel gaurangtpa...@gmail.com wrote:
 Hey,

 Never mind. I got *generate.update.db* in *nutch-default.xml* and set it
 true.

 Regards,
 Gaurang

 2009/10/5 Gaurang Patel gaurangtpa...@gmail.com

 Hey Andrzej,

 Can you tell me where to set this property (generate.update.db)? I am
 trying to run similar kind of crawl scenario that Eric is running.

 -Gaurang

 2009/10/5 Andrzej Bialecki a...@getopt.org

 Eric wrote:

 Andrzej,

 Just to make sure I have this straight, set the generate.update.db
 property to true then

 bin/nutch generate crawl/crawldb crawl/segments -topN 10: 16 times?


 Yes. When this property is set to true, then each fetchlist will be
 different, because the records for those pages that are already on another
 fetchlist will be temporarily locked. Please note that this lock holds only
 for 1 week, so you need to fetch all segments within one week from
 generating them.

 You can fetch and updatedb in arbitrary order, so once you fetched some
 segments you can run the parsing and updatedb just from these segments,
 without waiting for all 16 segments to be processed.



 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com







-- 
http://www.linkedin.com/in/paultomblin


Re: how to upgrade a java application with nutch?

2009-10-01 Thread Paul Tomblin
2009/10/1 Jaime Martín james...@gmail.com

 Hi!
 I´ve a java application that I would like to upgrade with nutch. What
 jars
 should I add to my lib applicaction to make it possible to use nutch
 features from some of my app pages and business logic classes?
 I´ve tried with nutch-1.0.jar generated by war target without success.
 I wonder what is the proper nutch build.xml target I should execute for
 this
 and what of the generated jars are to be included in my app. Maybe apart
 from nutch-1.0.jar are all nutch-1.0\lib jars compulsory or just a few of
 them?


Maybe I'm doing it wrong, but I used the nutch-1.0.job file instead of the
jar.

-- 
http://www.linkedin.com/in/paultomblin


Re: Something wrong with nutch.wiki

2009-10-01 Thread Paul Tomblin
2009/10/1 Kirby Bohling kirby.bohl...@gmail.com:
 2009/9/29 Ольга Пескова opesk...@mail.ru:
 Hello!

 Please check the url:
 http://wiki.apache.org/nutch/
 I can't find any content there.

 Just as a point of reference, I got the FrontPage to pull up just
 prior to sending this e-mail.  I'm not sure what is wrong with your
 connection to it, but I don't believe it is the server.

It was down for a number of hours today, but evidently it's back up now.



-- 
http://www.linkedin.com/in/paultomblin


Re: Why Nutch is not crawling all links from web page

2009-09-22 Thread Paul Tomblin
On Tue, Sep 22, 2009 at 4:17 AM, Pravin Karne pravin_ka...@persistent.co.in
 wrote:

 Hi,
 I am using nutch to crawl particular site. But I found that Nutch is not
 crawling all  links from every pages.
 Is there any tuning parameter  for nutch to crawl all links?


There are a number of reasons why it might not follow a link, and in my
opinion Nutch really needs a way to provide the information on why it's
doing or not doing what it does, without turning on DEBUG level logging.

One reason might be if the links go to a new site (or redirect internally -
if you follow links within www.prnewswire.com, they often do a silent
redirect to news.prnewswire.com) and you've got the property
db.ignore.external.links set to true.

Another reason might be the robots.txt file for the site you're crawling
forbids you from crawling those parts of the site.

Another reason might be that you're using one or more url filters, and the
url filters forbid the urls in question.

There are probably other reasons, but those are the ones that have bitten me
so far.



-- 
http://www.linkedin.com/in/paultomblin


Where should I do this?

2009-09-22 Thread Paul Tomblin
I want to output to a file or database every url/filename that's crawled,
along with the status.  I figure I can do this with a plugin, but I'm not
sure where to slot it into the plugin hierarchy.  Any suggestions?

-- 
http://www.linkedin.com/in/paultomblin


Re: Difference between Deiselpoint and Nutch?

2009-09-18 Thread Paul Tomblin
On Fri, Sep 18, 2009 at 12:06 PM, David M. Cole d...@colegroup.com wrote:

 At 11:30 AM -0400 9/18/09, Paul Tomblin wrote:

 Is anybody here familiar with how Desielpoint (DP) works?


 Dieselpoint is designed specifically for intranets and therefore doesn't
 take robots.txt into account because the Dieselpoint administrator and the
 web administrator (theoretically) work toward the same goals (see the thread
 from last Friday, Ignoring Robots.txt for an instance where that wasn't
 the case).

 Nutch is designed specifically for all-web crawling (like Google or Bing)
 and respects robots.txt because Nutch needs to be polite when indexing sites
 over which it has no control.

 Your client has a robots.txt file to control Google and/or Bing, so Nutch
 is respecting it the same way Google or Bing would.


I'm afraid I wasn't clear.  The site that the client is indexing with DP is
an external site, not hers.  Nutch is, I think, doing the right thing by not
crawling it, but I can't convince her of this because she's convinced that
DP is commercial and Nutch is only Open Source, so obviously DP is right.

The site in question does have several sitemaps.  Can Nutch do anything with
sitemaps?  (By the way, what does it mean when the robots.txt file lists
more than one sitemap?)

-- 
http://www.linkedin.com/in/paultomblin


What to do about sites with Disallow: * and a sitemap?

2009-09-17 Thread Paul Tomblin
Is there a way to make Nutch look at the pages of a site based on its
Sitemap?

-- 
http://www.linkedin.com/in/paultomblin


Changing the filter rules?

2009-09-14 Thread Paul Tomblin
If I change the filter rules, during a recrawl will URLs that are no
longer valid according to the new rules be removed from the segment
database?

-- 
http://www.linkedin.com/in/paultomblin


Re: taking a look into a nutch segment

2009-09-04 Thread Paul Tomblin
On Fri, Sep 4, 2009 at 4:29 PM, Lowell Kirshlow...@carbonfive.com wrote:
 I'd like to poke around in my nutch segment and see what data is there. I
 don't want to write any (or muhc) code. Are there any utilities out there
 that could help me with what I'm trying?

bin/nutch readsegs



-- 
http://www.linkedin.com/in/paultomblin


Re: Help me, No urls to fetch.

2009-09-02 Thread Paul Tomblin
On Wed, Sep 2, 2009 at 6:36 AM, zo tigerzo.ti...@hotmail.com wrote:


 At last i ran bin/nutch crawl command but it gives

 No urls to fetch check your filter and seed list error

 I am sure there is no problem in crawl-url filter and other configuration
 xml files

 İs anyone know any possible problem


What's in your url directory?


-- 
http://www.linkedin.com/in/paultomblin


Isn't this a bug?

2009-09-01 Thread Paul Tomblin
If I crawl a page with a url like:
http://localhost/Documents/pharma/DocSamples/?C=N;O=A
(which is what you get when you have a directory without an index.*,
and you've configured Options Indexes, and you click one of the
sorting options)
and it presents all the files in the directory as relative links like
foo.html, Nutch ends up trying to fetch the files with the second
part of that same parameter on the end, like foo.htmlO=A, which ends
up getting a 404.

Look at the parse data for http://localhost/Documents/pharma/DocSamples/?C=D;O=A
...
 [java]   outlink: toUrl:
http://localhost/Documents/pharma/DocSamples/15%20minutes.htm;O=A
anchor: 15 minutes.htm
 [java]   outlink: toUrl:
http://localhost/Documents/pharma/DocSamples/18whistle.html;O=A
anchor: 18whistle.html
 [java]   outlink: toUrl:
http://localhost/Documents/pharma/DocSamples/2010%20brings%20changes.doc;O=A
anchor: 2010 brings changes.doc
...

-- 
http://www.linkedin.com/in/paultomblin


Getting Can't be handled as Microsoft document - java.util.NoSuchElementException

2009-08-30 Thread Paul Tomblin
Is there something special I have to do to parse MS Word documents?
I've got parse-msword included as one of my plugin, but I'm getting
this error.

(This is Nutch-1.0)


-- 
http://www.linkedin.com/in/paultomblin


Nutch bug: can't handle urls with spaces in them

2009-08-25 Thread Paul Tomblin
In my browser, I can see a URL with spaces in it, but when I hover
over it, the browser has replaced the spaces with %20s, and when I
click on it I get the document.  However, when Nutch attempts to
follow the link, it doesn't do that, and so it gets a 404.  It should
do the same thing that web browsers do, or else I'm going to be facing
questions from my users about why certain documents aren't indexed
even though they can see them just fine.

If I do a view source, I can see the URLs with spaces in them:
a href=http://localhost/Documents/pharma/DocSamples/Leg blood
clots.htmLeg blood clots.htm/abr /

But when I click on them, the URL got converted to:
http://localhost/Documents/pharma/DocSamples/Leg%20blood%20clots.htm


-- 
http://www.linkedin.com/in/paultomblin


Memory cost of extra threads?

2009-08-24 Thread Paul Tomblin
Has anybody quantified what the memory cost per extra fetch thread?
My fetches are taking way too long, and since I'm spending hours at a
time staring at

 [java]
 [java] [DEBUG] 22:40 (Fetcher.java:run:482)
 [java] FetcherThread spin-waiting ...
 [java]
 [java] [DEBUG] 22:40 (Fetcher.java:run:482)
 [java] FetcherThread spin-waiting ...

over and over again, I'm thinking maybe I should give it more to chew on.

-- 
http://www.linkedin.com/in/paultomblin


Re: Keywords?

2009-08-21 Thread Paul Tomblin
On Fri, Aug 21, 2009 at 4:20 AM, Julien
Niochelists.digitalpeb...@gmail.com wrote:
 ou'll need to write a custom parser implementing HtmlParseFilter and get it
 to store the keywords found in the Metadata, then write a custom Indexer.

 By default the HTML parser does not do anything about meta tags.

That's unfortunate, because org.apache.nutch.parse.html.HtmlParser
actually extracts all the meta tags, and then takes a few and throws
the rest away.  It's mildly annoying that I'm going to have to
re-implement all of HtmlParser just to add two lines to take that data
out of metaTags and put it in content.getMetaData().

-- 
http://www.linkedin.com/in/paultomblin


Keywords?

2009-08-20 Thread Paul Tomblin
Is there a way to extract the keywords from an html page?  I can't
find it in ParseData or CrawlDatum anywhere.

-- 
http://www.linkedin.com/in/paultomblin


Nutch.SIGNATURE_KEY

2009-08-19 Thread Paul Tomblin
Is SIGNATURE_KEY (aka nutch.content.digest) a valid way to check if my
page has changed since the last time I crawled it?  I patched Nutch to
properly handle modification dates, and then discovered that my web
site doesn't send Modification-Date because it uses shmtl
(Server-parsed HTML).  I assume it's some sort of cryptographic hash
of the entire page?

Another question: is Nutch smart enough to use that signature to
determine that, say, http://xcski.com/ and http://xcski.com/index.html
are the same page?


-- 
http://www.linkedin.com/in/paultomblin


Re: Nutch.SIGNATURE_KEY

2009-08-19 Thread Paul Tomblin
On Wed, Aug 19, 2009 at 1:00 PM, Ken Kruglerkkrugler_li...@transpac.com wrote:
 Another question: is Nutch smart enough to use that signature to
 determine that, say, http://xcski.com/ and http://xcski.com/index.html
 are the same page?

 I believe the hashes would be the same for either raw MD5 or text signature,
 yes. So on the search side these would get collapsed. Don't know about what
 else you mean as far as same page - e.g. one entry in the CrawlDB? If so,
 then somebody else with more up-to-date knowledge of Nutch would need to
 chime in here. Older versions of Nutch would still have these as separate
 entries, FWIR.

Actually, I just checked some of my own pages, and http://xcski.com/
and http://xcski.com/index.html have different signatures, in spite of
them being the same page.  So I guess the answer to that is no, even
if there were logic to make them the same page in CrawlDB, it wouldn't
work.


-- 
http://www.linkedin.com/in/paultomblin


Which versions?

2009-08-16 Thread Paul Tomblin
Which versions of Lucene, Nutch and Solr work together?  I've
discovered that the Nutch trunk and the Solr trunk use wildly
different versions of the Lucene jars, and it's causing me problems.

-- 
http://www.linkedin.com/in/paultomblin


Re: How do I get all the documents in the index without searching?

2009-08-12 Thread Paul Tomblin
On Tue, Aug 11, 2009 at 2:10 PM, Paul Tomblinptomb...@xcski.com wrote:
 I want to iterate through all the documents that are in the crawl,
 programattically.  The only code I can find does searches.  I don't
 want to search for a term, I want everything.  Is there a way to do
 this?

To answer my own question, what I ended up doing was
IndexReader reader = IndexReader.open(indexDir.getAbsolutePath());
for (int i = 0; i  reader.numDocs(); i++)
{
Document doc = reader.document(i);
}

Now that I have the Document, I have to figure out how to process it
further to get the actual contents, but I assume that I need to go
back to the segment for that.



-- 
http://www.linkedin.com/in/paultomblin


How do I get all the documents in the index without searching?

2009-08-11 Thread Paul Tomblin
I want to iterate through all the documents that are in the crawl,
programattically.  The only code I can find does searches.  I don't
want to search for a term, I want everything.  Is there a way to do
this?

-- 
http://www.linkedin.com/in/paultomblin


Why isn't fetcher sending the last fetch time when it does a GET?

2009-08-08 Thread Paul Tomblin
I'm watching my server logs as I do a second crawl of the site I
crawled yesterday, and it's getting HTTP response code 200 on every
page.  Since none of those pages have changed, ideally the fetcher
should send the last retrieval time in the HTTP header, and the server
would then respond with a 301 code, so it wouldn't have to reparse the
same page.  Wouldn't this be a major win in terms of bandwidth
consumed?  Certainly GoogleBot does it that way.

I'm doing the crawl using a slightly modified version of the script on the Wiki
http://wiki.apache.org/nutch/Crawl


-- 
http://www.linkedin.com/in/paultomblin


Re: Print out a list of every URL fetched?

2009-08-07 Thread Paul Tomblin
Not  quite what I want - that will show me every url that's ever been
crawled, not just the ones fetched this time, nor is it real-time.


On Fri, Aug 7, 2009 at 3:23 AM, Sebastian
Nagelsebastian.na...@exorbyte.com wrote:
 Hi Paul,

 you can use

  $NUTCH_HOME/bin/nutch readdb my_crawl/crawldb/ -dump dump_crawldb/ -format 
 csv

 then in dump_crawldb you'll find a CSV file with all URLs in your crawlDb.
 One column indicates the status. Select only those records with db_fetched
 and you'll have your list.

 Sebastian




-- 
http://www.linkedin.com/in/paultomblin


Why did it think /style was part of the URL?

2009-08-07 Thread Paul Tomblin
I am crawling my own site, which includes an ancient MovableType
installation.  When it gets to http://xcski.com/movabletype/mt.cgi, it
produces an invalid outlink (seen by an exception in the crawl, and
in the following readseg dump):
Outlinks: 8
  outlink: toUrl: http://xcski.com/movabletype/text/css anchor:
  outlink: toUrl: http://xcski.com/movabletype//style anchor:
  outlink: toUrl: http://xcski.com/movabletype/mt.cgi?__mode=start_recover ancho
r:
  outlink: toUrl: http://xcski.com/movabletype/styles.css anchor:
  outlink: toUrl: http://xcski.com/movabletype/images/mt-logo.gif anchor:
  outlink: toUrl: http://xcski.com/movabletype/images/spacer.gif anchor:
  outlink: toUrl: http://xcski.com/movabletype/images/spacer.gif anchor:
  outlink: toUrl: http://xcski.com/movabletype/mt.cgi# anchor: Forgot your passw
ord?

Looking through the text returned by just doing a wget on that URL, I
don't see any href that's anywhere near a /style, so I can't figure
out why it's doing that.


-- 
http://www.linkedin.com/in/paultomblin


Print out a list of every URL fetched?

2009-08-06 Thread Paul Tomblin
If I want to print out a list of every URL as it's fetched, or better
yet write that list to a file, is there a good plugin to implement?
I'm guessing URLFilter isn't the best because it might see urls that
don't actually get fetched as well as ones that return 304, 4xx or 5xx
response codes.  Ideally, it should only print ones that are being
re-indexed.

-- 
http://www.linkedin.com/in/paultomblin


Re: Added plugins not visible

2009-08-05 Thread Paul Tomblin
On Wed, Aug 5, 2009 at 2:51 AM, Saurabh Sumansaurabhsuman...@rediff.com wrote:

 Hi
  i have created a plugins for indexing filter .
 i put it in [nutch_folder]\src\plugin\ then i built it with ant command.
 Build was successful and it created a jar. I also added that jar  in
 classpath. In nutch-default.xml
 in value of properties  nameplugin.includes/name  i added that plugin
 also . like
 value..parse-(text|html|js)|index-(basic|anchor|germinait)|.../value
  my pluguns is index-germinait.

 But when i run the crawl , it is not detecting index-germinait.
  Where am i wrong?WHich step i am missing?

Is the plugin.xml also in the classpath?

-- 
http://www.linkedin.com/in/paultomblin


Re: Added plugins not visible

2009-08-05 Thread Paul Tomblin
On Wed, Aug 5, 2009 at 8:08 AM, Saurabh Sumansaurabhsuman...@rediff.com wrote:

 No plugin.xml is not in classpath.

I think it needs to be.  Or at least, it needs to be in either
build/plugins/plugin-name or in plugins/plugin-name.

-- 
http://www.linkedin.com/in/paultomblin


Re: Nutch in C++

2009-08-04 Thread Paul Tomblin
On Tue, Aug 4, 2009 at 1:35 PM, reinhard schwabreinhard.sch...@aon.at wrote:
 And why?  I guess you may see some performance improvement, but it would be
 a LOT cheaper to throw hardware at the problem (and you may not see much if
 any).

 performance improvement?
 can you proove that c++ will be faster?

Considering that Nutch is mostly network IO bound, rewriting it in a
different language isn't going to make the Internet serve up your
pages faster.

-- 
http://www.linkedin.com/in/paultomblin


Re: Plugin development

2009-07-31 Thread Paul Tomblin
That assumes that you're going to be putting the plugin in the Nutch
source tree.  I'm looking for guidance on what to do differently if
you don't put it in the nutch source tree.


On Fri, Jul 31, 2009 at 12:48 AM, Alexander
Aristovalexander.aris...@gmail.com wrote:
 This is a simple HowTo
 http://wiki.apache.org/nutch/WritingPluginExample-0.9


 Best Regards
 Alexander Aristov


 2009/7/31 Paul Tomblin ptomb...@xcski.com

 How do I develop a plugin that isn't in the nutch source tree?  I want
 to keep all my project's source code together, and not put the
 project specific plugin in with the nutch code.  Do I just have my
 plugin's build.xml include $NUTCH_HOME/src/home/build-plugin.xml?
 (I'm a little shakey on ant syntax, I'm used to make.)  Other than
 that, and making sure my plugin's jar file ends up in nutch's
 CLASSPATH, is there anything special I need to know?

 Should I be asking this on the developer list?

 --
 http://www.linkedin.com/in/paultomblin





-- 
http://www.linkedin.com/in/paultomblin


Re: Plugin development

2009-07-31 Thread Paul Tomblin
On Fri, Jul 31, 2009 at 4:33 AM, Alexander
Aristovalexander.aris...@gmail.com wrote:
 What do you mean under putting it in the nutch source tree.


I mean those instructions you linked to (which I had already seen)
only show you how to compile your plugin if you're willing to put it
in $NUTCH_HOME/src/nutch/plugin, which I am not.  I want to be able to
compile it in my own source tree.  I don't need to put my servlet code
in the Tomcat source code tree to compile it, and I don't need to put
my Swing code in com/javax, so I shouldn't need the source code tree
of Nutch just to compile a plugin.


-- 
http://www.linkedin.com/in/paultomblin


Nutch and Solr

2009-07-30 Thread Paul Tomblin
I'm trying to follow the example in the Wiki, but it's corrupt.  It
has a bunch of garbage in the part you're supposed to past into
solrconfig.xml - I don't know if something got interpreted as wiki
markup when it shouldn't, or what, but I doubt superscripts are a
normal part of the configuration.

Can somebody please tell me what I'm supposed to do there?

-- 
http://www.linkedin.com/in/paultomblin


Re: how to exclude some external links

2009-07-30 Thread Paul Tomblin
On Thu, Jul 30, 2009 at 9:15 PM, alx...@aim.com wrote:

 I would like to know how can I modify nutch code to exclude external links 
 with certain extensions. For example, if have in urls mydomain.com and my 
 domain.com has a lot of links like mydomain.com/mylink.shtml, then I want 
 nutch not to fetch(crawl) these kind of urls at all.

Can't you do this with the existing RegexURLFilter plugin?  Make sure
urlfilter-regex is listed in plugin.includes, and that you've got the
property urlfilter.regex.file is set to a file (probably
regex-urlfilter.txt).  Then you can list the extensions you want to
skip in that file.

-- 
http://www.linkedin.com/in/paultomblin


Plugin development

2009-07-30 Thread Paul Tomblin
How do I develop a plugin that isn't in the nutch source tree?  I want
to keep all my project's source code together, and not put the
project specific plugin in with the nutch code.  Do I just have my
plugin's build.xml include $NUTCH_HOME/src/home/build-plugin.xml?
(I'm a little shakey on ant syntax, I'm used to make.)  Other than
that, and making sure my plugin's jar file ends up in nutch's
CLASSPATH, is there anything special I need to know?

Should I be asking this on the developer list?

-- 
http://www.linkedin.com/in/paultomblin


Include/exclude lists

2009-07-29 Thread Paul Tomblin
Is there any way other than the config files to specify the url filter
parameters?  I have a few dozen sites to crawl, and for each site I
want to specify its own includes and excludes.  I don't want to have
to go into the config file and change the
propertynameurlfilter.regex.file/name each time.  Can I specify
that on the command line to bin/nutch generate or something?

-- 
http://www.linkedin.com/in/paultomblin


Dumping what I have?

2009-07-28 Thread Paul Tomblin
The nutch data files are pretty opaque, and even strings can't extract
anything except the occasional URL.  Is there any code to dump the contents
of the various files in a human readable form?

-- 
http://www.linkedin.com/in/paultomblin


Re: Dumping what I have?

2009-07-28 Thread Paul Tomblin
Awesome!  Thanks.

On Tue, Jul 28, 2009 at 12:26 PM, reinhard schwab reinhard.sch...@aon.atwrote:

 yes, there are tools which you can use to dump the content of crawl db,
 link db and segments.

 dump=./crawl/dump
 bin/nutch readdb $crawl/crawldb -dump $dump/crawldb
 bin/nutch readlinkdb $crawl/linkdb -dump $dump/linkdb
 bin/nutch readseg -dump $1 $dump/segments/$1

 you will get more info if you call

 bin/nutch readdb
 bin/nutch readlinkdb
 bin/nutch readseg

 Paul Tomblin schrieb:
  The nutch data files are pretty opaque, and even strings can't extract
  anything except the occasional URL.  Is there any code to dump the
 contents
  of the various files in a human readable form?
 
 




-- 
http://www.linkedin.com/in/paultomblin


Re: How to index other fields in solr

2009-07-27 Thread Paul Tomblin
Wouldn't that be using facets, as per
http://wiki.apache.org/solr/SimpleFacetParameters


On Mon, Jul 27, 2009 at 2:34 AM, Saurabh Suman
saurabhsuman...@rediff.comwrote:


 I am using solr for searching.I used the class SolrIndexer.But i can search
 on content only?I want to search on author also?How to index on author?
 --
 View this message in context:
 http://www.nabble.com/How-to-index-other-fields-in-solr-tp24674208p24674208.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




-- 
http://www.linkedin.com/in/paultomblin


Re: Why did my crawl fail?

2009-07-27 Thread Paul Tomblin
Actually, I got that error the first time I used it, and then again when I
blew away the downloaded nutch and grabbed the latest trunk from Subversion.

On Mon, Jul 27, 2009 at 1:11 AM, xiao yang yangxiao9...@gmail.com wrote:

 You must have crawled for several times, and some of them failed
 before the parse phase. So the parse data was not generated.
 You'd better delete the whole directory
 file:/Users/ptomblin/nutch-1.0/crawl.blog, and recrawl it, then you
 will know the exact reason why it failed in the parse phase from the
 output information.

 Xiao

 On Fri, Jul 24, 2009 at 10:53 PM, Paul Tomblinptomb...@xcski.com wrote:
  I installed nutch 1.0 on my laptop last night and set it running to crawl
 my
  blog with the command:  bin/nutch crawl urls -dir crawl.blog -depth 10
  it was still running strong when I went to bed several hours later, and
 this
  morning I woke up to this:
 
  activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
  -activeThreads=0
  Fetcher: done
  CrawlDb update: starting
  CrawlDb update: db: crawl.blog/crawldb
  CrawlDb update: segments: [crawl.blog/segments/20090724010303]
  CrawlDb update: additions allowed: true
  CrawlDb update: URL normalizing: true
  CrawlDb update: URL filtering: true
  CrawlDb update: Merging segment data into db.
  CrawlDb update: done
  LinkDb: starting
  LinkDb: linkdb: crawl.blog/linkdb
  LinkDb: URL normalize: true
  LinkDb: URL filter: true
  LinkDb: adding segment:
  file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530
  LinkDb: adding segment:
  file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106
  LinkDb: adding segment:
  file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122
  LinkDb: adding segment:
  file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303
  LinkDb: adding segment:
  file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812
  LinkDb: adding segment:
  file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808
  LinkDb: adding segment:
  file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215
  LinkDb: adding segment:
  file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543
  LinkDb: adding segment:
  file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936
  LinkDb: adding segment:
  file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250
  LinkDb: adding segment:
  file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303
  Exception in thread main
 org.apache.hadoop.mapred.InvalidInputException:
  Input path does not exist:
 
 file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530/parse_data
  at
 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
  at
 
 org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
  at
 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
  at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
  at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
  at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
  at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
 
 
  --
  http://www.linkedin.com/in/paultomblin
 




-- 
http://www.linkedin.com/in/paultomblin


Re: Why did my crawl fail?

2009-07-27 Thread Paul Tomblin
Unfortunately I blew away those particular logs when I fetched the svn
trunk.  I just tried it again (well, I started it again at noon and it just
finished) and this time it worked fine, so it seems kind of heisenbug-like.
 Maybe it has something to do with which pages are types it can't handle?

On Mon, Jul 27, 2009 at 11:27 AM, xiao yang yangxiao9...@gmail.com wrote:

 Hi, Paul

 Can you post the error messages in the log file
 (file:/Users/ptomblin/nutch-1.0/logs)?

 On Mon, Jul 27, 2009 at 6:55 PM, Paul Tomblinptomb...@xcski.com wrote:
  Actually, I got that error the first time I used it, and then again when
 I
  blew away the downloaded nutch and grabbed the latest trunk from
 Subversion.
 
  On Mon, Jul 27, 2009 at 1:11 AM, xiao yang yangxiao9...@gmail.com
 wrote:
 
  You must have crawled for several times, and some of them failed
  before the parse phase. So the parse data was not generated.
  You'd better delete the whole directory
  file:/Users/ptomblin/nutch-1.0/crawl.blog, and recrawl it, then you
  will know the exact reason why it failed in the parse phase from the
  output information.
 
  Xiao
 
  On Fri, Jul 24, 2009 at 10:53 PM, Paul Tomblinptomb...@xcski.com
 wrote:
   I installed nutch 1.0 on my laptop last night and set it running to
 crawl
  my
   blog with the command:  bin/nutch crawl urls -dir crawl.blog -depth 10
   it was still running strong when I went to bed several hours later,
 and
  this
   morning I woke up to this:
  
   activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
   -activeThreads=0
   Fetcher: done
   CrawlDb update: starting
   CrawlDb update: db: crawl.blog/crawldb
   CrawlDb update: segments: [crawl.blog/segments/20090724010303]
   CrawlDb update: additions allowed: true
   CrawlDb update: URL normalizing: true
   CrawlDb update: URL filtering: true
   CrawlDb update: Merging segment data into db.
   CrawlDb update: done
   LinkDb: starting
   LinkDb: linkdb: crawl.blog/linkdb
   LinkDb: URL normalize: true
   LinkDb: URL filter: true
   LinkDb: adding segment:
   file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530
   LinkDb: adding segment:
   file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106
   LinkDb: adding segment:
   file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122
   LinkDb: adding segment:
   file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303
   LinkDb: adding segment:
   file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812
   LinkDb: adding segment:
   file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808
   LinkDb: adding segment:
   file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215
   LinkDb: adding segment:
   file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543
   LinkDb: adding segment:
   file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936
   LinkDb: adding segment:
   file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250
   LinkDb: adding segment:
   file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303
   Exception in thread main
  org.apache.hadoop.mapred.InvalidInputException:
   Input path does not exist:
  
 
 file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530/parse_data
   at
  
 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
   at
  
 
 org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
   at
  
 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
   at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
   at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
   at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
   at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
  
  
   --
   http://www.linkedin.com/in/paultomblin
  
 
 
 
 
  --
  http://www.linkedin.com/in/paultomblin
 




-- 
http://www.linkedin.com/in/paultomblin


Re: Why did my crawl fail?

2009-07-26 Thread Paul Tomblin
No, it fetched thousands of pages - my blog and picture gallery.  It just
never finished indexing them because as well as looking at the 11 segments
that exist, it's also trying to look at a segment that doesn't.

On Sun, Jul 26, 2009 at 9:06 PM, arkadi.kosmy...@csiro.au wrote:

 This is a very interesting issue. I guess that absence of parse_data means
 that no content has been fetched. Am I wrong?

 This happened in my crawls a few times. Theoretically (I am guessing again)
 this may happen if all urls selected for fetching on this iteration are
 either blocked by the filters, or failed to be fetched, for whatever reason.

 I got around this problem by checking for presence of parse_data, and if it
 is absent, deleting the segment. This seems to be working, but I am not 100%
 sure that this is a good thing to do. Can I do this? Is it safe to do? Would
 appreciate if someone with expert knowledge commented on this issue.

 Regards,

 Arkadi


  -Original Message-
  From: ptomb...@gmail.com [mailto:ptomb...@gmail.com] On Behalf Of Paul
  Tomblin
  Sent: Saturday, July 25, 2009 12:54 AM
  To: nutch-user
  Subject: Why did my crawl fail?
 
  I installed nutch 1.0 on my laptop last night and set it running to crawl
  my
  blog with the command:  bin/nutch crawl urls -dir crawl.blog -depth 10
  it was still running strong when I went to bed several hours later, and
  this
  morning I woke up to this:
 
  activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
  -activeThreads=0
  Fetcher: done
  CrawlDb update: starting
  CrawlDb update: db: crawl.blog/crawldb
  CrawlDb update: segments: [crawl.blog/segments/20090724010303]
  CrawlDb update: additions allowed: true
  CrawlDb update: URL normalizing: true
  CrawlDb update: URL filtering: true
  CrawlDb update: Merging segment data into db.
  CrawlDb update: done
  LinkDb: starting
  LinkDb: linkdb: crawl.blog/linkdb
  LinkDb: URL normalize: true
  LinkDb: URL filter: true
  LinkDb: adding segment:
  file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530
  LinkDb: adding segment:
  file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106
  LinkDb: adding segment:
  file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122
  LinkDb: adding segment:
  file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303
  LinkDb: adding segment:
  file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812
  LinkDb: adding segment:
  file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808
  LinkDb: adding segment:
  file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215
  LinkDb: adding segment:
  file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543
  LinkDb: adding segment:
  file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936
  LinkDb: adding segment:
  file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250
  LinkDb: adding segment:
  file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303
  Exception in thread main
 org.apache.hadoop.mapred.InvalidInputException:
  Input path does not exist:
  file:/Users/ptomblin/nutch-
  1.0/crawl.blog/segments/20090723154530/parse_data
  at
 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1
  79)
  at
 
 org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn
  putFormat.java:39)
  at
 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:19
  0)
  at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
  at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
  at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
  at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
 
 
  --
  http://www.linkedin.com/in/paultomblin




-- 
http://www.linkedin.com/in/paultomblin


Can I chunk during the crawl?

2009-07-24 Thread Paul Tomblin
Forgive me if this is a bit of a n00b question.
I've been tasked with taking some other person's code and replacing all the
DieselPoint code with Lucene/Nutch.  What they do in DieselPoint is crawl
specific parts of the web, then perform some proprietary splitting up of the
returned pages into chunks, and then the chunks themselves are
indexed.  Actually, I think they do it in a kind of a naive way,
because it appears that DieselPoint crawls and indexes, and then this
code goes through the index and creates
chunk files, possibly several from any given initial page, and then
DieselPoint is set loose to crawl and index those chunk files.  Then the app
uses *that* index in proprietary searches.
I'm trying to learn my way around Nutch, and I'm wondering if there might be
a way to get rid of the chunking stage by doing it directly in the initial
crawl, possibly by writing a plugin.  Unfortunately I'm under NDA so I can't
give away too much of what the chunking process does, but I hope I've given
enough information on what I'm trying to do.  Is what I'm doing possible?

-- 
http://www.linkedin.com/in/paultomblin


Why did my crawl fail?

2009-07-24 Thread Paul Tomblin
I installed nutch 1.0 on my laptop last night and set it running to crawl my
blog with the command:  bin/nutch crawl urls -dir crawl.blog -depth 10
it was still running strong when I went to bed several hours later, and this
morning I woke up to this:

activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl.blog/crawldb
CrawlDb update: segments: [crawl.blog/segments/20090724010303]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: crawl.blog/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303
Exception in thread main org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530/parse_data
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)


-- 
http://www.linkedin.com/in/paultomblin