from:"Chip Calhoun"

Re: Questions about upgrade to Nutch 1.3

2011-06-20 Thread Chip Calhoun

Thanks for replying!  I do still have a couple of questions:

 Markus Jelsma markus.jel...@openindex.io 6/20/2011 11:34 AM 
  On Monday 20 June 2011 16:44:13 Chip Calhoun wrote:
  Hi everyone,
  
  I'm a complete Nutch newbie.  I installed Nutch 1.2 and Solr 1.4.0 on my
  machine without any trouble.  I've decided to try Nutch 1.3 as it's
  compatible with Solr 3.1.0, which includes Solritas.  I hope you can help
  with some problems I'm having.
 
 Solr 1.4.x has it has Velocity as a contrib.
Does it?  Under 1.4.0 I could never get http://localhost:8983/solr/browse to 
work.  I thought this was only added later.

  I get an error saying solrurl is not set.  This seems to be new to Nutch
  1.3.  Where do I set this?
 
 According to the source you're using the crawl command. 
 Usage: Crawl urlDir -solr solrURL [-dir d] [-threads n] [-depth i] [-topN 
 N]
 
Thanks, I hadn't known about the solrURL argument at all.  So would a valid 
usage be:
bin/nutch crawl urls -solr http://127.0.0.1:8983 -dir solrcrawl -depth 10 -topN 
50
With the new solrURL argument, are there any steps I need to do after my crawl 
to get my content into Solr?
 
Thanks!

Re: Questions about upgrade to Nutch 1.3

2011-06-21 Thread Chip Calhoun

Ahh, thanks again.  Based on your advice, I'm going back to Nutch 1.2 / Solr 
1.4 and adding the Velocity contrib.  Once I get that working, I'll try with 
Nutch 1.3 again.
 
When I try to use Velocity now, I get this message:
java.lang.RuntimeException: Can't find resource 'velocity.properties' in 
classpath or 'solr/conf/', cwd=C:\apache\apache-solr-1.4.0\exampleThis is 
despite velocity.properties very definitely being in my 
C:\apache\apache-solr-1.4.0\example\solr\conf directory.  But I've veered 
completely into Solr territory now, so I guess that's off-topic.

 Markus Jelsma markus.jel...@openindex.io 6/20/2011 12:43 PM 
On Monday 20 June 2011 18:35:36 Chip Calhoun wrote:
 Thanks for replying!  I do still have a couple of questions:
  Markus Jelsma markus.jel...@openindex.io 6/20/2011 11:34 AM 
  
   On Monday 20 June 2011 16:44:13 Chip Calhoun wrote:
   Hi everyone,
   
   I'm a complete Nutch newbie.  I installed Nutch 1.2 and Solr 1.4.0 on
   my machine without any trouble.  I've decided to try Nutch 1.3 as it's
   compatible with Solr 3.1.0, which includes Solritas.  I hope you can
   help with some problems I'm having.
  
  Solr 1.4.x has it has Velocity as a contrib.
 
 Does it?  Under 1.4.0 I could never get http://localhost:8983/solr/browse
 to work.  I thought this was only added later.

libs must be added manually from contrib but it is shipped.

 
   I get an error saying solrurl is not set.  This seems to be new to
   Nutch 1.3.  Where do I set this?
  
  According to the source you're using the crawl command.
  Usage: Crawl urlDir -solr solrURL [-dir d] [-threads n] [-depth i]
  [-topN N]
 
 Thanks, I hadn't known about the solrURL argument at all.  So would a valid
 usage be: bin/nutch crawl urls -solr http://127.0.0.1:8983 -dir solrcrawl
 -depth 10 -topN 50 With the new solrURL argument, are there any steps I
 need to do after my crawl to get my content into Solr?

I think so but i don't use it. Please try.

 
 Thanks!

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Deploying the web application in Nutch 1.2

2011-07-13 Thread Chip Calhoun

I'm a newbie trying to set up a Nutch 1.2 web app, because it seems a bit 
better suited to my smallish site than the Nutch 1.3 / Solr connection.  I'm 
going through the tutorial at 
http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine , and I've hit the 
following instruction:

Deploy the Nutch web application as the ROOT context

I'm not sure what I'm meant to do here.  I get the idea that I'm supposed to 
replace the current contents of $CATALINA_HOME/webapps/ROOT/ with something 
from my Nutch directory, but I don't know what from my Nutch directory I'm 
supposed to move.   Can someone please explain what I need to move?

Thanks,
Chip

RE: Deploying the web application in Nutch 1.2

2011-07-15 Thread Chip Calhoun

You've gotten me very close to a breakthrough.  I've started over, and I've 
found that If I don't make any edits to nutch-site.xml, I get a working Nutch 
web app; I have no index and all of my searches fail, but I have Nutch.  When I 
add my crawl location to nutch-site.xml and restart Tomcat, that's when I start 
getting the 404 with the The requested resource () is not available message.
Clearly I'm doing something wrong when I edit nutch-site.xml.  I'm going to 
paste the entire contents of my nutch-site.xml.  Where am I screwing this up?

Thanks for your help on this.

?xml version=1.0?
configuration
property
namehttp.agent.name/name
valuenutch-solr-integration/value
/property
property
namegenerate.max.per.host/name
value100/value
/property
property
nameplugin.includes/name
valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
/property
property
namesearcher.dir/name
valueC:/Apache/apache-nutch-1.2/crawlvalue
/property
/configuration


-Original Message-
From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] 
Sent: Thursday, July 14, 2011 5:38 PM
To: user@nutch.apache.org
Subject: Re: Deploying the web application in Nutch 1.2

On Thu, Jul 14, 2011 at 8:01 PM, Chip Calhoun ccalh...@aip.org wrote:

 Thanks Lewis.

 I'm still having trouble.  I've moved the war file to 
 $CATALINA_HOME/webapps/nutch/ and unpacked it.  I don't' seem to have 
 a catalina.sh file, so I've skipped that step.


From memory the catalina.sh file is used to start you Tomcat server 
instance... this has nothing to do with Nutch. Regardless of what lind of WAR 
files you have in your Tomcat webapps directory, starting your tomat server 
from the command line sould be the same...

 And I've added the following to
 C:\Apache\Tomcat-5.5\webapps\nutch\WEB-INF\classes\nutch-site.xml :


As far as a I can remember nutch-site.xml is already there, however you need to 
specify various property values after this has been uploaded the first time. 
After rebooting Tomcat all of your property setting will be running.



 property
 namesearcher.dir/name
 valueC:\Apache\apache-nutch-1.2\crawlvalue !-- There must be a 
 crawl/index directory to run off !-- /property


Looks fine, however please remove the !... as this is not required.


 However, when I go to http://localhost:8080/nutch/ I always get a 404 with
 the message, The requested resource () is not available.  What am I
 missing?


As I said the name of the WAR file needs to be identical to the webapp you
specify in the tomcat URL... can you confirm this. There should really be no
problem starting up the Nutch web app if you follow the tutorial carfeully.


 Thanks,
 Chip

 -Original Message-
 From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
 Sent: Thursday, July 14, 2011 5:40 AM
 To: user@nutch.apache.org
 Subject: Re: Deploying the web application in Nutch 1.2

 Hi Chip,

 Please see this tutorial for 1.2 administration [1], many people have been
 using it recently and as far as I'm aware it is working perfectly.

 Please post back if you have any troubles

 [1] http://wiki.apache.org/nutch/NutchTutorial



 On Wed, Jul 13, 2011 at 5:50 PM, Chip Calhoun ccalh...@aip.org wrote:

  I'm a newbie trying to set up a Nutch 1.2 web app, because it seems a
  bit better suited to my smallish site than the Nutch 1.3 / Solr
  connection.  I'm going through the tutorial at
  http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine , and I've
  hit the following instruction:
 
  Deploy the Nutch web application as the ROOT context
 
  I'm not sure what I'm meant to do here.  I get the idea that I'm
  supposed to replace the current contents of
  $CATALINA_HOME/webapps/ROOT/ with something from my Nutch directory, but
 I don't know what from my Nutch
  directory I'm supposed to move.   Can someone please explain what I need
 to
  move?
 
  Thanks,
  Chip
 



 --
 *Lewis*




-- 
*Lewis*

RE: Deploying the web application in Nutch 1.2

2011-07-15 Thread Chip Calhoun

I'm definitely changing the file in my webapp.  I can tell I'm doing that much 
right because it makes a noticeable change to the function of my web app; 
unfortunately, the change is that it seems to break everything.

I've tried playing with the actual value for this, but with no success.  In the 
tutorial's example, value/somewhere/crawlvalue, what is that relative to?  
Where would that hypothetical /somewhere/ directory be, relative to 
$CATALINA_HOME/webapps/?  It feels like this is my problem, because I can't 
think of anything else it could be.

-Original Message-
From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] 
Sent: Friday, July 15, 2011 3:19 PM
To: user@nutch.apache.org
Subject: Re: Deploying the web application in Nutch 1.2

Are you adding this to nutch-site within your webapp or just in your root Nutch 
installation. This needs to be included in your webapp version of 
nutch-site.xml. In my experience this was a small case of confusion at first.

On Fri, Jul 15, 2011 at 7:03 PM, Chip Calhoun ccalh...@aip.org wrote:

 You've gotten me very close to a breakthrough.  I've started over, and 
 I've found that If I don't make any edits to nutch-site.xml, I get a 
 working Nutch web app; I have no index and all of my searches fail, 
 but I have Nutch.  When I add my crawl location to nutch-site.xml and 
 restart Tomcat, that's when I start getting the 404 with the The 
 requested resource () is not available message.
 Clearly I'm doing something wrong when I edit nutch-site.xml.  I'm 
 going to paste the entire contents of my nutch-site.xml.  Where am I 
 screwing this up?

 Thanks for your help on this.

 ?xml version=1.0?
 configuration
 property
 namehttp.agent.name/name
 valuenutch-solr-integration/value
 /property
 property
 namegenerate.max.per.host/name
 value100/value
 /property
 property
 nameplugin.includes/name

 valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|q
 uery-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|u
 rlnormalizer-(pass|regex|basic)/value
 /property
 property
 namesearcher.dir/name
 valueC:/Apache/apache-nutch-1.2/crawlvalue
 /property
 /configuration


 -Original Message-
 From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
 Sent: Thursday, July 14, 2011 5:38 PM
 To: user@nutch.apache.org
 Subject: Re: Deploying the web application in Nutch 1.2

 On Thu, Jul 14, 2011 at 8:01 PM, Chip Calhoun ccalh...@aip.org wrote:

  Thanks Lewis.
 
  I'm still having trouble.  I've moved the war file to 
  $CATALINA_HOME/webapps/nutch/ and unpacked it.  I don't' seem to 
  have a catalina.sh file, so I've skipped that step.


 From memory the catalina.sh file is used to start you Tomcat server 
 instance... this has nothing to do with Nutch. Regardless of what lind 
 of WAR files you have in your Tomcat webapps directory, starting your 
 tomat server from the command line sould be the same...

  And I've added the following to
  C:\Apache\Tomcat-5.5\webapps\nutch\WEB-INF\classes\nutch-site.xml :
 

 As far as a I can remember nutch-site.xml is already there, however 
 you need to specify various property values after this has been 
 uploaded the first time. After rebooting Tomcat all of your property 
 setting will be running.


 
  property
  namesearcher.dir/name
  valueC:\Apache\apache-nutch-1.2\crawlvalue !-- There must be a 
  crawl/index directory to run off !-- /property
 

 Looks fine, however please remove the !... as this is not required.

 
  However, when I go to http://localhost:8080/nutch/ I always get a 
  404
 with
  the message, The requested resource () is not available.  What am 
  I missing?
 

 As I said the name of the WAR file needs to be identical to the webapp 
 you specify in the tomcat URL... can you confirm this. There should 
 really be no problem starting up the Nutch web app if you follow the 
 tutorial carfeully.


  Thanks,
  Chip
 
  -Original Message-
  From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
  Sent: Thursday, July 14, 2011 5:40 AM
  To: user@nutch.apache.org
  Subject: Re: Deploying the web application in Nutch 1.2
 
  Hi Chip,
 
  Please see this tutorial for 1.2 administration [1], many people 
  have
 been
  using it recently and as far as I'm aware it is working perfectly.
 
  Please post back if you have any troubles
 
  [1] http://wiki.apache.org/nutch/NutchTutorial
 
 
 
  On Wed, Jul 13, 2011 at 5:50 PM, Chip Calhoun ccalh...@aip.org wrote:
 
   I'm a newbie trying to set up a Nutch 1.2 web app, because it 
   seems a bit better suited to my smallish site than the Nutch 1.3 / 
   Solr connection.  I'm going through the tutorial at 
   http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine , and 
   I've hit the following instruction:
  
   Deploy the Nutch web application as the ROOT context
  
   I'm not sure what I'm meant to do here.  I get the idea that I'm 
   supposed to replace the current contents of 
   $CATALINA_HOME/webapps

RE: Deploying the web application in Nutch 1.2

2011-07-15 Thread Chip Calhoun

Success!  I'm posting this not because I need further help, but in case someone 
with a similar issue finds this in the list archives.

First: I now know that if I make no changes to nutch-site.xml, Nutch will 
expect my crawl directory to be C:\Apache\Tomcat-5.5\crawl .  So now I know 
that much.

Second, for some reason when I add the searcher.dir language to nutch-site.xml 
it causes a SEVERE: Error listenerStart issue.  The obvious solution for me 
is to just stop editing nutch-site.xml, and live with /crawl/ being in my main 
Tomcat folder.   Whatever's causing this listenerStart issue when I play with 
this on my own machine may very well not come up when I put this on the 
production server, so I'm not going to waste any time on it.


-Original Message-
From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] 
Sent: Friday, July 15, 2011 3:32 PM
To: user@nutch.apache.org
Subject: Re: Deploying the web application in Nutch 1.2

As a resource it would be wise to have a look at the list archives for an exact 
answer to this. Take a look at your catalina.out logs for more verbose info on 
where the error is.

It has been a while since I have configured this now, sorry I can't be of more 
help in giving a definite answer.

On Fri, Jul 15, 2011 at 8:27 PM, Chip Calhoun ccalh...@aip.org wrote:

 I'm definitely changing the file in my webapp.  I can tell I'm doing 
 that much right because it makes a noticeable change to the function 
 of my web app; unfortunately, the change is that it seems to break everything.

 I've tried playing with the actual value for this, but with no 
 success.  In the tutorial's example, value/somewhere/crawlvalue, 
 what is that relative to?  Where would that hypothetical /somewhere/ 
 directory be, relative to $CATALINA_HOME/webapps/?  It feels like this 
 is my problem, because I can't think of anything else it could be.

 -Original Message-
 From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
 Sent: Friday, July 15, 2011 3:19 PM
 To: user@nutch.apache.org
 Subject: Re: Deploying the web application in Nutch 1.2

 Are you adding this to nutch-site within your webapp or just in your 
 root Nutch installation. This needs to be included in your webapp 
 version of nutch-site.xml. In my experience this was a small case of 
 confusion at first.

 On Fri, Jul 15, 2011 at 7:03 PM, Chip Calhoun ccalh...@aip.org wrote:

  You've gotten me very close to a breakthrough.  I've started over, 
  and I've found that If I don't make any edits to nutch-site.xml, I 
  get a working Nutch web app; I have no index and all of my searches 
  fail, but I have Nutch.  When I add my crawl location to 
  nutch-site.xml and restart Tomcat, that's when I start getting the 
  404 with the The requested resource () is not available message.
  Clearly I'm doing something wrong when I edit nutch-site.xml.  I'm 
  going to paste the entire contents of my nutch-site.xml.  Where am I 
  screwing this up?
 
  Thanks for your help on this.
 
  ?xml version=1.0?
  configuration
  property
  namehttp.agent.name/name
  valuenutch-solr-integration/value
  /property
  property
  namegenerate.max.per.host/name
  value100/value
  /property
  property
  nameplugin.includes/name
 
  valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)
  |q 
  uery-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic
  |u rlnormalizer-(pass|regex|basic)/value
  /property
  property
  namesearcher.dir/name
  valueC:/Apache/apache-nutch-1.2/crawlvalue
  /property
  /configuration
 
 
  -Original Message-
  From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
  Sent: Thursday, July 14, 2011 5:38 PM
  To: user@nutch.apache.org
  Subject: Re: Deploying the web application in Nutch 1.2
 
  On Thu, Jul 14, 2011 at 8:01 PM, Chip Calhoun ccalh...@aip.org wrote:
 
   Thanks Lewis.
  
   I'm still having trouble.  I've moved the war file to 
   $CATALINA_HOME/webapps/nutch/ and unpacked it.  I don't' seem to 
   have a catalina.sh file, so I've skipped that step.
 
 
  From memory the catalina.sh file is used to start you Tomcat server 
  instance... this has nothing to do with Nutch. Regardless of what 
  lind of WAR files you have in your Tomcat webapps directory, 
  starting your tomat server from the command line sould be the same...
 
   And I've added the following to
   C:\Apache\Tomcat-5.5\webapps\nutch\WEB-INF\classes\nutch-site.xml :
  
 
  As far as a I can remember nutch-site.xml is already there, however 
  you need to specify various property values after this has been 
  uploaded the first time. After rebooting Tomcat all of your property 
  setting will be running.
 
 
  
   property
   namesearcher.dir/name
   valueC:\Apache\apache-nutch-1.2\crawlvalue !-- There must be 
   a crawl/index directory to run off !-- /property
  
 
  Looks fine, however please remove the !... as this is not required.
 
  
   However, when I go to http://localhost:8080/nutch/ I

Nutch not indexing full collection

2011-07-20 Thread Chip Calhoun

Hi,

I'm using Nutch 1.3 to crawl a section of our website, and it doesn't seem to 
crawl the entire thing.  I'm probably missing something simple, so I hope 
somebody can help me.

My urls/nutch file contains a single URL: 
http://www.aip.org/history/ohilist/transcripts.html , which is an alphabetical 
listing of other pages.  It looks like the indexer stops partway down this 
page, meaning that entries later in the alphabet aren't indexed.

My nutch-site.xml has the following content:
?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?
!-- Put site-specific property overrides in this file. --
configuration
property
  namehttp.agent.name/name
  valueOHI Spider/value
/property
property
  namedb.max.outlinks.per.page/name
  value-1/value
  descriptionThe maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  /description
/property
/configuration

My regex-urlfilter.txt and crawl-urlfilter.txt both include the following, 
which should allow access to everything I want:
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*aip.org/history/ohilist/
# skip everything else
-.

I've crawled with the following command:
runtime/local/bin/nutch crawl urls -dir crawl -depth 15 -topN 50

Note that since we don't have NutchBean anymore, I can't tell whether this is 
actually a Nutch problem or whether something is failing when I port to Solr.  
What am I missing?

Thanks,
Chip

RE: Nutch not indexing full collection

2011-07-20 Thread Chip Calhoun

I've been working with $NUTCH_HOME/runtime/local/conf/nutch-site.xml, and I'm 
pretty sure that's the correct file.  I run my commands while in $NUTCH_HOME/ , 
which means all of my commands begin with runtime/local/bin/nutch... .  That 
means my urls directory is $NUTCH_HOME/urls/ and my crawl directory ends up 
being $NUTCH_HOME/crawl/ (as opposed to $NUTCH_HOME/runtime/local/urls/ and so 
forth), but it does seem to at least be getting my urlfilters from 
$NUTCH_HOME/runtime/local/conf/ .

I get no output when I try runtime/local/bin/nutch readdb -stats , so that's 
weird.

I dimly recall there being a total index size value somewhere in Nutch or Solr 
which has to be increased, but I can no longer find any reference to it.

Chip

-Original Message-
From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] 
Sent: Wednesday, July 20, 2011 10:06 AM
To: user@nutch.apache.org
Subject: Re: Nutch not indexing full collection

I'd have suspected db.max.outlinks.per.page but you seem to have set it up 
correctly. Are you running Nutch in runtime/local? in which case you modified 
nutch-site.xml in runtime/local/conf, right?

nutch readdb -stats will give you the total number of pages known etc

Julien

On 20 July 2011 14:51, Chip Calhoun ccalh...@aip.org wrote:

 Hi,

 I'm using Nutch 1.3 to crawl a section of our website, and it doesn't 
 seem to crawl the entire thing.  I'm probably missing something 
 simple, so I hope somebody can help me.

 My urls/nutch file contains a single URL:
 http://www.aip.org/history/ohilist/transcripts.html , which is an 
 alphabetical listing of other pages.  It looks like the indexer stops 
 partway down this page, meaning that entries later in the alphabet 
 aren't indexed.

 My nutch-site.xml has the following content:
 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?
 !-- Put site-specific property overrides in this file. -- 
 configuration property  namehttp.agent.name/name  valueOHI 
 Spider/value /property property  
 namedb.max.outlinks.per.page/name
  value-1/value
  descriptionThe maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (=0), at most db.max.outlinks.per.page 
 outlinks  will be processed for a page; otherwise, all outlinks will 
 be processed.
  /description
 /property
 /configuration

 My regex-urlfilter.txt and crawl-urlfilter.txt both include the 
 following, which should allow access to everything I want:
 # accept hosts in MY.DOMAIN.NAME
 +^http://([a-z0-9]*\.)*aip.org/history/ohilist/
 # skip everything else
 -.

 I've crawled with the following command:
 runtime/local/bin/nutch crawl urls -dir crawl -depth 15 -topN 50

 Note that since we don't have NutchBean anymore, I can't tell whether 
 this is actually a Nutch problem or whether something is failing when 
 I port to Solr.  What am I missing?

 Thanks,
 Chip




--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

RE: Nutch not indexing full collection

2011-07-25 Thread Chip Calhoun

I'm still having trouble.  I've set a windows environment variable, NUTCH_HOME, 
which for me is C:\Apache\nutch-1.3\runtime\local .  I now have my urls and 
crawl directories in that C:\Apache\nutch-1.3\runtime\local folder.  But I'm 
still not crawling files later on my urls list, and apparently I can't search 
for words or phrases toward the end of any of my documents.  Am I 
misremembering that there was a total file size value somewhere in Nutch or 
Solr that needs to be increased?

-Original Message-
From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] 
Sent: Wednesday, July 20, 2011 5:23 PM
To: user@nutch.apache.org
Subject: Re: Nutch not indexing full collection

Hi Chip,

I would try running your scripts after setting the environment variable 
$NUTCH_HOME to nutch/runtime/local/NUTCH_HOME

On Wed, Jul 20, 2011 at 4:01 PM, Chip Calhoun ccalh...@aip.org wrote:

 I've been working with $NUTCH_HOME/runtime/local/conf/nutch-site.xml, 
 and I'm pretty sure that's the correct file.  I run my commands while 
 in $NUTCH_HOME/ , which means all of my commands begin with 
 runtime/local/bin/nutch... .  That means my urls directory is 
 $NUTCH_HOME/urls/ and my crawl directory ends up being 
 $NUTCH_HOME/crawl/ (as opposed to $NUTCH_HOME/runtime/local/urls/ and 
 so forth), but it does seem to at least be getting my urlfilters from 
 $NUTCH_HOME/runtime/local/conf/ .

 I get no output when I try runtime/local/bin/nutch readdb -stats , so 
 that's weird.

 I dimly recall there being a total index size value somewhere in Nutch 
 or Solr which has to be increased, but I can no longer find any 
 reference to it.

 Chip

 -Original Message-
 From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com]
 Sent: Wednesday, July 20, 2011 10:06 AM
 To: user@nutch.apache.org
 Subject: Re: Nutch not indexing full collection

 I'd have suspected db.max.outlinks.per.page but you seem to have set 
 it up correctly. Are you running Nutch in runtime/local? in which case 
 you modified nutch-site.xml in runtime/local/conf, right?

 nutch readdb -stats will give you the total number of pages known etc

 Julien

 On 20 July 2011 14:51, Chip Calhoun ccalh...@aip.org wrote:

  Hi,
 
  I'm using Nutch 1.3 to crawl a section of our website, and it 
  doesn't seem to crawl the entire thing.  I'm probably missing 
  something simple, so I hope somebody can help me.
 
  My urls/nutch file contains a single URL:
  http://www.aip.org/history/ohilist/transcripts.html , which is an 
  alphabetical listing of other pages.  It looks like the indexer 
  stops partway down this page, meaning that entries later in the 
  alphabet aren't indexed.
 
  My nutch-site.xml has the following content:
  ?xml version=1.0?
  ?xml-stylesheet type=text/xsl href=configuration.xsl?
  !-- Put site-specific property overrides in this file. -- 
  configuration property  namehttp.agent.name/name  valueOHI 
  Spider/value /property property 
  namedb.max.outlinks.per.page/name
   value-1/value
   descriptionThe maximum number of outlinks that we'll process for 
  a
 page.
   If this value is nonnegative (=0), at most 
  db.max.outlinks.per.page outlinks  will be processed for a page; 
  otherwise, all outlinks will be processed.
   /description
  /property
  /configuration
 
  My regex-urlfilter.txt and crawl-urlfilter.txt both include the 
  following, which should allow access to everything I want:
  # accept hosts in MY.DOMAIN.NAME
  +^http://([a-z0-9]*\.)*aip.org/history/ohilist/
  # skip everything else
  -.
 
  I've crawled with the following command:
  runtime/local/bin/nutch crawl urls -dir crawl -depth 15 -topN 50
 
  Note that since we don't have NutchBean anymore, I can't tell 
  whether this is actually a Nutch problem or whether something is 
  failing when I port to Solr.  What am I missing?
 
  Thanks,
  Chip
 



 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com




--
*Lewis*

RE: Nutch not indexing full collection

2011-07-28 Thread Chip Calhoun

Thanks!  This has solved half of my problem.  I am now indexing material from 
every document I want.  However, I'm still not indexing words from toward the 
end of longer documents.  I'm not sure what else I could be missing.  

The current contents of my nutch-site.xml are:
?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?
!-- Put site-specific property overrides in this file. --
configuration
 property
  namehttp.agent.name/name
  valueOHI Spider/value
 /property
 property
  namedb.max.outlinks.per.page/name
  value-1/value
  descriptionThe maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  /description
 /property
 property
  namehttp.content.limit/name
  value-1/value
 /property
/configuration

And I'm still indexing with this command:
bin/nutch crawl urls -dir crawl -depth 15 -topN 50


-Original Message-
From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] 
Sent: Wednesday, July 27, 2011 12:18 PM
To: user@nutch.apache.org
Subject: Re: Nutch not indexing full collection

has this been solved?

If your http.content.limit has not been increased in nutch-site.xml then you 
will not be able to store this data and index with Solr.

On Mon, Jul 25, 2011 at 6:18 PM, Chip Calhoun ccalh...@aip.org wrote:

 I'm still having trouble.  I've set a windows environment variable, 
 NUTCH_HOME, which for me is C:\Apache\nutch-1.3\runtime\local .  I now 
 have my urls and crawl directories in that 
 C:\Apache\nutch-1.3\runtime\local folder.  But I'm still not crawling 
 files later on my urls list, and apparently I can't search for words 
 or phrases toward the end of any of my documents.  Am I misremembering 
 that there was a total file size value somewhere in Nutch or Solr that needs 
 to be increased?

 -Original Message-
 From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
 Sent: Wednesday, July 20, 2011 5:23 PM
 To: user@nutch.apache.org
 Subject: Re: Nutch not indexing full collection

 Hi Chip,

 I would try running your scripts after setting the environment 
 variable $NUTCH_HOME to nutch/runtime/local/NUTCH_HOME

 On Wed, Jul 20, 2011 at 4:01 PM, Chip Calhoun ccalh...@aip.org wrote:

  I've been working with 
  $NUTCH_HOME/runtime/local/conf/nutch-site.xml,
  and I'm pretty sure that's the correct file.  I run my commands 
  while in $NUTCH_HOME/ , which means all of my commands begin with 
  runtime/local/bin/nutch... .  That means my urls directory is 
  $NUTCH_HOME/urls/ and my crawl directory ends up being 
  $NUTCH_HOME/crawl/ (as opposed to $NUTCH_HOME/runtime/local/urls/ 
  and so forth), but it does seem to at least be getting my urlfilters 
  from $NUTCH_HOME/runtime/local/conf/ .
 
  I get no output when I try runtime/local/bin/nutch readdb -stats , 
  so that's weird.
 
  I dimly recall there being a total index size value somewhere in 
  Nutch or Solr which has to be increased, but I can no longer find 
  any reference to it.
 
  Chip
 
  -Original Message-
  From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com]
  Sent: Wednesday, July 20, 2011 10:06 AM
  To: user@nutch.apache.org
  Subject: Re: Nutch not indexing full collection
 
  I'd have suspected db.max.outlinks.per.page but you seem to have set 
  it up correctly. Are you running Nutch in runtime/local? in which 
  case you modified nutch-site.xml in runtime/local/conf, right?
 
  nutch readdb -stats will give you the total number of pages known etc
 
  Julien
 
  On 20 July 2011 14:51, Chip Calhoun ccalh...@aip.org wrote:
 
   Hi,
  
   I'm using Nutch 1.3 to crawl a section of our website, and it 
   doesn't seem to crawl the entire thing.  I'm probably missing 
   something simple, so I hope somebody can help me.
  
   My urls/nutch file contains a single URL:
   http://www.aip.org/history/ohilist/transcripts.html , which is an 
   alphabetical listing of other pages.  It looks like the indexer 
   stops partway down this page, meaning that entries later in the 
   alphabet aren't indexed.
  
   My nutch-site.xml has the following content:
   ?xml version=1.0?
   ?xml-stylesheet type=text/xsl href=configuration.xsl?
   !-- Put site-specific property overrides in this file. -- 
   configuration property  namehttp.agent.name/name  
   valueOHI Spider/value /property property 
   namedb.max.outlinks.per.page/name
value-1/value
descriptionThe maximum number of outlinks that we'll process 
   for a
  page.
If this value is nonnegative (=0), at most 
   db.max.outlinks.per.page outlinks  will be processed for a page; 
   otherwise, all outlinks will be processed.
/description
   /property
   /configuration
  
   My regex-urlfilter.txt and crawl-urlfilter.txt both include the 
   following, which should allow access to everything I want:
   # accept hosts in MY.DOMAIN.NAME
   +^http://([a-z0

RE: Nutch not indexing full collection

2011-08-01 Thread Chip Calhoun

That did it!  For the convenience of anyone who finds this in the list archives 
later on, I'll paste what it took:

$NUTCH_HOME/runtime/local/conf/nutch-site.xml (full contents):
?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?
!-- Put site-specific property overrides in this file. --
configuration
 property
  namehttp.agent.name/name
  valueOHI Spider/value
 /property
 property
  namedb.max.outlinks.per.page/name
  value-1/value
  descriptionThe maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  /description
 /property
 property
  namehttp.content.limit/name
  value-1/value
  descriptionThe length limit for downloaded content, in bytes.
  If this value is nonnegative (=0), content longer than it will be 
  truncated; otherwise, no truncation at all.
  /description
 /property
/configuration

$NUTCH_HOME/runtime/local/conf/schema.xml  
$SOLR_HOME/example/solr/conf/schema.xml:
Replace this:   field name=content type=text stored=false 
indexed=true/
With this:  field name=content type=text stored=true indexed=true/

$SOLR_HOME/example/solr/conf/solrconfig.xml:
Replace this:   maxFieldLength1/maxFieldLength
With this:  maxFieldLength2147483647/maxFieldLength

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Monday, August 01, 2011 3:45 PM
To: user@nutch.apache.org
Cc: Chip Calhoun
Subject: Re: Nutch not indexing full collection

Nutch truncates content longer than configured and Solr truncates content 
exceeding max field length. Maybe check your limits.

 I'm still having trouble with this.  In addition to the nutch-site-xml 
 posted below, I have now modified my schema.xml (in both nutch and 
 solr) to include the following important line: field name=content 
 type=text
 stored=true indexed=true/
 
 Now, when I search, the full text of each document shows up under str 
 name=content.  I'm clearly getting everything.  And yet, when I 
 search for text toward the end of a long document, I still don't get 
 that document in my search results.
 
 It sounds like this might be an issue with my Solr setup.  Can anyone 
 think of what I might be missing?
 
 Chip
 
 -Original Message-
 From: Chip Calhoun [mailto:ccalh...@aip.org]
 Sent: Thursday, July 28, 2011 3:29 PM
 To: user@nutch.apache.org
 Subject: RE: Nutch not indexing full collection
 
 Thanks!  This has solved half of my problem.  I am now indexing 
 material from every document I want.  However, I'm still not indexing 
 words from toward the end of longer documents.  I'm not sure what else 
 I could be missing.
 
 The current contents of my nutch-site.xml are:
 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?
 !-- Put site-specific property overrides in this file. -- 
 configuration  property namehttp.agent.name/name
   valueOHI Spider/value
  /property
  property
   namedb.max.outlinks.per.page/name
   value-1/value
   descriptionThe maximum number of outlinks that we'll process for a 
 page. If this value is nonnegative (=0), at most 
 db.max.outlinks.per.page outlinks will be processed for a page; 
 otherwise, all outlinks will be processed. /description  /property  
 property
   namehttp.content.limit/name
   value-1/value
  /property
 /configuration
 
 And I'm still indexing with this command:
 bin/nutch crawl urls -dir crawl -depth 15 -topN 50
 
 
 -Original Message-
 From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
 Sent: Wednesday, July 27, 2011 12:18 PM
 To: user@nutch.apache.org
 Subject: Re: Nutch not indexing full collection
 
 has this been solved?
 
 If your http.content.limit has not been increased in nutch-site.xml 
 then you will not be able to store this data and index with Solr.
 
 On Mon, Jul 25, 2011 at 6:18 PM, Chip Calhoun ccalh...@aip.org wrote:
  I'm still having trouble.  I've set a windows environment variable, 
  NUTCH_HOME, which for me is C:\Apache\nutch-1.3\runtime\local .  I 
  now have my urls and crawl directories in that 
  C:\Apache\nutch-1.3\runtime\local folder.  But I'm still not 
  crawling files later on my urls list, and apparently I can't search 
  for words or phrases toward the end of any of my documents.  Am I 
  misremembering that there was a total file size value somewhere in 
  Nutch or Solr that needs to be increased?
  
  -Original Message-
  From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
  Sent: Wednesday, July 20, 2011 5:23 PM
  To: user@nutch.apache.org
  Subject: Re: Nutch not indexing full collection
  
  Hi Chip,
  
  I would try running your scripts after setting the environment 
  variable $NUTCH_HOME to nutch/runtime/local/NUTCH_HOME
  
  On Wed, Jul 20, 2011 at 4:01 PM, Chip Calhoun ccalh...@aip.org wrote:
   I've been working with
   $NUTCH_HOME/runtime/local/conf/nutch-site.xml

Machine readable vs. human readable URLs.

2011-09-15 Thread Chip Calhoun

Hi everyone,

We'd like to use Nutch and Solr to replace an existing Verity search that's 
become a bit long in the tooth. In our Verity search, we have a hack which 
allows each document to have a machine-readable URL which is indexed (generally 
an xml document), and a human-readable URL which we actually send users to. Has 
anyone done the same with Nutch and Solr?

Thanks,
Chip

RE: Machine readable vs. human readable URLs.

2011-09-19 Thread Chip Calhoun

Hi Julien,

Thanks, that's encouraging. I'm trying to make this work, and I'm definitely 
missing something. I hope I'm not too far off the mark. I've started with the 
instructions at http://wiki.apache.org/nutch/WritingPluginExample . If I 
understand this properly, the changes I needed to make were the following:

In Nutch:
Paste the prescribed block of code into 
%NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells Nutch to look for 
and run the urlmeta plugin.
In %NUTCH_HOME%, run ant war.
Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line in this file now 
looks like: http://www.aip.org/history/ead/20110369.xml\t 
humanURL=http://www.aip.org/history/ead/20110369.html;

In Solr:
Added my new tag to %SOLR_HOME%\example\solr\conf\schema.xml . The new line 
consists of:  field name=humanURL type=string stored=true 
indexed=false/

I've redone the indexing, and my new field still doesn't show up in the search 
results. Can you tell where I'm going wrong?

Thanks,
Chip

-Original Message-
From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] 
Sent: Friday, September 16, 2011 4:37 AM
To: user@nutch.apache.org
Subject: Re: Machine readable vs. human readable URLs.

Hi Chip,

Should simply be a matter of creating a custom field with an IndexingFilter, 
you can then use it in any way you want on the SOLR side

Julien

On 15 September 2011 21:50, Chip Calhoun ccalh...@aip.org wrote:

 Hi everyone,

 We'd like to use Nutch and Solr to replace an existing Verity search 
 that's become a bit long in the tooth. In our Verity search, we have a 
 hack which allows each document to have a machine-readable URL which 
 is indexed (generally an xml document), and a human-readable URL which 
 we actually send users to. Has anyone done the same with Nutch and Solr?

 Thanks,
 Chip




--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

RE: Machine readable vs. human readable URLs.

2011-09-19 Thread Chip Calhoun

Hi Lewis,

My probably wrong understanding was that I'm supposed to add the tags for my 
new field to my list of seed URLs. So if I have a seed URL followed by 
\t humanURL=http://www.aip.org/history/ead/20110369.html;, I get a new field 
called humanURL which is populated with the string I've specified for that 
specific URL. I may just be greatly misunderstanding how this plugin works.

I've checked my Nutch logs now and it looks like nothing happened. The new 
field does at least show up in the Solr admin UI's schema, but clearly my 
problem is on the Nutch end of things.

-Original Message-
From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] 
Sent: Monday, September 19, 2011 3:34 PM
To: user@nutch.apache.org
Subject: Re: Machine readable vs. human readable URLs.

Hi Chip,

There is no need to run ant war, there is no war target in the = Nutch 1.3 
build.xml file.

Can you explian more about adding 'the tags to %NUTCH_HOME% etc etc. Do you 
mean you've added your seed URLs?

Have you had a look at any of your log output as to whether the urlmeta plugin 
is loaded and used when fetching?

You should be able to get info on your schema, fields etc within the Solr admin 
UI

On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun ccalh...@aip.org wrote:

 Hi Julien,

 Thanks, that's encouraging. I'm trying to make this work, and I'm 
 definitely missing something. I hope I'm not too far off the mark. 
 I've started with the instructions at 
 http://wiki.apache.org/nutch/WritingPluginExample . If I understand 
 this properly, the changes I needed to make were the following:

 In Nutch:
 Paste the prescribed block of code into 
 %NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells Nutch to 
 look for and run the urlmeta plugin.
 In %NUTCH_HOME%, run ant war.
 Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line in this file
 now looks like: http://www.aip.org/history/ead/20110369.xml\t
 humanURL=http://www.aip.org/history/ead/20110369.html;

 In Solr:
 Added my new tag to %SOLR_HOME%\example\solr\conf\schema.xml . The new 
 line consists of:  field name=humanURL type=string stored=true
 indexed=false/

 I've redone the indexing, and my new field still doesn't show up in 
 the search results. Can you tell where I'm going wrong?

 Thanks,
 Chip

 -Original Message-
 From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com]
 Sent: Friday, September 16, 2011 4:37 AM
 To: user@nutch.apache.org
 Subject: Re: Machine readable vs. human readable URLs.

 Hi Chip,

 Should simply be a matter of creating a custom field with an 
 IndexingFilter, you can then use it in any way you want on the SOLR 
 side

 Julien

 On 15 September 2011 21:50, Chip Calhoun ccalh...@aip.org wrote:

  Hi everyone,
 
  We'd like to use Nutch and Solr to replace an existing Verity search 
  that's become a bit long in the tooth. In our Verity search, we have 
  a hack which allows each document to have a machine-readable URL 
  which is indexed (generally an xml document), and a human-readable 
  URL which we actually send users to. Has anyone done the same with Nutch 
  and Solr?
 
  Thanks,
  Chip
 



 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com




--
*Lewis*

RE: Machine readable vs. human readable URLs.

2011-09-19 Thread Chip Calhoun

I thought it seemed too good to be true. I understood the part about this 
picking up metadata from tags within the actual documents; that seems like a 
feature a lot of people would need. But I thought the whole point of the 
tab-delimited tags in my URLs file was that I could also inject tags that 
aren't in the source documents. That doesn't seem like it would be a standard 
feature, but it's what I need. Most of the pages I need to index aren't owned 
by us, and I won't always be able to get other sites to add an extra meta tag 
to their pages.

It looks like I might need to write my own plugin, which is a little daunting 
for me. Can anyone think of an existing plugin that injects metadata into 
indexed documents after the fact? It would be nice to have some existing code I 
could examine and learn from.

Thanks,
Chip

-Original Message-
From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] 
Sent: Monday, September 19, 2011 4:56 PM
To: user@nutch.apache.org
Subject: Re: Machine readable vs. human readable URLs.

In addition, it looks like you are misinterpreting how the urlmeta plugin works 
Chip. It is designed to pick up addition meta tags with name and a content 
values respectively. e.g.

meta name=humanURL content=blahblahblah

The plugin then gets this data as well as any additional values added in the 
urlmeta.tags property within nutch-site.xml and add this to the index which can 
then be queried.

Does this make sense?

On Mon, Sep 19, 2011 at 9:10 PM, Julien Nioche  lists.digitalpeb...@gmail.com 
wrote:

 Hi

 Since the info is available thanks to the injection you can use the 
 url-meta plugin as-is and won't need to have a custom version.  See
 https://issues.apache.org/jira/browse/NUTCH-855

 Apart from that do not modify the content of  \runtime\local\conf\ 
 before re-compiling with ANT as this will be overwritten. Either 
 modify $NUTCH/conf/nutch-site.xml or recompile THEN modify.

 As Lewis suggested check the logs and see if the plugin is activated etc...

 J.


 On 19 September 2011 21:03, Chip Calhoun ccalh...@aip.org wrote:

  Hi Lewis,
 
  My probably wrong understanding was that I'm supposed to add the 
  tags for my new field to my list of seed URLs. So if I have a seed 
  URL followed by
 
 \t humanURL=http://www.aip.org/history/ead/20110369.html;, I 
  get
 a
  new field called humanURL which is populated with the string I've 
  specified for that specific URL. I may just be greatly 
  misunderstanding
 how
  this plugin works.
 
  I've checked my Nutch logs now and it looks like nothing happened. 
  The
 new
  field does at least show up in the Solr admin UI's schema, but 
  clearly my problem is on the Nutch end of things.
 
  -Original Message-
  From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
  Sent: Monday, September 19, 2011 3:34 PM
  To: user@nutch.apache.org
  Subject: Re: Machine readable vs. human readable URLs.
 
  Hi Chip,
 
  There is no need to run ant war, there is no war target in the = 
  Nutch
 1.3
  build.xml file.
 
  Can you explian more about adding 'the tags to %NUTCH_HOME% etc etc. 
  Do
 you
  mean you've added your seed URLs?
 
  Have you had a look at any of your log output as to whether the 
  urlmeta plugin is loaded and used when fetching?
 
  You should be able to get info on your schema, fields etc within the 
  Solr admin UI
 
  On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun ccalh...@aip.org wrote:
 
   Hi Julien,
  
   Thanks, that's encouraging. I'm trying to make this work, and I'm 
   definitely missing something. I hope I'm not too far off the mark.
   I've started with the instructions at 
   http://wiki.apache.org/nutch/WritingPluginExample . If I 
   understand this properly, the changes I needed to make were the following:
  
   In Nutch:
   Paste the prescribed block of code into 
   %NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells Nutch 
   to look for and run the urlmeta plugin.
   In %NUTCH_HOME%, run ant war.
   Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line in 
   this
  file
   now looks like: http://www.aip.org/history/ead/20110369.xml\t
   humanURL=http://www.aip.org/history/ead/20110369.html;
  
   In Solr:
   Added my new tag to %SOLR_HOME%\example\solr\conf\schema.xml . The 
   new line consists of:  field name=humanURL type=string stored=true
   indexed=false/
  
   I've redone the indexing, and my new field still doesn't show up 
   in the search results. Can you tell where I'm going wrong?
  
   Thanks,
   Chip
  
   -Original Message-
   From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com]
   Sent: Friday, September 16, 2011 4:37 AM
   To: user@nutch.apache.org
   Subject: Re: Machine readable vs. human readable URLs.
  
   Hi Chip,
  
   Should simply be a matter of creating a custom field with an 
   IndexingFilter, you can then use it in any way you want on the 
   SOLR side
  
   Julien
  
   On 15 September 2011 21:50

RE: Machine readable vs. human readable URLs.

2011-09-20 Thread Chip Calhoun

Hi Julien,

Thanks for clarifying this! I've got it working now. Instead of seeding with a 
proper tab-delimited file created in Excel, I had been wrong-headedly seeding 
it with a text file that just had tabs in it. They look the same, but it makes 
a difference. Thanks!

Chip

-Original Message-
From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] 
Sent: Monday, September 19, 2011 5:23 PM
To: user@nutch.apache.org
Subject: Re: Machine readable vs. human readable URLs.

 In addition, it looks like you are misinterpreting how the urlmeta 
 plugin works Chip. It is designed to pick up addition meta tags with 
 name and a content values respectively. e.g.

 meta name=humanURL content=blahblahblah


Sorry Lewis but it does not do that at all. See link I gave earlier for a 
description of urlmeta. I agree that the name is misleading, it does not extra 
the content from the page but simply uses the crawldb metadata



 The plugin then gets this data as well as any additional values added 
 in the urlmeta.tags property within nutch-site.xml and add this to the 
 index which can then be queried.

 Does this make sense?

 On Mon, Sep 19, 2011 at 9:10 PM, Julien Nioche  
 lists.digitalpeb...@gmail.com wrote:

  Hi
 
  Since the info is available thanks to the injection you can use the 
  url-meta plugin as-is and won't need to have a custom version.  See
  https://issues.apache.org/jira/browse/NUTCH-855
 
  Apart from that do not modify the content of  \runtime\local\conf\ 
  before re-compiling with ANT as this will be overwritten. Either 
  modify $NUTCH/conf/nutch-site.xml or recompile THEN modify.
 
  As Lewis suggested check the logs and see if the plugin is activated
 etc...
 
  J.
 
 
  On 19 September 2011 21:03, Chip Calhoun ccalh...@aip.org wrote:
 
   Hi Lewis,
  
   My probably wrong understanding was that I'm supposed to add the 
   tags
 for
   my new field to my list of seed URLs. So if I have a seed URL 
   followed
 by
  
  \t humanURL=http://www.aip.org/history/ead/20110369.html;, 
   I
 get
  a
   new field called humanURL which is populated with the string 
   I've specified for that specific URL. I may just be greatly 
   misunderstanding
  how
   this plugin works.
  
   I've checked my Nutch logs now and it looks like nothing happened. 
   The
  new
   field does at least show up in the Solr admin UI's schema, but 
   clearly
 my
   problem is on the Nutch end of things.
  
   -Original Message-
   From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
   Sent: Monday, September 19, 2011 3:34 PM
   To: user@nutch.apache.org
   Subject: Re: Machine readable vs. human readable URLs.
  
   Hi Chip,
  
   There is no need to run ant war, there is no war target in the = 
   Nutch
  1.3
   build.xml file.
  
   Can you explian more about adding 'the tags to %NUTCH_HOME% etc 
   etc. Do
  you
   mean you've added your seed URLs?
  
   Have you had a look at any of your log output as to whether the 
   urlmeta plugin is loaded and used when fetching?
  
   You should be able to get info on your schema, fields etc within 
   the
 Solr
   admin UI
  
   On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun ccalh...@aip.org
 wrote:
  
Hi Julien,
   
Thanks, that's encouraging. I'm trying to make this work, and 
I'm definitely missing something. I hope I'm not too far off the mark.
I've started with the instructions at 
http://wiki.apache.org/nutch/WritingPluginExample . If I 
understand this properly, the changes I needed to make were the 
following:
   
In Nutch:
Paste the prescribed block of code into 
%NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells Nutch 
to look for and run the urlmeta plugin.
In %NUTCH_HOME%, run ant war.
Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line 
in
 this
   file
now looks like: http://www.aip.org/history/ead/20110369.xml
  \t
humanURL=http://www.aip.org/history/ead/20110369.html;
   
In Solr:
Added my new tag to %SOLR_HOME%\example\solr\conf\schema.xml . 
The
 new
line consists of:  field name=humanURL type=string
 stored=true
indexed=false/
   
I've redone the indexing, and my new field still doesn't show up 
in the search results. Can you tell where I'm going wrong?
   
Thanks,
Chip
   
-Original Message-
From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com]
Sent: Friday, September 16, 2011 4:37 AM
To: user@nutch.apache.org
Subject: Re: Machine readable vs. human readable URLs.
   
Hi Chip,
   
Should simply be a matter of creating a custom field with an 
IndexingFilter, you can then use it in any way you want on the 
SOLR side
   
Julien
   
On 15 September 2011 21:50, Chip Calhoun ccalh...@aip.org wrote:
   
 Hi everyone,

 We'd like to use Nutch and Solr to replace an existing Verity
 search
 that's become a bit long in the tooth. In our Verity

RE: Machine readable vs. human readable URLs.

2011-09-21 Thread Chip Calhoun

For my own sake I wish I could think of a way in which it was unclear, but no; 
I just screwed up. I could maybe see reinforcing that the urls document has to 
be saved as a tab-delimited file, so a newbie like me won't look at the 
examples and think this is meant to be a text file. Otherwise, both the plugin 
and the documentation work great!

-Original Message-
From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] 
Sent: Wednesday, September 21, 2011 3:05 AM
To: user@nutch.apache.org
Subject: Re: Machine readable vs. human readable URLs.

H^i Chip,

Was there anything in particular you found misleading about the plugin example 
on the wiki? I am keen to make it as clear as possible.

Thank you

Lewis

On Tue, Sep 20, 2011 at 6:00 PM, Chip Calhoun ccalh...@aip.org wrote:

 Hi Julien,

 Thanks for clarifying this! I've got it working now. Instead of 
 seeding with a proper tab-delimited file created in Excel, I had been 
 wrong-headedly seeding it with a text file that just had tabs in it. 
 They look the same, but it makes a difference. Thanks!

 Chip

 -Original Message-
 From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com]
 Sent: Monday, September 19, 2011 5:23 PM
 To: user@nutch.apache.org
 Subject: Re: Machine readable vs. human readable URLs.

  In addition, it looks like you are misinterpreting how the urlmeta 
  plugin works Chip. It is designed to pick up addition meta tags with 
  name and a content values respectively. e.g.

  meta name=humanURL content=blahblahblah

 Sorry Lewis but it does not do that at all. See link I gave earlier 
 for a description of urlmeta. I agree that the name is misleading, it 
 does not extra the content from the page but simply uses the crawldb 
 metadata

  The plugin then gets this data as well as any additional values 
  added in the urlmeta.tags property within nutch-site.xml and add 
  this to the index which can then be queried.

  Does this make sense?

  On Mon, Sep 19, 2011 at 9:10 PM, Julien Nioche  
  lists.digitalpeb...@gmail.com wrote:

   Hi

   Since the info is available thanks to the injection you can use 
   the url-meta plugin as-is and won't need to have a custom version.  
   See
   https://issues.apache.org/jira/browse/NUTCH-855

   Apart from that do not modify the content of  \runtime\local\conf\ 
   before re-compiling with ANT as this will be overwritten. Either 
   modify $NUTCH/conf/nutch-site.xml or recompile THEN modify.

   As Lewis suggested check the logs and see if the plugin is 
   activated
  etc...

   J.

   On 19 September 2011 21:03, Chip Calhoun ccalh...@aip.org wrote:

Hi Lewis,

My probably wrong understanding was that I'm supposed to add the 
tags
  for
my new field to my list of seed URLs. So if I have a seed URL 
followed
  by

   \t 
humanURL=http://www.aip.org/history/ead/20110369.html;,
I
  get
   a
new field called humanURL which is populated with the string 
I've specified for that specific URL. I may just be greatly 
misunderstanding
   how
this plugin works.

I've checked my Nutch logs now and it looks like nothing happened.
The
   new
field does at least show up in the Solr admin UI's schema, but 
clearly
  my
problem is on the Nutch end of things.

-Original Message-
From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
Sent: Monday, September 19, 2011 3:34 PM
To: user@nutch.apache.org
Subject: Re: Machine readable vs. human readable URLs.

Hi Chip,

There is no need to run ant war, there is no war target in the 
= Nutch
   1.3
build.xml file.

Can you explian more about adding 'the tags to %NUTCH_HOME% etc 
etc. Do
   you
mean you've added your seed URLs?

Have you had a look at any of your log output as to whether the 
urlmeta plugin is loaded and used when fetching?

You should be able to get info on your schema, fields etc within 
the
  Solr
admin UI

On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun ccalh...@aip.org
  wrote:

 Hi Julien,

 Thanks, that's encouraging. I'm trying to make this work, and 
 I'm definitely missing something. I hope I'm not too far off 
 the
 mark.
 I've started with the instructions at 
 http://wiki.apache.org/nutch/WritingPluginExample . If I 
 understand this properly, the changes I needed to make were 
 the
 following:

 In Nutch:
 Paste the prescribed block of code into 
 %NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells 
 Nutch to look for and run the urlmeta plugin.
 In %NUTCH_HOME%, run ant war.
 Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line 
 in
  this
file
 now looks like: http://www.aip.org/history/ead/20110369.xml
   \t
 humanURL=http://www.aip.org/history/ead/20110369.html;

 In Solr:
 Added my new tag

How can I figure out what my user-agent is?

2011-09-23 Thread Chip Calhoun

I thought I understood how to set my user-agent, but after asking a few sites 
to add me to their robots.txt it looks like I'm missing something.

My nutch-sites.xml includes:
property
  namehttp.agent.name/name
  valuePHFAWS Spider/value
/property
property
  namehttp.robots.agents/name
  valuePHFAWS Spider,*/value
  descriptionThe agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  /description
/property

A friendly site created a robots.txt which includes the following:
User-agent: PHFAWS Spider
Disallow:

User-agent: *
Disallow: /

Why doesn't this work?

Thanks,
Chip

What could be blocking me, if not robots.txt?

2011-09-29 Thread Chip Calhoun

Hi everyone,

I'm using Nutch to crawl a few friendly sites, and am having trouble with some 
of them. One site in particular has created an exception for me in its 
robots.txt, and yet I can't crawl any of its pages. I've tried copying the 
files I want to index (3 XML documents) to my own server and crawling that, and 
it works fine that way; so something is keeping me from indexing any files on 
this other site.

I compared the logs of my attempt to crawl the friendly site with my attempt to 
crawl my own site, and I've found few differences. Most differences come from 
the fact that my own site requires a crawlDelay, so there are many log sections 
along the lines of:

2011-09-29 10:57:37,529 INFO  fetcher.Fetcher - -activeThreads=10, 
spinWaiting=10, fetchQueues.totalSize=2
2011-09-29 10:57:37,529 INFO  fetcher.Fetcher - * queue: http://www.aip.org
2011-09-29 10:57:37,529 INFO  fetcher.Fetcher -   maxThreads= 1
2011-09-29 10:57:37,529 INFO  fetcher.Fetcher -   inProgress= 0
2011-09-29 10:57:37,529 INFO  fetcher.Fetcher -   crawlDelay= 5000
2011-09-29 10:57:37,529 INFO  fetcher.Fetcher -   minCrawlDelay = 0
2011-09-29 10:57:37,529 INFO  fetcher.Fetcher -   nextFetchTime = 1317308262122
2011-09-29 10:57:37,529 INFO  fetcher.Fetcher -   now   = 1317308257529
2011-09-29 10:57:37,529 INFO  fetcher.Fetcher -   0. 
http://www.aip.org/history/ead/umd/MdU.ead.histms.0067.xml
2011-09-29 10:57:37,529 INFO  fetcher.Fetcher -   1. 
http://www.aip.org/history/ead/umd/MdU.ead.histms.0312.xml

That strikes me as probably irrelevant, but I figured I should mention it. The 
main difference I see in the logs is that the crawl of my own site (the crawl 
that worked) has the following two lines which do not appear in the log of my 
failed crawl:

2011-09-29 10:57:50,497 INFO  parse.ParserFactory - The parsing plugins: 
[org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes 
system property, and all claim to support the content type application/xml, but 
they are not mapped to it  in the parse-plugins.xml file
2011-09-29 10:58:23,559 INFO  crawl.SignatureFactory - Using Signature impl: 
org.apache.nutch.crawl.MD5Signature

Also, while my successful crawl has three lines like the following, my failed 
one only has two:

2011-09-29 10:58:44,824 WARN  regex.RegexURLNormalizer - can't find rules for 
scope 'crawldb', using default

Can anyone think of something I might have missed?

Chip

RE: What could be blocking me, if not robots.txt?

2011-10-03 Thread Chip Calhoun

I apologize, but I haven't found much Nutch documentation that deals with the 
user-agent and robots.txt. Why am I being blocked when the user-agent I'm 
sending matches the user-agent in that robots.txt?

Chip

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Friday, September 30, 2011 6:28 PM
To: user@nutch.apache.org
Cc: Chip Calhoun
Subject: Re: What could be blocking me, if not robots.txt?

 I've been able to run the ParserChecker now, but I'm not sure how to 
 understand the results. Here's what I got:

 # bin/nutch org.apache.nutch.parse.ParserChecker
 http://digital.lib.umd.edu/oclc/MdU.ead.histms.0094.xml - Url
 ---
 http://digital.lib.umd.edu/oclc/MdU.ead.histms.0094.xml-
 ParseData
 -
 Version: 5
 Status: success(1,0)
 Title:
 Outlinks: 1
   outlink: toUrl: GR:32:A:128 anchor:
 Content Metadata: ETag=1fa962a-56f20-485df79c50980 Date=Fri, 30 Sep 
 2011
 19:54:14 GMT Content-Length=356128 Last-Modified=Wed, 05 May 2010 
 21:26:14 GMT Content-Type=text/xml Connection=close 
 Accept-Ranges=bytes
 Server=Apache/2.2.3 (Red Hat) Parse Metadata: 
 Content-Type=application/xml

This means almost everything is good to go but...

 Curl also retrieves this file, and yet I can't get my crawl to pick it up.

 Could it be an issue with robots.txt? The robots file for this site 
 reads as follows: User-agent: PHFAWS/Nutch-1.3
 Disallow:

 User-agent: archive.org_bot
 Disallow:

 User-agent: *
 Disallow: /

This is the problem.

 That first user-agent is, as near as I can tell, what I'm sending. My 
 log shows the following: 2011-09-30 15:54:17,712 INFO  http.Http - 
 http.agent = PHFAWS/Nutch-1.3 (American Institute of Physics: Physics 
 History Finding Aids Web Site; 
 http://www.aip.org/history/nbl/findingaids.html;
 ccalh...@aip.org)

 Can anyone tell what I'm missing? Thanks.

 Chip

 -Original Message-
 From: Chip Calhoun [mailto:ccalh...@aip.org]
 Sent: Thursday, September 29, 2011 4:12 PM
 To: user@nutch.apache.org
 Subject: RE: What could be blocking me, if not robots.txt?

 Ah, sorry. I had already deleted the local copy from my server 
 (aip.org) to avoid clutter. So yeah, that will definitely 404 now.

 Curl retrieves the whole file with no problems. I can't try the 
 ParserChecker today as I'm stuck away from my own machine, but I will 
 try it tomorrow. The fact that I can curl it at least tells me this is 
 a problem I need to fix in Nutch.

 Chip

 From: Markus Jelsma [markus.jel...@openindex.io]
 Sent: Thursday, September 29, 2011 1:01 PM
 To: user@nutch.apache.org
 Cc: Chip Calhoun
 Subject: Re: What could be blocking me, if not robots.txt?

 Oh, it's a 404. That makes sense.

  Hi everyone,

  I'm using Nutch to crawl a few friendly sites, and am having trouble 
  with some of them. One site in particular has created an exception 
  for me in its robots.txt, and yet I can't crawl any of its pages. 
  I've tried copying the files I want to index (3 XML documents) to my 
  own server and crawling that, and it works fine that way; so 
  something is keeping me from indexing any files on this other site.

  I compared the logs of my attempt to crawl the friendly site with my 
  attempt to crawl my own site, and I've found few differences. Most 
  differences come from the fact that my own site requires a 
  crawlDelay, so there are many log sections along the lines of:

  2011-09-29 10:57:37,529 INFO  fetcher.Fetcher - -activeThreads=10, 
  spinWaiting=10, fetchQueues.totalSize=2 2011-09-29 10:57:37,529 INFO 
  fetcher.Fetcher - * queue: http://www.aip.org 2011-09-29 
  10:57:37,529 INFO

   fetcher.Fetcher -   maxThreads= 1 2011-09-29 10:57:37,529 INFO

  fetcher.Fetcher -   inProgress= 0 2011-09-29 10:57:37,529 INFO
  fetcher.Fetcher -   crawlDelay= 5000 2011-09-29 10:57:37,529 INFO
  fetcher.Fetcher -   minCrawlDelay = 0 2011-09-29 10:57:37,529 INFO
  fetcher.Fetcher -   nextFetchTime = 1317308262122 2011-09-29 10:57:37,529
  INFO  fetcher.Fetcher -   now   = 1317308257529 2011-09-29
  10:57:37,529 INFO  fetcher.Fetcher -   0.
  http://www.aip.org/history/ead/umd/MdU.ead.histms.0067.xml 2011-09-29
  10:57:37,529 INFO  fetcher.Fetcher -   1.
  http://www.aip.org/history/ead/umd/MdU.ead.histms.0312.xml

  That strikes me as probably irrelevant, but I figured I should 
  mention it. The main difference I see in the logs is that the crawl 
  of my own site (the crawl that worked) has the following two lines 
  which do not appear in the log of my failed crawl:

  2011-09-29 10:57:50,497 INFO  parse.ParserFactory - The parsing plugins:
  [org.apache.nutch.parse.tika.TikaParser] are enabled via the 
  plugin.includes system property, and all claim to support the 
  content type application/xml, but they are not mapped to it  in the 
  parse-plugins.xml file 2011-09-29 10:58:23,559 INFO 
  crawl.SignatureFactory - Using

RE: What could be blocking me, if not robots.txt?

2011-10-03 Thread Chip Calhoun

Aha! That's done it. Thanks!

Incidentally, I only asked them to add the /Nutch-1.3 because originally I had 
a user-agent of PHFAWS Spider and had them add PHFAWS Spider to their 
user-agent, and it didn't work. It seems that at least some sites have trouble 
with a user-agent that's more than one word. And I only went with multiple 
words because the tutorial gives  valueMy Nutch Spider/value as an 
example. This might be something to warn people about in the documentation.

Chip

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Monday, October 03, 2011 9:42 AM
To: user@nutch.apache.org
Subject: Re: What could be blocking me, if not robots.txt?

Oh i misread, your user agent is PHFAWS/Nutch-1.3? Are you sure that that's 
what is configured as your user agent name? If your name is PHFAWS then the 
robots.txt must list your name without /Nutch-1.3.

Or maybe change the robots.txt to 
 User-agent: PHFAWS/Nutch-1.3
 Allow: /


On Monday 03 October 2011 15:31:46 Chip Calhoun wrote:
 I apologize, but I haven't found much Nutch documentation that deals 
 with the user-agent and robots.txt. Why am I being blocked when the 
 user-agent I'm sending matches the user-agent in that robots.txt?
 
 Chip
 
 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io]
 Sent: Friday, September 30, 2011 6:28 PM
 To: user@nutch.apache.org
 Cc: Chip Calhoun
 Subject: Re: What could be blocking me, if not robots.txt?
 
  I've been able to run the ParserChecker now, but I'm not sure how to 
  understand the results. Here's what I got:
  
  # bin/nutch org.apache.nutch.parse.ParserChecker
  http://digital.lib.umd.edu/oclc/MdU.ead.histms.0094.xml - 
  Url
  ---
  http://digital.lib.umd.edu/oclc/MdU.ead.histms.0094.xml-
  ParseData
  -
  Version: 5
  Status: success(1,0)
  Title:
  Outlinks: 1
  
outlink: toUrl: GR:32:A:128 anchor:
  Content Metadata: ETag=1fa962a-56f20-485df79c50980 Date=Fri, 30 
  Sep
  2011
  19:54:14 GMT Content-Length=356128 Last-Modified=Wed, 05 May 2010
  21:26:14 GMT Content-Type=text/xml Connection=close 
  Accept-Ranges=bytes
  Server=Apache/2.2.3 (Red Hat) Parse Metadata:
  Content-Type=application/xml
 
 This means almost everything is good to go but...
 
  Curl also retrieves this file, and yet I can't get my crawl to pick 
  it up.
  
  Could it be an issue with robots.txt? The robots file for this site 
  reads as follows: User-agent: PHFAWS/Nutch-1.3
  Disallow:
  
  User-agent: archive.org_bot
  Disallow:
  
  User-agent: *
  Disallow: /
 
 This is the problem.
 
  That first user-agent is, as near as I can tell, what I'm sending. 
  My log shows the following: 2011-09-30 15:54:17,712 INFO  http.Http 
  - http.agent = PHFAWS/Nutch-1.3 (American Institute of Physics: 
  Physics History Finding Aids Web Site; 
  http://www.aip.org/history/nbl/findingaids.html;
  ccalh...@aip.org)
  
  Can anyone tell what I'm missing? Thanks.
  
  Chip
  
  
  -Original Message-
  From: Chip Calhoun [mailto:ccalh...@aip.org]
  Sent: Thursday, September 29, 2011 4:12 PM
  To: user@nutch.apache.org
  Subject: RE: What could be blocking me, if not robots.txt?
  
  Ah, sorry. I had already deleted the local copy from my server
  (aip.org) to avoid clutter. So yeah, that will definitely 404 now.
  
  Curl retrieves the whole file with no problems. I can't try the 
  ParserChecker today as I'm stuck away from my own machine, but I 
  will try it tomorrow. The fact that I can curl it at least tells me 
  this is a problem I need to fix in Nutch.
  
  Chip
  
  
  From: Markus Jelsma [markus.jel...@openindex.io]
  Sent: Thursday, September 29, 2011 1:01 PM
  To: user@nutch.apache.org
  Cc: Chip Calhoun
  Subject: Re: What could be blocking me, if not robots.txt?
  
  Oh, it's a 404. That makes sense.
  
   Hi everyone,
   
   I'm using Nutch to crawl a few friendly sites, and am having 
   trouble with some of them. One site in particular has created an 
   exception for me in its robots.txt, and yet I can't crawl any of its 
   pages.
   I've tried copying the files I want to index (3 XML documents) to 
   my own server and crawling that, and it works fine that way; so 
   something is keeping me from indexing any files on this other site.
   
   I compared the logs of my attempt to crawl the friendly site with 
   my attempt to crawl my own site, and I've found few differences. 
   Most differences come from the fact that my own site requires a 
   crawlDelay, so there are many log sections along the lines of:
   
   2011-09-29 10:57:37,529 INFO  fetcher.Fetcher - -activeThreads=10, 
   spinWaiting=10, fetchQueues.totalSize=2 2011-09-29 10:57:37,529 
   INFO fetcher.Fetcher - * queue: http://www.aip.org 2011-09-29
   10:57:37,529 INFO
   
fetcher.Fetcher -   maxThreads= 1 2011-09-29 10:57:37,529 INFO
   
   fetcher.Fetcher -   inProgress= 0 2011-09-29 10:57

Unable to parse large XML files.

2011-10-04 Thread Chip Calhoun

Hi everyone,

I've found that I'm unable to parse very large XML files. This doesn't seem to 
happen with other file formats. When I run any of the offending files through 
ParserChecker, I get something along the lines of:

# bin/nutch org.apache.nutch.parse.ParserChecker 
http://www.aip.org/history/ead/19990074.xml
-
Url
---
http://www.aip.org/history/ead/19990074.xml-
ParseData
-
Version: 5
Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to 
successfully parse content
Title:
Outlinks: 0
Content Metadata:
Parse Metadata:

One thing which may or may not be relevant is that when I look XML files up in 
a browser the http:// at the beginning tends to disappear. That seems relevant 
because it seems like it might defeat my file.content.limit, 
http.content.limit, and ftp.content.limitftp://ftp.content.limit properties. 
Is there a way around this?

Thanks,
Chip

RE: Unable to parse large XML files.

2011-10-05 Thread Chip Calhoun

Huh. It turns out my http.content.limit was fine, but I also needed a 
file.content.limit statement in nutch-site.xml to make this work. Thanks!

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Tuesday, October 04, 2011 7:41 PM
To: user@nutch.apache.org
Subject: Re: Unable to parse large XML files.

 Hi everyone,

 I've found that I'm unable to parse very large XML files. This doesn't 
 seem to happen with other file formats. When I run any of the 
 offending files through ParserChecker, I get something along the lines of:

 # bin/nutch org.apache.nutch.parse.ParserChecker
 http://www.aip.org/history/ead/19990074.xml - Url
 ---
 http://www.aip.org/history/ead/19990074.xml-
 ParseData
 -
 Version: 5
 Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable 
 to successfully parse content Title:
 Outlinks: 0
 Content Metadata:
 Parse Metadata:

 One thing which may or may not be relevant is that when I look XML 
 files up in a browser the http:// at the beginning tends to disappear.

You're using some fancy new browser? Some seem to do that.  Check your 
http.content.limit.

 That seems
 relevant because it seems like it might defeat my file.content.limit, 
 http.content.limit, and ftp.content.limitftp://ftp.content.limit
 properties. Is there a way around this?

 Thanks,
 Chip

RE: Unable to parse large XML files.

2011-10-05 Thread Chip Calhoun

Hrm. No, it turns out I was wrong; I'd misread an error message. I've got the 
following in my nutch-site.xml:

property
  namefile.content.limit/name
  value-1/value
  descriptionThe length limit for downloaded content using the file://
  protocol, in bytes. If this value is nonnegative (=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the http.content.limit setting.
  /description
/property
property
  namehttp.content.limit/name
  value-1/value
  descriptionThe length limit for downloaded content, in bytes.
  If this value is nonnegative (=0), content longer than it will be 
  truncated; otherwise, no truncation at all.
  /description
 /property
property
  nameftp.content.limit/name
  value-1/value 
  descriptionThe length limit for downloaded content, in bytes.
  If this value is nonnegative (=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  Caution: classical ftp RFCs never defines partial transfer and, in fact,
  some ftp servers out there do not handle client side forced close-down very
  well. Our implementation tries its best to handle such situations smoothly.
  /description
/property

-Original Message-
From: Chip Calhoun [mailto:ccalh...@aip.org] 
Sent: Wednesday, October 05, 2011 9:34 AM
To: 'user@nutch.apache.org'; 'markus.jel...@openindex.io'
Subject: RE: Unable to parse large XML files.

Huh. It turns out my http.content.limit was fine, but I also needed a 
file.content.limit statement in nutch-site.xml to make this work. Thanks!

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Tuesday, October 04, 2011 7:41 PM
To: user@nutch.apache.org
Subject: Re: Unable to parse large XML files.


 Hi everyone,
 
 I've found that I'm unable to parse very large XML files. This doesn't 
 seem to happen with other file formats. When I run any of the 
 offending files through ParserChecker, I get something along the lines of:
 
 # bin/nutch org.apache.nutch.parse.ParserChecker
 http://www.aip.org/history/ead/19990074.xml - Url
 ---
 http://www.aip.org/history/ead/19990074.xml-
 ParseData
 -
 Version: 5
 Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable 
 to successfully parse content Title:
 Outlinks: 0
 Content Metadata:
 Parse Metadata:
 
 One thing which may or may not be relevant is that when I look XML 
 files up in a browser the http:// at the beginning tends to disappear.

You're using some fancy new browser? Some seem to do that.  Check your 
http.content.limit.

 That seems
 relevant because it seems like it might defeat my file.content.limit, 
 http.content.limit, and ftp.content.limitftp://ftp.content.limit
 properties. Is there a way around this?
 
 Thanks,
 Chip

Truncated content despite my content.limit settings.

2011-10-17 Thread Chip Calhoun

Hi everyone,

I'm having issues with truncated content on some pages, despite what I believe 
to be solid content.limit settings.

One page I have an issue with:
http://www.canisius.edu/archives/ruddick.asp

When I run a search in Solr, the content I get is limited to:
str name=contentCanisius College - Ruddick Collection Canisius College 
Archives Return to Home Admissions Academics Athletics Student Life Alumni and 
Friends News and Events Welcome to CanisiusDepartment IndexArchives  
Special Collections   Ruddick Collection Collection of Fr. James J. Ruddick, 
S.J., 1924-2007 Welcome to the Collection of Rev. James J. Ruddick, S.J. 
chronicling the/str

Here's what I have in my nutch-site.xml page, which looks sufficient to me.
property
  namedb.max.outlinks.per.page/name
  value-1/value
  descriptionThe maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  /description
/property
property
  namefile.content.limit/name
  value-1/value
  descriptionThe length limit for downloaded content using the file://
  protocol, in bytes. If this value is nonnegative (=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the http.content.limit setting.
  /description
/property
property
  namehttp.content.limit/name
  value-1/value
  descriptionThe length limit for downloaded content, in bytes.
  If this value is nonnegative (=0), content longer than it will be
  truncated; otherwise, no truncation at all.
  /description
/property
property
  nameftp.content.limit/name
  value-1/value
  descriptionThe length limit for downloaded content, in bytes.
  If this value is nonnegative (=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  Caution: classical ftp RFCs never defines partial transfer and, in fact,
  some ftp servers out there do not handle client side forced close-down very
  well. Our implementation tries its best to handle such situations smoothly.
  /description
/property

Can anyone see what I'm missing? Thanks.

Chip

RE: Truncated content despite my content.limit settings.

2011-10-18 Thread Chip Calhoun

With ParserChecker it's similarly truncated. Could it be the fact that it's a 
.asp page? The output is as follows:

# bin/nutch org.apache.nutch.parse.ParserChecker -dumpText http://www.canisius.
edu/archives/ruddick.asp
-
Url
---
http://www.canisius.edu/archives/ruddick.asp-
ParseData
-
Version: 5
Status: success(1,0)
Title: Canisius College - Ruddick Collection
Outlinks: 20
  outlink: toUrl: http://www.canisius.edu/v2/SiteStyleClient.css anchor:
  outlink: toUrl: http://www.canisius.edu/v2/SiteStylePrint.css anchor:
  outlink: toUrl: http://www.google-analytics.com/urchin.js anchor:
  outlink: toUrl: http://www.canisius.edu/default.asp anchor: Return to Home
  outlink: toUrl: http://www.canisius.edu/admissions/rd/?PROP-PROADM anchor: Adm
issions
  outlink: toUrl: http://www.canisius.edu/academics/ anchor: Academics
  outlink: toUrl: http://www.gogriffs.com anchor: Athletics
  outlink: toUrl: http://www.canisius.edu/studentlife/ anchor: Student Life
  outlink: toUrl: http://www.canisius.edu/alumnifriends/ anchor: Alumni and Frie
nds
  outlink: toUrl: http://www.canisius.edu/newsevents/ anchor: News and Events
  outlink: toUrl: http://www.canisius.edu/images/userImages/creans/Page_12509/ru
ddick_centerBanner.jpg anchor:
  outlink: toUrl: http://www.canisius.edu/images/userImages/creans/Page_12509/ru
ddick_HC.gif anchor:
  outlink: toUrl: http://www.canisius.edu/archives/mission.asp anchor: mission s
tatement
  outlink: toUrl: http://www.canisius.edu/images/userImages/creans/Page_12509/mi
ssion_blue.gif anchor: mission statement
  outlink: toUrl: http://www.canisius.edu/archives/directory.asp anchor: archive
s directory
  outlink: toUrl: http://www.canisius.edu/images/userImages/creans/Page_12509/ar
chives_gold.gif anchor: archives directory
  outlink: toUrl: http://www.canisius.edu/default.asp anchor: Welcome to Canisiu
s
  outlink: toUrl: http://www.canisius.edu/about/departments.asp anchor: Departme
nt Index
  outlink: toUrl: http://www.canisius.edu/archives/default.asp anchor: Archives
 Special Collections
  outlink: toUrl: http://www.canisius.edu/images/userImages/libweb/Page_12509/Ru
ddick.jpg anchor:
Content Metadata: Cache-control=private Date=Tue, 18 Oct 2011 13:44:06 GMT Conte
nt-Length=10610 Set-Cookie=ASPSESSIONIDASSCBRRA=LNGICEKCBKDEAOFICKHLDHEL; path=/
 Content-Type=text/html Connection=close X-Powered-By=ASP.NET Server=Microsoft-I
IS/6.0
Parse Metadata: CharEncodingForConversion=windows-1252 OriginalCharEncoding=wind
ows-1252
-
ParseText
-
Canisius College - Ruddick Collection Canisius College Archives Return to Home A
dmissions Academics Athletics Student Life Alumni and Friends News and Events We
lcome to Canisius áá Department Index áá Archives  Special Collections ááRud
dick Collection Collection of Fr. James J. Ruddick, S.J., 1924-2007 Welcome to t
he C

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Monday, October 17, 2011 4:26 PM
To: user@nutch.apache.org
Subject: Re: Truncated content despite my content.limit settings.

What does parsechecker tell you?

nutch org.apache.nutch.parse.ParserChecker -dumpText URL

Keep in mind that your Solr may have a low value for max field length.

 Hi everyone,
 
 I'm having issues with truncated content on some pages, despite what I 
 believe to be solid content.limit settings.
 
 One page I have an issue with:
 http://www.canisius.edu/archives/ruddick.asp
 
 When I run a search in Solr, the content I get is limited to:
 str name=contentCanisius College - Ruddick Collection Canisius 
 College Archives Return to Home Admissions Academics Athletics Student 
 Life Alumni and Friends News and Events Welcome to Canisius
 Department Index   Archives  Special Collections   Ruddick Collection 
 Collection of Fr.
 James J. Ruddick, S.J., 1924-2007 Welcome to the Collection of Rev. 
 James J. Ruddick, S.J. chronicling the/str
 
 Here's what I have in my nutch-site.xml page, which looks sufficient to me.
 property
   namedb.max.outlinks.per.page/name
   value-1/value
   descriptionThe maximum number of outlinks that we'll process for a 
 page. If this value is nonnegative (=0), at most 
 db.max.outlinks.per.page outlinks will be processed for a page; 
 otherwise, all outlinks will be processed. /description /property 
 property
   namefile.content.limit/name
   value-1/value
   descriptionThe length limit for downloaded content using the file://
   protocol, in bytes. If this value is nonnegative (=0), content longer
   than it will be truncated; otherwise, no truncation at all. Do not
   confuse this setting with the http.content.limit setting.
   /description
 /property
 property
   namehttp.content.limit/name
   value-1/value
   descriptionThe length limit for downloaded content, in bytes.
   If this value is nonnegative (=0), content longer than it will be
   truncated; otherwise, no truncation at all.
   /description
 /property
 property

RE: Truncated content despite my content.limit settings.

2011-10-18 Thread Chip Calhoun

Aha! It turns out that removing protocol-httpclient from my nutch-site.xml's 
plugin.includes value fixes this. If I'm remembering correctly, I only added 
this in the hope that it would fix something else that it didn't actually fix, 
so hopefully removing it won't break anything.

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Tuesday, October 18, 2011 9:58 AM
To: user@nutch.apache.org
Subject: Re: Truncated content despite my content.limit settings.

Strange! I parsed it yesterday as well with parse-tike and the Boilerpipe patch 
enabled and got a lot of output. Can you try a different parser? Your settings 
look fine but are there any other exoting settings you use or custom code?

On Tuesday 18 October 2011 15:53:26 Chip Calhoun wrote:
 With ParserChecker it's similarly truncated. Could it be the fact that 
 it's a .asp page? The output is as follows:

 # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText 
 http://www.canisius. edu/archives/ruddick.asp
 -
 Url
 ---
 http://www.canisius.edu/archives/ruddick.asp-
 ParseData
 -
 Version: 5
 Status: success(1,0)
 Title: Canisius College - Ruddick Collection
 Outlinks: 20
   outlink: toUrl: http://www.canisius.edu/v2/SiteStyleClient.css anchor:
   outlink: toUrl: http://www.canisius.edu/v2/SiteStylePrint.css anchor:
   outlink: toUrl: http://www.google-analytics.com/urchin.js anchor:
   outlink: toUrl: http://www.canisius.edu/default.asp anchor: Return 
 to Home outlink: toUrl: 
 http://www.canisius.edu/admissions/rd/?PROP-PROADM
 anchor: Adm issions
   outlink: toUrl: http://www.canisius.edu/academics/ anchor: Academics
   outlink: toUrl: http://www.gogriffs.com anchor: Athletics
   outlink: toUrl: http://www.canisius.edu/studentlife/ anchor: Student Life
   outlink: toUrl: http://www.canisius.edu/alumnifriends/ anchor: 
 Alumni and Frie nds
   outlink: toUrl: http://www.canisius.edu/newsevents/ anchor: News and 
 Events outlink: toUrl:
 http://www.canisius.edu/images/userImages/creans/Page_12509/ru
 ddick_centerBanner.jpg anchor:
   outlink: toUrl:
 http://www.canisius.edu/images/userImages/creans/Page_12509/ru
 ddick_HC.gif anchor:
   outlink: toUrl: http://www.canisius.edu/archives/mission.asp anchor:
 mission s tatement
   outlink: toUrl:
 http://www.canisius.edu/images/userImages/creans/Page_12509/mi
 ssion_blue.gif anchor: mission statement
   outlink: toUrl: http://www.canisius.edu/archives/directory.asp anchor:
 archive s directory
   outlink: toUrl:
 http://www.canisius.edu/images/userImages/creans/Page_12509/ar
 chives_gold.gif anchor: archives directory
   outlink: toUrl: http://www.canisius.edu/default.asp anchor: Welcome 
 to Canisiu s
   outlink: toUrl: http://www.canisius.edu/about/departments.asp anchor:
 Departme nt Index
   outlink: toUrl: http://www.canisius.edu/archives/default.asp anchor:
 Archives  Special Collections
   outlink: toUrl:
 http://www.canisius.edu/images/userImages/libweb/Page_12509/Ru 
 ddick.jpg
 anchor:
 Content Metadata: Cache-control=private Date=Tue, 18 Oct 2011 13:44:06 
 GMT Conte nt-Length=10610 
 Set-Cookie=ASPSESSIONIDASSCBRRA=LNGICEKCBKDEAOFICKHLDHEL; path=/ 
 Content-Type=text/html Connection=close X-Powered-By=ASP.NET 
 Server=Microsoft-I IS/6.0 Parse Metadata: 
 CharEncodingForConversion=windows-1252
 OriginalCharEncoding=wind ows-1252
 -
 ParseText
 -
 Canisius College - Ruddick Collection Canisius College Archives Return 
 to Home A dmissions Academics Athletics Student Life Alumni and 
 Friends News and Events We lcome to Canisius áá Department Index áá 
 Archives  Special Collections ááRud dick Collection Collection of Fr. James 
 J.
 Ruddick, S.J., 1924-2007 Welcome to t he C

 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io]
 Sent: Monday, October 17, 2011 4:26 PM
 To: user@nutch.apache.org
 Subject: Re: Truncated content despite my content.limit settings.

 What does parsechecker tell you?

 nutch org.apache.nutch.parse.ParserChecker -dumpText URL

 Keep in mind that your Solr may have a low value for max field length.

  Hi everyone,

  I'm having issues with truncated content on some pages, despite what 
  I believe to be solid content.limit settings.

  One page I have an issue with:
  http://www.canisius.edu/archives/ruddick.asp

  When I run a search in Solr, the content I get is limited to:
  str name=contentCanisius College - Ruddick Collection Canisius 
  College Archives Return to Home Admissions Academics Athletics 
  Student Life Alumni and Friends News and Events Welcome to Canisius  
   Department Index   Archives  Special Collections   Ruddick 
  Collection Collection of Fr. James J. Ruddick, S.J., 1924-2007 
  Welcome to the Collection of Rev. James J. Ruddick, S.J. chronicling 
  the/str

  Here's what I have in my nutch-site.xml page, which looks sufficient 
  to me. property

namedb.max.outlinks.per.page/name
value-1/value

Good workaround for timeout?

2011-10-19 Thread Chip Calhoun

I'm getting a fairly persistent  timeout on a particular page. Other, smaller 
pages in this folder do fine, but this one times out most of the time. When it 
fails, my ParserChecker results look like:

# bin/nutch org.apache.nutch.parse.ParserChecker -dumpText 
http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2932DonaldsonLauren.xml
Exception in thread main java.lang.NullPointerException
at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)

I've stuck with the default value of 10 in my nutch-default.xml's 
fetcher.threads.fetch value, and I've added the following to nutch-site.xml:

property
  namedb.max.outlinks.per.page/name
  value-1/value
  descriptionThe maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  /description
/property
property
  namefile.content.limit/name
  value-1/value
  descriptionThe length limit for downloaded content using the file://
  protocol, in bytes. If this value is nonnegative (=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the http.content.limit setting.
  /description
/property
property
  namehttp.content.limit/name
  value-1/value
  descriptionThe length limit for downloaded content, in bytes.
  If this value is nonnegative (=0), content longer than it will be
  truncated; otherwise, no truncation at all.
  /description
/property
property
  nameftp.content.limit/name
  value-1/value
  descriptionThe length limit for downloaded content, in bytes.
  If this value is nonnegative (=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  Caution: classical ftp RFCs never defines partial transfer and, in fact,
  some ftp servers out there do not handle client side forced close-down very
  well. Our implementation tries its best to handle such situations smoothly.
  /description
/property
property
  namehttp.timeout/name
  value999/value
  descriptionThe default network timeout, in milliseconds./description
/property

What else can I do? Thanks.

Chip

RE: Good workaround for timeout?

2011-10-19 Thread Chip Calhoun

If I'm reading the log correctly, it's the fetch:

2011-10-19 11:18:11,405 INFO  fetcher.Fetcher - fetch of 
http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2932DonaldsonLauren.xml
 failed with: java.net.SocketTimeoutException: Read timed out


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Wednesday, October 19, 2011 11:08 AM
To: user@nutch.apache.org
Subject: Re: Good workaround for timeout?

What is timing out, the fetch or the parse?

 I'm getting a fairly persistent  timeout on a particular page. Other, 
 smaller pages in this folder do fine, but this one times out most of 
 the time. When it fails, my ParserChecker results look like:
 
 # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText 
 http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2932D
 onal dsonLauren.xml Exception in thread main 
 java.lang.NullPointerException
 at 
 org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)
 
 I've stuck with the default value of 10 in my nutch-default.xml's 
 fetcher.threads.fetch value, and I've added the following to
 nutch-site.xml:
 
 property
   namedb.max.outlinks.per.page/name
   value-1/value
   descriptionThe maximum number of outlinks that we'll process for a 
 page. If this value is nonnegative (=0), at most 
 db.max.outlinks.per.page outlinks will be processed for a page; 
 otherwise, all outlinks will be processed. /description /property 
 property
   namefile.content.limit/name
   value-1/value
   descriptionThe length limit for downloaded content using the file://
   protocol, in bytes. If this value is nonnegative (=0), content longer
   than it will be truncated; otherwise, no truncation at all. Do not
   confuse this setting with the http.content.limit setting.
   /description
 /property
 property
   namehttp.content.limit/name
   value-1/value
   descriptionThe length limit for downloaded content, in bytes.
   If this value is nonnegative (=0), content longer than it will be
   truncated; otherwise, no truncation at all.
   /description
 /property
 property
   nameftp.content.limit/name
   value-1/value
   descriptionThe length limit for downloaded content, in bytes.
   If this value is nonnegative (=0), content longer than it will be 
 truncated; otherwise, no truncation at all.
   Caution: classical ftp RFCs never defines partial transfer and, in fact,
   some ftp servers out there do not handle client side forced 
 close-down very well. Our implementation tries its best to handle such 
 situations smoothly. /description /property property
   namehttp.timeout/name
   value999/value
   descriptionThe default network timeout, in 
 milliseconds./description /property
 
 What else can I do? Thanks.
 
 Chip

RE: Good workaround for timeout?

2011-10-19 Thread Chip Calhoun

I'm using protocol-http, but I removed protocol-httpclient after you pointed 
out in another thread that it's broken. Unfortunately I'm not sure which 
properties are used by what, and I'm not sure how to find out. I added some 
more stuff to nutch-site.xml (I'll paste it at the end), and it seems to be 
working so far; but since this has been an intermittent problem, I can't be 
sure whether I've really fixed it or whether I'm getting lucky.

property
  namehttp.timeout/name
  value999/value
  descriptionThe default network timeout, in milliseconds./description
/property
property
  nameftp.timeout/name
  value99/value
  descriptionDefault timeout for ftp client socket, in millisec.
  Please also see ftp.keep.connection below./description
/property
property
  nameftp.server.timeout/name
  value9/value
  descriptionAn estimation of ftp server idle time, in millisec.
  Typically it is 12 millisec for many ftp servers out there.
  Better be conservative here. Together with ftp.timeout, it is used to
  decide if we need to delete (annihilate) current ftp.client instance and
  force to start another ftp.client instance anew. This is necessary because
  a fetcher thread may not be able to obtain next request from queue in time
  (due to idleness) before our ftp client times out or remote server
  disconnects. Used only when ftp.keep.connection is true (please see below).
  /description
/property
property
  nameparser.timeout/name
  value300/value
  descriptionTimeout in seconds for the parsing of a document, otherwise 
treats it as an exception and 
  moves on the the following documents. This parameter is applied to any Parser 
implementation. 
  Set to -1 to deactivate, bearing in mind that this could cause
  the parsing to crash because of a very long or corrupted document.
  /description
/property

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Wednesday, October 19, 2011 11:28 AM
To: user@nutch.apache.org
Subject: Re: Good workaround for timeout?

It is indeed. Tricky.

Are you going through some proxy? Are you using protocol-http or httpclient? 
Are you sure the http.time.out value is actually used in lib-http?

 If I'm reading the log correctly, it's the fetch:
 
 2011-10-19 11:18:11,405 INFO  fetcher.Fetcher - fetch of 
 http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2932D
 onal dsonLauren.xml failed with: java.net.SocketTimeoutException: Read 
 timed out
 
 
 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io]
 Sent: Wednesday, October 19, 2011 11:08 AM
 To: user@nutch.apache.org
 Subject: Re: Good workaround for timeout?
 
 What is timing out, the fetch or the parse?
 
  I'm getting a fairly persistent  timeout on a particular page. 
  Other, smaller pages in this folder do fine, but this one times out 
  most of the time. When it fails, my ParserChecker results look like:
  
  # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText 
  http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_293
  2D onal dsonLauren.xml Exception in thread main
  java.lang.NullPointerException
  
  at
  
  org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)
  
  I've stuck with the default value of 10 in my nutch-default.xml's 
  fetcher.threads.fetch value, and I've added the following to
  nutch-site.xml:
  
  property
  
namedb.max.outlinks.per.page/name
value-1/value
descriptionThe maximum number of outlinks that we'll process for 
  a
  
  page. If this value is nonnegative (=0), at most 
  db.max.outlinks.per.page outlinks will be processed for a page; 
  otherwise, all outlinks will be processed. /description 
  /property property
  
namefile.content.limit/name
value-1/value
descriptionThe length limit for downloaded content using the file://
protocol, in bytes. If this value is nonnegative (=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the http.content.limit setting.
/description
  
  /property
  property
  
namehttp.content.limit/name
value-1/value
descriptionThe length limit for downloaded content, in bytes.
If this value is nonnegative (=0), content longer than it will be
truncated; otherwise, no truncation at all.
/description
  
  /property
  property
  
nameftp.content.limit/name
value-1/value
descriptionThe length limit for downloaded content, in bytes.
If this value is nonnegative (=0), content longer than it will be
  
  truncated; otherwise, no truncation at all.
  
Caution: classical ftp RFCs never defines partial transfer and, in
fact, some ftp servers out there do not handle client side forced
  
  close-down very well. Our implementation tries its best to handle 
  such situations smoothly. /description /property property
  
namehttp.timeout/name
value999/value

Is there a workaround for https?

2011-10-19 Thread Chip Calhoun

I've noticed the recent posts about trouble with protocol-httpclient, which to 
my understanding is needed for https URLs. Is there another way to handle 
these? ParserChecker gives me the following when I try one of these URLs. 
Thanks.

# bin/nutch org.apache.nutch.parse.ParserChecker -dumpText 
https://libwebspace.library.cmu.edu:4430/Research/Archives/ead/generated/shull.xml
Exception in thread main org.apache.nutch.protocol.ProtocolNotFound: protocol 
not found for url=https
at 
org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:80)
at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:78)

RE: Good workaround for timeout?

2011-10-20 Thread Chip Calhoun

I started out with a pretty high number in http.timeout, and I've increased it 
to the fairly ridiculous 999. Is there an upper limit at which it would 
stop working properly?

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Wednesday, October 19, 2011 4:57 PM
To: user@nutch.apache.org
Cc: Chip Calhoun
Subject: Re: Good workaround for timeout?

 I'm using protocol-http, but I removed protocol-httpclient after you 
 pointed out in another thread that it's broken. Unfortunately I'm not 
 sure which properties are used by what, and I'm not sure how to find 
 out. I added some more stuff to nutch-site.xml (I'll paste it at the 
 end), and it seems to be working so far; but since this has been an 
 intermittent problem, I can't be sure whether I've really fixed it or 
 whether I'm getting lucky.

http.timeout is used in lib-http so it should work unless there's a bug around. 
Does the problem persist for that one URL if you increase this value to a more 
reasonable number, say 300?

 property
   namehttp.timeout/name
   value999/value
   descriptionThe default network timeout, in 
 milliseconds./description /property property
   nameftp.timeout/name
   value99/value
   descriptionDefault timeout for ftp client socket, in millisec.
   Please also see ftp.keep.connection below./description /property 
 property
   nameftp.server.timeout/name
   value9/value
   descriptionAn estimation of ftp server idle time, in millisec.
   Typically it is 12 millisec for many ftp servers out there.
   Better be conservative here. Together with ftp.timeout, it is used to
   decide if we need to delete (annihilate) current ftp.client instance and
   force to start another ftp.client instance anew. This is necessary 
 because a fetcher thread may not be able to obtain next request from 
 queue in time (due to idleness) before our ftp client times out or 
 remote server disconnects. Used only when ftp.keep.connection is true 
 (please see below). /description /property property
   nameparser.timeout/name
   value300/value
   descriptionTimeout in seconds for the parsing of a document, 
 otherwise treats it as an exception and moves on the the following 
 documents. This parameter is applied to any Parser implementation. Set 
 to -1 to deactivate, bearing in mind that this could cause
   the parsing to crash because of a very long or corrupted document.
   /description
 /property

 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io]
 Sent: Wednesday, October 19, 2011 11:28 AM
 To: user@nutch.apache.org
 Subject: Re: Good workaround for timeout?

 It is indeed. Tricky.

 Are you going through some proxy? Are you using protocol-http or 
 httpclient? Are you sure the http.time.out value is actually used in 
 lib-http?

  If I'm reading the log correctly, it's the fetch:

  2011-10-19 11:18:11,405 INFO  fetcher.Fetcher - fetch of 
  http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_293
  2D onal dsonLauren.xml failed with: java.net.SocketTimeoutException: 
  Read timed out

  -Original Message-
  From: Markus Jelsma [mailto:markus.jel...@openindex.io]
  Sent: Wednesday, October 19, 2011 11:08 AM
  To: user@nutch.apache.org
  Subject: Re: Good workaround for timeout?

  What is timing out, the fetch or the parse?

   I'm getting a fairly persistent  timeout on a particular page.
   Other, smaller pages in this folder do fine, but this one times 
   out most of the time. When it fails, my ParserChecker results look like:

   # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText
   http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2
   93 2D onal dsonLauren.xml Exception in thread main
   java.lang.NullPointerException

   at

   org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)

   I've stuck with the default value of 10 in my 
   nutch-default.xml's fetcher.threads.fetch value, and I've added 
   the following to
   nutch-site.xml:

   property

 namedb.max.outlinks.per.page/name
 value-1/value
 descriptionThe maximum number of outlinks that we'll process 
   for

   a

   page. If this value is nonnegative (=0), at most 
   db.max.outlinks.per.page outlinks will be processed for a page; 
   otherwise, all outlinks will be processed. /description 
   /property property

 namefile.content.limit/name
 value-1/value
 descriptionThe length limit for downloaded content using the
 file:// protocol, in bytes. If this value is nonnegative (=0),
 content longer than it will be truncated; otherwise, no truncation
 at all. Do not confuse this setting with the http.content.limit
 setting.
 /description

   /property
   property

 namehttp.content.limit/name
 value-1/value
 descriptionThe length limit for downloaded content, in bytes.
 If this value

RE: Good workaround for timeout?

2011-10-20 Thread Chip Calhoun

Good to know! I was definitely exceeding that, so I've changed my properties.

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Thursday, October 20, 2011 10:00 AM
To: user@nutch.apache.org
Cc: Chip Calhoun
Subject: Re: Good workaround for timeout?

On Thursday 20 October 2011 15:56:01 Chip Calhoun wrote:
 I started out with a pretty high number in http.timeout, and I've 
 increased it to the fairly ridiculous 999. Is there an upper 
 limit at which it would stop working properly?

It's interpreted as an Integer so don't exceed Integer.MAX_VALUE. Don't know 
how hadoop will handle for sure.

 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io]
 Sent: Wednesday, October 19, 2011 4:57 PM
 To: user@nutch.apache.org
 Cc: Chip Calhoun
 Subject: Re: Good workaround for timeout?

  I'm using protocol-http, but I removed protocol-httpclient after you 
  pointed out in another thread that it's broken. Unfortunately I'm 
  not sure which properties are used by what, and I'm not sure how to 
  find out. I added some more stuff to nutch-site.xml (I'll paste it 
  at the end), and it seems to be working so far; but since this has 
  been an intermittent problem, I can't be sure whether I've really 
  fixed it or whether I'm getting lucky.

 http.timeout is used in lib-http so it should work unless there's a 
 bug around. Does the problem persist for that one URL if you increase 
 this value to a more reasonable number, say 300?

  property

namehttp.timeout/name
value999/value
descriptionThe default network timeout, in

  milliseconds./description /property property

nameftp.timeout/name
value99/value
descriptionDefault timeout for ftp client socket, in millisec.
Please also see ftp.keep.connection below./description 
  /property

  property

nameftp.server.timeout/name
value9/value
descriptionAn estimation of ftp server idle time, in millisec.
Typically it is 12 millisec for many ftp servers out there.
Better be conservative here. Together with ftp.timeout, it is used to
decide if we need to delete (annihilate) current ftp.client instance
and force to start another ftp.client instance anew. This is 
  necessary

  because a fetcher thread may not be able to obtain next request from 
  queue in time (due to idleness) before our ftp client times out or 
  remote server disconnects. Used only when ftp.keep.connection is 
  true (please see below). /description /property property

nameparser.timeout/name
value300/value
descriptionTimeout in seconds for the parsing of a document,

  otherwise treats it as an exception and moves on the the following 
  documents. This parameter is applied to any Parser implementation. 
  Set to -1 to deactivate, bearing in mind that this could cause

the parsing to crash because of a very long or corrupted document.
/description

  /property

  -Original Message-
  From: Markus Jelsma [mailto:markus.jel...@openindex.io]
  Sent: Wednesday, October 19, 2011 11:28 AM
  To: user@nutch.apache.org
  Subject: Re: Good workaround for timeout?

  It is indeed. Tricky.

  Are you going through some proxy? Are you using protocol-http or 
  httpclient? Are you sure the http.time.out value is actually used in 
  lib-http?

   If I'm reading the log correctly, it's the fetch:

   2011-10-19 11:18:11,405 INFO  fetcher.Fetcher - fetch of
   http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2
   93 2D onal dsonLauren.xml failed with: 
   java.net.SocketTimeoutException:
   Read timed out

   -Original Message-
   From: Markus Jelsma [mailto:markus.jel...@openindex.io]
   Sent: Wednesday, October 19, 2011 11:08 AM
   To: user@nutch.apache.org
   Subject: Re: Good workaround for timeout?

   What is timing out, the fetch or the parse?

I'm getting a fairly persistent  timeout on a particular page.
Other, smaller pages in this folder do fine, but this one times 
out most of the time. When it fails, my ParserChecker results 
look
like:

# bin/nutch org.apache.nutch.parse.ParserChecker -dumpText
http://digital.lib.washington.edu/findingaids/view?docId=UA37_06
_2
93 2D onal dsonLauren.xml Exception in thread main
java.lang.NullPointerException

at

org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)

I've stuck with the default value of 10 in my 
nutch-default.xml's fetcher.threads.fetch value, and I've added 
the following to
nutch-site.xml:

property

  namedb.max.outlinks.per.page/name
  value-1/value
  descriptionThe maximum number of outlinks that we'll process

for

a

page. If this value is nonnegative (=0), at most 
db.max.outlinks.per.page outlinks will be processed for a page

Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

2011-10-26 Thread Chip Calhoun

|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)|urlmeta/value
  descriptionRegular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  /description
 /property
 property
  nameurlmeta.tags/name
  valuehumanurl/value
 /property




-Original Message-
From: Chip Calhoun [mailto:ccalh...@aip.org] 
Sent: Thursday, October 20, 2011 10:23 AM
To: 'markus.jel...@openindex.io'; user@nutch.apache.org
Subject: RE: Good workaround for timeout?

Good to know! I was definitely exceeding that, so I've changed my properties.

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Thursday, October 20, 2011 10:00 AM
To: user@nutch.apache.org
Cc: Chip Calhoun
Subject: Re: Good workaround for timeout?



On Thursday 20 October 2011 15:56:01 Chip Calhoun wrote:
 I started out with a pretty high number in http.timeout, and I've 
 increased it to the fairly ridiculous 999. Is there an upper 
 limit at which it would stop working properly?

It's interpreted as an Integer so don't exceed Integer.MAX_VALUE. Don't know 
how hadoop will handle for sure.

 
 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io]
 Sent: Wednesday, October 19, 2011 4:57 PM
 To: user@nutch.apache.org
 Cc: Chip Calhoun
 Subject: Re: Good workaround for timeout?
 
  I'm using protocol-http, but I removed protocol-httpclient after you 
  pointed out in another thread that it's broken. Unfortunately I'm 
  not sure which properties are used by what, and I'm not sure how to 
  find out. I added some more stuff to nutch-site.xml (I'll paste it 
  at the end), and it seems to be working so far; but since this has 
  been an intermittent problem, I can't be sure whether I've really 
  fixed it or whether I'm getting lucky.
 
 http.timeout is used in lib-http so it should work unless there's a 
 bug around. Does the problem persist for that one URL if you increase 
 this value to a more reasonable number, say 300?
 
  property
  
namehttp.timeout/name
value999/value
descriptionThe default network timeout, in
  
  milliseconds./description /property property
  
nameftp.timeout/name
value99/value
descriptionDefault timeout for ftp client socket, in millisec.
Please also see ftp.keep.connection below./description 
  /property
  
  property
  
nameftp.server.timeout/name
value9/value
descriptionAn estimation of ftp server idle time, in millisec.
Typically it is 12 millisec for many ftp servers out there.
Better be conservative here. Together with ftp.timeout, it is used to
decide if we need to delete (annihilate) current ftp.client instance
and force to start another ftp.client instance anew. This is 
  necessary
  
  because a fetcher thread may not be able to obtain next request from 
  queue in time (due to idleness) before our ftp client times out or 
  remote server disconnects. Used only when ftp.keep.connection is 
  true (please see below). /description /property property
  
nameparser.timeout/name
value300/value
descriptionTimeout in seconds for the parsing of a document,
  
  otherwise treats it as an exception and moves on the the following 
  documents. This parameter is applied to any Parser implementation. 
  Set to -1 to deactivate, bearing in mind that this could cause
  
the parsing to crash because of a very long or corrupted document.
/description
  
  /property
  
  -Original Message-
  From: Markus Jelsma [mailto:markus.jel...@openindex.io]
  Sent: Wednesday, October 19, 2011 11:28 AM
  To: user@nutch.apache.org
  Subject: Re: Good workaround for timeout?
  
  It is indeed. Tricky.
  
  Are you going through some proxy? Are you using protocol-http or 
  httpclient? Are you sure the http.time.out value is actually used in 
  lib-http?
  
   If I'm reading the log correctly, it's the fetch:
   
   2011-10-19 11:18:11,405 INFO  fetcher.Fetcher - fetch of
   http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2
   93 2D onal dsonLauren.xml failed with: 
   java.net.SocketTimeoutException:
   Read timed out
   
   
   -Original Message-
   From: Markus Jelsma [mailto:markus.jel...@openindex.io]
   Sent: Wednesday, October 19, 2011 11:08 AM
   To: user@nutch.apache.org
   Subject: Re: Good workaround for timeout?
   
   What is timing out, the fetch or the parse?
   
I'm getting a fairly persistent  timeout on a particular page.
Other, smaller pages in this folder do fine, but this one times 
out most of the time. When it fails, my ParserChecker results 
look
like:

# bin/nutch org.apache.nutch.parse.ParserChecker -dumpText
http

RE: Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

2011-10-26 Thread Chip Calhoun

Increasing parser.timeout to 3600 got me what I needed. I only have a few files 
this huge, so I'll live with that.

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Wednesday, October 26, 2011 10:55 AM
To: user@nutch.apache.org
Subject: Re: Extremely long parsing of large XML files (Was RE: Good workaround 
for timeout?)

The actual parse which is producing time outs happens early in the process. 
There are, to my knowledge, no Nutch settings to make this faster or change its 
behaviour, it's all about the parser implementation.

Try increasing your parser.timeout setting.

On Wednesday 26 October 2011 16:45:33 Chip Calhoun wrote:
 I've got a few very large (upwards of 3 MB) XML files I'm trying to 
 index, and I'm having trouble. Previously I'd had trouble with the 
 fetch; now that seems to be okay, but due to the size of the files the 
 parse takes much too long.

 Is there a good way to optimize this that I'm missing? Is lengthy 
 parsing of XML a known problem? I recognize that part of my problem is 
 that I'm doing my testing from my aging desktop PC, and it will run 
 faster when I move things to the server, but it's still slow.

 I do get the following weird message in my log when I run 
 ParserChecker or the crawler:

 2011-10-26 09:51:47,729 INFO  parse.ParserFactory - The parsing plugins:
 [org.apache.nutch.parse.tika.TikaParser] are enabled via the 
 plugin.includes system property, and all claim to support the content 
 type application/xml, but they are not mapped to it  in the 
 parse-plugins.xml file 2011-10-26 10:06:40,639 WARN  parse.ParseUtil - 
 TIMEOUT parsing http://www.aip.org/history/ead/19990074.xml with 
 org.apache.nutch.parse.tika.TikaParser@18355aa 2011-10-26 10:06:40,639 
 WARN  parse.ParseUtil - Unable to successfully parse content 
 http://www.aip.org/history/ead/19990074.xml of type application/xml

 My ParserChecker results look like this:

 # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText 
 http://www.aip.org/history/ead/19990074.xml - Url
 ---
 http://www.aip.org/history/ead/19990074.xml-
 ParseData
 -
 Version: 5
 Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable 
 to successfully parse content Title:
 Outlinks: 0
 Content Metadata:
 Parse Metadata:
 -
 ParseText
 -

 And here's everything that might be relevant in my nutch-site.xml; 
 I've tried it both with and without the urlmeta plugin, and that 
 doesn't make a
 difference:

Trouble running solrindexer from Nutch 1.4

2011-12-07 Thread Chip Calhoun

This is probably just down to my not waiting for a 1.4 tutorial, but here goes. 
I've always used the following two commands to run my crawl and then index to 
Solr:
# bin/nutch crawl urls -dir crawl -depth 1 -topN 50
# bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb 
crawl/segments/*

In 1.3 that works great. But in 1.4, when I run Solrindex I get this:
# bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb 
crawl/segments/*
SolrIndexer: starting at 2011-12-07 17:09:58
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
file: /C:/apache/apache-nutch-1.4/runtime/local/crawl/linkdb/crawl_fetch
Input path does not exist: 
file:/C:/apache/apache-nutch-1.4/runtime/local/crawl/linkdb/crawl_parse
Input path does not exist: 
file:/C:/apache/apache-nutch-1.4/runtime/local/crawl/linkdb/parse_data
Input path does not exist: 
file:/C:/apache/apache-nutch-1.4/runtime/local/crawl/linkdb/parse_text

Sure enough, those directories don't exist. But they didn't exist in 1.3 
either. What am I missing?

Thanks,
Chip

Can't crawl a domain; can't figure out why.

2011-12-19 Thread Chip Calhoun

I'm trying to crawl pages from a number of domains, and one of these domains 
has been giving me trouble. The really irritating thing is that it did work at 
least once, which led me to believe that I'd solved the problem. I can't think 
of anything at this point but to paste my log of a failed crawl and solrindex 
and hope that someone can think of anything I've overlooked. Does anything look 
strange here?

Thanks,
Chip

2011-12-19 16:31:01,010 WARN  crawl.Crawl - solrUrl is not set, indexing will 
be skipped...
2011-12-19 16:31:01,404 INFO  crawl.Crawl - crawl started in: mit-c-crawl
2011-12-19 16:31:01,420 INFO  crawl.Crawl - rootUrlDir = mit-c-urls
2011-12-19 16:31:01,420 INFO  crawl.Crawl - threads = 10
2011-12-19 16:31:01,420 INFO  crawl.Crawl - depth = 1
2011-12-19 16:31:01,420 INFO  crawl.Crawl - solrUrl=null
2011-12-19 16:31:01,420 INFO  crawl.Crawl - topN = 50
2011-12-19 16:31:01,420 INFO  crawl.Injector - Injector: starting at 2011-12-19 
16:31:01
2011-12-19 16:31:01,420 INFO  crawl.Injector - Injector: crawlDb: 
mit-c-crawl/crawldb
2011-12-19 16:31:01,420 INFO  crawl.Injector - Injector: urlDir: mit-c-urls
2011-12-19 16:31:01,436 INFO  crawl.Injector - Injector: Converting injected 
urls to crawl db entries.
2011-12-19 16:31:02,854 INFO  plugin.PluginRepository - Plugins: looking in: 
C:\Apache\apache-nutch-1.4\runtime\local\plugins
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository - Plugin Auto-activation 
mode: [true]
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository - Registered Plugins:
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -the 
nutch core extension points (nutch-extensionpoints)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Basic 
URL Normalizer (urlnormalizer-basic)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Html 
Parse Plug-in (parse-html)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Basic 
Indexing Filter (index-basic)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Http / 
Https Protocol Plug-in (protocol-httpclient)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -HTTP 
Framework (lib-http)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Regex 
URL Filter (urlfilter-regex)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Http 
Protocol Plug-in (protocol-http)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Regex 
URL Normalizer (urlnormalizer-regex)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Tika 
Parser Plug-in (parse-tika)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -OPIC 
Scoring Plug-in (scoring-opic)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Anchor 
Indexing Filter (index-anchor)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -URL Meta 
Indexing Filter (urlmeta)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Regex 
URL Filter Framework (lib-regex-filter)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository - Registered 
Extension-Points:
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Nutch 
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Nutch 
Protocol (org.apache.nutch.protocol.Protocol)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Nutch 
Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Nutch 
URL Filter (org.apache.nutch.net.URLFilter)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Nutch 
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -HTML 
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Nutch 
Content Parser (org.apache.nutch.parse.Parser)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Nutch 
Scoring (org.apache.nutch.scoring.ScoringFilter)
2011-12-19 16:31:02,964 INFO  regex.RegexURLNormalizer - can't find rules for 
scope 'inject', using default
2011-12-19 16:31:05,722 INFO  crawl.Injector - Injector: Merging injected urls 
into crawl db.
2011-12-19 16:31:07,014 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
2011-12-19 16:31:07,897 INFO  crawl.Injector - Injector: finished at 2011-12-19 
16:31:07, elapsed: 00:00:06
2011-12-19 16:31:07,913 INFO  crawl.Generator - Generator: starting at

RE: Can't crawl a domain; can't figure out why.

2011-12-20 Thread Chip Calhoun

I just compared this against a similar crawl of a completely different domain 
which I know works, and you're right on both counts. The parser doesn't parse a 
file, and nothing is sent to the solrindexer. I tried a crawl with more 
documents and found that while I can get documents from mit.edu, I get 
absolutely nothing from libraries.mit.edu. I get the same effect using Nutch 
1.3 as well.

I don't think we're dealing with truncated files. I'm willing to believe it's a 
parse error, but how could I tell? I've spoken with some helpful people from 
MIT, and they don't see a reason why this wouldn't work.

Chip

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Monday, December 19, 2011 5:01 PM
To: user@nutch.apache.org
Subject: Re: Can't crawl a domain; can't figure out why.

Nothing peculiar, looks like Nutch 1.4 right? But you also didn't mention the 
domain you can't crawl. libraries.mit.edu seems to work, although the indexer 
doesn't seem to send a document in and the parser doesn't mention parsing that 
file.

Either the file throws a parse error or is truncated or 

 I'm trying to crawl pages from a number of domains, and one of these 
 domains has been giving me trouble. The really irritating thing is 
 that it did work at least once, which led me to believe that I'd 
 solved the problem. I can't think of anything at this point but to 
 paste my log of a failed crawl and solrindex and hope that someone can 
 think of anything I've overlooked. Does anything look strange here?
 
 Thanks,
 Chip
 
 2011-12-19 16:31:01,010 WARN  crawl.Crawl - solrUrl is not set, 
 indexing will be skipped... 2011-12-19 16:31:01,404 INFO  crawl.Crawl 
 - crawl started in: mit-c-crawl 2011-12-19 16:31:01,420 INFO  
 crawl.Crawl - rootUrlDir = mit-c-urls 2011-12-19 16:31:01,420 INFO  
 crawl.Crawl - threads = 10
 2011-12-19 16:31:01,420 INFO  crawl.Crawl - depth = 1
 2011-12-19 16:31:01,420 INFO  crawl.Crawl - solrUrl=null
 2011-12-19 16:31:01,420 INFO  crawl.Crawl - topN = 50
 2011-12-19 16:31:01,420 INFO  crawl.Injector - Injector: starting at
 2011-12-19 16:31:01 2011-12-19 16:31:01,420 INFO  crawl.Injector -
 Injector: crawlDb: mit-c-crawl/crawldb 2011-12-19 16:31:01,420 INFO 
 crawl.Injector - Injector: urlDir: mit-c-urls 2011-12-19 16:31:01,436 
 INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
 2011-12-19 16:31:02,854 INFO  plugin.PluginRepository - Plugins: 
 looking
 in: C:\Apache\apache-nutch-1.4\runtime\local\plugins 2011-12-19
 16:31:02,917 INFO  plugin.PluginRepository - Plugin Auto-activation mode:
 [true] 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository - Registered
 Plugins: 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -  
  the nutch core extension points (nutch-extensionpoints) 2011-12-19
 16:31:02,917 INFO  plugin.PluginRepository -Basic URL
 Normalizer (urlnormalizer-basic) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -Html Parse Plug-in (parse-html)
 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
 Basic Indexing Filter (index-basic) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -Http / Https Protocol Plug-in
 (protocol-httpclient) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -HTTP Framework (lib-http)
 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
 Regex URL Filter (urlfilter-regex) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -Pass-through URL Normalizer
 (urlnormalizer-pass) 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository
 -Http Protocol Plug-in (protocol-http) 2011-12-19
 16:31:02,917 INFO  plugin.PluginRepository -Regex URL
 Normalizer (urlnormalizer-regex) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -Tika Parser Plug-in (parse-tika)
 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
 OPIC Scoring Plug-in (scoring-opic) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -CyberNeko HTML Parser
 (lib-nekohtml) 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -
Anchor Indexing Filter (index-anchor) 2011-12-19 16:31:02,917
 INFO  plugin.PluginRepository -URL Meta Indexing Filter
 (urlmeta) 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository - 
   Regex URL Filter Framework (lib-regex-filter) 2011-12-19
 16:31:02,917 INFO  plugin.PluginRepository - Registered Extension-Points:
 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
 Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2011-12-19
 16:31:02,917 INFO  plugin.PluginRepository -Nutch Protocol
 (org.apache.nutch.protocol.Protocol) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -Nutch Segment Merge Filter
 (org.apache.nutch.segment.SegmentMergeFilter)

RE: Can't crawl a domain; can't figure out why.

2011-12-20 Thread Chip Calhoun

://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html anchor: 
[http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html]
  outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf 
anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf]
  outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf 
anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf]
  outlink: toUrl: 
http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf anchor: 
[http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf]
  outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf 
anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf]
  outlink: toUrl: 
http://libraries.mit.edu/archives/exhibits/andrew/index1.html#transcription 
anchor: 
[http://libraries.mit.edu/archives/exhibits/andrew/index1.html#transcription]
  outlink: toUrl: 
http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association.html 
anchor: 
[http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association.html]
  outlink: toUrl: 
http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1 
anchor: 
[http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1]
  outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf 
anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf]
  outlink: toUrl: 
http://libraries.mit.edu/archives/mithistory/pdf/scope-plan.pdf anchor: 
[http://libraries.mit.edu/archives/mithistory/pdf/scope-plan.pdf]
  outlink: toUrl: 
http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html anchor: 
[http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html]
  outlink: toUrl: 
http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.html#toc
 anchor: Return to Table of Contents ╗
  outlink: toUrl: http://libraries.mit.edu anchor: [http://libraries.mit.edu]
  outlink: toUrl: 
http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf anchor: 
[http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf]
  outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf 
anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf]
  outlink: toUrl: 
http://libraries.mit.edu/archives/exhibits/wbr/bibliography.html anchor: 
[http://libraries.mit.edu/archives/exhibits/wbr/bibliography.html]
  outlink: toUrl: 
http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.html#toc
 anchor: Return to Table of Contents ╗
  outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf 
anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf]
  outlink: toUrl: 
http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1 
anchor: 
[http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1]
  outlink: toUrl: 
http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association.html 
anchor: 
[http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association.html]
  outlink: toUrl: 
http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.html#toc
 anchor: Return to Table of Contents ╗
Content Metadata: Date=Tue, 20 Dec 2011 21:30:50 GMT Content-Length=191500 
Via=1.0 barracuda.acp.org:8080 (http_scan/4.0.2.6.19) Connection=close 
Content-Type=text/html Accept-Ranges=bytes X-Cache=MISS from barracuda.acp.org 
Server=Apache/2.2.3 (Red Hat)
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8



-Original Message-
From: alx...@aim.com [mailto:alx...@aim.com] 
Sent: Tuesday, December 20, 2011 2:15 PM
To: user@nutch.apache.org
Subject: Re: Can't crawl a domain; can't figure out why.

It seems that  robots.txt in 
libraries.mit.edu 


has a lot of restrictions.

Alex.

 

-Original Message-
From: Chip Calhoun ccalh...@aip.org
To: user user@nutch.apache.org; 'markus.jel...@openindex.io' 
markus.jel...@openindex.io
Sent: Tue, Dec 20, 2011 7:28 am
Subject: RE: Can't crawl a domain; can't figure out why.


I just compared this against a similar crawl of a completely different domain 
which I know works, and you're right on both counts. The parser doesn't parse a 
file, and nothing is sent to the solrindexer. I tried a crawl with more 
documents and found that while I can get documents from mit.edu, I get 
absolutely nothing from libraries.mit.edu. I get the same effect using Nutch 
1.3 
as well.

I don't think we're dealing with truncated files. I'm willing to believe it's a 
parse error, but how could I tell? I've spoken with some helpful people from 
MIT, and they don't see a reason why this wouldn't work.

Chip

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Monday, December 19, 2011 5:01 PM
To: user@nutch.apache.org
Subject: Re: Can't crawl a domain; can't figure out why.

Nothing peculiar, looks like

Indexing urlmeta fields into Solr 5.5.3 (Was RE: Failing to index from Nutch 1.12 to Solr 5.5.3)

2017-02-03 Thread Chip Calhoun

We've found that the solrindex process chokes on the custom metadata fields I 
added to my Nutch using the urlmeta plugin. A sample of the lengthy error 
messages: 

java.lang.Exception: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://localhost:8983/solr/phfaws: ERROR: 
[doc=http://academics.wellesley.edu/lts/archives/3/3L_Astronomy.html] unknown 
field 'icosreposurl'
 at 
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)

As mentioned in my previous message, I've copied my Nutch schema.xml into my 
Solr's conf folder, but since my Solr instance hadn't already had a schema.xml 
file I'm not convinced it's being read.. How do I set up my Solr to take these 
new fields?

Chip


From: Chip Calhoun [ccalh...@aip.org]
Sent: Friday, February 03, 2017 11:45 AM
To: user@nutch.apache.org
Subject: Failing to index from Nutch 1.12 to Solr 5.5.3

I'm switching to more recent Nutch/Solr, after years of using Nutch 1.4 and 
Solr 3.3.0. I get no results when I index into Solr. I can't tell where this 
breaks down.

I use these commands:
cd /opt/apache-nutch-1.12/runtime/local
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.121.x86_64
export NUTCH_CONF_DIR=/opt/apache-nutch-1.12/runtime/local/conf/phfaws
bin/crawl urls/phfaws crawl/phfaws 1
bin/nutch solrindex http://localhost:8983/solr/phfaws/ crawl/phfaws/crawldb 
-linkdb crawl/phfaws/linkdb crawl/phfaws/segments/*

I believe that Nutch is crawling properly, but I do find that the crawl folders 
end up about 25% as large as what I produced with Nutch 1.4. I suspect that the 
problem is with the Nutch/Solr integration. My Solr core didn't create a 
schema.xml, instead having a managed scheme. I've copied my Nutch local conf's 
schema.xml into Solr, but I haven't seen that I'm supposed to do anything more 
with that.


Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD  20740
301-209-3180
https://www.aip.org/history-programs/niels-bohr-library

Need help installing scoring-depth plugin

2017-01-31 Thread Chip Calhoun

I'm upgrading from Nutch 1.4 to Nutch 1.12. I limit this crawl to my seeds, so 
my 1.4 command was:
bin/nutch crawl phfaws -dir crawl -depth 1 -topN 5

My understanding is that the "crawl" command is deprecated, "-depth" went with 
it, and I need to install the scoring-depth plugin. I'm new to adding plugins. 
The instructions at https://wiki.apache.org/nutch/AboutPlugins give a sample 
command, but I don't know what the official PluginRepository for this plugin is 
and the sample link for the HtmlParser plugin is dead.

I'll appreciate any help. Thank you!

Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD  20740
301-209-3180
https://www.aip.org/history-programs/niels-bohr-library

RE: Need help installing scoring-depth plugin

2017-01-31 Thread Chip Calhoun

Thank you Julien! That's exactly what I needed.

Chip

-Original Message-
From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] 
Sent: Tuesday, January 31, 2017 1:09 PM
To: user@nutch.apache.org
Subject: Re: Need help installing scoring-depth plugin

You don't need to install scoring-depth. It's just that the term 'depth' in the 
old crawl class has been replaced by 'rounds', which is more accurate.

The equivalent of the command you used to call should be *bin/crawl phfaws 
crawl **1 *

The value for topN needs setting in the crawl scrip, see sizeFetchlist in [ 
https://github.com/apache/nutch/blob/master/src/bin/crawl#L117]

HTH

Julien

On 31 January 2017 at 16:49, Chip Calhoun <ccalh...@aip.org> wrote:

> I'm upgrading from Nutch 1.4 to Nutch 1.12. I limit this crawl to my 
> seeds, so my 1.4 command was:
> bin/nutch crawl phfaws -dir crawl -depth 1 -topN 5
>
> My understanding is that the "crawl" command is deprecated, "-depth" 
> went with it, and I need to install the scoring-depth plugin. I'm new 
> to adding plugins. The instructions at 
> https://wiki.apache.org/nutch/AboutPlugins
> give a sample command, but I don't know what the official 
> PluginRepository for this plugin is and the sample link for the HtmlParser 
> plugin is dead.
>
> I'll appreciate any help. Thank you!
>
> Chip Calhoun
> Digital Archivist
> Niels Bohr Library & Archives
> American Institute of Physics
> One Physics Ellipse
> College Park, MD  20740
> 301-209-3180
> https://www.aip.org/history-programs/niels-bohr-library
>
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>

Queries in new Solr version not finding results I'd expect

2017-02-08 Thread Chip Calhoun

I'm testing a new setup, Solr 5.5.3 indexed from Nutch 1.12. I'm comparing it 
against my production instance, Solr 3.3.0 and Nutch 1.4. The new search misses 
some results that the old one got.

If I search for the word "optician", the old results include a result which the 
new one misses. The document in question is indexed by the new Solr; I can find 
it using other search terms. The content field for this document, stored in the 
new Solr, does clearly include the word "optician". Why wouldn't it turn up? 
Where do I start looking?

As an aside, thank you to everyone who's replied to my questions the past few 
weeks. I don't want to clog the listserv with a lot of short "thank you" posts, 
but I do appreciate it.

Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD  20740
301-209-3180
https://www.aip.org/history-programs/niels-bohr-library

Failing to index from Nutch 1.12 to Solr 5.5.3

2017-02-03 Thread Chip Calhoun

I'm switching to more recent Nutch/Solr, after years of using Nutch 1.4 and 
Solr 3.3.0. I get no results when I index into Solr. I can't tell where this 
breaks down.

I use these commands:
cd /opt/apache-nutch-1.12/runtime/local
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.121.x86_64
export NUTCH_CONF_DIR=/opt/apache-nutch-1.12/runtime/local/conf/phfaws
bin/crawl urls/phfaws crawl/phfaws 1
bin/nutch solrindex http://localhost:8983/solr/phfaws/ crawl/phfaws/crawldb 
-linkdb crawl/phfaws/linkdb crawl/phfaws/segments/*

I believe that Nutch is crawling properly, but I do find that the crawl folders 
end up about 25% as large as what I produced with Nutch 1.4. I suspect that the 
problem is with the Nutch/Solr integration. My Solr core didn't create a 
schema.xml, instead having a managed scheme. I've copied my Nutch local conf's 
schema.xml into Solr, but I haven't seen that I'm supposed to do anything more 
with that.


Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD  20740
301-209-3180
https://www.aip.org/history-programs/niels-bohr-library

No build.xml for Nutch 1.12

2017-01-25 Thread Chip Calhoun

I'm upgrading to Nutch 1.12, and I have an extremely basic problem. I can't 
find a build.xml in apache-nutch-1.12-bin.zip , and therefore can't run ant. 
What am I missing?

Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD  20740
301-209-3180
https://www.aip.org/history-programs/niels-bohr-library

RE: No build.xml for Nutch 1.12

2017-01-25 Thread Chip Calhoun

Markus,

Thank you! 

Chip

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Wednesday, January 25, 2017 5:03 PM
To: user@nutch.apache.org
Subject: RE: No build.xml for Nutch 1.12

Hello,

A *-bin* file in ASF downloads is always a precompiled distribution. You are 
looking for the *-src* file, or not, if you don't have to recompile or don't 
need to customize the sources.

Regards,
Markus

 
 
-Original message-
> From:Chip Calhoun <ccalh...@aip.org>
> Sent: Wednesday 25th January 2017 22:57
> To: user@nutch.apache.org
> Subject: No build.xml for Nutch 1.12
> 
> I'm upgrading to Nutch 1.12, and I have an extremely basic problem. I can't 
> find a build.xml in apache-nutch-1.12-bin.zip , and therefore can't run ant. 
> What am I missing?
> 
> Chip Calhoun
> Digital Archivist
> Niels Bohr Library & Archives
> American Institute of Physics
> One Physics Ellipse
> College Park, MD  20740
> 301-209-3180
> https://www.aip.org/history-programs/niels-bohr-library
> 
>

RE: [MASSMAIL]Nutch not indexing all seed URLs

2017-05-12 Thread Chip Calhoun

Thank you. The problem was right below that; I had the default 
"timeLimitFetch=180", and it stopped after 3 hours. I'll bump that up to 
something ridiculous and try again.

Chip

-Original Message-
From: Eyeris Rodriguez Rueda [mailto:eru...@uci.cu] 
Sent: Thursday, May 11, 2017 4:46 PM
To: user@nutch.apache.org
Subject: Re: [MASSMAIL]Nutch not indexing all seed URLs

Hi.
Maybe one cause:
Have you seen topN (fetchlist) parameter inside bin/crawl script (line 117) 
sizeFetchlist=`expr $numSlaves \* 50` this number could limit your url list.

Also check your filters.

Tell me if you have solved the problem

- Mensaje original -
De: "Chip Calhoun" <ccalh...@aip.org>
Para: user@nutch.apache.org
Enviados: Jueves, 11 de Mayo 2017 16:30:34
Asunto: [MASSMAIL]Nutch not indexing all seed URLs

I'm using Nutch 1.12 to index a local site. To keep Nutch from indexing the 
uninteresting navigation pages on my site, I've made a URLs list of all the 
URLs I want crawled; the current list is 2522 URLs. However, the indexer 
stopped after just 1077 of these URLs. My generate.max.count is set to -1. What 
would cause my URLs to be skipped?

Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD  20740-3840  USA
Tel: +1 301-209-3180
Email: ccalh...@aip.org
https://www.aip.org/history-programs/niels-bohr-library

La @universidad_uci es Fidel. Los jóvenes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre

Nutch not indexing all seed URLs

2017-05-11 Thread Chip Calhoun

I'm using Nutch 1.12 to index a local site. To keep Nutch from indexing the 
uninteresting navigation pages on my site, I've made a URLs list of all the 
URLs I want crawled; the current list is 2522 URLs. However, the indexer 
stopped after just 1077 of these URLs. My generate.max.count is set to -1. What 
would cause my URLs to be skipped?

Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD  20740-3840  USA
Tel: +1 301-209-3180
Email: ccalh...@aip.org
https://www.aip.org/history-programs/niels-bohr-library

Re: Nutch fetching times out at 3 hours, not sure why.

2018-05-01 Thread Chip Calhoun

Hi Sebastian,


Yes, that explains it! Now I wish I'd pasted my crawl command in the first 
place. I'll leave it alone for now, but if it becomes an issue again I know 
where to check. Thank you.


Chip


From: Sebastian Nagel <wastl.na...@googlemail.com>
Sent: Monday, April 30, 2018 4:53:20 PM
To: user@nutch.apache.org
Subject: Re: Nutch fetching times out at 3 hours, not sure why.

Hi Chip,

got it, you probably run bin/crawl which has the option:
  --time-limit-fetch  Number of minutes allocated to the 
fetching [default: 180]

It's good to have a time limit, in case a single server responds too slowly.

Best,
Sebastian

On 04/30/2018 09:04 PM, Chip Calhoun wrote:
> Hi Sebastian,
>
> Thank you! Increasing my fetcher.threads.per.queue both fixed my crawl and 
> saved me a lot of time.
>
> I'm still bewildered by the original problem, though. Both my 
> fetcher.timelimit.mins and my fetcher.max.exceptions.per.queue are set to -1. 
> I'll ignore it unless it causes a problem for my other cores.
>
> Chip
>
> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> Sent: Monday, April 30, 2018 12:21 PM
> To: user@nutch.apache.org
> Subject: Re: Nutch fetching times out at 3 hours, not sure why.
>
> Hi,
>
> if you still see the log message
>
>fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping!
>
> then it can be only
>  - fetcher.timelimit.mins
>  - fetcher.max.exceptions.per.queue
>
>> I crawl a list of roughly 2600 URLs all on my local server
>
> If this is the case you can crawl more aggressively, see
>   fetcher.server.delay
> or even fetch in parallel from your host, see
>   fetcher.threads.per.queue
>
> Best,
> Sebastian
>
> On 04/30/2018 04:44 PM, Chip Calhoun wrote:
>> I'm still experimenting with this. I had been crawling with a depth of 1 
>> because I don't need anything outside my URLs list, but I tried with a depth 
>> of 10. It went through a crawl loop that ended after 3 hours, then a second 
>> 3 hour crawl loop, then a third shorter loop. It still stopped 5 URLs short 
>> of crawling every URL in my list, though it crawled a few I hadn't included.
>>
>> Are these 3 hour loops standard for large crawls?
>>
>> -Original Message-
>> From: Chip Calhoun [mailto:ccalh...@aip.org]
>> Sent: Tuesday, April 17, 2018 3:27 PM
>> To: user@nutch.apache.org
>> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
>>
>> I'm on 1.12, and mine also defaulted at -1. It does not fail at the same 
>> URL, or even at the same point in a URL's fetcher loop; it really seems to 
>> be time based.
>>
>> -Original Message-
>> From: Sadiki Latty [mailto:sla...@uottawa.ca]
>> Sent: Tuesday, April 17, 2018 1:43 PM
>> To: user@nutch.apache.org
>> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
>>
>> Which version are you running? That value is defaulted to -1 in my current 
>> version (1.14)  so shouldn't be something you should have needed to change. 
>> My crawls, by default, go for as much as even 12 hours with little to no 
>> tweaking necessary from the nutch-default. Something else is causing it. Is 
>> it always the same URL that it fails at?
>>
>> -Original Message-
>> From: Chip Calhoun [mailto:ccalh...@aip.org]
>> Sent: April-17-18 10:45 AM
>> To: user@nutch.apache.org
>> Subject: Nutch fetching times out at 3 hours, not sure why.
>>
>> I crawl a list of roughly 2600 URLs all on my local server, and I'm only 
>> crawling around 1000 of them. The fetcher quits after exactly 3 hours (give 
>> or take a few milliseconds) with this message in the log:
>>
>> 2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: 
>> https://history.aip.org >> dropping!
>>
>> I've seen that 3 hours is the default in some Nutch installations, but I've 
>> got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something 
>> obvious. Any thoughts would be greatly appreciated. Thank you.
>>
>> Chip Calhoun
>> Digital Archivist
>> Niels Bohr Library & Archives
>> American Institute of Physics
>> One Physics Ellipse
>> College Park, MD  20740-3840  USA
>> Tel: +1 301-209-3180
>> Email: ccalh...@aip.org
>> https://www.aip.org/history-programs/niels-bohr-library
>>
>

RE: Nutch fetching times out at 3 hours, not sure why.

2018-04-30 Thread Chip Calhoun

I'm still experimenting with this. I had been crawling with a depth of 1 
because I don't need anything outside my URLs list, but I tried with a depth of 
10. It went through a crawl loop that ended after 3 hours, then a second 3 hour 
crawl loop, then a third shorter loop. It still stopped 5 URLs short of 
crawling every URL in my list, though it crawled a few I hadn't included. 

Are these 3 hour loops standard for large crawls?

-Original Message-
From: Chip Calhoun [mailto:ccalh...@aip.org] 
Sent: Tuesday, April 17, 2018 3:27 PM
To: user@nutch.apache.org
Subject: RE: Nutch fetching times out at 3 hours, not sure why.

I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, 
or even at the same point in a URL's fetcher loop; it really seems to be time 
based. 

-Original Message-
From: Sadiki Latty [mailto:sla...@uottawa.ca] 
Sent: Tuesday, April 17, 2018 1:43 PM
To: user@nutch.apache.org
Subject: RE: Nutch fetching times out at 3 hours, not sure why.

Which version are you running? That value is defaulted to -1 in my current 
version (1.14)  so shouldn't be something you should have needed to change. My 
crawls, by default, go for as much as even 12 hours with little to no tweaking 
necessary from the nutch-default. Something else is causing it. Is it always 
the same URL that it fails at?

-Original Message-
From: Chip Calhoun [mailto:ccalh...@aip.org] 
Sent: April-17-18 10:45 AM
To: user@nutch.apache.org
Subject: Nutch fetching times out at 3 hours, not sure why.

I crawl a list of roughly 2600 URLs all on my local server, and I'm only 
crawling around 1000 of them. The fetcher quits after exactly 3 hours (give or 
take a few milliseconds) with this message in the log:

2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: 
https://history.aip.org >> dropping!

I've seen that 3 hours is the default in some Nutch installations, but I've got 
my fetcher.timelimit.mins set to -1. I'm sure I'm missing something obvious. 
Any thoughts would be greatly appreciated. Thank you.

Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD  20740-3840  USA
Tel: +1 301-209-3180
Email: ccalh...@aip.org
https://www.aip.org/history-programs/niels-bohr-library

RE: Nutch fetching times out at 3 hours, not sure why.

2018-04-30 Thread Chip Calhoun

Hi Sebastian,

Thank you! Increasing my fetcher.threads.per.queue both fixed my crawl and 
saved me a lot of time.

I'm still bewildered by the original problem, though. Both my 
fetcher.timelimit.mins and my fetcher.max.exceptions.per.queue are set to -1. 
I'll ignore it unless it causes a problem for my other cores.

Chip

-Original Message-
From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] 
Sent: Monday, April 30, 2018 12:21 PM
To: user@nutch.apache.org
Subject: Re: Nutch fetching times out at 3 hours, not sure why.

Hi,

if you still see the log message

   fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping!

then it can be only
 - fetcher.timelimit.mins
 - fetcher.max.exceptions.per.queue

> I crawl a list of roughly 2600 URLs all on my local server

If this is the case you can crawl more aggressively, see
  fetcher.server.delay
or even fetch in parallel from your host, see
  fetcher.threads.per.queue

Best,
Sebastian

On 04/30/2018 04:44 PM, Chip Calhoun wrote:
> I'm still experimenting with this. I had been crawling with a depth of 1 
> because I don't need anything outside my URLs list, but I tried with a depth 
> of 10. It went through a crawl loop that ended after 3 hours, then a second 3 
> hour crawl loop, then a third shorter loop. It still stopped 5 URLs short of 
> crawling every URL in my list, though it crawled a few I hadn't included. 
> 
> Are these 3 hour loops standard for large crawls?
> 
> -----Original Message-
> From: Chip Calhoun [mailto:ccalh...@aip.org] 
> Sent: Tuesday, April 17, 2018 3:27 PM
> To: user@nutch.apache.org
> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
> 
> I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, 
> or even at the same point in a URL's fetcher loop; it really seems to be time 
> based. 
> 
> -Original Message-
> From: Sadiki Latty [mailto:sla...@uottawa.ca] 
> Sent: Tuesday, April 17, 2018 1:43 PM
> To: user@nutch.apache.org
> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
> 
> Which version are you running? That value is defaulted to -1 in my current 
> version (1.14)  so shouldn't be something you should have needed to change. 
> My crawls, by default, go for as much as even 12 hours with little to no 
> tweaking necessary from the nutch-default. Something else is causing it. Is 
> it always the same URL that it fails at?
> 
> -Original Message-
> From: Chip Calhoun [mailto:ccalh...@aip.org] 
> Sent: April-17-18 10:45 AM
> To: user@nutch.apache.org
> Subject: Nutch fetching times out at 3 hours, not sure why.
> 
> I crawl a list of roughly 2600 URLs all on my local server, and I'm only 
> crawling around 1000 of them. The fetcher quits after exactly 3 hours (give 
> or take a few milliseconds) with this message in the log:
> 
> 2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: 
> https://history.aip.org >> dropping!
> 
> I've seen that 3 hours is the default in some Nutch installations, but I've 
> got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something 
> obvious. Any thoughts would be greatly appreciated. Thank you.
> 
> Chip Calhoun
> Digital Archivist
> Niels Bohr Library & Archives
> American Institute of Physics
> One Physics Ellipse
> College Park, MD  20740-3840  USA
> Tel: +1 301-209-3180
> Email: ccalh...@aip.org
> https://www.aip.org/history-programs/niels-bohr-library
>

Nutch fetching times out at 3 hours, not sure why.

2018-04-17 Thread Chip Calhoun

I crawl a list of roughly 2600 URLs all on my local server, and I'm only 
crawling around 1000 of them. The fetcher quits after exactly 3 hours (give or 
take a few milliseconds) with this message in the log:

2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: 
https://history.aip.org >> dropping!

I've seen that 3 hours is the default in some Nutch installations, but I've got 
my fetcher.timelimit.mins set to -1. I'm sure I'm missing something obvious. 
Any thoughts would be greatly appreciated. Thank you.

Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD  20740-3840  USA
Tel: +1 301-209-3180
Email: ccalh...@aip.org
https://www.aip.org/history-programs/niels-bohr-library

RE: Nutch fetching times out at 3 hours, not sure why.

2018-04-19 Thread Chip Calhoun

Hi Lewis,

I'm using Nutch 1.2.

Chip

-Original Message-
From: lewis john mcgibbney [mailto:lewi...@apache.org] 
Sent: Wednesday, April 18, 2018 1:55 PM
To: user@nutch.apache.org
Subject: Re: Nutch fetching times out at 3 hours, not sure why.

Hi Chip,
Which version of Nutch are you using?

On Tue, Apr 17, 2018 at 7:45 AM, <user-digest-h...@nutch.apache.org> wrote:

> From: Chip Calhoun <ccalh...@aip.org>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Cc:
> Bcc:
> Date: Tue, 17 Apr 2018 14:45:01 +
> Subject: Nutch fetching times out at 3 hours, not sure why.
> I crawl a list of roughly 2600 URLs all on my local server, and I'm 
> only crawling around 1000 of them. The fetcher quits after exactly 3 
> hours (give or take a few milliseconds) with this message in the log:
>
> 2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue:
> https://history.aip.org >> dropping!
>
> I've seen that 3 hours is the default in some Nutch installations, but 
> I've got my fetcher.timelimit.mins set to -1. I'm sure I'm missing 
> something obvious. Any thoughts would be greatly appreciated. Thank you.
>
> Chip Calhoun
> Digital Archivist
> Niels Bohr Library & Archives
> American Institute of Physics
> One Physics Ellipse
> College Park, MD  20740-3840  USA
> Tel: +1 301-209-3180
> Email: ccalh...@aip.org
> https://www.aip.org/history-programs/niels-bohr-library
>
>
>


--
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

RE: Nutch fetching times out at 3 hours, not sure why.

2018-04-19 Thread Chip Calhoun

Hi Markus,

I don't see an indication of the web server blocking me, though that sounds 
reasonable. Could there be a per-server limit in Nutch itself that we're 
overlooking, since this is all on the same server? 

Chip

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Tuesday, April 17, 2018 3:58 PM
To: user@nutch.apache.org
Subject: RE: Nutch fetching times out at 3 hours, not sure why.

Hello Chip,

I have no clue where the three hour limit could come from. Please take a 
further look in the last few minutes of the logs.

The only thing i can think of is that a webserver would block you after some 
amount of requests/time window, that would be visible in the logs. It is clear 
Nutch itself terminates the fetcher (the dropping line). That is only possible 
with an imposed time limit, or a if you reached some number of exceptions (or 
one other variable i am forgetting).

Regards,
Markus
 
-Original message-
> From:Chip Calhoun <ccalh...@aip.org>
> Sent: Tuesday 17th April 2018 21:27
> To: user@nutch.apache.org
> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
> 
> I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, 
> or even at the same point in a URL's fetcher loop; it really seems to be time 
> based. 
> 
> -Original Message-
> From: Sadiki Latty [mailto:sla...@uottawa.ca] 
> Sent: Tuesday, April 17, 2018 1:43 PM
> To: user@nutch.apache.org
> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
> 
> Which version are you running? That value is defaulted to -1 in my current 
> version (1.14)  so shouldn't be something you should have needed to change. 
> My crawls, by default, go for as much as even 12 hours with little to no 
> tweaking necessary from the nutch-default. Something else is causing it. Is 
> it always the same URL that it fails at?
> 
> -Original Message-
> From: Chip Calhoun [mailto:ccalh...@aip.org] 
> Sent: April-17-18 10:45 AM
> To: user@nutch.apache.org
> Subject: Nutch fetching times out at 3 hours, not sure why.
> 
> I crawl a list of roughly 2600 URLs all on my local server, and I'm only 
> crawling around 1000 of them. The fetcher quits after exactly 3 hours (give 
> or take a few milliseconds) with this message in the log:
> 
> 2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: 
> https://history.aip.org >> dropping!
> 
> I've seen that 3 hours is the default in some Nutch installations, but I've 
> got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something 
> obvious. Any thoughts would be greatly appreciated. Thank you.
> 
> Chip Calhoun
> Digital Archivist
> Niels Bohr Library & Archives
> American Institute of Physics
> One Physics Ellipse
> College Park, MD  20740-3840  USA
> Tel: +1 301-209-3180
> Email: ccalh...@aip.org
> https://www.aip.org/history-programs/niels-bohr-library
> 
>

RE: Nutch fetching times out at 3 hours, not sure why.

2018-04-17 Thread Chip Calhoun

I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, 
or even at the same point in a URL's fetcher loop; it really seems to be time 
based. 

-Original Message-
From: Sadiki Latty [mailto:sla...@uottawa.ca] 
Sent: Tuesday, April 17, 2018 1:43 PM
To: user@nutch.apache.org
Subject: RE: Nutch fetching times out at 3 hours, not sure why.

Which version are you running? That value is defaulted to -1 in my current 
version (1.14)  so shouldn't be something you should have needed to change. My 
crawls, by default, go for as much as even 12 hours with little to no tweaking 
necessary from the nutch-default. Something else is causing it. Is it always 
the same URL that it fails at?

-Original Message-
From: Chip Calhoun [mailto:ccalh...@aip.org] 
Sent: April-17-18 10:45 AM
To: user@nutch.apache.org
Subject: Nutch fetching times out at 3 hours, not sure why.

I crawl a list of roughly 2600 URLs all on my local server, and I'm only 
crawling around 1000 of them. The fetcher quits after exactly 3 hours (give or 
take a few milliseconds) with this message in the log:

2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: 
https://history.aip.org >> dropping!

I've seen that 3 hours is the default in some Nutch installations, but I've got 
my fetcher.timelimit.mins set to -1. I'm sure I'm missing something obvious. 
Any thoughts would be greatly appreciated. Thank you.

Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD  20740-3840  USA
Tel: +1 301-209-3180
Email: ccalh...@aip.org
https://www.aip.org/history-programs/niels-bohr-library

55 matches

Mail list logo