Re: Questions about upgrade to Nutch 1.3

2011-06-20 Thread Chip Calhoun
Thanks for replying! I do still have a couple of questions: Markus Jelsma markus.jel...@openindex.io 6/20/2011 11:34 AM On Monday 20 June 2011 16:44:13 Chip Calhoun wrote: Hi everyone, I'm a complete Nutch newbie. I installed Nutch 1.2 and Solr 1.4.0 on my machine without any

Re: Questions about upgrade to Nutch 1.3

2011-06-21 Thread Chip Calhoun
markus.jel...@openindex.io 6/20/2011 12:43 PM On Monday 20 June 2011 18:35:36 Chip Calhoun wrote: Thanks for replying! I do still have a couple of questions: Markus Jelsma markus.jel...@openindex.io 6/20/2011 11:34 AM On Monday 20 June 2011 16:44:13 Chip Calhoun wrote: Hi everyone, I'm

Deploying the web application in Nutch 1.2

2011-07-13 Thread Chip Calhoun
I'm a newbie trying to set up a Nutch 1.2 web app, because it seems a bit better suited to my smallish site than the Nutch 1.3 / Solr connection. I'm going through the tutorial at http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine , and I've hit the following instruction: Deploy the

RE: Deploying the web application in Nutch 1.2

2011-07-15 Thread Chip Calhoun
5:38 PM To: user@nutch.apache.org Subject: Re: Deploying the web application in Nutch 1.2 On Thu, Jul 14, 2011 at 8:01 PM, Chip Calhoun ccalh...@aip.org wrote: Thanks Lewis. I'm still having trouble. I've moved the war file to $CATALINA_HOME/webapps/nutch/ and unpacked it. I don't' seem

RE: Deploying the web application in Nutch 1.2

2011-07-15 Thread Chip Calhoun
webapp version of nutch-site.xml. In my experience this was a small case of confusion at first. On Fri, Jul 15, 2011 at 7:03 PM, Chip Calhoun ccalh...@aip.org wrote: You've gotten me very close to a breakthrough. I've started over, and I've found that If I don't make any edits to nutch

RE: Deploying the web application in Nutch 1.2

2011-07-15 Thread Chip Calhoun
, sorry I can't be of more help in giving a definite answer. On Fri, Jul 15, 2011 at 8:27 PM, Chip Calhoun ccalh...@aip.org wrote: I'm definitely changing the file in my webapp. I can tell I'm doing that much right because it makes a noticeable change to the function of my web app

Nutch not indexing full collection

2011-07-20 Thread Chip Calhoun
Hi, I'm using Nutch 1.3 to crawl a section of our website, and it doesn't seem to crawl the entire thing. I'm probably missing something simple, so I hope somebody can help me. My urls/nutch file contains a single URL: http://www.aip.org/history/ohilist/transcripts.html , which is an

RE: Nutch not indexing full collection

2011-07-20 Thread Chip Calhoun
of pages known etc Julien On 20 July 2011 14:51, Chip Calhoun ccalh...@aip.org wrote: Hi, I'm using Nutch 1.3 to crawl a section of our website, and it doesn't seem to crawl the entire thing. I'm probably missing something simple, so I hope somebody can help me. My urls/nutch file

RE: Nutch not indexing full collection

2011-07-25 Thread Chip Calhoun
, July 20, 2011 5:23 PM To: user@nutch.apache.org Subject: Re: Nutch not indexing full collection Hi Chip, I would try running your scripts after setting the environment variable $NUTCH_HOME to nutch/runtime/local/NUTCH_HOME On Wed, Jul 20, 2011 at 4:01 PM, Chip Calhoun ccalh...@aip.org wrote

RE: Nutch not indexing full collection

2011-07-28 Thread Chip Calhoun
has this been solved? If your http.content.limit has not been increased in nutch-site.xml then you will not be able to store this data and index with Solr. On Mon, Jul 25, 2011 at 6:18 PM, Chip Calhoun ccalh...@aip.org wrote: I'm still having trouble. I've set a windows environment variable

RE: Nutch not indexing full collection

2011-08-01 Thread Chip Calhoun
To: user@nutch.apache.org Cc: Chip Calhoun Subject: Re: Nutch not indexing full collection Nutch truncates content longer than configured and Solr truncates content exceeding max field length. Maybe check your limits. I'm still having trouble with this. In addition to the nutch-site-xml posted

Machine readable vs. human readable URLs.

2011-09-15 Thread Chip Calhoun
Hi everyone, We'd like to use Nutch and Solr to replace an existing Verity search that's become a bit long in the tooth. In our Verity search, we have a hack which allows each document to have a machine-readable URL which is indexed (generally an xml document), and a human-readable URL which

RE: Machine readable vs. human readable URLs.

2011-09-19 Thread Chip Calhoun
. human readable URLs. Hi Chip, Should simply be a matter of creating a custom field with an IndexingFilter, you can then use it in any way you want on the SOLR side Julien On 15 September 2011 21:50, Chip Calhoun ccalh...@aip.org wrote: Hi everyone, We'd like to use Nutch and Solr

RE: Machine readable vs. human readable URLs.

2011-09-19 Thread Chip Calhoun
the Solr admin UI On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun ccalh...@aip.org wrote: Hi Julien, Thanks, that's encouraging. I'm trying to make this work, and I'm definitely missing something. I hope I'm not too far off the mark. I've started with the instructions at http

RE: Machine readable vs. human readable URLs.

2011-09-19 Thread Chip Calhoun
of \runtime\local\conf\ before re-compiling with ANT as this will be overwritten. Either modify $NUTCH/conf/nutch-site.xml or recompile THEN modify. As Lewis suggested check the logs and see if the plugin is activated etc... J. On 19 September 2011 21:03, Chip Calhoun ccalh...@aip.org

RE: Machine readable vs. human readable URLs.

2011-09-20 Thread Chip Calhoun
\conf\ before re-compiling with ANT as this will be overwritten. Either modify $NUTCH/conf/nutch-site.xml or recompile THEN modify. As Lewis suggested check the logs and see if the plugin is activated etc... J. On 19 September 2011 21:03, Chip Calhoun ccalh...@aip.org wrote

RE: Machine readable vs. human readable URLs.

2011-09-21 Thread Chip Calhoun
in particular you found misleading about the plugin example on the wiki? I am keen to make it as clear as possible. Thank you Lewis On Tue, Sep 20, 2011 at 6:00 PM, Chip Calhoun ccalh...@aip.org wrote: Hi Julien, Thanks for clarifying this! I've got it working now. Instead of seeding

How can I figure out what my user-agent is?

2011-09-23 Thread Chip Calhoun
I thought I understood how to set my user-agent, but after asking a few sites to add me to their robots.txt it looks like I'm missing something. My nutch-sites.xml includes: property namehttp.agent.name/name valuePHFAWS Spider/value /property property namehttp.robots.agents/name

What could be blocking me, if not robots.txt?

2011-09-29 Thread Chip Calhoun
Hi everyone, I'm using Nutch to crawl a few friendly sites, and am having trouble with some of them. One site in particular has created an exception for me in its robots.txt, and yet I can't crawl any of its pages. I've tried copying the files I want to index (3 XML documents) to my own server

RE: What could be blocking me, if not robots.txt?

2011-10-03 Thread Chip Calhoun
] Sent: Friday, September 30, 2011 6:28 PM To: user@nutch.apache.org Cc: Chip Calhoun Subject: Re: What could be blocking me, if not robots.txt? I've been able to run the ParserChecker now, but I'm not sure how to understand the results. Here's what I got: # bin/nutch

RE: What could be blocking me, if not robots.txt?

2011-10-03 Thread Chip Calhoun
-1.3. Or maybe change the robots.txt to User-agent: PHFAWS/Nutch-1.3 Allow: / On Monday 03 October 2011 15:31:46 Chip Calhoun wrote: I apologize, but I haven't found much Nutch documentation that deals with the user-agent and robots.txt. Why am I being blocked when the user-agent I'm

Unable to parse large XML files.

2011-10-04 Thread Chip Calhoun
Hi everyone, I've found that I'm unable to parse very large XML files. This doesn't seem to happen with other file formats. When I run any of the offending files through ParserChecker, I get something along the lines of: # bin/nutch org.apache.nutch.parse.ParserChecker

RE: Unable to parse large XML files.

2011-10-05 Thread Chip Calhoun
Huh. It turns out my http.content.limit was fine, but I also needed a file.content.limit statement in nutch-site.xml to make this work. Thanks! -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Tuesday, October 04, 2011 7:41 PM To: user@nutch.apache.org

RE: Unable to parse large XML files.

2011-10-05 Thread Chip Calhoun
. /description /property -Original Message- From: Chip Calhoun [mailto:ccalh...@aip.org] Sent: Wednesday, October 05, 2011 9:34 AM To: 'user@nutch.apache.org'; 'markus.jel...@openindex.io' Subject: RE: Unable to parse large XML files. Huh. It turns out my http.content.limit was fine, but I also

Truncated content despite my content.limit settings.

2011-10-17 Thread Chip Calhoun
Hi everyone, I'm having issues with truncated content on some pages, despite what I believe to be solid content.limit settings. One page I have an issue with: http://www.canisius.edu/archives/ruddick.asp When I run a search in Solr, the content I get is limited to: str name=contentCanisius

RE: Truncated content despite my content.limit settings.

2011-10-18 Thread Chip Calhoun
With ParserChecker it's similarly truncated. Could it be the fact that it's a .asp page? The output is as follows: # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText http://www.canisius. edu/archives/ruddick.asp - Url ---

RE: Truncated content despite my content.limit settings.

2011-10-18 Thread Chip Calhoun
a lot of output. Can you try a different parser? Your settings look fine but are there any other exoting settings you use or custom code? On Tuesday 18 October 2011 15:53:26 Chip Calhoun wrote: With ParserChecker it's similarly truncated. Could it be the fact that it's a .asp page? The output

Good workaround for timeout?

2011-10-19 Thread Chip Calhoun
I'm getting a fairly persistent timeout on a particular page. Other, smaller pages in this folder do fine, but this one times out most of the time. When it fails, my ParserChecker results look like: # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText

RE: Good workaround for timeout?

2011-10-19 Thread Chip Calhoun
If I'm reading the log correctly, it's the fetch: 2011-10-19 11:18:11,405 INFO fetcher.Fetcher - fetch of http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2932DonaldsonLauren.xml failed with: java.net.SocketTimeoutException: Read timed out -Original Message- From:

RE: Good workaround for timeout?

2011-10-19 Thread Chip Calhoun
I'm using protocol-http, but I removed protocol-httpclient after you pointed out in another thread that it's broken. Unfortunately I'm not sure which properties are used by what, and I'm not sure how to find out. I added some more stuff to nutch-site.xml (I'll paste it at the end), and it seems

Is there a workaround for https?

2011-10-19 Thread Chip Calhoun
I've noticed the recent posts about trouble with protocol-httpclient, which to my understanding is needed for https URLs. Is there another way to handle these? ParserChecker gives me the following when I try one of these URLs. Thanks. # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText

RE: Good workaround for timeout?

2011-10-20 Thread Chip Calhoun
, 2011 4:57 PM To: user@nutch.apache.org Cc: Chip Calhoun Subject: Re: Good workaround for timeout? I'm using protocol-http, but I removed protocol-httpclient after you pointed out in another thread that it's broken. Unfortunately I'm not sure which properties are used by what, and I'm not sure

RE: Good workaround for timeout?

2011-10-20 Thread Chip Calhoun
Good to know! I was definitely exceeding that, so I've changed my properties. -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, October 20, 2011 10:00 AM To: user@nutch.apache.org Cc: Chip Calhoun Subject: Re: Good workaround for timeout

Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

2011-10-26 Thread Chip Calhoun
plugins. /description /property property nameurlmeta.tags/name valuehumanurl/value /property -Original Message- From: Chip Calhoun [mailto:ccalh...@aip.org] Sent: Thursday, October 20, 2011 10:23 AM To: 'markus.jel...@openindex.io'; user@nutch.apache.org Subject: RE: Good workaround

RE: Extremely long parsing of large XML files (Was RE: Good workaround for timeout?)

2011-10-26 Thread Chip Calhoun
parser.timeout setting. On Wednesday 26 October 2011 16:45:33 Chip Calhoun wrote: I've got a few very large (upwards of 3 MB) XML files I'm trying to index, and I'm having trouble. Previously I'd had trouble with the fetch; now that seems to be okay, but due to the size of the files

Trouble running solrindexer from Nutch 1.4

2011-12-07 Thread Chip Calhoun
This is probably just down to my not waiting for a 1.4 tutorial, but here goes. I've always used the following two commands to run my crawl and then index to Solr: # bin/nutch crawl urls -dir crawl -depth 1 -topN 50 # bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb

Can't crawl a domain; can't figure out why.

2011-12-19 Thread Chip Calhoun
I'm trying to crawl pages from a number of domains, and one of these domains has been giving me trouble. The really irritating thing is that it did work at least once, which led me to believe that I'd solved the problem. I can't think of anything at this point but to paste my log of a failed

RE: Can't crawl a domain; can't figure out why.

2011-12-20 Thread Chip Calhoun
I just compared this against a similar crawl of a completely different domain which I know works, and you're right on both counts. The parser doesn't parse a file, and nothing is sent to the solrindexer. I tried a crawl with more documents and found that while I can get documents from mit.edu,

RE: Can't crawl a domain; can't figure out why.

2011-12-20 Thread Chip Calhoun
, December 20, 2011 2:15 PM To: user@nutch.apache.org Subject: Re: Can't crawl a domain; can't figure out why. It seems that robots.txt in libraries.mit.edu has a lot of restrictions. Alex. -Original Message- From: Chip Calhoun ccalh...@aip.org To: user user@nutch.apache.org

Indexing urlmeta fields into Solr 5.5.3 (Was RE: Failing to index from Nutch 1.12 to Solr 5.5.3)

2017-02-03 Thread Chip Calhoun
conf folder, but since my Solr instance hadn't already had a schema.xml file I'm not convinced it's being read.. How do I set up my Solr to take these new fields? Chip From: Chip Calhoun [ccalh...@aip.org] Sent: Friday, February 03, 2017 11:45 AM To: user

Need help installing scoring-depth plugin

2017-01-31 Thread Chip Calhoun
epth plugin. I'm new to adding plugins. The instructions at https://wiki.apache.org/nutch/AboutPlugins give a sample command, but I don't know what the official PluginRepository for this plugin is and the sample link for the HtmlParser plugin is dead. I'll appreciate any help. Thank you!

RE: Need help installing scoring-depth plugin

2017-01-31 Thread Chip Calhoun
://github.com/apache/nutch/blob/master/src/bin/crawl#L117] HTH Julien On 31 January 2017 at 16:49, Chip Calhoun <ccalh...@aip.org> wrote: > I'm upgrading from Nutch 1.4 to Nutch 1.12. I limit this crawl to my > seeds, so my 1.4 command was: > bin/nutch crawl phfaws -dir crawl -depth 1 -to

Queries in new Solr version not finding results I'd expect

2017-02-08 Thread Chip Calhoun
who's replied to my questions the past few weeks. I don't want to clog the listserv with a lot of short "thank you" posts, but I do appreciate it. Chip Calhoun Digital Archivist Niels Bohr Library & Archives American Institute of Physics One Physics Ellipse College Park, MD 2074

Failing to index from Nutch 1.12 to Solr 5.5.3

2017-02-03 Thread Chip Calhoun
that I'm supposed to do anything more with that. Chip Calhoun Digital Archivist Niels Bohr Library & Archives American Institute of Physics One Physics Ellipse College Park, MD 20740 301-209-3180 https://www.aip.org/history-programs/niels-bohr-library

No build.xml for Nutch 1.12

2017-01-25 Thread Chip Calhoun
I'm upgrading to Nutch 1.12, and I have an extremely basic problem. I can't find a build.xml in apache-nutch-1.12-bin.zip , and therefore can't run ant. What am I missing? Chip Calhoun Digital Archivist Niels Bohr Library & Archives American Institute of Physics One Physics Ellipse College

RE: No build.xml for Nutch 1.12

2017-01-25 Thread Chip Calhoun
ject: No build.xml for Nutch 1.12 > > I'm upgrading to Nutch 1.12, and I have an extremely basic problem. I can't > find a build.xml in apache-nutch-1.12-bin.zip , and therefore can't run ant. > What am I missing? > > Chip Calhoun > Digital Archivist > Niels Bohr Library

RE: [MASSMAIL]Nutch not indexing all seed URLs

2017-05-12 Thread Chip Calhoun
me if you have solved the problem - Mensaje original - De: "Chip Calhoun" <ccalh...@aip.org> Para: user@nutch.apache.org Enviados: Jueves, 11 de Mayo 2017 16:30:34 Asunto: [MASSMAIL]Nutch not indexing all seed URLs I'm using Nutch 1.12 to index a local site. T

Nutch not indexing all seed URLs

2017-05-11 Thread Chip Calhoun
to -1. What would cause my URLs to be skipped? Chip Calhoun Digital Archivist Niels Bohr Library & Archives American Institute of Physics One Physics Ellipse College Park, MD 20740-3840 USA Tel: +1 301-209-3180 Email: ccalh...@aip.org https://www.aip.org/history-programs/niels-bohr-library

Re: Nutch fetching times out at 3 hours, not sure why.

2018-05-01 Thread Chip Calhoun
a time limit, in case a single server responds too slowly. Best, Sebastian On 04/30/2018 09:04 PM, Chip Calhoun wrote: > Hi Sebastian, > > Thank you! Increasing my fetcher.threads.per.queue both fixed my crawl and > saved me a lot of time. > > I'm still bewildered by the original

RE: Nutch fetching times out at 3 hours, not sure why.

2018-04-30 Thread Chip Calhoun
short of crawling every URL in my list, though it crawled a few I hadn't included. Are these 3 hour loops standard for large crawls? -Original Message- From: Chip Calhoun [mailto:ccalh...@aip.org] Sent: Tuesday, April 17, 2018 3:27 PM To: user@nutch.apache.org Subject: RE: Nutch

RE: Nutch fetching times out at 3 hours, not sure why.

2018-04-30 Thread Chip Calhoun
ven fetch in parallel from your host, see fetcher.threads.per.queue Best, Sebastian On 04/30/2018 04:44 PM, Chip Calhoun wrote: > I'm still experimenting with this. I had been crawling with a depth of 1 > because I don't need anything outside my URLs list, but I tried with a depth > of 1

Nutch fetching times out at 3 hours, not sure why.

2018-04-17 Thread Chip Calhoun
://history.aip.org >> dropping! I've seen that 3 hours is the default in some Nutch installations, but I've got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something obvious. Any thoughts would be greatly appreciated. Thank you. Chip Calhoun Digital Archivist Niels Bohr Library & Archive

RE: Nutch fetching times out at 3 hours, not sure why.

2018-04-19 Thread Chip Calhoun
? On Tue, Apr 17, 2018 at 7:45 AM, <user-digest-h...@nutch.apache.org> wrote: > From: Chip Calhoun <ccalh...@aip.org> > To: "user@nutch.apache.org" <user@nutch.apache.org> > Cc: > Bcc: > Date: Tue, 17 Apr 2018 14:45:01 + > Subject: Nutch fetching

RE: Nutch fetching times out at 3 hours, not sure why.

2018-04-19 Thread Chip Calhoun
s even 12 hours with little to no > tweaking necessary from the nutch-default. Something else is causing it. Is > it always the same URL that it fails at? > > -Original Message- > From: Chip Calhoun [mailto:ccalh...@aip.org] > Sent: April-17-18 10:45 AM > To: user@nutc

RE: Nutch fetching times out at 3 hours, not sure why.

2018-04-17 Thread Chip Calhoun
to no tweaking necessary from the nutch-default. Something else is causing it. Is it always the same URL that it fails at? -Original Message- From: Chip Calhoun [mailto:ccalh...@aip.org] Sent: April-17-18 10:45 AM To: user@nutch.apache.org Subject: Nutch fetching times out at 3 hours, not sure