Thanks for replying! I do still have a couple of questions:
Markus Jelsma markus.jel...@openindex.io 6/20/2011 11:34 AM
On Monday 20 June 2011 16:44:13 Chip Calhoun wrote:
Hi everyone,
I'm a complete Nutch newbie. I installed Nutch 1.2 and Solr 1.4.0 on my
machine without any
markus.jel...@openindex.io 6/20/2011 12:43 PM
On Monday 20 June 2011 18:35:36 Chip Calhoun wrote:
Thanks for replying! I do still have a couple of questions:
Markus Jelsma markus.jel...@openindex.io 6/20/2011 11:34 AM
On Monday 20 June 2011 16:44:13 Chip Calhoun wrote:
Hi everyone,
I'm
I'm a newbie trying to set up a Nutch 1.2 web app, because it seems a bit
better suited to my smallish site than the Nutch 1.3 / Solr connection. I'm
going through the tutorial at
http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine , and I've hit the
following instruction:
Deploy the
5:38 PM
To: user@nutch.apache.org
Subject: Re: Deploying the web application in Nutch 1.2
On Thu, Jul 14, 2011 at 8:01 PM, Chip Calhoun ccalh...@aip.org wrote:
Thanks Lewis.
I'm still having trouble. I've moved the war file to
$CATALINA_HOME/webapps/nutch/ and unpacked it. I don't' seem
webapp version of
nutch-site.xml. In my experience this was a small case of confusion at first.
On Fri, Jul 15, 2011 at 7:03 PM, Chip Calhoun ccalh...@aip.org wrote:
You've gotten me very close to a breakthrough. I've started over, and
I've found that If I don't make any edits to nutch
, sorry I can't be of more
help in giving a definite answer.
On Fri, Jul 15, 2011 at 8:27 PM, Chip Calhoun ccalh...@aip.org wrote:
I'm definitely changing the file in my webapp. I can tell I'm doing
that much right because it makes a noticeable change to the function
of my web app
Hi,
I'm using Nutch 1.3 to crawl a section of our website, and it doesn't seem to
crawl the entire thing. I'm probably missing something simple, so I hope
somebody can help me.
My urls/nutch file contains a single URL:
http://www.aip.org/history/ohilist/transcripts.html , which is an
of pages known etc
Julien
On 20 July 2011 14:51, Chip Calhoun ccalh...@aip.org wrote:
Hi,
I'm using Nutch 1.3 to crawl a section of our website, and it doesn't
seem to crawl the entire thing. I'm probably missing something
simple, so I hope somebody can help me.
My urls/nutch file
, July 20, 2011 5:23 PM
To: user@nutch.apache.org
Subject: Re: Nutch not indexing full collection
Hi Chip,
I would try running your scripts after setting the environment variable
$NUTCH_HOME to nutch/runtime/local/NUTCH_HOME
On Wed, Jul 20, 2011 at 4:01 PM, Chip Calhoun ccalh...@aip.org wrote
has this been solved?
If your http.content.limit has not been increased in nutch-site.xml then you
will not be able to store this data and index with Solr.
On Mon, Jul 25, 2011 at 6:18 PM, Chip Calhoun ccalh...@aip.org wrote:
I'm still having trouble. I've set a windows environment variable
To: user@nutch.apache.org
Cc: Chip Calhoun
Subject: Re: Nutch not indexing full collection
Nutch truncates content longer than configured and Solr truncates content
exceeding max field length. Maybe check your limits.
I'm still having trouble with this. In addition to the nutch-site-xml
posted
Hi everyone,
We'd like to use Nutch and Solr to replace an existing Verity search that's
become a bit long in the tooth. In our Verity search, we have a hack which
allows each document to have a machine-readable URL which is indexed (generally
an xml document), and a human-readable URL which
. human readable URLs.
Hi Chip,
Should simply be a matter of creating a custom field with an IndexingFilter,
you can then use it in any way you want on the SOLR side
Julien
On 15 September 2011 21:50, Chip Calhoun ccalh...@aip.org wrote:
Hi everyone,
We'd like to use Nutch and Solr
the Solr admin
UI
On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun ccalh...@aip.org wrote:
Hi Julien,
Thanks, that's encouraging. I'm trying to make this work, and I'm
definitely missing something. I hope I'm not too far off the mark.
I've started with the instructions at
http
of \runtime\local\conf\
before re-compiling with ANT as this will be overwritten. Either
modify $NUTCH/conf/nutch-site.xml or recompile THEN modify.
As Lewis suggested check the logs and see if the plugin is activated etc...
J.
On 19 September 2011 21:03, Chip Calhoun ccalh...@aip.org
\conf\
before re-compiling with ANT as this will be overwritten. Either
modify $NUTCH/conf/nutch-site.xml or recompile THEN modify.
As Lewis suggested check the logs and see if the plugin is activated
etc...
J.
On 19 September 2011 21:03, Chip Calhoun ccalh...@aip.org wrote
in particular you found misleading about the plugin example
on the wiki? I am keen to make it as clear as possible.
Thank you
Lewis
On Tue, Sep 20, 2011 at 6:00 PM, Chip Calhoun ccalh...@aip.org wrote:
Hi Julien,
Thanks for clarifying this! I've got it working now. Instead of
seeding
I thought I understood how to set my user-agent, but after asking a few sites
to add me to their robots.txt it looks like I'm missing something.
My nutch-sites.xml includes:
property
namehttp.agent.name/name
valuePHFAWS Spider/value
/property
property
namehttp.robots.agents/name
Hi everyone,
I'm using Nutch to crawl a few friendly sites, and am having trouble with some
of them. One site in particular has created an exception for me in its
robots.txt, and yet I can't crawl any of its pages. I've tried copying the
files I want to index (3 XML documents) to my own server
]
Sent: Friday, September 30, 2011 6:28 PM
To: user@nutch.apache.org
Cc: Chip Calhoun
Subject: Re: What could be blocking me, if not robots.txt?
I've been able to run the ParserChecker now, but I'm not sure how to
understand the results. Here's what I got:
# bin/nutch
-1.3.
Or maybe change the robots.txt to
User-agent: PHFAWS/Nutch-1.3
Allow: /
On Monday 03 October 2011 15:31:46 Chip Calhoun wrote:
I apologize, but I haven't found much Nutch documentation that deals
with the user-agent and robots.txt. Why am I being blocked when the
user-agent I'm
Hi everyone,
I've found that I'm unable to parse very large XML files. This doesn't seem to
happen with other file formats. When I run any of the offending files through
ParserChecker, I get something along the lines of:
# bin/nutch org.apache.nutch.parse.ParserChecker
Huh. It turns out my http.content.limit was fine, but I also needed a
file.content.limit statement in nutch-site.xml to make this work. Thanks!
-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Tuesday, October 04, 2011 7:41 PM
To: user@nutch.apache.org
.
/description
/property
-Original Message-
From: Chip Calhoun [mailto:ccalh...@aip.org]
Sent: Wednesday, October 05, 2011 9:34 AM
To: 'user@nutch.apache.org'; 'markus.jel...@openindex.io'
Subject: RE: Unable to parse large XML files.
Huh. It turns out my http.content.limit was fine, but I also
Hi everyone,
I'm having issues with truncated content on some pages, despite what I believe
to be solid content.limit settings.
One page I have an issue with:
http://www.canisius.edu/archives/ruddick.asp
When I run a search in Solr, the content I get is limited to:
str name=contentCanisius
With ParserChecker it's similarly truncated. Could it be the fact that it's a
.asp page? The output is as follows:
# bin/nutch org.apache.nutch.parse.ParserChecker -dumpText http://www.canisius.
edu/archives/ruddick.asp
-
Url
---
a lot of output. Can you try a different parser? Your settings
look fine but are there any other exoting settings you use or custom code?
On Tuesday 18 October 2011 15:53:26 Chip Calhoun wrote:
With ParserChecker it's similarly truncated. Could it be the fact that
it's a .asp page? The output
I'm getting a fairly persistent timeout on a particular page. Other, smaller
pages in this folder do fine, but this one times out most of the time. When it
fails, my ParserChecker results look like:
# bin/nutch org.apache.nutch.parse.ParserChecker -dumpText
If I'm reading the log correctly, it's the fetch:
2011-10-19 11:18:11,405 INFO fetcher.Fetcher - fetch of
http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2932DonaldsonLauren.xml
failed with: java.net.SocketTimeoutException: Read timed out
-Original Message-
From:
I'm using protocol-http, but I removed protocol-httpclient after you pointed
out in another thread that it's broken. Unfortunately I'm not sure which
properties are used by what, and I'm not sure how to find out. I added some
more stuff to nutch-site.xml (I'll paste it at the end), and it seems
I've noticed the recent posts about trouble with protocol-httpclient, which to
my understanding is needed for https URLs. Is there another way to handle
these? ParserChecker gives me the following when I try one of these URLs.
Thanks.
# bin/nutch org.apache.nutch.parse.ParserChecker -dumpText
, 2011 4:57 PM
To: user@nutch.apache.org
Cc: Chip Calhoun
Subject: Re: Good workaround for timeout?
I'm using protocol-http, but I removed protocol-httpclient after you
pointed out in another thread that it's broken. Unfortunately I'm not
sure which properties are used by what, and I'm not sure
Good to know! I was definitely exceeding that, so I've changed my properties.
-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Thursday, October 20, 2011 10:00 AM
To: user@nutch.apache.org
Cc: Chip Calhoun
Subject: Re: Good workaround for timeout
plugins.
/description
/property
property
nameurlmeta.tags/name
valuehumanurl/value
/property
-Original Message-
From: Chip Calhoun [mailto:ccalh...@aip.org]
Sent: Thursday, October 20, 2011 10:23 AM
To: 'markus.jel...@openindex.io'; user@nutch.apache.org
Subject: RE: Good workaround
parser.timeout setting.
On Wednesday 26 October 2011 16:45:33 Chip Calhoun wrote:
I've got a few very large (upwards of 3 MB) XML files I'm trying to
index, and I'm having trouble. Previously I'd had trouble with the
fetch; now that seems to be okay, but due to the size of the files
This is probably just down to my not waiting for a 1.4 tutorial, but here goes.
I've always used the following two commands to run my crawl and then index to
Solr:
# bin/nutch crawl urls -dir crawl -depth 1 -topN 50
# bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb
I'm trying to crawl pages from a number of domains, and one of these domains
has been giving me trouble. The really irritating thing is that it did work at
least once, which led me to believe that I'd solved the problem. I can't think
of anything at this point but to paste my log of a failed
I just compared this against a similar crawl of a completely different domain
which I know works, and you're right on both counts. The parser doesn't parse a
file, and nothing is sent to the solrindexer. I tried a crawl with more
documents and found that while I can get documents from mit.edu,
, December 20, 2011 2:15 PM
To: user@nutch.apache.org
Subject: Re: Can't crawl a domain; can't figure out why.
It seems that robots.txt in
libraries.mit.edu
has a lot of restrictions.
Alex.
-Original Message-
From: Chip Calhoun ccalh...@aip.org
To: user user@nutch.apache.org
conf folder, but since my Solr instance hadn't already had a schema.xml
file I'm not convinced it's being read.. How do I set up my Solr to take these
new fields?
Chip
From: Chip Calhoun [ccalh...@aip.org]
Sent: Friday, February 03, 2017 11:45 AM
To: user
epth plugin. I'm new to adding plugins.
The instructions at https://wiki.apache.org/nutch/AboutPlugins give a sample
command, but I don't know what the official PluginRepository for this plugin is
and the sample link for the HtmlParser plugin is dead.
I'll appreciate any help. Thank you!
://github.com/apache/nutch/blob/master/src/bin/crawl#L117]
HTH
Julien
On 31 January 2017 at 16:49, Chip Calhoun <ccalh...@aip.org> wrote:
> I'm upgrading from Nutch 1.4 to Nutch 1.12. I limit this crawl to my
> seeds, so my 1.4 command was:
> bin/nutch crawl phfaws -dir crawl -depth 1 -to
who's replied to my questions the past few
weeks. I don't want to clog the listserv with a lot of short "thank you" posts,
but I do appreciate it.
Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD 2074
that I'm supposed to do anything more
with that.
Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD 20740
301-209-3180
https://www.aip.org/history-programs/niels-bohr-library
I'm upgrading to Nutch 1.12, and I have an extremely basic problem. I can't
find a build.xml in apache-nutch-1.12-bin.zip , and therefore can't run ant.
What am I missing?
Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College
ject: No build.xml for Nutch 1.12
>
> I'm upgrading to Nutch 1.12, and I have an extremely basic problem. I can't
> find a build.xml in apache-nutch-1.12-bin.zip , and therefore can't run ant.
> What am I missing?
>
> Chip Calhoun
> Digital Archivist
> Niels Bohr Library
me if you have solved the problem
- Mensaje original -
De: "Chip Calhoun" <ccalh...@aip.org>
Para: user@nutch.apache.org
Enviados: Jueves, 11 de Mayo 2017 16:30:34
Asunto: [MASSMAIL]Nutch not indexing all seed URLs
I'm using Nutch 1.12 to index a local site. T
to -1. What
would cause my URLs to be skipped?
Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD 20740-3840 USA
Tel: +1 301-209-3180
Email: ccalh...@aip.org
https://www.aip.org/history-programs/niels-bohr-library
a time limit, in case a single server responds too slowly.
Best,
Sebastian
On 04/30/2018 09:04 PM, Chip Calhoun wrote:
> Hi Sebastian,
>
> Thank you! Increasing my fetcher.threads.per.queue both fixed my crawl and
> saved me a lot of time.
>
> I'm still bewildered by the original
short of
crawling every URL in my list, though it crawled a few I hadn't included.
Are these 3 hour loops standard for large crawls?
-Original Message-
From: Chip Calhoun [mailto:ccalh...@aip.org]
Sent: Tuesday, April 17, 2018 3:27 PM
To: user@nutch.apache.org
Subject: RE: Nutch
ven fetch in parallel from your host, see
fetcher.threads.per.queue
Best,
Sebastian
On 04/30/2018 04:44 PM, Chip Calhoun wrote:
> I'm still experimenting with this. I had been crawling with a depth of 1
> because I don't need anything outside my URLs list, but I tried with a depth
> of 1
://history.aip.org >> dropping!
I've seen that 3 hours is the default in some Nutch installations, but I've got
my fetcher.timelimit.mins set to -1. I'm sure I'm missing something obvious.
Any thoughts would be greatly appreciated. Thank you.
Chip Calhoun
Digital Archivist
Niels Bohr Library & Archive
?
On Tue, Apr 17, 2018 at 7:45 AM, <user-digest-h...@nutch.apache.org> wrote:
> From: Chip Calhoun <ccalh...@aip.org>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Cc:
> Bcc:
> Date: Tue, 17 Apr 2018 14:45:01 +
> Subject: Nutch fetching
s even 12 hours with little to no
> tweaking necessary from the nutch-default. Something else is causing it. Is
> it always the same URL that it fails at?
>
> -Original Message-
> From: Chip Calhoun [mailto:ccalh...@aip.org]
> Sent: April-17-18 10:45 AM
> To: user@nutc
to no tweaking
necessary from the nutch-default. Something else is causing it. Is it always
the same URL that it fails at?
-Original Message-
From: Chip Calhoun [mailto:ccalh...@aip.org]
Sent: April-17-18 10:45 AM
To: user@nutch.apache.org
Subject: Nutch fetching times out at 3 hours, not sure
55 matches
Mail list logo