PM, Christopher Gross wrote:
On Thu, Jul 24, 2014 at 5:00 PM, Sebastian Nagel
wastl.na...@googlemail.com
wrote:
Hi Chris,
I started off the crawler, using the runbot.sh script
Which Nutch version and what script is used?
Nutch 1.6
Sorry, its the newer crawl script (I just
Is there any documentation about the limits of a single Nutch crawler,
running with just the built-in Hadoop?
I started off the crawler, using the runbot.sh script, and set the topN to
1000, and let it fly. I set up a cron job so that it kicks off every few
hours. It went pretty well for a few
command.
Alex.
-Original Message-
From: Christopher Gross cogr...@gmail.com
To: user user@nutch.apache.org
Sent: Tue, May 28, 2013 5:20 am
Subject: Re: error crawling
Local mode.
Script:
#!/bin/bash
#
# Licensed to the Apache Software Foundation (ASF) under one or more
]
then exit $?
fi
done
exit 0
-- Chris
On Fri, May 24, 2013 at 2:51 PM, alx...@aim.com wrote:
Can you send the scrpit? Also are you running it in deploy or local mode?
-Original Message-
From: Christopher Gross cogr...@gmail.com
To: user user@nutch.apache.org
Sent: Fri, May 24
Right. runbot is the old one. They don't package something with nutch
anymore like that. Through digging on the web I found something.
I took this script.
http://svn.apache.org/repos/asf/nutch/branches/2.x/src/bin/crawl
I made small changes -- rather than passing in args I hard coded them (to
?
-- Chris
On Mon, May 20, 2013 at 1:55 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Chris,
On Mon, May 20, 2013 at 10:21 AM, Christopher Gross cogr...@gmail.com
wrote:
Lewis --
Is the DEBUG something set in the conf/log4j.properties file? I have the
rootLogger set to INFO
I'm trying to crawl. I'm just running the script that I pulled from the
nutch site, so I assumed that it would be good to go, like the old
runbot.sh script. I could try removing that part, but I still get the error
farther down in the main body of the loop.
-- Christopher Gross
Sent from my nexus
inject urls/ -crawlId ./crawl/
Try this:
$ ./bin/nutch inject urls/ -crawlId crawl
On Fri, May 17, 2013 at 12:47 PM, alx...@aim.com wrote:
What if you do bin/nutch inject urls/ ?
-Original Message-
From: Christopher Gross cogr...@gmail.com
To: user user
I'm attempting to get a crawl working using scripts, but I've been getting
a Skipping url; different batch id (null) error and then nothing new in
Solr. So I've reverted back to trying out the crawl for the nutch script:
./nutch crawl ../urls/ -solr http://localhost/nutchsolr; -threads 5 -depth
, 2013 at 11:56 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Please search the mailing list for the HBase logging. There was a
conversation on this reasonably recently.
Please see my other response for the rest.
hth
Lewis
On Monday, May 20, 2013, Christopher Gross cogr
I'm having trouble getting my nutch working. I had it on another server
and it was working fine. I migrated it to a new server, and I've been
getting nothing but problems. My old script wasn't working right (getting
a lot of skipping on the parser saying that the crawl id was null [a
separate
No, I never had any luck with it, after trying for a few days I gave up and
moved on to other things. Even tried using Nutch 2.x, but still wasn't
able to get to a cert protected site.
I'm going to look into Apache Droids (http://incubator.apache.org/droids/)
and see if their crawler can crawl
: Christopher Gross cogr...@gmail.com
To: user user@nutch.apache.org
Sent: Mon, Oct 1, 2012 1:22 pm
Subject: Re: Building Nutch 2.0
I have my 1.3 set up in a /proj/nutch/ directory that has the bin,
conf, lib, logs, ..etc.., with NUTCH_HOME pointing there. I don't
quite see what
I know older versions of Nutch didn't support it, but does the 2.x
line support crawling with certificates?
-- Chris
What fields are available to go in the solrindex-mapping.xml file for
Nutch 2.1? Is there a list somewhere?
In my 1.3 setup, I had url -- I don't think I added anything in like
a plugin to get that.
-- Chris
you out. This is also applicable to any Hadoop, Solr.
HBase, Cassandra, Accumulo, Sql, etc. configurations you may be using.
On Tue, Oct 2, 2012 at 3:19 PM, Christopher Gross cogr...@gmail.com wrote:
I have tried running nutch and having it dump the items found into a solr
index:
./bin/nutch
to date a while back and more recently
when writing some trivial plugin tests however please shout about
anything which is not correct and we can edit accordingly.
hth
Lewis
[0] http://wiki.apache.org/nutch/IndexStructure
On Tue, Oct 2, 2012 at 7:32 PM, Christopher Gross cogr...@gmail.com
I just downloaded the tarball from the nutch.apache.org site for Nutch
2.0, unzipped untarred it, and tried to build it. Here's what I
get:
$ ant runtime
Buildfile: build.xml
[taskdef] Could not load definitions from resource
org/sonar/ant/antlib.xml. It could not be found.
ivy-probe-antlib:
...@gmail.com wrote:
Hi Chris,
On Mon, Oct 1, 2012 at 3:27 PM, Christopher Gross cogr...@gmail.com wrote:
unzipped untarred it,
I don't think you need to do both!
BUILD FAILED
/tmp/nutch-2.0/build.xml:72: Specify at least one source--a file or
resource collection.
Mmmm... can you even
if you're really pushing for 2.1 to be out soon,
then that's what I'll work with.
-- Chris
On Mon, Oct 1, 2012 at 11:31 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Chris,
On Mon, Oct 1, 2012 at 4:17 PM, Christopher Gross cogr...@gmail.com wrote:
I moved it to a different
I get that same error for 2.1 as well, FYI.
-- Chris
On Mon, Oct 1, 2012 at 12:10 PM, Christopher Gross cogr...@gmail.com wrote:
OS: Red Hat Enterprise Linux Server release 5.8 (Tikanga)
java version 1.6.0_30
Apache Ant version 1.7.0 compiled on December 13 2006
OK, I'm not going
configuration needed), but 2.1 seems to be a completely different
beast.
-- Chris
On Mon, Oct 1, 2012 at 12:12 PM, Christopher Gross cogr...@gmail.com wrote:
I get that same error for 2.1 as well, FYI.
-- Chris
On Mon, Oct 1, 2012 at 12:10 PM, Christopher Gross cogr...@gmail.com wrote:
OS
your webdb or
hostdb crawl data.
Any questions, please get back on list.
hth
Lewis
On Mon, Oct 1, 2012 at 6:02 PM, Christopher Gross cogr...@gmail.com wrote:
I was able to get it built on my windows box, then I moved that to my
linux box. Now I'm trying to run it, but getting other errors
lewis.mcgibb...@gmail.com wrote:
Hi Chris,
On Mon, Oct 1, 2012 at 7:09 PM, Christopher Gross cogr...@gmail.com wrote:
We have ports blocked on our box, so that may be causing issues with
Ivy (which is why I prefer just standard ant and having all the
required jars sitting in a lib directory).
Well
this laid out? Should I be running out of the
'runtime' dir, or is it fine that I've pulled all those files out and
into a /proj/nutch-2.1/ directory (so there's a bin, conf, lib,
..etc.. in there, with NUTCH_HOME pointing to that dir).
-- Chris
On Mon, Oct 1, 2012 at 2:53 PM, Christopher Gross
?
-- Chris
On Mon, Oct 1, 2012 at 3:59 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Chris,
On Mon, Oct 1, 2012 at 8:52 PM, Christopher Gross cogr...@gmail.com wrote:
OK, I added the port being used by hbase to iptables, and now I'm farther.
I'm getting:
12/10/01 19:44:17 ERROR
Hi all.
I'm running Nutch 1.4 with Java 1.6.0_30. I'm trying to have it
crawl a directory with test files and I'm getting an error on ppt and
pptx files. It can get pdf, doc/docx, xsl/xslx, but for whatever
reason it flips out on powerpoint. I can attach the document if need
be. Below is a
http.content.limit accommodate
this?
Also you ARE getting back that the content metadata connection appears to
be closed! Maybe there are some other credentials to be supplied for
crawling certificate authenticated sites... I really don't know.
On Wed, Mar 7, 2012 at 9:28 PM, Christopher Gross cogr
ideas on where to look for that? The Authentication
Schemes page (http://wiki.apache.org/nutch/HttpAuthenticationSchemes)
fails to mention certs.
-- Chris
On Thu, Mar 8, 2012 at 8:16 AM, Christopher Gross cogr...@gmail.com wrote:
The page text is pretty small -- I just made a few quick pages
Is there any good documentation for setting up Nutch to crawl HTTPS
sites using a certificate? I've poked around on the wiki and tried
some google searches without much luck.
I'm using Nutch 1.4.
Thanks!
-- Chris
On Wednesday, March 7, 2012, Christopher Gross cogr...@gmail.com wrote:
Is there any good documentation for setting up Nutch to crawl HTTPS
sites using a certificate? I've poked around on the wiki and tried
some google searches without much luck.
I'm using Nutch 1.4.
Thanks!
-- Chris
. ParserChecker, debug-level log info, ...
BTW, which authentication scheme is required by your site? For NTLMv2 is
poorly supported
Remi
On Wednesday, March 7, 2012, Christopher Gross cogr...@gmail.com wrote:
I have protocol-httpclient set.
I can't see how I'm supposed to do the certs. I can't
: Connection=close Content-Type=text/html
Parse Metadata: CharEncodingForConversion=windows-1252
OriginalCharEncoding=windows-1252
-
ParseText
-
-- Chris
On Wed, Mar 7, 2012 at 4:22 PM, Christopher Gross cogr...@gmail.com wrote:
Well, NTLM is a windows thing with a username
I have my Solr set up on a secure port -- and I think that is causing
a problem for nutch (nothing else changed.) I don't see anything in
the documentation regarding this.
My nutch version is 1.2, Solr is 3.4. Here's the line from my runbot.sh script:
$NUTCH_HOME/bin/nutch solrindex
Meant to include this...the output from the runbot.sh script. Not
that it really says a whole lot...
- Index (Step 5 of 8) -
SolrIndexer: starting at 2012-02-23 18:18:20
java.io.IOException: Job failed!
-- Chris
On Thu, Feb 23, 2012 at 1:26 PM, Christopher Gross cogr...@gmail.com
PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Yeah I can confirm it was 1.4
On Thu, Feb 23, 2012 at 7:05 PM, Christopher Gross cogr...@gmail.comwrote:
I tried using 1.4, but I couldn't get that to work at all.
What is wrong with your configuration, if this is all that is preventing
I don't think it's a redirect, unless SharePoint made it one. Any
idea how to check for that?
-- Chris
On Mon, Dec 19, 2011 at 5:15 PM, Markus Jelsma
markus.jel...@openindex.io wrote:
Half-way, it's clear in the log. Is your document a redirect, i've not yet
seen such a log line before.
*
I created a script based on:
http://wiki.apache.org/nutch/NutchTutorial#A3.2_Using_Individual_Commands_for_Whole-Web_Crawling
The script is similar to what happens in the old runbot.sh script, but
it isn't working for me. The part that does the s1 barely gets
anything, but then the s2 fails
On Mon, Dec 19, 2011 at 2:17 PM, Markus Jelsma
markus.jel...@openindex.io wrote:
What do you mean by skipping over? You don't want ppt pptx and things? In all
cases you need to set up URL filters specific for your scenario and whishes.
I want to index all the office type documents, they're
Hmm, the status db_gone prevents it from being indexed, of course. It is
perfectly possible for the checkers to pass but that the fetcher will fail.
There may have been an error and i remeber you using a proxy earlier, that's
likely the problem here too. The checkers don't use proxy
The URLFilterChecker tool doesn't have a page yet...what is the syntax
parameters for it?
-- Chris
On Mon, Dec 19, 2011 at 2:33 PM, Markus Jelsma
markus.jel...@openindex.io wrote:
On Mon, Dec 19, 2011 at 2:17 PM, Markus Jelsma
markus.jel...@openindex.io wrote:
What do you mean by
Does it normally take a long time to run? It's been going about 5 minutes...
-- Chris
On Mon, Dec 19, 2011 at 2:43 PM, Markus Jelsma
markus.jel...@openindex.io wrote:
bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
./nutch org.apache.nutch.net.URLFilterChecker -allCombined in.txt
Checking combination of all URLFilters available
+http://urlAlpha.docx
So it looks like it is a valid one, right? Any other testing tools to try?
-- Chris
On Mon, Dec 19, 2011 at 2:52 PM, Markus Jelsma
I'm a little confused -- should I set up a whole other instance of
nutch, crawldb, etc?
Set the log to trace, I think this helps to tell why.
2011-12-19 20:14:10,716 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl:
org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19
Not sure where fetching starts...
2011-12-19 20:13:53,223 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2011-12-19 20:13:53,223 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
2011-12-19 20:13:53,223 INFO
to be filtered out. Not sure how to make it right...
On Friday, December 16, 2011, Christopher Gross cogr...@gmail.com wrote:
http://wiki.apache.org/nutch/Crawl
This script no longer works. See:echo - Index (Step 5 of $steps)
-$NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb
I'm able to start crawling a SharePoint site, but then I get this for
the body of ALL the pages it finds:
You may be trying to access this site from a secured browser on the
server. Please enable scripts and reload this page. Turn on more
accessible mode Turn off more accessible mode Skip Ribbon
47 matches
Mail list logo