Re: Limits of a single crawler

2014-07-29 Thread Christopher Gross
PM, Christopher Gross wrote: On Thu, Jul 24, 2014 at 5:00 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi Chris, I started off the crawler, using the runbot.sh script Which Nutch version and what script is used? Nutch 1.6 Sorry, its the newer crawl script (I just

Limits of a single crawler

2014-07-24 Thread Christopher Gross
Is there any documentation about the limits of a single Nutch crawler, running with just the built-in Hadoop? I started off the crawler, using the runbot.sh script, and set the topN to 1000, and let it fly. I set up a cron job so that it kicks off every few hours. It went pretty well for a few

Re: error crawling

2013-05-29 Thread Christopher Gross
command. Alex. -Original Message- From: Christopher Gross cogr...@gmail.com To: user user@nutch.apache.org Sent: Tue, May 28, 2013 5:20 am Subject: Re: error crawling Local mode. Script: #!/bin/bash # # Licensed to the Apache Software Foundation (ASF) under one or more

Re: error crawling

2013-05-28 Thread Christopher Gross
] then exit $? fi done exit 0 -- Chris On Fri, May 24, 2013 at 2:51 PM, alx...@aim.com wrote: Can you send the scrpit? Also are you running it in deploy or local mode? -Original Message- From: Christopher Gross cogr...@gmail.com To: user user@nutch.apache.org Sent: Fri, May 24

Re: error crawling

2013-05-24 Thread Christopher Gross
Right. runbot is the old one. They don't package something with nutch anymore like that. Through digging on the web I found something. I took this script. http://svn.apache.org/repos/asf/nutch/branches/2.x/src/bin/crawl I made small changes -- rather than passing in args I hard coded them (to

Re: error crawling

2013-05-22 Thread Christopher Gross
? -- Chris On Mon, May 20, 2013 at 1:55 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Chris, On Mon, May 20, 2013 at 10:21 AM, Christopher Gross cogr...@gmail.com wrote: Lewis -- Is the DEBUG something set in the conf/log4j.properties file? I have the rootLogger set to INFO

Re: error crawling

2013-05-22 Thread Christopher Gross
I'm trying to crawl. I'm just running the script that I pulled from the nutch site, so I assumed that it would be good to go, like the old runbot.sh script. I could try removing that part, but I still get the error farther down in the main body of the loop. -- Christopher Gross Sent from my nexus

Re: error crawling

2013-05-20 Thread Christopher Gross
inject urls/ -crawlId ./crawl/ Try this: $ ./bin/nutch inject urls/ -crawlId crawl On Fri, May 17, 2013 at 12:47 PM, alx...@aim.com wrote: What if you do bin/nutch inject urls/ ? -Original Message- From: Christopher Gross cogr...@gmail.com To: user user

nutch crawl

2013-05-20 Thread Christopher Gross
I'm attempting to get a crawl working using scripts, but I've been getting a Skipping url; different batch id (null) error and then nothing new in Solr. So I've reverted back to trying out the crawl for the nutch script: ./nutch crawl ../urls/ -solr http://localhost/nutchsolr; -threads 5 -depth

Re: error crawling

2013-05-20 Thread Christopher Gross
, 2013 at 11:56 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Please search the mailing list for the HBase logging. There was a conversation on this reasonably recently. Please see my other response for the rest. hth Lewis On Monday, May 20, 2013, Christopher Gross cogr

error crawling

2013-05-17 Thread Christopher Gross
I'm having trouble getting my nutch working. I had it on another server and it was working fine. I migrated it to a new server, and I've been getting nothing but problems. My old script wasn't working right (getting a lot of skipping on the parser saying that the crawl id was null [a separate

Re: Crawling with Certs

2013-02-04 Thread Christopher Gross
No, I never had any luck with it, after trying for a few days I gave up and moved on to other things. Even tried using Nutch 2.x, but still wasn't able to get to a cert protected site. I'm going to look into Apache Droids (http://incubator.apache.org/droids/) and see if their crawler can crawl

Re: Building Nutch 2.0

2012-10-02 Thread Christopher Gross
: Christopher Gross cogr...@gmail.com To: user user@nutch.apache.org Sent: Mon, Oct 1, 2012 1:22 pm Subject: Re: Building Nutch 2.0 I have my 1.3 set up in a /proj/nutch/ directory that has the bin, conf, lib, logs, ..etc.., with NUTCH_HOME pointing there. I don't quite see what

Crawl with Certificates

2012-10-02 Thread Christopher Gross
I know older versions of Nutch didn't support it, but does the 2.x line support crawling with certificates? -- Chris

Nutch 2.1 fields

2012-10-02 Thread Christopher Gross
What fields are available to go in the solrindex-mapping.xml file for Nutch 2.1? Is there a list somewhere? In my 1.3 setup, I had url -- I don't think I added anything in like a plugin to get that. -- Chris

Re: NullPointerException

2012-10-02 Thread Christopher Gross
you out. This is also applicable to any Hadoop, Solr. HBase, Cassandra, Accumulo, Sql, etc. configurations you may be using. On Tue, Oct 2, 2012 at 3:19 PM, Christopher Gross cogr...@gmail.com wrote: I have tried running nutch and having it dump the items found into a solr index: ./bin/nutch

Re: Nutch 2.1 fields

2012-10-02 Thread Christopher Gross
to date a while back and more recently when writing some trivial plugin tests however please shout about anything which is not correct and we can edit accordingly. hth Lewis [0] http://wiki.apache.org/nutch/IndexStructure On Tue, Oct 2, 2012 at 7:32 PM, Christopher Gross cogr...@gmail.com

Building Nutch 2.0

2012-10-01 Thread Christopher Gross
I just downloaded the tarball from the nutch.apache.org site for Nutch 2.0, unzipped untarred it, and tried to build it. Here's what I get: $ ant runtime Buildfile: build.xml [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found. ivy-probe-antlib:

Re: Building Nutch 2.0

2012-10-01 Thread Christopher Gross
...@gmail.com wrote: Hi Chris, On Mon, Oct 1, 2012 at 3:27 PM, Christopher Gross cogr...@gmail.com wrote: unzipped untarred it, I don't think you need to do both! BUILD FAILED /tmp/nutch-2.0/build.xml:72: Specify at least one source--a file or resource collection. Mmmm... can you even

Re: Building Nutch 2.0

2012-10-01 Thread Christopher Gross
if you're really pushing for 2.1 to be out soon, then that's what I'll work with. -- Chris On Mon, Oct 1, 2012 at 11:31 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Chris, On Mon, Oct 1, 2012 at 4:17 PM, Christopher Gross cogr...@gmail.com wrote: I moved it to a different

Re: Building Nutch 2.0

2012-10-01 Thread Christopher Gross
I get that same error for 2.1 as well, FYI. -- Chris On Mon, Oct 1, 2012 at 12:10 PM, Christopher Gross cogr...@gmail.com wrote: OS: Red Hat Enterprise Linux Server release 5.8 (Tikanga) java version 1.6.0_30 Apache Ant version 1.7.0 compiled on December 13 2006 OK, I'm not going

Re: Building Nutch 2.0

2012-10-01 Thread Christopher Gross
configuration needed), but 2.1 seems to be a completely different beast. -- Chris On Mon, Oct 1, 2012 at 12:12 PM, Christopher Gross cogr...@gmail.com wrote: I get that same error for 2.1 as well, FYI. -- Chris On Mon, Oct 1, 2012 at 12:10 PM, Christopher Gross cogr...@gmail.com wrote: OS

Re: Building Nutch 2.0

2012-10-01 Thread Christopher Gross
your webdb or hostdb crawl data. Any questions, please get back on list. hth Lewis On Mon, Oct 1, 2012 at 6:02 PM, Christopher Gross cogr...@gmail.com wrote: I was able to get it built on my windows box, then I moved that to my linux box. Now I'm trying to run it, but getting other errors

Re: Building Nutch 2.0

2012-10-01 Thread Christopher Gross
lewis.mcgibb...@gmail.com wrote: Hi Chris, On Mon, Oct 1, 2012 at 7:09 PM, Christopher Gross cogr...@gmail.com wrote: We have ports blocked on our box, so that may be causing issues with Ivy (which is why I prefer just standard ant and having all the required jars sitting in a lib directory). Well

Re: Building Nutch 2.0

2012-10-01 Thread Christopher Gross
this laid out? Should I be running out of the 'runtime' dir, or is it fine that I've pulled all those files out and into a /proj/nutch-2.1/ directory (so there's a bin, conf, lib, ..etc.. in there, with NUTCH_HOME pointing to that dir). -- Chris On Mon, Oct 1, 2012 at 2:53 PM, Christopher Gross

Re: Building Nutch 2.0

2012-10-01 Thread Christopher Gross
? -- Chris On Mon, Oct 1, 2012 at 3:59 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Chris, On Mon, Oct 1, 2012 at 8:52 PM, Christopher Gross cogr...@gmail.com wrote: OK, I added the port being used by hbase to iptables, and now I'm farther. I'm getting: 12/10/01 19:44:17 ERROR

Error with ppt/pptx files

2012-04-16 Thread Christopher Gross
Hi all. I'm running Nutch 1.4 with Java 1.6.0_30. I'm trying to have it crawl a directory with test files and I'm getting an error on ppt and pptx files. It can get pdf, doc/docx, xsl/xslx, but for whatever reason it flips out on powerpoint. I can attach the document if need be. Below is a

Re: Crawling with Certs

2012-03-08 Thread Christopher Gross
http.content.limit accommodate this? Also you ARE getting back that the content metadata connection appears to be closed! Maybe there are some other credentials to be supplied for crawling certificate authenticated sites... I really don't know. On Wed, Mar 7, 2012 at 9:28 PM, Christopher Gross cogr

Re: Crawling with Certs

2012-03-08 Thread Christopher Gross
ideas on where to look for that? The Authentication Schemes page (http://wiki.apache.org/nutch/HttpAuthenticationSchemes) fails to mention certs. -- Chris On Thu, Mar 8, 2012 at 8:16 AM, Christopher Gross cogr...@gmail.com wrote: The page text is pretty small -- I just made a few quick pages

Crawling with Certs

2012-03-07 Thread Christopher Gross
Is there any good documentation for setting up Nutch to crawl HTTPS sites using a certificate? I've poked around on the wiki and tried some google searches without much luck. I'm using Nutch 1.4. Thanks! -- Chris

Re: Crawling with Certs

2012-03-07 Thread Christopher Gross
On Wednesday, March 7, 2012, Christopher Gross cogr...@gmail.com wrote: Is there any good documentation for setting up Nutch to crawl HTTPS sites using a certificate?  I've poked around on the wiki and tried some google searches without much luck. I'm using Nutch 1.4. Thanks! -- Chris

Re: Crawling with Certs

2012-03-07 Thread Christopher Gross
. ParserChecker, debug-level log info, ... BTW, which authentication scheme is required by your site? For NTLMv2 is poorly supported Remi On Wednesday, March 7, 2012, Christopher Gross cogr...@gmail.com wrote: I have protocol-httpclient set. I can't see how I'm supposed to do the certs.  I can't

Re: Crawling with Certs

2012-03-07 Thread Christopher Gross
: Connection=close Content-Type=text/html Parse Metadata: CharEncodingForConversion=windows-1252 OriginalCharEncoding=windows-1252 - ParseText - -- Chris On Wed, Mar 7, 2012 at 4:22 PM, Christopher Gross cogr...@gmail.com wrote: Well, NTLM is a windows thing with a username

Nutch data to Solr on HTTPS

2012-02-23 Thread Christopher Gross
I have my Solr set up on a secure port -- and I think that is causing a problem for nutch (nothing else changed.) I don't see anything in the documentation regarding this. My nutch version is 1.2, Solr is 3.4. Here's the line from my runbot.sh script: $NUTCH_HOME/bin/nutch solrindex

Re: Nutch data to Solr on HTTPS

2012-02-23 Thread Christopher Gross
Meant to include this...the output from the runbot.sh script. Not that it really says a whole lot... - Index (Step 5 of 8) - SolrIndexer: starting at 2012-02-23 18:18:20 java.io.IOException: Job failed! -- Chris On Thu, Feb 23, 2012 at 1:26 PM, Christopher Gross cogr...@gmail.com

Re: Nutch data to Solr on HTTPS

2012-02-23 Thread Christopher Gross
PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Yeah I can confirm it was 1.4 On Thu, Feb 23, 2012 at 7:05 PM, Christopher Gross cogr...@gmail.comwrote: I tried using 1.4, but I couldn't get that to work at all. What is wrong with your configuration, if this is all that is preventing

Re: Missing document

2011-12-20 Thread Christopher Gross
I don't think it's a redirect, unless SharePoint made it one. Any idea how to check for that? -- Chris On Mon, Dec 19, 2011 at 5:15 PM, Markus Jelsma markus.jel...@openindex.io wrote: Half-way, it's clear in the log. Is your document a redirect, i've not yet seen such a log line before. *

problem with tutorial

2011-12-19 Thread Christopher Gross
I created a script based on: http://wiki.apache.org/nutch/NutchTutorial#A3.2_Using_Individual_Commands_for_Whole-Web_Crawling The script is similar to what happens in the old runbot.sh script, but it isn't working for me. The part that does the s1 barely gets anything, but then the s2 fails

Re: problem with tutorial

2011-12-19 Thread Christopher Gross
On Mon, Dec 19, 2011 at 2:17 PM, Markus Jelsma markus.jel...@openindex.io wrote: What do you mean by skipping over? You don't want ppt pptx and things? In all cases you need to set up URL filters specific for your scenario and whishes. I want to index all the office type documents, they're

Re: Missing document

2011-12-19 Thread Christopher Gross
Hmm, the status db_gone prevents it from being indexed, of course. It is perfectly possible for the checkers to pass but that the fetcher will fail. There may have been an error and i remeber you using a proxy earlier, that's likely the problem here too. The checkers don't use proxy

Re: problem with tutorial

2011-12-19 Thread Christopher Gross
The URLFilterChecker tool doesn't have a page yet...what is the syntax parameters for it? -- Chris On Mon, Dec 19, 2011 at 2:33 PM, Markus Jelsma markus.jel...@openindex.io wrote: On Mon, Dec 19, 2011 at 2:17 PM, Markus Jelsma markus.jel...@openindex.io wrote: What do you mean by

Re: problem with tutorial

2011-12-19 Thread Christopher Gross
Does it normally take a long time to run? It's been going about 5 minutes... -- Chris On Mon, Dec 19, 2011 at 2:43 PM, Markus Jelsma markus.jel...@openindex.io wrote: bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined

Re: problem with tutorial

2011-12-19 Thread Christopher Gross
./nutch org.apache.nutch.net.URLFilterChecker -allCombined in.txt Checking combination of all URLFilters available +http://urlAlpha.docx So it looks like it is a valid one, right? Any other testing tools to try? -- Chris On Mon, Dec 19, 2011 at 2:52 PM, Markus Jelsma

Re: Missing document

2011-12-19 Thread Christopher Gross
I'm a little confused -- should I set up a whole other instance of nutch, crawldb, etc? Set the log to trace, I think this helps to tell why. 2011-12-19 20:14:10,716 INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19

Re: Missing document

2011-12-19 Thread Christopher Gross
Not sure where fetching starts... 2011-12-19 20:13:53,223 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2011-12-19 20:13:53,223 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2011-12-19 20:13:53,223 INFO

Re: updates to runbot.sh

2011-12-16 Thread Christopher Gross
to be filtered out. Not sure how to make it right... On Friday, December 16, 2011, Christopher Gross cogr...@gmail.com wrote: http://wiki.apache.org/nutch/Crawl This script no longer works.  See:echo - Index (Step 5 of $steps) -$NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb

Crawling Sharepoint

2011-12-15 Thread Christopher Gross
I'm able to start crawling a SharePoint site, but then I get this for the body of ALL the pages it finds: You may be trying to access this site from a secured browser on the server. Please enable scripts and reload this page. Turn on more accessible mode Turn off more accessible mode Skip Ribbon