Re: file:/// URLS with spaces in path

2013-08-08 Thread Bai Shen
them over is time trivial... also the workaround is annoying knowing that we have a protocl-file plugin. Thanks for help Lewis On Wednesday, August 7, 2013, Bai Shen baishen.li...@gmail.com wrote: Is it possible to run a web server and connect to them that way? That was what I

Re: Incorrect fetch time

2013-08-08 Thread Bai Shen
it was the problem. However, the behavior occurs with both it and the default scheduler. Did you then start from scratch again? Otherwise the next fetch time is still far in the future and the fetch interval keeps too large. Sebastian On 08/07/2013 03:30 PM, Bai Shen wrote: Sorry

Re: file:/// URLS with spaces in path

2013-08-07 Thread Bai Shen
Is it possible to run a web server and connect to them that way? That was what I ended up doing. On Tue, Aug 6, 2013 at 4:58 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, Struggling with this one. And yes I acknowledge that it is not really a Nutch based question but

Re: Incorrect fetch time

2013-08-07 Thread Bai Shen
db.fetch.schedule.adaptive.max_interval db.fetch.schedule.adaptive.sync_delta Sebastian On 07/17/2013 06:58 PM, Bai Shen wrote: I'm using Nutch 2.x HEAD with the default scheduler. I have the max fetch interval set to one week and the fetch interval set to one day. Everything seems to work

Incorrect fetch time

2013-07-17 Thread Bai Shen
I'm using Nutch 2.x HEAD with the default scheduler. I have the max fetch interval set to one week and the fetch interval set to one day. Everything seems to work correctly for a while. Pages show up as fetched with a fetch time of the next day. However, after a couple of days generate

Re: Batch id and Fetch list

2013-07-15 Thread Bai Shen
.com So when fetch has fetched say page1, page2, page3 from url1 and page4,page5,page6 from url2, after the crawl, how do I tell that page4 is from url2.com and page1 is from url1.com? On Thu, Jul 11, 2013 at 10:54 AM, Bai Shen baishen.li...@gmail.com wrote: Yes, generate marks the urls

Re: Unfetched urls not being generated for fetching.

2013-07-15 Thread Bai Shen
in HBase, most probably it is either 404 or status other than 200. Hope this helps. On Fri, May 24, 2013 at 8:13 AM, Bai Shen baishen.li...@gmail.com wrote: I'm running Nutch 2.1 using HBase. When I run readdb -stats I show that there are 15k unfetched urls. However, when I run generate

Re: Using Batch Id

2013-07-11 Thread Bai Shen
The crawl script doesn't accept Batch ID. So in order to use Batch ID you would run the commands separately which would not involve depth. Depth is just the number of times to run the generate, fetch, parse, update cycle. Any unfetched pages will not have a Batch ID. The Batch ID only applies

Re: nutch redirection issue

2013-07-11 Thread Bai Shen
Have you done another crawl? By default, Nutch puts the redirect into the database as a new url to be crawled. So you will find the content under the location of the redirect. If I remember correctly, there used to be a setting that would have Nutch follow the redirect instead of storing it as

Re: Batch id and Fetch list

2013-07-11 Thread Bai Shen
This isn't what Batch ID is for. If you're crawling on only the one server and only want that specific section, use the regex-urlfilter to accept only the specific pages you want. On Tue, Jul 9, 2013 at 3:36 PM, h b hb6...@gmail.com wrote: Hi Use case: * Scrape a given url. e.g.

Re: Batch id and Fetch list

2013-07-11 Thread Bai Shen
, Jul 11, 2013 at 4:25 AM, Bai Shen baishen.li...@gmail.com wrote: This isn't what Batch ID is for. If you're crawling on only the one server and only want that specific section, use the regex-urlfilter to accept only the specific pages you want. On Tue, Jul 9, 2013 at 3:36 PM, h b hb6

Suffix URLFilter not working

2013-06-12 Thread Bai Shen
I'm dealing with a lot of file types that I don't want to index. I was originally using the regex filter to exclude them but it was getting out of hand. I changed my plugin includes from urlfilter-regex to urlfilter-(regex|suffix) I've tried using both the default urlfilter-suffix.txt file

Re: Suffix URLFilter not working

2013-06-12 Thread Bai Shen
Sorry. I forgot to mention that I'm running a 2.x release taken from a few weeks ago. On Wed, Jun 12, 2013 at 8:31 AM, Bai Shen baishen.li...@gmail.com wrote: I'm dealing with a lot of file types that I don't want to index. I was originally using the regex filter to exclude them

Re: Suffix URLFilter not working

2013-06-12 Thread Bai Shen
I figured as much, which is why I'm not sure why it's not working for me. I ran bin/nutch org.apache.nutch.net.URLFilterChecker http://myserver/myurland it's been thirty minutes with no results. Is there something I should run before running that? Thanks. On Wed, Jun 12, 2013 at 8:34 AM,

Re: Suffix URLFilter not working

2013-06-12 Thread Bai Shen
Doh! I really should just read the code of things before posting. I ran the URLFilterChecker and passed it in a url that the SuffixFilter should flag and it still passed it. However, if I change the url to end in a format that is in the default config file, it rejects the url. So it looks like

Re: Suffix URLFilter not working

2013-06-12 Thread Bai Shen
Turns out it was because I had a copy of the default file sitting in the directory I was calling nutch from. Once I removed that it correctly found my copy in the conf directory. On Wed, Jun 12, 2013 at 9:29 AM, Bai Shen baishen.li...@gmail.com wrote: Doh! I really should just read the code

Re: Extremely long fetch time

2013-06-05 Thread Bai Shen
4, 2013 at 4:31 PM, Bai Shen baishen.li...@gmail.com wrote: I dropped the f family from HBase and readded it. Nutch filled in the columns and now I have sane fetch times. However, my fetchInterval is not being populated and every time I run a crawl I get the same urls. Here is my metadata

Re: Extremely long fetch time

2013-06-04 Thread Bai Shen
attempted to be fetched but that failed and so their retry interval was incremented to a larger value. Can't say for sure though. Can you share the crawl datum ? The status and meta fields can give some clue. On Mon, Jun 3, 2013 at 8:30 AM, Bai Shen baishen.li...@gmail.com wrote: The time

Re: Extremely long fetch time

2013-06-04 Thread Bai Shen
: 1370388764041 prevFetchTime: 0 fetchInterval: 0 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: SUCCESS parseStatus: success/ok Any ideas why? On Tue, Jun 4, 2013 at 8:16 AM, Bai Shen baishen.li...@gmail.com wrote: I'm looking at my base url(the root of the internal site that I

Extremely long fetch time

2013-06-03 Thread Bai Shen
I'm using the 2.x head and even with adding 30 days I'm not getting any refetches. I did a readdb on my injected url and it says that the fetch time is in 2027. Any idea why this would occur? Will db.fetch.interval.max kick in and cause it to be fetched earlier? Or will I have to manually

Re: Extremely long fetch time

2013-06-03 Thread Bai Shen
3, 2013 at 8:57 PM, Bai Shen baishen.li...@gmail.com wrote: I'm using the 2.x head and even with adding 30 days I'm not getting any refetches. I did a readdb on my injected url and it says that the fetch time is in 2027. Can share the crawl datum for that url ? Any idea

Re: [REQUEST] (NUTCH-1569) Upgrade 2.x to Gora 0.3

2013-06-03 Thread Bai Shen
The issue with CDH4 is that it uses a newer version of HBase. If Gora 0.3 now supports versions of HBase newer than 0.90 it should fix the CDH4 issues. However, from what I've read, Gora won't support newer HBase versions until 0.4. On Mon, Jun 3, 2013 at 10:24 AM, Tejas Patil

Re: Unfetched urls not being generated for fetching.

2013-05-28 Thread Bai Shen
that unfetched status is not updated in Nutch. I also faced a similar problem [1]. Please open a jira and report any findings. [1] http://find.searchhub.org/document/6e4464919811d20f#c2a5de6e93942ada On Fri, May 24, 2013 at 10:03 AM, Bai Shen baishen.li...@gmail.com wrote: I'm trying

Unfetched urls not being generated for fetching.

2013-05-24 Thread Bai Shen
I'm running Nutch 2.1 using HBase. When I run readdb -stats I show that there are 15k unfetched urls. However, when I run generate -topN 1000 I get no urls to be fetched. Up until now it's been pulling a full thousand urls for each cycle. Any ideas? I'm not sure what to check. Thanks.

Re: Unfetched urls not being generated for fetching.

2013-05-24 Thread Bai Shen
other than 200. Hope this helps. On Fri, May 24, 2013 at 8:13 AM, Bai Shen baishen.li...@gmail.com wrote: I'm running Nutch 2.1 using HBase. When I run readdb -stats I show that there are 15k unfetched urls. However, when I run generate -topN 1000 I get no urls to be fetched. Up

Re: Example crawl script Nutch 2.1

2013-05-17 Thread Bai Shen
I just tested the GeneratorJob portion and it works fine. I have two comments, though. 1. I added braces around the -batchId arg if statement. I don't like if's without them. 2. BatchIds never get cleared. So if you use the same batchId for multiple crawl cycles your urls per batch will

HBase dependency removed from HEAD?

2013-05-13 Thread Bai Shen
I'm trying to set up nutch using HEAD instead of 2.1. I went to change ivy.xml to uncomment the HBase dependency before calling ant and it's not there. Has this been removed? Is there a new way to set up HBase integration? Thanks.

Re: HBase dependency removed from HEAD?

2013-05-13 Thread Bai Shen
NM, I grabbed trunk instead of 2.x. On Mon, May 13, 2013 at 7:25 AM, Bai Shen baishen.li...@gmail.com wrote: I'm trying to set up nutch using HEAD instead of 2.1. I went to change ivy.xml to uncomment the HBase dependency before calling ant and it's not there. Has this been removed

Re: Solrindex -all not working correctly

2013-05-13 Thread Bai Shen
you to upgrade to 2.x HEAD. On Wed, May 1, 2013 at 4:32 AM, Bai Shen baishen.li...@gmail.com wrote: My crawl loop consists of the following. generate -topN fetch -all parse -all updatedb solrindex -all With the fetch and parse the -all only pulls the batch that was generated

Re: Solrindex -all not working correctly

2013-05-02 Thread Bai Shen
? On Wed, May 1, 2013 at 7:32 AM, Bai Shen baishen.li...@gmail.com wrote: My crawl loop consists of the following. generate -topN fetch -all parse -all updatedb solrindex -all With the fetch and parse the -all only pulls the batch that was generated, skipping all of the other

Re: Solrindex -all not working correctly

2013-05-02 Thread Bai Shen
I'm using 2.1. Are there any other notable changes for using the HEAD instead? On Wed, May 1, 2013 at 2:13 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: What version are you using? If you can I would advise you to upgrade to 2.x HEAD. On Wed, May 1, 2013 at 4:32 AM, Bai Shen

Gora not finding HBase master

2013-05-02 Thread Bai Shen
Apologies for the cross posting, but I'm not sure where the configuration I'm missing lies. I have Nutch 2.1 connecting to HBase 0.90.6 with everything running locally on the same machine. Now I'm trying to move them to separate machines. I added hbase.zookeeper.quorum in the Nutch

Re: Gora not finding HBase master

2013-05-02 Thread Bai Shen
It turned out to be an /etc/hosts issue. I needed to remove the hbase host name from 127.0.0.1 and add a separate line with the machines ip and host name. Then I had to duplicate that on the Nutch machine. On Thu, May 2, 2013 at 9:29 AM, Bai Shen baishen.li...@gmail.com wrote: Apologies

Re: Solrindex adding documents in small chunks

2013-05-01 Thread Bai Shen
: It does not explicitly trigger a server side commit. /description /property On Thu, Apr 25, 2013 at 4:35 PM, Bai Shen baishen.li...@gmail.com wrote: I'm having two problems with the solrindex job in Nutch 2.1 When I run it with -all, it indexes every single parsed document, not just

Re: Nutch 2 hanging after aborting hung threads

2013-05-01 Thread Bai Shen
relaxing my other settings to see if the errors and hung threads come back yet. On Tue, Apr 30, 2013 at 12:50 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: That would be very much appreciated. Lewis On Tue, Apr 30, 2013 at 5:00 AM, Bai Shen baishen.li...@gmail.com wrote: I'll

Re: Nutch 2 hanging after aborting hung threads

2013-04-30 Thread Bai Shen
I'll let you know if I figure out any good defaults. Thanks. On Sat, Apr 27, 2013 at 5:30 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Bai, On Thu, Apr 25, 2013 at 4:33 AM, Bai Shen baishen.li...@gmail.com wrote: Well, I still ended up having to set a content limit

Remove fetched files from HBase after parse

2013-04-30 Thread Bai Shen
Is there a way to remove the files that fetched files from HBase after they've been parsed? I'm running things locally and don't have the storage space to store all of the fetched files. Thanks.

Re: Nutch 2 hanging after aborting hung threads

2013-04-25 Thread Bai Shen
a question for the Gora experts. I was finally able to get the fetch to complete by setting the Nutch heap to 4GB and the HBase heap to 4GB. A heap size 4 times the document size doesn't seem that much ;-) On 04/24/2013 01:34 PM, Bai Shen wrote: It doesn't take that long on my local machine

Solrindex adding documents in small chunks

2013-04-25 Thread Bai Shen
I'm having two problems with the solrindex job in Nutch 2.1 When I run it with -all, it indexes every single parsed document, not just the newly generated ones, as fetch and parse do. Secondly, it's adding my documents in small chunks. I was fetching in 100 document cycles and when I run

Re: Nutch 2 hanging after aborting hung threads

2013-04-24 Thread Bai Shen
these values, but a single fetch should never take 5min. Sebastian On 04/23/2013 06:17 PM, Bai Shen wrote: Anything larger than the default http.content.limit. I'm crawling an internal server and we have some large files. That's why I had increased the heap size to 8G. When I run

Re: Nutch 2 hanging after aborting hung threads

2013-04-23 Thread Bai Shen
of the files which were truncated? thank you Lewis On Tuesday, April 23, 2013, Bai Shen baishen.li...@gmail.com wrote: I just set http.content.limit back to the default and my fetch completed successfully on the server. However, it truncated several of my files. Also, my server is running

Nutch 2 hanging after aborting hung threads

2013-04-22 Thread Bai Shen
I'm crawling a local server. I have Nutch 2 working on a local machine with the default 1G heap size. I got several OOM errors, but the fetch eventually finishes. In order to get rid of the OOM errors, I moved everything to a machine with more memory and increased the heap size to 8G. However,

Re: Nutch 2 hanging after aborting hung threads

2013-04-22 Thread Bai Shen
://stackoverflow.com/questions/10331440/nutch-fetcher-aborting-with-n-hung-threads Also keep track of: https://issues.apache.org/jira/browse/NUTCH-1182 Sebastian On 04/22/2013 08:18 PM, Bai Shen wrote: I'm crawling a local server. I have Nutch 2 working on a local machine with the default 1G

Re: Nutch 2 hanging after aborting hung threads

2013-04-22 Thread Bai Shen
keep track of: https://issues.apache.org/jira/browse/NUTCH-1182 Sebastian On 04/22/2013 08:18 PM, Bai Shen wrote: I'm crawling a local server. I have Nutch 2 working on a local machine with the default 1G heap size. I got several OOM errors, but the fetch eventually finishes. In order

Re: Root slash being stripped from file path

2013-03-28 Thread Bai Shen
at 12:26 PM, Bai Shen baishen.li...@gmail.com wrote: I'm trying to crawl a local file system. I've made the changes to not ignore file urls and added protocol-file to the plugins list. I've included file:///data/mydir in my url fille. However, when I run the fetch, Nutch tries to connect

Re: Root slash being stripped from file path

2013-03-28 Thread Bai Shen
. I don't know about the progress on it. There is most certainly open/resolved tickets for it on Jira please look there. Thank you Lewis On Wed, Mar 27, 2013 at 12:26 PM, Bai Shen baishen.li...@gmail.com wrote: I'm trying to crawl a local file system. I've made the changes to not ignore

Root slash being stripped from file path

2013-03-27 Thread Bai Shen
I'm trying to crawl a local file system. I've made the changes to not ignore file urls and added protocol-file to the plugins list. I've included file:///data/mydir in my url fille. However, when I run the fetch, Nutch tries to connect to file://data/mydir and therefore returns a 404 error. I

Re: Changing Nutch heap size when using LocalJobRunner

2012-10-17 Thread Bai Shen
NM. I didn't realize the distributed launcher ignores NUTCH_HEAP_SIZE. You have to use HADOOP_HEAP_SIZE. On Wed, Oct 17, 2012 at 3:33 PM, Bai Shen baishen.li...@gmail.com wrote: I'm running Nutch in a distributed deployment using LocalJobRunner. I'm trying to increase the heap size

Re: NutchDocument API change in Nutch 2

2012-10-12 Thread Bai Shen
Doh! I misread the code when I was looking at it. Thanks. On Fri, Oct 12, 2012 at 10:04 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Just look at the existing plugins e.g. anchor : call add() as many times as you have values On 12 October 2012 14:43, Bai Shen baishen.li

Re: Referencing files in job from plugin

2012-10-11 Thread Bai Shen
are big, or change a lot. On Thu, Oct 11, 2012 at 1:20 PM, Bai Shen baishen.li...@gmail.com wrote: I need to reference a file from my plugin. However, when I try to call it using File(blah.txt), it looks for the file at the location where I run nutch from, not in the job file. What

Re: Referencing files in job from plugin

2012-10-11 Thread Bai Shen
. On Thu, Oct 11, 2012 at 2:52 PM, Bai Shen baishen.li...@gmail.com wrote: I'm trying the first method now. However, some of my code requires a file object that is a directory. So I can't use the first method with it. If I put something in the /classes, what path do I reference it with? Do

Re: Referencing files in job from plugin

2012-10-11 Thread Bai Shen
, 2012 at 3:41 PM, Bai Shen baishen.li...@gmail.com wrote: I'll give that a try. I know when I was doing new File(dir/blah.txt) it was looking in the directory on the client machine that I was running the job from. On Thu, Oct 11, 2012 at 9:30 AM, Ferdy Galema ferdy.gal...@kalooga.com

Re: Referencing files in job from plugin

2012-10-11 Thread Bai Shen
is expanded into the current working directory of a running task. Last but not least you are able to use shared filesystem, for example the HDFS or the mapreduce DistributedCache. This is useful if the files are big, or change a lot. On Thu, Oct 11, 2012 at 1:20 PM, Bai Shen baishen.li

Re: [PING] [VOTE] Apache Nutch 2.1 Release Candidate Available

2012-10-03 Thread Bai Shen
I just tried to run it and I'm getting the following bug on CDH4. https://issues.apache.org/jira/browse/NUTCH-1447 On Mon, Oct 1, 2012 at 8:17 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi All, Anyone else for this VOTE? Sorry to be a pest! Thanks Lewis On Fri, Sep

Re: [PING] [VOTE] Apache Nutch 2.1 Release Candidate Available

2012-10-03 Thread Bai Shen
it to work on other distribution then the better it is but this can't be considered a bug or a blocker for the release On 3 October 2012 14:10, Bai Shen baishen.li...@gmail.com wrote: I just tried to run it and I'm getting the following bug on CDH4. https://issues.apache.org/jira/browse

Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

2012-10-01 Thread Bai Shen
I'm getting the following error trying to run Nutch 2 on CDH4. A lot of people seem to be running into this problem, and there's even a JIRA bug opened, but I haven't seen any comments on it. Anybody have any suggestions to work around it? https://issues.apache.org/jira/browse/NUTCH-1447

Re: Fix for binary operator expected error

2012-09-28 Thread Bai Shen
. It is the intention to integrate this into 2.x one it has been tested enough. The glitch you highlight is exactly the type of stuff we need to find. Thanks Lewis [0] https://issues.apache.org/jira/browse/NUTCH-1087 On Fri, Sep 28, 2012 at 2:52 PM, Bai Shen baishen.li...@gmail.com wrote: When

Re: Nutch 2.1 Advice, thoughts and comments on crawl performance, indexing and deployment?

2012-09-27 Thread Bai Shen
Two things to note. db.ignore.external.links doesn't quite work the way you think it should. If you have a url inside the domain that resolves to a url outside the domain, nutch will end up indexing that domain as well. The way to get around this is to use the whitelist instead of

Nutch 2 on Hadoop

2012-09-27 Thread Bai Shen
I'm attempting to get Nutch 2 running on a CDH4 cluster. I'm having the following problem. [root@node4-0 nutch]# bin/nutch readdb -stats bin/nutch: line 98: [: /opt/nutch/apache-nutch-2.1-SNAPSHOT.job: binary operator expected Error: Could not find or load main class

Re: Nutch 2 solrindex fails with no error

2012-09-17 Thread Bai Shen
lewis.mcgibb...@gmail.com wrote: Solr logs? On Fri, Sep 14, 2012 at 9:33 PM, Bai Shen baishen.li...@gmail.com wrote: I have a nutch 2 setup that I got working with solr about a month ago. I had to shelve it for a little while and I've recently come back to it. Everything seems to be working

Re: Nutch 2 solrindex fails with no error

2012-09-17 Thread Bai Shen
The problem appears to be that Nutch is not sending anything to solr. But I can't seem to find a reason in nutch as to why this is. On Sat, Sep 15, 2012 at 7:36 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Solr logs? On Fri, Sep 14, 2012 at 9:33 PM, Bai Shen baishen.li

Re: MoreIndexingFilter plugin failing with NPE

2012-08-13 Thread Bai Shen
Sorry, I keep forgetting. I'm using the Nutch 2.x branch as of last week. However, there hasn't been a change to the filter in a month or so. It was parsed correctly as far as I can tell. I'm seeing the same content in solr as what I see in the browser. On Mon, Aug 13, 2012 at 9:19 AM, Markus

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-08-09 Thread Bai Shen
with this parameters: property namemapred.child.java.opts/name value-Xmx1600m -XX:-UseGCOverheadLimit -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/tmp/value /property On Wed, Aug 8, 2012 at 9:32 PM, Bai Shen baishen.li...@gmail.com wrote: Is this something other people

java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-08-08 Thread Bai Shen
Is this something other people are seeing? I was parsing 10k urls when I got this exception. I'm running Nutch 2 head as of Aug 6 with the default memory settings(1 GB). Just wondering if anybody else has experienced this on Nutch 2. Thanks.

Re: Nutch 2 plugins

2012-08-07 Thread Bai Shen
error. So to fix it: Apply patch in NUTCH-1444https://issues.apache.org/jira/browse/NUTCH-1444 or update to Nutch2x head. Ferdy On Mon, Aug 6, 2012 at 9:21 PM, Bai Shen baishen.li...@gmail.com wrote: I'm working on writing a Nutch 2 plugin. Whenever something is configured wrong, I

Re: Nutch 2 plugins

2012-08-07 Thread Bai Shen
in NUTCH-1444https://issues.apache.org/jira/browse/NUTCH-1444 or update to Nutch2x head. Ferdy On Mon, Aug 6, 2012 at 9:21 PM, Bai Shen baishen.li...@gmail.com wrote: I'm working on writing a Nutch 2 plugin. Whenever something is configured wrong, I don't get any valid logging info

Re: Different batch id

2012-08-02 Thread Bai Shen
I just tried running this with the actual batch Id instead of using -all, and I'm still getting similar results. On Mon, Jul 30, 2012 at 1:12 PM, Bai Shen baishen.li...@gmail.com wrote: I set up Nutch 2.x with a new instance of HBase. I ran the following commands. bin/nutch inject urls bin

Nutch 2 solrindex

2012-08-01 Thread Bai Shen
I'm trying to crawl using Nutch 2. However, I can't seem to get it to index to solr without adding -reindex to the command. And at that point it indexes everything I've crawled. I've tried both -all and the batch id, but neither one results in anything being indexed to solr. Any suggestions of

Re: Different batch id

2012-07-31 Thread Bai Shen
Is there a specific place it's located? I turned on debugging, but I'm not seeing a batch id. On Mon, Jul 30, 2012 at 1:14 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Can you stick on debug logging and see what the batch ID's actually are? On Mon, Jul 30, 2012 at 6:12 PM, Bai

Re: Different batch id

2012-07-31 Thread Bai Shen
updatedb. So, each generate command assigned different batchId s to its own set of urls. Alex. -Original Message- From: Bai Shen baishen.li...@gmail.com To: user user@nutch.apache.org Sent: Tue, Jul 31, 2012 10:26 am Subject: Re: Different batch id Is there a specific place it's

Re: Nutch 1.5 - Error: Java heap space during MAP step of CrawlDb update

2012-06-27 Thread Bai Shen
It's your topN. You're pulling too many urls at one time. At the end of each fetch map task, Nutch pulls all of the downloaded data into memory in order to do the merge and sort. If that data is more than your allocated memory, then you'll get a java heap exception. There are only two ways to

Slow url filtering

2012-06-27 Thread Bai Shen
Currently, I have roughly 10M records in my crawldb. I added some regex's to remove some urls from my crawldb. Nothing complicated. However, when I run with filtering turned on, the updatedb job took 118 hours. Looking in the regex-urlfilter.txt file, I noticed some of the other regex's are

Re: URL filtering and normalization

2012-06-11 Thread Bai Shen
be necessary any more. On Fri, Jun 8, 2012 at 3:53 PM, Bai Shen baishen.li...@gmail.com wrote: I'm attempting to filter during the generating. I removed the noFilter and noNorm flags from my generate job. I have around 10M records in my crawldb. The generate job has been running

Re: URL filtering and normalization

2012-06-08 Thread Bai Shen
I'm attempting to filter during the generating. I removed the noFilter and noNorm flags from my generate job. I have around 10M records in my crawldb. The generate job has been running for several days now. Is there a way to check it's progress and/or make sure it's not hung? Also, is there a

URL filtering and normalization

2012-05-22 Thread Bai Shen
Somehow my crawler started fetching youtube. I'm not really sure why as I have db.ignore.external.links set to true. I've since added the following line to my regex-urlfilter.txt file. -^http://www\.youtube\.com/ However, I'm still seeing youtube urls in the fetch logs. I'm using the

Re: URL filtering and normalization

2012-05-22 Thread Bai Shen
On Tue, May 22, 2012 at 2:00 PM, Markus Jelsma markus.jel...@openindex.iowrote: -Original message- From:Bai Shen baishen.li...@gmail.com Sent: Tue 22-May-2012 19:40 To: user@nutch.apache.org Subject: URL filtering and normalization Somehow my crawler started fetching

Re: URL filtering and normalization

2012-05-22 Thread Bai Shen
On Tue, May 22, 2012 at 2:00 PM, Markus Jelsma markus.jel...@openindex.iowrote: -Original message- From:Bai Shen baishen.li...@gmail.com Sent: Tue 22-May-2012 19:40 To: user@nutch.apache.org Subject: URL filtering and normalization Somehow my crawler started fetching

Re: HTML documents with TXT extension

2012-05-11 Thread Bai Shen
parse an HTML file with a .txt extension just as a normal HTML file, at least, here it does. What does your parserchecker say? In any case you must strip potential left-over HTML in your Solr analyzer, if left like this it's a bad XSS vulnerability. Cheers On Tue, 8 May 2012 08:34:58 -0400, Bai

HTML documents with TXT extension

2012-05-08 Thread Bai Shen
Nutch ended up crawling some HTML files that had a TXT extension. Because of this(I assume), it didn't strip out the HTML. So now I have weird formatting on my results page. Is there a way to fix this on the Nutch side so it doesn't happen again?

Language Identification

2012-04-20 Thread Bai Shen
I'm working on using Shuyo's work to improve the language identification of our search. Apparently, it's been moved from Nutch to Solr. Is there a reason for this? http://code.google.com/p/language-detection/issues/detail?id=34 I would prefer to have the processing done in Nutch as that has

Re: Java out of memory error

2012-02-09 Thread Bai Shen
I ended up using a shell script instead of doing it in Java, so I never spent any more time investigating it. On Wed, Feb 8, 2012 at 2:10 PM, webdev1977 webdev1...@gmail.com wrote: I am curious as to what ever came of this? I am having the exact same issue with nutch 1.3 -- View this

Re: Fetching large files

2012-01-13 Thread Bai Shen
segments, parse them, then merge incrementally rather than attempting to merge several larger segments at once? Are you getting any IO problems when parsing the segments? If so this may be an early warning light to attack the problem from another angle. On Thu, Jan 12, 2012 at 4:41 PM, Bai Shen

Fetching large files

2012-01-12 Thread Bai Shen
I'm using nutch in distributed mode. I'm crawling large files(a bunch of videos), and when the fetcher map job goes to merge the spill files in order to send to the reduce I'm getting an OOM exception. It appears to be because the merge is attempting to merge the data from all of the fetched

Re: Continuous Crawling

2012-01-03 Thread Bai Shen
with bash and fetch and parse. On Thursday 29 December 2011 21:29:02 Bai Shen wrote: Currently, I'm using a shell script to run my nutch crawl. It seems to work okay, but it only generates one segment at a time. Does anybody have any suggestions for how to improve it, make it work

Re: Java out of memory error

2011-12-23 Thread Bai Shen
script instead since that doesn't have any of the problems. On Fri, Dec 23, 2011 at 5:08 AM, Markus Jelsma markus.jel...@openindex.iowrote: On Thursday 22 December 2011 19:36:29 Bai Shen wrote: How does the whole multiple segments work? Use the generator to create multipel segments in one go

Re: Multiple values encountered for non multivalued field

2011-12-23 Thread Bai Shen
schema field for title multiValued and deal with it appropriately in your search front-end. Cheers On Wednesday 02 November 2011 15:02:11 Bai Shen wrote: Found it right after I asked. :) BTW, the command is wrong on the wiki. I need to get around to making

Re: Java out of memory error

2011-12-22 Thread Bai Shen
both just infinite loops that call the various nutch parts in order. On Mon, Dec 19, 2011 at 10:08 AM, Markus Jelsma markus.jel...@openindex.iowrote: On Monday 19 December 2011 15:57:02 Bai Shen wrote: AFAIK, mapred.map.child.java.opts is not set, but I'll double check. When you say

Fetch Retries

2011-12-22 Thread Bai Shen
I'm using the default db.fetch.retry.max value of 3, but I'm seeing retry counts as high as 14 in the crawldb stats output. Any ideas why this is and how to change it?

Re: Nutch Hadoop Optimization

2011-12-16 Thread Bai Shen
the parsing of these segments that is taking time... no? On Thu, Dec 15, 2011 at 9:57 PM, Bai Shen baishen.li...@gmail.com wrote: On Thu, Dec 15, 2011 at 12:47 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: This is overwhelmingly weighted towards Hadoop configuration

Java out of memory error

2011-12-16 Thread Bai Shen
I've tried running Nutch in local, psuedo, and full distributed mode, and I keep getting OutOfMemoryErrors. I'm running Nutch using a slightly modified version of the Crawler code that's included. Basically, I've modified it to continously crawl instead of stopping after a set number of cycles.

Re: Java out of memory error

2011-12-16 Thread Bai Shen
where to find it in Cloudera. On Fri, Dec 16, 2011 at 11:38 AM, Markus Jelsma markus.jel...@openindex.iowrote: What jobs exit with OOM? What is your heap size for the mapper and reducer? On Friday 16 December 2011 17:13:45 Bai Shen wrote: I've tried running Nutch in local, psuedo, and full

Nutch Hadoop Optimization

2011-12-15 Thread Bai Shen
So I have Nutch running on a hadoop cluster with three data nodes. The machines are all pretty beefy, but Nutch isn't performing any faster than when I was running in pseudo mode on one machine. How to I set Nutch in order to take full advantage of the cluster? Thanks.

Re: Nutch Hadoop Optimization

2011-12-15 Thread Bai Shen
On Thu, Dec 15, 2011 at 12:47 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: This is overwhelmingly weighted towards Hadoop configuration. There are some guidance notes on the Nutch wiki for performance issues so you may wish to give them a try first. -- Lewis I'm assuming

Re: new nutch tool

2011-12-06 Thread Bai Shen
I'm definitely interested in such a tool. We've been considering moving to ElasticSearch, so if you need a tester, let me know. On Mon, Dec 5, 2011 at 6:38 PM, Tim Pease tim.pe...@gmail.com wrote: I am in the process of writing a new Nutch tool that will index documents into the ElasticSearch

Re: Continuous crawling

2011-11-30 Thread Bai Shen
a ClassNotFoundException. On Mon, Nov 28, 2011 at 3:38 PM, Bai Shen baishen.li...@gmail.com wrote: I've changed nutch to use the pseudo-distributed mode, but it keeps erroring out that no agent is listed in the http.agent.name property. I copied over my conf directory from local, but that didn't fix

Re: Continuous crawling

2011-11-28 Thread Bai Shen
On 28 November 2011 14:09, Bai Shen baishen.li...@gmail.com wrote: We looked at the hadoop reporter and aren't sure how to access it with nutch. Is there a certain way it works? Can you give me an example? Thanks. On Mon, Nov 21, 2011 at 3:11 PM, Markus Jelsma markus.jel

Re: Continuous crawling

2011-11-21 Thread Bai Shen
On Thu, Nov 10, 2011 at 3:32 PM, Markus Jelsma markus.jel...@openindex.iowrote: Interesting. How do you tell if the segments have been fetched, etc? after a job the shell script waits for its completion and return code. If it returns 0 all is fine and we move it to another queue. If != 0

Re: Fetch log error

2011-11-10 Thread Bai Shen
through the net. On Mon, Oct 31, 2011 at 4:47 PM, Bai Shen baishen.li...@gmail.com wrote: We just did an ant clean and rebuilt nutch and we're still seeing the same error in the logs. On Fri, Oct 28, 2011 at 11:12 AM, Markus Jelsma markus.jel

Re: Removing urls from crawl db

2011-11-10 Thread Bai Shen
Can you give me an example of how would I set my URL filter to do this? Right now I'm just using the default. On Mon, Oct 31, 2011 at 3:47 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi Write an regex URL filter and use it the next time you update the db; it will disappear. Be sure

Re: Removing urls from crawl db

2011-11-10 Thread Bai Shen
. The URL's will then be deleted. On Thursday 10 November 2011 16:36:24 Bai Shen wrote: Can you give me an example of how would I set my URL filter to do this? Right now I'm just using the default. On Mon, Oct 31, 2011 at 3:47 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi

  1   2   >