Thanks for your help MillBii!
I will definitely try the squareroot option - but is that only valid for
outlinks or also affects pages linking to the page?
Did you try implementing automatic Regex generation? I'm doing focused
crawling but I'm also thinking about scaling it in the future.
Also I
I just use the Java build-in regex features... and therefore just supplied
the string, which I design for my case using RegexBuddy a really great tool
by the way.
Pay attention though at static creation in order to avoid regex creation at
each plug-in load and run-time hit.
Didn't find a way to
oops : why it shouldn't work for others.
2009/11/28 MilleBii mille...@gmail.com
I just use the Java build-in regex features... and therefore just supplied
the string, which I design for my case using RegexBuddy a really great tool
by the way.
Pay attention though at static creation in order
Hi MilleBii,
I think you misinterpreted what I've meant.
1. Regarding Regex - I know I can build a Regex beforehand to identify URLs,
but I will have to create one manually for each domain I'm crawling - not
scalable. I'm looking for a way to build Regex automatically using automatic
machine
Hi all,
I'm trying to run a whole-web crawling on my notebook.
with :
- 40 GB Free for the crawl directory
- and I set “hadoop.tmp.dir” to an external hard disk (usb) with 90 GB
free.
I used the the crawl script from the wiki:
http://wiki.apache.org/nutch/Crawl with the following configuration:
Oh !
1. not worked but if you find something I'm interested
2. the inlinks by definition points to the page you are considering so I
don't understand what you mean. Boosting those inlinks actually means giving
more weight to the page which gets distributed to the outlinks
But probably what
nutch-721 is a different issue. 719 has no patch but describes the solution
to the problem you encountered.
if you get errors with 770 it would be helpful to comment on the JIRA
2009/11/27 MilleBii mille...@gmail.com
Already applied that patch which is actually 721, I was part of that
You right, I forgot to put 719 manually when I moved on my Linux box. Thank
Julien.
We really ought to have patch for this one and probably also in a Nutch 1.1
I will comment on the JIRA for 770, bare with me I've never done that
before.
Now to the bandwidth issue : I found a way to greatly
Although I have applied https://issues.apache.org/jira/browse/NUTCH-719 (
769)
I get my fetcher job hang-up at the end :
...
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=2
-finishing thread
Oops a bit too quick on this one it actually ended up whilst I was
making this post.
2009/11/28 MilleBii mille...@gmail.com
Although I have applied https://issues.apache.org/jira/browse/NUTCH-719(
769)
I get my fetcher job hang-up at the end :
...
-finishing thread FetcherThread,
My nutch crawl just stopped. The process is still there, and doesn't
respond to a kill -TERM or a kill -HUP, but it hasn't written
anything to the log file in the last 40 minutes. The last thing it
logged was some calls to my custom url filter. Nothing has been
written in the hadoop directory
Paul Tomblin wrote:
My nutch crawl just stopped. The process is still there, and doesn't
respond to a kill -TERM or a kill -HUP, but it hasn't written
anything to the log file in the last 40 minutes. The last thing it
logged was some calls to my custom url filter. Nothing has been
written in
On Sat, Nov 28, 2009 at 4:45 PM, Andrzej Bialecki a...@getopt.org wrote:
Paul Tomblin wrote:
How can I tell what's going on and why it's stopped?
Try to generate a thread dump to see what code is being executed.
I didn't do any sort of distributed mode because I've only got one
core. I had
hello
can someone help me with this:
i am using nutch-0-9 with hadoop and want use bw-filter from patch nutch-249.
after using ant i get some errors about import problems.
as i read somewhere, nutch imports hadoop itself. so its not necessary to
install it separately.
i don't realy understand
hello
can someone help me with this:
i am using nutch-0-9 with hadoop and want use bw-filter from patch nutch-249.
after using ant i get some errors about import problems.
as i read somewhere, nutch imports hadoop itself. so its not necessary to
install it separately.
i don't realy understand
There is no separate hadoop folder within nutch.
--
Varish
Myname To wrote:
hello
can someone help me with this:
i am using nutch-0-9 with hadoop and want use bw-filter from patch nutch-249.
after using ant i get some errors about import problems.
as i read somewhere, nutch imports hadoop
i see that before, Varish, but have you some advice for using nutch-249 patch:
https://issues.apache.org/jira/browse/NUTCH-249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
if there is no org.apache.hadoop folder, how can i use this patch?
any help, advice, pointers are
Paul Tomblin wrote:
On Sat, Nov 28, 2009 at 5:48 PM, Andrzej Bialecki a...@getopt.org wrote:
Paul Tomblin wrote:
-bash-3.2$ jstack -F 32507
Attaching to process ID 32507, please wait...
Hm, I can't see anything obviously wrong with that thread dump. What's the
CPU and swap usage, and
On Sat, Nov 28, 2009 at 8:25 PM, Andrzej Bialecki a...@getopt.org wrote:
Hm, the curious thing here is that the java process is sleeping, and 99% of
cpu is in system time ... usually this would indicate swapping, but since
there is no swap in your setup I'm stumped. Still, this may be related
19 matches
Mail list logo