This two patches could fix the problem. The first patch is a hadoop
patch and the other patch is a nutch patch. I dont know whether i
should create a bug in the nutch-jira and hadoop-jira?
Anyway... here are the two patches.
Index: src/java/org/apache/hadoop/ipc/Server.java
Sounds like nutch does not found your plugins. A stacktrace from your
exception could help.
Please verify your nutch-default.xml at the property
property
nameplugin.folders/name
valueplugin/value
descriptionDirectories where nutch plugins are located. Each
element may be a relative
Am 10.03.2006 um 05:58 schrieb Vertical Search:
Okay, I have noticed that for URLs containing ?, and = I
cannot
crawl.
I have tried all combinations of modifying crawl-urlfilter.txt and
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
Try [EMAIL
Do you crawl the intranet or do you crawl the web? If you crawl the
web then you must edit the urlfilter-regex.txt and not the crawl-
urlfilter.txt.
In your first mail you said you get an exception like
org.apache.nutch.net.URLFilter not found. Does the exception still
occur?
Marko
Am 14.03.2006 um 23:20 schrieb ArentJan Banck:
java.lang.NullPointerException
at org.apache.nutch.indexer.Indexer$OutputFormat$1.write
(Indexer.java:109)
What for index plugins do you have configured in your nutch-
default.xml or nutch-site.xml? Be sure that the index-basic plugin
Am 16.03.2006 um 06:43 schrieb Ilia S. Yatsenko:
And got next error: file not found index/segment
Do you have the property searcher.dir in the nutch-default.xml or
nutch-site.xml configured in your nutch webapp?
property
namesearcher.dir/name
valuecrawl/value
description
Path to
Am 16.03.2006 um 06:43 schrieb Ilia S. Yatsenko:
And got next error: file not found index/segment
Your folder structure should be:
YOUR_SEARCH_FOLDER/crawldb
YOUR_SEARCH_FOLDER/linkdb
YOUR_SEARCH_FOLDER/segments/2006...
YOUR_SEARCH_FOLDER/indexes/part-
Am 16.03.2006 um 12:50 schrieb Grégory Debord:
Hi all,
I would like to implement a distributed crawl which would be
something like
this :
The hadoop project is used for working with a dfs. In hadoop exists
one master (namenode, jobtracker) and n slaves (datanodes and
tasktrackers).
Am 17.03.2006 um 00:28 schrieb MagRaj:
Is it possible to create a new segment(contains all the pages of
that url)
for each url??
You can use the regex-urlfilter.txt to accept only the urls you want.
But for every new segment you have to change the regex-urlfilter.txt.
A better way is to
Am 17.03.2006 um 17:20 schrieb Dennis Kubes:
Exception in thread main java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:
310)
at org.apache.nutch.crawl.Injector.inject(Injector.java:114)
at
Am 17.03.2006 um 20:22 schrieb MagRaj:
Thanks Marko for your suggestion.
But, here is my problem: Find below the config files with sample
data i
have:
urls.txt has got 5 urls (just as an example)
Am 20.03.2006 um 04:30 schrieb Berlin Brown:
Is there a way to specify
where the plugins are located?
property
nameplugin.folders/name
valueYOUR PLUGIN PATH/value
descriptionDirectories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute,
Am 23.03.2006 um 08:42 schrieb Raghavendra Prabhu:
Hi
Is there a separate mailing list for hadoop right now
http://lucene.apache.org/hadoop/mailing_lists.html
Marko
Am 22.05.2006 um 13:50 schrieb Murat Ali Bayir:
Hi, I have problem when I am using black-white list url filtering.
I have two directiory for filtering
called NegativeURLS and PositiveURLS
**
***
in
This is a bug in the query-basic plugin. The boosting values in the
nutch-default.xml are not used.
We should open a bug in jira. This simple patch should work.
Index: src/plugin/query-basic/src/java/org/apache/nutch/searcher/
basic/BasicQueryFilter.java
Am 25.05.2006 um 13:21 schrieb Stefan Neufeind:
Hi,
I did use index-basic and index-more. I see lastModified in the
RSS-output. Now I want to sort=lastModified - does not work.
Try sort=date.
Regards,
Marko
Hmm, that works. But why - since I think the field is named
lastModified.
LastModified is only used if lastModified is available about the html
meta tags. If that true, lastModified is stored but not indexed.
However the date field is always indexed. Is lastModified is
available as
Am 26.05.2006 um 01:57 schrieb Stefan Neufeind:
Modified. If not, date=FetchTime.
Hi Marko,
Hi Stefan,
that hint really helped. Can you maybe also help me out with
sort=title?
See also:
http://issues.apache.org/jira/browse/NUTCH-287
The problem is that it works on some searches - but
# De-duplicate indexes
# bogus argument is ignored but needed due to
# a bug in the number of args expected
bin/nutch dedup crawl/segments bogus
The dedup command works only on many indexes and not on one or many
segments. The directory structure of an index looks like:
Am 05.07.2006 um 17:24 schrieb aicha BEN:
hello,
Hi,
fetch of file:///C:/doc/test.pdf failed with: java.lang.Exception:
org.apache.nutch.protocol.file.FileError: File Error: 404
Exists the pdf file? Error Code 404 sounds like 'File Not Found'.
Marko
Hi,
try to export JAVA_HOME in your $HOME/.bashrc and
$HOME/.bash_profile. You must also export $JAVA_HOME/bin in your
$PATH variable.
e.g. export PATH=$JAVA_HOME/bin:$PATH.
It is important that you export your $JAVA_HOME/bin before the rest
of the other $PATH variables. The first
Hi,
if you delete segments then be sure that you doesnt have an index
from this segment.
The segment contains the parsed content and the index is the index
from this content. If you delete the segment and you doing a search
on this index, a NPE occurs because no summary (parsed content) are
Am 03.08.2006 um 18:52 schrieb Lourival Júnior:
My questions:
Why it occurs? How can I know which segments can be deleted?
You must know which segment are indexed. You can not index all
segments and after that delete these segments.
The Indexer index the name of the segment that the
Am 04.08.2006 um 12:33 schrieb Rocio Chongtay:
Hi,
Hi
How can I check if my indexing is has gone well if so far I cannot
search?
I have followed the step by step guide all the way to indexing and
setting the GUI in tomcat.
in my indexes/part-0 folder I can see files like:
On Aug 18, 2009, at 7:04 AM, fa...@butterflycluster.net wrote:
hi,
Hi Fadzi,
I have a requirement to build a simple UI for starting stopping the
crawler, and also a scheduling mechanism (Quartz).
Has anyone attempted this before?
We have started to implement the upgrade of the
Do you start the gui from eclipse or from binary package?
marko
On Aug 18, 2009, at 9:57 AM, fa...@butterflycluster.net wrote:
tried that; no joy still.
are there any specifics i need to put in nutch-site.xml?
because mine is blank at the moment.
Quoting Marko Bauhardt m...@101tec.com
on
127.0.0.1:50060/general.
but the gui is currently only in german language :(.
in the next days we translate it via i18n.
marko
Quoting Marko Bauhardt m...@101tec.com:
On Aug 18, 2009, at 9:36 AM, fa...@butterflycluster.net wrote:
Hi Marko,
Hi
I am trying to run the AdminApp
On Aug 19, 2009, at 8:42 PM, alx...@aim.com wrote:
hi
Thanks. What if urls in my seed file do not have outlinks, let
say .pdf files. Should I still specify topN variable? All I need is
to index all urls in my seed file. And they are about 1 M.
topN means that your generated
Hi.
You said that you open and close the nutch bean at every request.
first
this is very expensive. create the nutch bean only once and save it in
the application and read it from the application if needed.
second!!
not sure but maybe it is possible that the PluginRepository has the
memory
On Aug 20, 2009, at 5:42 PM, Mark Round wrote:
not sure but maybe it is possible that the PluginRepository has the
memory leak. i think the cache (the weakhashmap) is growing and
growing.
Is this the same issue as reported here :
https://issues.apache.org/jira/browse/NUTCH-356 ?
ups. yes.
Hi list.
we have pushed the second nutch gui release version 0.2.
You can download the binary or the sources on
http://github.com/101tec/nutch/downloads
Two main features are implemented in this version
+ Security. You can start the admin gui with login feature, usernames
and passwords can
On Sep 30, 2009, at 3:47 PM, Bartosz Gadzimski wrote:
Hello,
Hi Bartosz
First - great job, it looks and works very nice.
:) Thanks!
I have a question about urlfilters. Is this possible to get regex-
urlfilter per instance (different for each instance) ?
good idea. i think you
Hi David.
sorry i dont understand your question. documentation about the nutch
gui can you find here
http://wiki.github.com/101tec/nutch
marko
On Sep 30, 2009, at 4:02 PM, David Jashi wrote:
Any documentation on how to add this GUI to existing NUtch instance?
პატივისცემით,
დავით ჯაში
version you have patched?
you can try to make a diff on the release-1.0 to create a patch file.
after that you can checkout or download the gui and try to apply your
patch.
maybe this could work.
marko
პატივისცემით,
დავით ჯაში
On Wed, Sep 30, 2009 at 18:19, Marko Bauhardt m
On Oct 2, 2009, at 3:38 PM, Haris Papadopoulos wrote:
Hi,
hi haris.
maybe you can use some code snippets from the nutch gui v0.2 (http://github.com/101tec/nutch
). this version has an api to reload the searcher (only nutchbeans are
supported).
for example:
SearcherFactory
hi.
is there a way for using http-keep-alive with nutch?
supports protocol-http or protocol-httpclient keep alive?
i cant find the using of http-keep-alive inside the code or in
configuration files?
thanks
marko
36 matches
Mail list logo