quality it
would be nice to have that wrapped as a plugin in Nutch.
--
Sami Siren
Hannu,
Do you use same set of QueryFilters both in the webapp and when running
from shell?
Perhaps your filter is not executed when running from cli? You can
verify how your query is transformed by running bin/nutch
org.apache.nutch.searcher.Query and entering some queries.
--
Sami Siren
The schema.xml file there is usable only when using Solr as the search
server. Are you using Solr?
--
Sami Siren
Pedro Bezunartea López wrote:
Hi,
I've developed a web application in lucene that searches web pages using a
nutch generated index. I'd like to highlight the query searched
Andrzej Bialecki wrote:
Sami Siren wrote:
Lots of good thoughts and ideas, easy to agree with.
Something for the ease of use category:
-allow running on top of plain vanilla hadoop
What does it mean plain vanilla here? Do you mean the current DB
implementation? That's the idea, we should
efficient and understandable if the
foundation (eg. data structures, extendability for example) was in
better shape. Also if written nicely other projects could use them too!
--
Sami Siren
Andrzej Bialecki wrote:
Hi all,
The ApacheCon is over, our release 1.0 has been out already for some
time
connections.
2. Your machine has ip6 enabled. This I noticed more recently when I was
wondering relatively slow fetching speed on a box. After disabling ipv6
totally I was able to fetch 2-4 times faster without any other config
changes.
--
Sami Siren
information on Apache Nutch, visit the project home page:
http://lucene.apache.org/nutch
-- Sami Siren (on behalf of the Apache Nutch community)
. That is why
it does not end up in the index.
--
Sami Siren
linkdb).
--
Sami Siren
snippets.
--
Sami Siren
dealmaker wrote:
I have similar problem with nightly build #741 (Mar 3, 2009 4:01:53 AM).
What's wrong?
There was a change in hadoop that caused this problem to appear. It has
now been fixed on build #743
--
Sami Siren
just a FYI,
there is also (unofficial) git repos for many apache projects -
including nutch here:
http://jukka.zitting.name/git/
--
Sami Siren
Dingding Ye wrote:
similar.
1. git-svn clone nutch-trunk
Then create a git project which is my working project. After that, clone
the nutch-git
Hi,
and thanks for being persistent. Can you specify what is the version of
nutch that you are running, is it a nightly build (if yes, which one?)
or did you check out the svn trunk? And just to be sure: you are running
with default configuration?
--
Sami Siren
ahammad wrote:
I checked
I can see this error also. not sure yet what's going wrong...
--
Sami Siren
Justin Yao wrote:
log4j configure:
log4j.logger.org.apache.nutch.indexer.Indexer=TRACE,cmdstdout
log4j.logger.org.apache.nutch=TRACE
log4j.logger.org.apache.hadoop=TRACE
Output:
2009-03-02 17:53:21,987 DEBUG
Sami Siren wrote:
I can see this error also. not sure yet what's going wrong...
it's NUTCH-703 (hadoop upgrade) that broke the indexing. any ideas what
changed in hadoop that might have caused this?
--
Sami Siren
--
Sami Siren
Justin Yao wrote:
log4j configure
)
at
org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest.jav
a:217)
at
Hi, I would check the Solr log to see why it is failing, probably Nutch
is providing content to a field not present in sol schema.
--
Sami Siren
in testing
the current nightly builds and providing documentation patches or wiki
updates is appreciated.
--
Sami Siren
nightly build? thanks
On Fri, Feb 20, 2009 at 6:31 PM, Kham Vo k...@mac.com wrote:
Hello Nutch 1.0 designers,
I successfully installed and set up Nutch 1.0 (build # 722
(searcher.dir)
- execute (from command line) bin/nutch
org.apache.nutch.searcher.NutchBean query
--
Sami Siren
Thanks
Sam
Hi,
I just dropped Nutch web app into tomcat version 6.0.18 and it worked
fine, perhaps you should upgrade your Tomcat?
--
Sami Siren
samuel.gre...@mesaaz.gov wrote
(?)
There is an open issue for this
https://issues.apache.org/jira/browse/NUTCH-699. Please contribute your
findings there.
--
Sami Siren
Hi,
I just dropped Nutch web app into tomcat version 6.0.18 and it worked
fine, perhaps you should upgrade your Tomcat?
--
Sami Siren
samuel.gre...@mesaaz.gov wrote:
Hi,
I am following the tutorial here:
http://nutch.sourceforge.net/docs/en/tutorial.html
Crawling works fine, as does
for version = 1.0, priority = blocker.
thanks.
--
Sami Siren
as indexing back
end, the integration is in nightly version of nutch. I am not sure if
the procedure is documented anywhere.
--
Sami Siren
All scripts suggest restarting nutch but this leads that searching is
unavailable for a few minutes.
May I call an API or something?
but not for Fetrcher2. If you add such line for Fetcher2 it should start
outputting logging to stdout.
--
Sami Siren
Thanks in advance.
Kind regards,
Martina
Dog(acan Güney wrote:
I think I have found the bug here, but I am in a hurry now, I will
create a JIRA issue
and post (what is hopefully) the fix later today.
Great! thanks.
--
Sami Siren
On Tue, Feb 17, 2009 at 21:39, Dog(acan Güney doga...@gmail.com wrote:
2009/2/17 Sami Siren ssi
). If your setup is similar and you ensure that the filesystem can
survive single node failures your data should be safe.
--
Sami Siren
that people daily used on windiws? which can maximize performance?
Well it can be anything, the important thing is to set up a small system
with similar hardware and see how it performs. That way you can get
quite accurate estimates on larger scale systems running on similar
hardware.
--
Sami Siren
Do we have a Jira issue for this, seems like a blocker for 1.0 to me if
it is reproducible.
--
Sami Siren
Dog(acan Güney wrote:
Thanks for detailed analysis. I will take a look and get back to you.
On Mon, Feb 16, 2009 at 13:41, Koch Martina k...@huberverlag.de wrote:
Hi,
sorry
the directory did you? It might
be working because the webapp still has references to all files it
needs. Restart tomcat and it should work no more.
--
Sami Siren
, updatedb, generate...
--
Sami Siren
Dennis Kubes wrote:
jdk1.5 or better, I am currently on jdk1.6 sun. For the webapp we use
tomcat but should run on any jsp/servlet container, websphere included.
I think you need 1.6 now (for trunk) since we use Hadoop 0.19.
--
Sami Siren
buddha1021 wrote:
Sami Siren-2 wrote:
Dennis Kubes wrote:
jdk1.5 or better, I am currently on jdk1.6 sun. For the webapp we use
tomcat but should run on any jsp/servlet container, websphere included.
I think you need 1.6 now (for trunk) since we use Hadoop 0.19.
--
Sami
is to enable language-identifier plugin and
execute class through the plugin command:
bin/nutch plugin language-identifier
org.apache.nutch.analysis.lang.NGramProfile -create te sample_te.txt utf-8
--
Sami Siren
directions (where did
all that time go?). I think that we need a simple to maintain ui that is
easy to customize (both of the current ui fail to satisfy those
requirements IMO).
What kind of thought do others have?
--
Sami Siren
michos101 wrote:
Hi,
i am trying to enable the web2 plugins but i
and tutorials (maybe
even a book :)). So up to this point I have created MapReduce jobs
that use spring for dependency injection and it is simple and works
well. The above is the direction I would like to head down but I
would also like to see what everyone else is thinking.
Dennis
--
Sami
of interesting lucene/solr/hadoop related stuff
there to attend to.
--
Sami Siren
. Is there any way to use
Nutch without them? Thank you for answers to any or all of these questions.
The hadoop jar (hadoop-version-core.jar) should be available under
lib/. Nutch cannot be compiled/run without it.
--
Sami Siren
karthik085 wrote:
Hi,
I got nutch from svn tags - release0.9 - but can't get rid of this problem.
I did
ant compile
ant jar
ant war
All of them build successfully with different versions of ant - 1.6.5 and
1.7.0
do ant job
--
Sami Siren
.
--
Sami Siren
Sergio Morales wrote:
Hi Sami,
Thanks for the info.
Is there any other way to share this?
create a jira issue and attach to it?
--
Sami Siren
html document
to your webserver/filesytem.
There was not any html document attached. This is because mailing list
software removes them.
--
Sami Siren
vm processes with hadoop conf like:
property
namemapred.child.java.opts/name
value-Xmx1000m/value
/property
--
Sami Siren
showed did not have it registered)
--
Sami Siren
java.lang.RuntimeException: No scoring plugins - at least one scoring plugin
is required!
at org.apache.nutch.scoring.ScoringFilters.init(ScoringFilters.java:87)
at
org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java
hand I think that
things are already too complicated for novice users/imo)
2) Make it work in distributed setups (i.e. with more than 1 index
server) . Sami Siren also makes a note of this, but I don't believe
that a simple hash-the-url approach is appropriate for nutch. It would
be nice
like a thing that can manage large online indexes perhaps it would
serve most goodness if it was not tied to nutch.
--
Sami Siren
simplicity in mind, other
motivation was doing it without touching Nutch source code.
--
Sami Siren
org.apache.nutch.webapp.common does not exist
Could you help me to know where is a problem?
it seems you can just ignore step #5, because they get compiled in #7
--
Sami Siren
.
--
Sami Siren
or
configure crawl to use regex-urlfilter.xml via crawl-tool.xml.
--
Sami Siren
would require source code changes)
--
Sami Siren
Andrzej Bialecki wrote:
Sami Siren wrote:
Emmanuel JOKE wrote:
...
those files. I tried to look at the code and I think the plugin doesn't
manage correctly the dynamic URL with ? and parameters after the
extension of the file.
Yes your observation is correct, the filter compares only
Siddharth Jonathan wrote:
Hi,
After a couple of days of being up, my nutch app begins to
freeze/hang and basically
indexing and searching can no longer happen.
During this time (couple of days) is it just sitting idle or serving
requests?
--
Sami Siren
the functionality so it meets your
requirement.
--
Sami Siren
several crawlers running concurrently. We
You should perhaps use and call the classes directly and take control of
managing the Configuration object, this way PermGen size is not wasted
by loading same classes over and over again.
--
Sami Siren
://issues.apache.org/jira/secure/BrowseProject.jspa?id=10680subset=3
where most of the changes are listed.
--
Sami Siren
project called Apache Tika [1] which
has a goal of putting together generally usable parsing/extracting
framework. It hasn't yet got out of the ground so there is a good chance
to get your voice heard.
[1] http://incubator.apache.org/tika/
--
Sami Siren
to cut down your temp size requirements (after
compression, I think it's possible to compress the temp data?) is to do
your work in smaller slices.
--
Sami Siren
- Original Message
From: qi wu [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Wednesday, April 11, 2007 10:41:35
|fetcher2 can fetch them fast unless you make it non polite.
--
Sami Siren
Tomi N/A wrote:
2007/3/31, Sami Siren [EMAIL PROTECTED]:
You could also let your reverse proxy do the rewriting using something
like http://apache.webthing.com/mod_proxy_html/. I have been using
something like that for rewriting massive amount of html in realtime for
AA purposes to hammer
of html in realtime for
AA purposes to hammer web applications to different url space.
--
Sami Siren
/Projects/DummyNutch/Nutch/linkdb/parse_data in local is
invalid.
thanks in advance for help
LinkDb treats the parameter invertlinks as the path to linkdb (the 1st
parameter), remove it and the command should succeed.
--
Sami Siren
PROTECTED]
--
Sami Siren
Nicolás Lichtmaier wrote:
I've backported revision 450799 to the 0.8.x branch for supporting
-noAdditions. Perhaps you could consider committing it there... (I
haven't tested it yet whough).
Can you please create a JIRA issue for this and attach the patch there.
--
Sami Siren
to find the images. You would
also need to change indexer to index just the content you are interested
in (images) and skip the rest.
--
Sami Siren
PruneIndexTool).
$ ant
--
Sami Siren
time up until that point.
There's some more about that issue and how it affected to a random
segment here: http://blog.foofactory.fi/2007/01/sorted-out.html
--
Sami Siren
if it is active.
http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/lang/LanguageIndexingFilter.html
--
Sami Siren
Owner can be reached at [EMAIL PROTECTED]
What kind of error are you experiencing (if any)?
--
Sami Siren
James Phillips wrote:
Can somebody tell me how to contact the owner of this list? I have tried
on COUNTLESS occasions to remove myself using
[EMAIL PROTECTED] but still keep
if
that suits your use case.
--
Sami Siren
- Original Message -
From: Sami Siren [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Sunday, January 07, 2007 5:47 PM
Subject: Re: Nutch .81: the process to add a new analyzer ?
Chee Wu wrote:
Hi,
I am trying to add a new
right results are not that good.
identification method that would be helpful. Otherwise, I'd be happy to
contribute my pseudo-NB hack and maybe even implement the correct version.
Go ahead and attach it to JIRA. I am sure there's plenty of people
interested in such thing.
--
Sami Siren
Are you looking for something like the google keymatch as described in [1]
which was then more or less mimiced in nutch web2 module[1],
and since also atleast as a lookalike released in google code [3]
--
Sami Siren
[1] http://www.google.com/enterprise/mini/end_user_features.html
[2]
http
anything wrong.
If you did exactly those steps then what happens is that the
subcollections.xml is read from inside the .job file. You need to
rebuild the .job to put new file inside of it.
simply do ant and rerun indexing and it should work as expected.
--
Sami Siren
--
Sami Siren
(ie. add a site to a newly created subcollection) I don't want to
recrawl it again. I hope it can be done by simply using the existent/crawled
data.
no need to recrawl, unfortunately you still need to reindex.
--
Sami Siren
.
--
Sami Siren
contents of hdfs also.
One could also write a protocol-hdfs plugin to do the job.
--
Sami Siren
[1]http://issues.apache.org/jira/browse/HADOOP-4
- HttpBase.getProtocolOutput(194) |
Skipping: http://www.lequipe.fr/ exceeds fetcher.max.crawl.delay, max=30,
Crawl-Delay=120
and i can't find this property in nutch-site.xml
You need to add it there.
property
namefetcher.max.crawl.delay/name
value your value here /value
/property
--
Sami Siren
Gavino Marras wrote:
Nutch does work with sessions and cookies on https protocol ?
No, Nutch does not support cookies nor sessions.
--
Sami Siren
Andrzej Bialecki wrote:
Sami Siren wrote:
Gavino Marras wrote:
Nutch does work with sessions and cookies on https protocol ?
No, Nutch does not support cookies nor sessions.
This is not strictly speaking true ... if you use protocol-httpclient
then https, cookies and sessions
://issues.apache.org/jira/browse/NUTCH-395
[2]http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg04344.html
--
Sami Siren
are even worse.
my numbers are with local job runner.
I can't imagine how much it took to crawl let say 10mio pages.
I'll let you know when mine is finished, just started 3rd segment of
size 1 million to test the trunk version (running with local job runner)
--
Sami Siren
[1]http
what as static exporter means?
--
Sami Siren
Are you saying that generator generates 200k urls but fetcher fetches
around 100k or are you saying that you generate (-topN 20) 200k urls
and fetcher fetches only around 100k.
If latter and you are running with LocalJobRunner you need to generate
with -numFetchers 1.
--
Sami Siren
forgot one important one:
set generate.max.per.host to something reasonable so you won't end up
fetching urls from only low number of hosts which by default is very slow.
--
Sami Siren
Sami Siren wrote:
Some simple rules for generally speeding things up
1. Crawl only the content you
You are using DistributedSearch? and local filesystem to store index and
related data?
--
Sami Siren
Håvard W. Kongsgård wrote:
I have nutch 0.8.1 running on 3 servers (AMD X2 3800 with 4 000 memory),
searching with queries like 'China Nuclear Forces' takes 20 – 25 s.
My config
from this proposal:
http://mail-archives.apache.org/mod_mbox/lucene-general/200610.mbox/[EMAIL
PROTECTED]
--
Sami Siren
Håvard W. Kongsgård wrote:
DistributedSearch
2x datanodes, 2x Task Trackers
Sami Siren wrote:
You are using DistributedSearch? and local filesystem to store index
(to compile and to create nutch-x.x.x.job)
then:
bin/nutch ...
--
Sami Siren
:/// and to generate a file list
to be crawled. This file list is fairly big ~200,000 entries, and with the
current 0.8.1 release of nutch the fetcher just freezes right at the end of
a crawl.
What exactly happens when your fetcher freezes? 200 000 entries is not a
big list to
be fetched.
--
Sami Siren
application.
I agree also. Different query parsers could perhaps be made pluggable or
at least configurable. The current(-alike) implementation could be the
default one offered and by configuration one could switch it to
intranet mode.
Contributions anyone?
--
Sami Siren
'
-shutdown 127.0.0.1
--
Sami Siren
Alvaro Cabrerizo wrote:
2006/9/27, Sami Siren [EMAIL PROTECTED] mailto:[EMAIL PROTECTED]:
Alvaro Cabrerizo wrote:
How could I stop an index server (started with bin/nutch server
port
index) knowing the port?
Thanks in advance
what you need to do is modify the Query.
--
Sami Siren
Thanks,
Alvaro Cabrerizo wrote:
How could I stop an index server (started with bin/nutch server port
index) knowing the port?
Thanks in advance.
It does not support such a feature. Can you describe a little bit more
what are you trying to accomplish something similar to tomcats SHUTDOWN?
--
Sami
branch and fixes many
serious bugs discovered in previous release. For a list of changes see
http://www.apache.org/dist/lucene/nutch/CHANGES-0.8.1.txt
A big thanks to everybody who participated and made this release possible.
--
Sami Siren
defaults to 2)
--
Sami Siren
Frank Kempf wrote:
Hello,
got stuck with generating.
Injecting 3200 Urls into the database and generating afterwards leads
always to the same result of having 1632 Urls in crawl_generate.
(I checked the db and it actually has 3200 entries).
No matter if I try -topN
Your observations are correct, 0.8 has some serious problems and we'll be
putting 0.8.1 out pretty soon to fix also the performance problem you
describe.
--
Sami Siren
2006/9/18, carmmello [EMAIL PROTECTED]:
I have been trying Nutch, since its version 0.3, sometimes with some
problems. Now I
Is your environment windows or linux?
You are saying that most are not logged - can you please give an example
what is
logged (and where) and also what is not.
Logging in general can be configured by editing conf/log4j.properties
--
Sami Siren
2006/9/1, AJ Chen [EMAIL PROTECTED]:
When
could also be succesfully used for efficient crawling of smb,
ftp and webdaw resources,
--
Sami Siren
2006/8/27, Sandy Polanski [EMAIL PROTECTED]:
This maybe more of a straight Lucene task, but I thought I'd ask
anyway. Rather than using Nutch as a crawler, I'd rather just send the
Nutch parser
text/vnd.wap.wml
text/xml
text/x-setext
I would guess that handling of text/xhtml+xml mimetpe should be done with
html parser anyway.
--
Sami Siren
2006/8/25, Michael Wechner [EMAIL PROTECTED]:
I think the problem is as follows with XHTML files:
2006-08-25 16:06:11,925 WARN
The job should terminate in it's own, but not as soon as all pages are
found - only after -depth iterations.
Are you saying It won't honor the -depth parameter?
--
Sami Siren
Sandy Polanski wrote:
Sami, in 0.7.2 my intranet crawling job did terminate on its own. The issue
that I described
of mime types Nutch really can handle. Then again
those two text type of documents you picked up are quite rare and not
mainstream and probably enabling/disabling them doesn't really make any
difference in search results.
--
Sami Siren
There's no such feature present in Nutch currently. Feel free to open issue
(of type new feature) in Nutch Jira and provide a patch or wait until
someone else gets to it.
--
Sami Siren
2006/8/27, Sandy Polanski [EMAIL PROTECTED]:
On my intranet, I have 8100 documents. The nutch crawler
1 - 100 of 129 matches
Mail list logo