Doğacan Güney-3 wrote:
On Mon, Jul 13, 2009 at 10:10, Beatstarun_agrawal...@yahoo.com wrote:
hi,
i want delete indexes whose url field contain certain character.
Are you using solr or lucene for your indexes?
i m using Solr.
any help would b appriciated.
thanx in advance
Hi,
i want to use url normalizer to my project as per its requirement,
can any one guide me how to fetch it from nutch and add.
Thanks Regards
neeti
--
View this message in context:
http://www.nabble.com/url-normalizer-tp24474519p24474519.html
Sent from the Nutch - User mailing list
But are there any rules by which we can define when to crawl a website to get
its updated contents
as soon as possible.
Otis Gospodnetic-2 wrote:
Neeti,
I don't think there is a way to know when a regular web site has been
updated. You can issue GET or HEAD requests and look at the
You have to detect changes of web content.
http://zipclue.com
--- On Tue, 7/14/09, Neeti Gupta neeti_gupt...@yahoo.com wrote:
From: Neeti Gupta neeti_gupt...@yahoo.com
Subject: Re: recrawling
To: nutch-user@lucene.apache.org
Date: Tuesday, July 14, 2009, 6:50 AM
But are there any rules
Are you using eclipse to make a javadoc or commandline.
schroedi wrote:
How may I generate the JAVA Doc in html format for all nutch and
hadoop classen and everythink else.
--
Mario Schröder | http://www.finanz-checks.de
--
View this message in context:
hi all,
i m trying to make ntch crawler ignore robots.txt file.
i hv tried to change fetcher.java RobotsRulesParser.java.
But NullPointException error is reported.
Can anybody gv the correct changes require to made.
with regards
Beats
be...@yahoo.com
--
View this message in context:
but I get a number of messages in crawl.log, like:
Error parsing: http://lucene.apache.org/skin/getMenu.js:
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/javascript
url=http://lucene.apache.org/skin/getMenu.js
at
hi jim,
what i think ur error statement says it couldn't find plugin for parsing a
perticular content type.
go to parse-plugins.xml in conf directory.
there u will find different plugin id define for different Content type.
add perticular plugin-id in nutch-site.xml file under plugin.includes
Hi, Jim
I got the second error too. It's because the previous crawl failed
abnormally.
There should be the following sub-directories in /segments/20090713171413:
content crawl_fetch crawl_generate crawl_parse parse_data parse_text
My solution is deleting the corrupted directory and
I run nutch to convert arc file to segements, it works well for 1 millions
pages, but when i increase the page counts to 500 millions, it failed for
the error messges as below. can anyone help me ? I
java.io.IOException: Task process exit with nonzero status of 255.
at
I did attach it.
Jake Jacobson
http://www.linkedin.com/in/jakejacobson
http://www.facebook.com/jakecjacobson
http://twitter.com/jakejacobson
Our greatest fear should not be of failure,
but of succeeding at something that doesn't really matter.
-- ANONYMOUS
On Mon, Jul 13, 2009 at 9:04 PM,
2009/7/14 Jake Jacobson jakecjacob...@gmail.com:
I did attach it.
I am afraid that I cant see anything either. Can you perhaps upload it
somewhere and link to it?
I'd like to say thank you for your effort. We could do with more
tutorials which look at it in different ways.
Alex
Posted it to my blog,
http://jakecjacobson.blogspot.com/2009/07/nutch10installationguide.html
Jake Jacobson
http://www.linkedin.com/in/jakejacobson
http://www.facebook.com/jakecjacobson
http://twitter.com/jakejacobson
Our greatest fear should not be of failure,
but of succeeding at something
Here are few questions I had about crawl-urlfilter.txt.
- Does Nutch obey crawl-urlfilter.txt properly? By default, it is set
to not download css, but when I do the crawl, I do see parse.ParseUtil
exceptions in my Hadoop.log (org.apache.nutch.parse.ParseException: parser not
found
hi,
actually what i want is to crawl a web page say 'page A' and all its
outlinks.
i want to index all the content gathered by crawling the outlinks. But not
the 'page A'.
is there any way to do it in single run.
with Regards
Beats
be...@yahoo.com
SunGod wrote:
1.create work dir test
hi everyone,
i want to crawl search results displayed by the solr.
the source code of search page doesn't give the content Type.
thus Invalid XML error is shown.
does anyone knows how to crawl this kind of page???
i m using 'Feed' plugin and also tried with parse-rss.
page source code :
Alex (et al),
There was/is plenty of space on the drive (3GB).
I was trying the command line from the tutorial:
bin/nutch crawl urls -dir crawl.test -depth 3 crawl.log
I'm re-running again, to see what happens. If I get that error again, I'll
delete the dirs, as yourself and xiao yang
Here are few questions I had about crawl-urlfilter.txt.
Some very quick responses - others will know better.
- Does Nutch obey crawl-urlfilter.txt properly? By default,
it is set to not download css, but when I do the crawl, I do see
parse.ParseUtil exceptions in my Hadoop.log
Hi,
I'm still following the Tutorial, per earlier post, and I think that I've
gotten past the earlier errors with the intranet crawl (it's still running), so
I wanted to try to get the web app running.
I had Tomcat installed and working, so I deployed the nutch WAR file (I put it
in
Hi,
I noticed that in tomcat/webapps/nutch-1.0/WEB-INF/classes, there was a
crawl-urlfilters.txt, which still had MY.DOMAIN in it, so I tried changing the
parameter to *apache.org to match what was in the same file in /opt/nutch-1.0.
But, even after that, and bouncing Tomcat, I get just the
Hi,
I think that there must've been something messed up.
I tried running a new crawl:
bin/nutch crawl urls -dir crawl3.test -depth 2 crawl3.log
and I modified the nutch-site.xml file to point to crawl3.test directory.
Then, after I bounce Tomcat, I can search successfully.
However, I then
Hi All,
I'm getting totally frustrated with this nutch web app :(.
I re-installed Nutch 1.0 completely.
I created the urls file in /opt/nutch-1.0
I added http.agent.name of test1 and modified http.robots.agent in
nutch-default.xml.
I modified /opt/nutch-1.0/conf/nutch-site.xml to:
?xml
On Tue, Jul 14, 2009 at 21:17, oh...@cox.net wrote:
Hi All,
I'm getting totally frustrated with this nutch web app :(.
I re-installed Nutch 1.0 completely.
I created the urls file in /opt/nutch-1.0
I added http.agent.name of test1 and modified http.robots.agent in
nutch-default.xml.
I
Doğacan Güney doga...@gmail.com wrote:
On Tue, Jul 14, 2009 at 21:17, oh...@cox.net wrote:
Hi All,
I'm getting totally frustrated with this nutch web app :(.
I re-installed Nutch 1.0 completely.
I created the urls file in /opt/nutch-1.0
I added http.agent.name of test1
any ideas
On Mon, Jul 13, 2009 at 11:58 AM, Kenan Azam azam.ke...@gmail.com wrote:
Hi there,I am utilizing nutch 0.8.1. Is there a log or mechanism to see
what the users are searching for. Something like a Search History or Top
Searches.
Thanks much..
can anyone help me?
On Tue, Jul 14, 2009 at 7:05 PM, lei wang nutchmaill...@gmail.com wrote:
I run nutch to convert arc file to segements, it works well for 1
millions pages, but when i increase the page counts to 500 millions, it
failed for the error messges as below. can anyone help me ?
26 matches
Mail list logo