Hello,
I am looking for a way to search for multiple indexes from one webapp
and found some code. I can allways make one webapp = one website but
what if it grows?
Is it possible to make this code work:
in search.jsp
/*
Comment this original line of code and use code below.
Hello,
You should try to copy your data to local machine and try it. VPS
creates a lot of limits depending on technology used. Anyway, nutch is
disk bound, slow disk will get you very slow results.
VPS's are always on commodity hardware, I am almost sure that there's
standard SATA drive and
Hi,
I want to make site search for few of my (and friends) websites but
without access to database data. So using nutch crawling and then I have
2 ways.
1. index data to solr
2. leave it with nutch index
I need help in finding advantages/disadvantages of solr vs nutch
searching because I
Hello,
First - great job, it looks and works very nice.
I have a question about urlfilters. Is this possible to get
regex-urlfilter per instance (different for each instance) ?
Also what for is nutch-gui/conf/regex-urlfilter.txt file ?
Feature request - option to merge segments or maybe
Hello,
Can you look about size of merged segments?
If I remember correctly when I had segment1 = 1GB and segment2= 1GB new
merged segment was like 5GB but I havn't got time to look into it.
Thanks,
Bartosz
czerwionka paul pisze:
hi justin,
i am running hadoop in distributed mode and
As Arkadi said, your hdd is to slow for 2 x quad core processor. I have
the same problem and now thinking of using more boxes or very fast
drives (sas 15k).
Raymond Balmčs pisze:
Well I suspect the sort function is mono-threaded as usually they are so
only one core is used 25% is the max you
Andrzej Bialecki pisze:
Bartosz Gadzimski wrote:
As Arkadi said, your hdd is to slow for 2 x quad core processor. I
have the same problem and now thinking of using more boxes or very
fast drives (sas 15k).
Raymond Balm�s pisze:
Well I suspect the sort function is mono-threaded as usually
has a space when Tomcat is run as a service under windows
vista.
Any suggestions on how to fix that ?
-Raymond-
2009/6/1 Raymond Balmès raymond.bal...@gmail.com
hum where do you change that ?
2009/6/1 Bartosz Gadzimski bartek...@o2.pl
I think is the login of user running nutch
I think is the login of user running nutch.
It want's one token after using whoami but it gets:
autorite nt\system
So change the user for one word like autorite and it should help.
Thanks,
Bartosz
Raymond Balmčs pisze:
Hi guys,
My webapp does not work anymore suddenly... I get the
Hello,
Problem is partialy solved but I still write it :)
Usuing bin/nutch commands (inject, generate, fetch etc.) is working.
Only bin/nutch crawl is not
--
I have successfully setup hadoop cluster on 6 nodes (1
Hello,
First - congratulations for new PMC member.
Second - I have still problem with new scoring framework.
After
bin/nutch org.apache.nutch.scoring.webgraph.WebGraph -segment
crawl/segments/20090306093949/ -webgraphdb crawl/webgraphdb
there seems that after dumping:
bin/nutch
Chris Muktar pisze:
I'm trying to initiate a nutch server on localhost without installing
tomcat. Can I do this:
bin/nutch server 8080 crawl
it seems to execute just fine and opens the port, although it doesn't serve
any pages. I'm a newb to this so I might have missed something critical!
Any
Hi,
During tests of crawling (with crawl command) big 1mln website HDD space
was run out.
So I have
crawldb with 1 112 000 urls (112 000 urls were tested before)
segments with 40GB of data
index with partial data
/tmp/hadoop-root with 173GB of temporary hadoop data
After looking at mailing
Hello,
It's hard for me to get big picture of why to use solr as indexing and
searching.
Could someone try to describe this a little bit?
I understand that nutch is doing crawling and solr just indexing and
searching?
Any help would be great.
Thanks,
Bartosz
Hi,
Which version of nutch are you using?
You have wiki tutorial on running nutch in eclipse (it's important to
add conf dir to classpath and move it to top of loading libs)
http://wiki.apache.org/nutch/RunNutchInEclipse0.9
I've installed nutch rc in eclipse on windows just 2 hours ago and
throguh that tutorial.. Must have missed
something..
-Original Message-
From: Bartosz Gadzimski [mailto:bartek...@o2.pl]
Sent: Tuesday, March 10, 2009 8:02 AM
To: nutch-user@lucene.apache.org
Subject: Re: Hadopp Config Exception in Nutch
Hi,
Which version of nutch are you using?
You
Hi,
You can use Adaptive class and it theory your site will be very freash
Change org.apache.nutch.crawl.DefaultFetchSchedule to
org.apache.nutch.crawl.AdaptiveFetchSchedule
In nutch-default.xml you have bunch of options for this class
property
namedb.fetch.schedule.class/name
Oh, I forgot. I didn't test that one so can tell you how it works.
I know that many people are makeing generate, fetch, etc. loops very
often to make sites fresh
John Martyniak pisze:
Justin,
thanks for the info this very helpful.
This value would apply to all pages though. I was thinking
Yves Yu pisze:
Hi, all,
My nutch can viewed cache correctly by most pages, but some pages cannot.
Always said like following:
Display of this content was administratively prohibited by the webmaster.
You may visit the original page instead:
http://forum.laopdr.gov.la/forums/list.page.
Any
million urls), is Nutch-trunk stable
enough to use to do the fetching and indexing?
-John
On Mar 3, 2009, at 11:11 AM, Bartosz Gadzimski wrote:
Oh, I forgot. I didn't test that one so can tell you how it works.
I know that many people are makeing generate, fetch, etc. loops very
often to make
alx...@aim.com pisze:
Hello,
I use nutch-0.9 and try to index urls with ? and symbols. I have commented
this line? -[...@=] in conf/crawl-urlfilter.txt, conf/automaton-urlfilter and
conf/regex-urlfilter.txt files.
However nutch still ignores these urls.
Does anyone know how this can be
Hi,
Command admin is not valid in 0.9 and in trunk versions of nutch
You can use tutorial:
http://peterpuwang.googlepages.com/NutchGuideForDummies.htm
and many more on nutch wiki:
http://wiki.apache.org/nutch/
nutchu...@sycona.com pisze:
Hi Alexander
i downloaded the nutch tutorial and
Hello,
I am testing crawling (with bin/crawl command) on www.webhostingtalk.pl
And it looks that crawler fetches many disallowed urls
for example (there are many more):
robots.txt Disallow: /index.php?showuser
but it fetched and indexed:
http://www.webhostingtalk.pl/index.php?showuser=6470
Hello,
It might sound stupid but try to add few spaces and few new lines in
your myURLS.txt (it happend few times on different computers both linux
and windows)
Thanks,
Bartosz
Lukas, Ray pisze:
Thanks for your help.. I have neither of these properties defined in the
nutch-site.xml file,
man for helping me out on this..
ray
-Original Message-
From: Bartosz Gadzimski [mailto:bartek...@o2.pl]
Sent: Thursday, February 26, 2009 8:06 AM
To: nutch-user@lucene.apache.org
Subject: Re: Does not locate my urls or filter problem.
Hello,
It might sound stupid but try to add
NutchDeveloper pisze:
I use this script to crawl and recrawl web:
http://wiki.apache.org/nutch/Crawl
I noticed that database grow very slow (depth=2, topn=1000, adddays=30)
because it fetches the same urls several times in different recrawl loops.
What I should do to force Nutch to fetch ONLY
manavr pisze:
Hi,
I have a set of 1,00,000 urls that I am trying to crawl and index. I have
heap memory size for child tasktrackers set to 512MB. I have disabled pdf
and doc parsing currently. I am running this on Nutch-0.8 with 1 RHEL node
with depth to set to 1.
I get this
Nicolas MARTIN pisze:
Hi,
Assuming the fact that Nutch is running under Hadoop platform, i would like
to know if i have to follow the Hadoop Quick
Starthttp://hadoop.apache.org/core/docs/r0.19.0/quickstart.htmlguide
before configuring Nutch.
TY
Hello,
If you want to run it on single
Nicolas MARTIN pisze:
Hi,
I run nutch under Windows using cygwin. When i try a basic command like :
bin/nutch crawl urls -dir crawl -dpeth 3 -topN 50 i have the error in the
title of this message.
However i have already defined an environment variable JAVA_HOME in
windows...
Still I have to
Hi,
Your searcher.dir property should be made in:
{tomcat-webapps}/nutch_folder/WEB_INF/classes/nutch-site.xml
And value should be absolute path to your crawled dir ex.
/usr/local/nutch/crawl
Regards,
Bartosz
samuel.gre...@mesaaz.gov pisze:
I have tried tomcat 6.0 and after escaping some
crawls, so the crawldb is also up to date.
Be careful with Dmoz, there is a lot of Spam out there.
The loop is also useful for invertinglinks etc. whenever it is important to have single segments and not the whole directory.
-UrsprĂźngliche Nachricht-
Von: Bartosz Gadzimski [mailto:bartek
some performance
problems to merge and index them quickly.
-UrsprĂźngliche Nachricht-
Von: Bartosz Gadzimski [mailto:bartek...@o2.pl]
Gesendet: Donnerstag, 19. Februar 2009 15:38
An: nutch-user@lucene.apache.org
Betreff: Re: AW: AW: How to index while fetcher works
Dear Nadine,
So when
32 matches
Mail list logo