Multiple index from webapp

2009-11-05 Thread Bartosz Gadzimski
Hello, I am looking for a way to search for multiple indexes from one webapp and found some code. I can allways make one webapp = one website but what if it grows? Is it possible to make this code work: in search.jsp /* Comment this original line of code and use code below.

Re: reduce heap space error + DiskChecker$DiskErrorException

2009-11-04 Thread Bartosz Gadzimski
Hello, You should try to copy your data to local machine and try it. VPS creates a lot of limits depending on technology used. Anyway, nutch is disk bound, slow disk will get you very slow results. VPS's are always on commodity hardware, I am almost sure that there's standard SATA drive and

Nutch/Solr question

2009-11-04 Thread Bartosz Gadzimski
Hi, I want to make site search for few of my (and friends) websites but without access to database data. So using nutch crawling and then I have 2 ways. 1. index data to solr 2. leave it with nutch index I need help in finding advantages/disadvantages of solr vs nutch searching because I

Re: graphical user interface v0.2 for nutch

2009-09-30 Thread Bartosz Gadzimski
Hello, First - great job, it looks and works very nice. I have a question about urlfilters. Is this possible to get regex-urlfilter per instance (different for each instance) ? Also what for is nutch-gui/conf/regex-urlfilter.txt file ? Feature request - option to merge segments or maybe

Re: Merge taking forever

2009-06-15 Thread Bartosz Gadzimski
Hello, Can you look about size of merged segments? If I remember correctly when I had segment1 = 1GB and segment2= 1GB new merged segment was like 5GB but I havn't got time to look into it. Thanks, Bartosz czerwionka paul pisze: hi justin, i am running hadoop in distributed mode and

Re: Merge taking forever

2009-06-04 Thread Bartosz Gadzimski
As Arkadi said, your hdd is to slow for 2 x quad core processor. I have the same problem and now thinking of using more boxes or very fast drives (sas 15k). Raymond Balmčs pisze: Well I suspect the sort function is mono-threaded as usually they are so only one core is used 25% is the max you

Re: Merge taking forever

2009-06-04 Thread Bartosz Gadzimski
Andrzej Bialecki pisze: Bartosz Gadzimski wrote: As Arkadi said, your hdd is to slow for 2 x quad core processor. I have the same problem and now thinking of using more boxes or very fast drives (sas 15k). Raymond Balm�s pisze: Well I suspect the sort function is mono-threaded as usually

Re: Problem opening the index

2009-06-03 Thread Bartosz Gadzimski
has a space when Tomcat is run as a service under windows vista. Any suggestions on how to fix that ? -Raymond- 2009/6/1 Raymond Balmès raymond.bal...@gmail.com hum where do you change that ? 2009/6/1 Bartosz Gadzimski bartek...@o2.pl I think is the login of user running nutch

Re: Problem opening the index

2009-06-01 Thread Bartosz Gadzimski
I think is the login of user running nutch. It want's one token after using whoami but it gets: autorite nt\system So change the user for one word like autorite and it should help. Thanks, Bartosz Raymond Balmčs pisze: Hi guys, My webapp does not work anymore suddenly... I get the

Job not finished on nutch and hadoop

2009-05-14 Thread Bartosz Gadzimski
Hello, Problem is partialy solved but I still write it :) Usuing bin/nutch commands (inject, generate, fetch etc.) is working. Only bin/nutch crawl is not -- I have successfully setup hadoop cluster on 6 nodes (1

Re: LinkRank job in webgraph scoring fails

2009-03-25 Thread Bartosz Gadzimski
Hello, First - congratulations for new PMC member. Second - I have still problem with new scoring framework. After bin/nutch org.apache.nutch.scoring.webgraph.WebGraph -segment crawl/segments/20090306093949/ -webgraphdb crawl/webgraphdb there seems that after dumping: bin/nutch

Re: Standalone nutch server

2009-03-25 Thread Bartosz Gadzimski
Chris Muktar pisze: I'm trying to initiate a nutch server on localhost without installing tomcat. Can I do this: bin/nutch server 8080 crawl it seems to execute just fine and opens the port, although it doesn't serve any pages. I'm a newb to this so I might have missed something critical! Any

Cleaning after job failed

2009-03-18 Thread Bartosz Gadzimski
Hi, During tests of crawling (with crawl command) big 1mln website HDD space was run out. So I have crawldb with 1 112 000 urls (112 000 urls were tested before) segments with 40GB of data index with partial data /tmp/hadoop-root with 173GB of temporary hadoop data After looking at mailing

nutch - solr integration advantages

2009-03-17 Thread Bartosz Gadzimski
Hello, It's hard for me to get big picture of why to use solr as indexing and searching. Could someone try to describe this a little bit? I understand that nutch is doing crawling and solr just indexing and searching? Any help would be great. Thanks, Bartosz

Re: Hadopp Config Exception in Nutch

2009-03-10 Thread Bartosz Gadzimski
Hi, Which version of nutch are you using? You have wiki tutorial on running nutch in eclipse (it's important to add conf dir to classpath and move it to top of loading libs) http://wiki.apache.org/nutch/RunNutchInEclipse0.9 I've installed nutch rc in eclipse on windows just 2 hours ago and

Re: Hadopp Config Exception in Nutch

2009-03-10 Thread Bartosz Gadzimski
throguh that tutorial.. Must have missed something.. -Original Message- From: Bartosz Gadzimski [mailto:bartek...@o2.pl] Sent: Tuesday, March 10, 2009 8:02 AM To: nutch-user@lucene.apache.org Subject: Re: Hadopp Config Exception in Nutch Hi, Which version of nutch are you using? You

Re: Keeping content fresh

2009-03-03 Thread Bartosz Gadzimski
Hi, You can use Adaptive class and it theory your site will be very freash Change org.apache.nutch.crawl.DefaultFetchSchedule to org.apache.nutch.crawl.AdaptiveFetchSchedule In nutch-default.xml you have bunch of options for this class property namedb.fetch.schedule.class/name

Re: Keeping content fresh

2009-03-03 Thread Bartosz Gadzimski
Oh, I forgot. I didn't test that one so can tell you how it works. I know that many people are makeing generate, fetch, etc. loops very often to make sites fresh John Martyniak pisze: Justin, thanks for the info this very helpful. This value would apply to all pages though. I was thinking

Re: why a forum cannot be viewed cache correctly

2009-03-03 Thread Bartosz Gadzimski
Yves Yu pisze: Hi, all, My nutch can viewed cache correctly by most pages, but some pages cannot. Always said like following: Display of this content was administratively prohibited by the webmaster. You may visit the original page instead: http://forum.laopdr.gov.la/forums/list.page. Any

Re: Keeping content fresh

2009-03-03 Thread Bartosz Gadzimski
million urls), is Nutch-trunk stable enough to use to do the fetching and indexing? -John On Mar 3, 2009, at 11:11 AM, Bartosz Gadzimski wrote: Oh, I forgot. I didn't test that one so can tell you how it works. I know that many people are makeing generate, fetch, etc. loops very often to make

Re: urls with ? and symbols

2009-03-01 Thread Bartosz Gadzimski
alx...@aim.com pisze: Hello, I use nutch-0.9 and try to index urls with ? and symbols. I have commented this line? -[...@=] in conf/crawl-urlfilter.txt, conf/automaton-urlfilter and conf/regex-urlfilter.txt files. However nutch still ignores these urls. Does anyone know how this can be

Re: Could not find the main class: admin.

2009-03-01 Thread Bartosz Gadzimski
Hi, Command admin is not valid in 0.9 and in trunk versions of nutch You can use tutorial: http://peterpuwang.googlepages.com/NutchGuideForDummies.htm and many more on nutch wiki: http://wiki.apache.org/nutch/ nutchu...@sycona.com pisze: Hi Alexander i downloaded the nutch tutorial and

Is nutch obey robots.txt properly?

2009-02-26 Thread Bartosz Gadzimski
Hello, I am testing crawling (with bin/crawl command) on www.webhostingtalk.pl And it looks that crawler fetches many disallowed urls for example (there are many more): robots.txt Disallow: /index.php?showuser but it fetched and indexed: http://www.webhostingtalk.pl/index.php?showuser=6470

Re: Does not locate my urls or filter problem.

2009-02-26 Thread Bartosz Gadzimski
Hello, It might sound stupid but try to add few spaces and few new lines in your myURLS.txt (it happend few times on different computers both linux and windows) Thanks, Bartosz Lukas, Ray pisze: Thanks for your help.. I have neither of these properties defined in the nutch-site.xml file,

Re: Does not locate my urls or filter problem.

2009-02-26 Thread Bartosz Gadzimski
man for helping me out on this.. ray -Original Message- From: Bartosz Gadzimski [mailto:bartek...@o2.pl] Sent: Thursday, February 26, 2009 8:06 AM To: nutch-user@lucene.apache.org Subject: Re: Does not locate my urls or filter problem. Hello, It might sound stupid but try to add

Re: nutch fetches already fetched urls again and again

2009-02-26 Thread Bartosz Gadzimski
NutchDeveloper pisze: I use this script to crawl and recrawl web: http://wiki.apache.org/nutch/Crawl I noticed that database grow very slow (depth=2, topn=1000, adddays=30) because it fetches the same urls several times in different recrawl loops. What I should do to force Nutch to fetch ONLY

Re: OutOfMemory Exception in parsing

2009-02-24 Thread Bartosz Gadzimski
manavr pisze: Hi, I have a set of 1,00,000 urls that I am trying to crawl and index. I have heap memory size for child tasktrackers set to 512MB. I have disabled pdf and doc parsing currently. I am running this on Nutch-0.8 with 1 RHEL node with depth to set to 1. I get this

Re: configuring hadoop with nutch

2009-02-24 Thread Bartosz Gadzimski
Nicolas MARTIN pisze: Hi, Assuming the fact that Nutch is running under Hadoop platform, i would like to know if i have to follow the Hadoop Quick Starthttp://hadoop.apache.org/core/docs/r0.19.0/quickstart.htmlguide before configuring Nutch. TY Hello, If you want to run it on single

Re: JAVA_HOME is not set

2009-02-24 Thread Bartosz Gadzimski
Nicolas MARTIN pisze: Hi, I run nutch under Windows using cygwin. When i try a basic command like : bin/nutch crawl urls -dir crawl -dpeth 3 -topN 50 i have the error in the title of this message. However i have already defined an environment variable JAVA_HOME in windows... Still I have to

Re: HTTP Status 500 - No Context configured to process this request

2009-02-21 Thread Bartosz Gadzimski
Hi, Your searcher.dir property should be made in: {tomcat-webapps}/nutch_folder/WEB_INF/classes/nutch-site.xml And value should be absolute path to your crawled dir ex. /usr/local/nutch/crawl Regards, Bartosz samuel.gre...@mesaaz.gov pisze: I have tried tomcat 6.0 and after escaping some

Re: AW: AW: How to index while fetcher works

2009-02-19 Thread Bartosz Gadzimski
crawls, so the crawldb is also up to date. Be careful with Dmoz, there is a lot of Spam out there. The loop is also useful for invertinglinks etc. whenever it is important to have single segments and not the whole directory. -UrsprĂźngliche Nachricht- Von: Bartosz Gadzimski [mailto:bartek

Re: AW: AW: AW: How to index while fetcher works

2009-02-19 Thread Bartosz Gadzimski
some performance problems to merge and index them quickly. -UrsprĂźngliche Nachricht- Von: Bartosz Gadzimski [mailto:bartek...@o2.pl] Gesendet: Donnerstag, 19. Februar 2009 15:38 An: nutch-user@lucene.apache.org Betreff: Re: AW: AW: How to index while fetcher works Dear Nadine, So when