Re: Does hadoop not reclaim blocks when files are deleted?

2006-04-11 Thread Andrzej Bialecki
Shawn Gervais wrote: Andrzej Bialecki wrote: Shawn Gervais wrote: Greetings list, This is my DFS report: Total raw bytes: 709344133120 (660.62 Gb) Used raw bytes: 302794461922 (281.99 Gb) % used: 42.68% Total effective bytes: 11826067632 (11.01 Gb) Effective replication multiplier:

Re: [Nutch-general] Add new content on the fly!

2006-04-11 Thread Kelvin Tan
Dave, you could think about running a separate crawler to handle these ad-hoc requests, perform the crawl, generate the index, then merge with the live index. This will result in a shorter turn-around time for the paying customers anyhow.. kelvin On Sat, 8 Apr 2006 16:32:30 -0400,

Re: Small dev question

2006-04-11 Thread Andrzej Bialecki
Gal Nitzan wrote: Hi Andrzej, I have two questions in regards to ParseOutputFormat.java: 1. On line 102 a String[] is used. Do you think it might be better to use a ListArray? It will save a few cycles down the road -- it shall save you to use validCount and will save you the if on line 121. I

Enabling different file types

2006-04-11 Thread bob knob
Hi, it's me again, If I'm going to use Nutch, I need xls, ppt, doc file types to be searchable if at all possible. The wiki says most file types are disabled by default, but they can be turned on by changing conf/nutch-site.xml. Unfortunately there is no documentation that I can find for this

Re: Enabling different file types

2006-04-11 Thread Jérôme Charron
types to be searchable if at all possible. The wiki says most file types are disabled by default, but they can be turned on by changing conf/nutch-site.xml. Unfortunately there is no documentation that I can find for this file... any ideas how to do it, or sample xml that somebody could send

Adaptive fetch patch

2006-04-11 Thread Raghavendra Prabhu
Hi Andrzej Is the adaptive fetch patch in synch with the main code As i mentioned it will be useful if we have this feature and will help save unnecessary recrawls of static html pages resulting in unnecessary bandwidth usage. Rgds Prabhu

Re: Enabling different file types

2006-04-11 Thread Rajesh Munavalli
Follow these steps for nutch-0.7.2: (1) Modify the nutch-default.xml for the following property For ex: if you want to include doc file type, replace the value node to parse-(text|html|doc) as shown below. property nameplugin.includes/name

Re: Enabling different file types

2006-04-11 Thread bob knob
Okay but it sounds like I need parser plugins for word, excel and powerpoint - plugins only has a parser-msword directory. Has anyone created plugins for excel powerpoint? --- J�r�me Charron [EMAIL PROTECTED] wrote: types to be searchable if at all possible. The wiki says most file

Re: Enabling different file types

2006-04-11 Thread Rajesh Munavalli
Have a look at http://jakarta.apache.org/poi/ On 4/11/06, bob knob [EMAIL PROTECTED] wrote: Okay but it sounds like I need parser plugins for word, excel and powerpoint - plugins only has a parser-msword directory. Has anyone created plugins for excel powerpoint? --- J�r�me Charron [EMAIL

Re: Enabling different file types

2006-04-11 Thread Jérôme Charron
Okay but it sounds like I need parser plugins for word, excel and powerpoint - plugins only has a parser-msword directory. Has anyone created plugins for excel powerpoint? They are available in the trunk version, not in the 0.7.x Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Nutch administration web interface?

2006-04-11 Thread Robert Douglass
Hi, has anyone done any work on a web interface for administering Nutch? How would one go about doing this? In Java, I imagine you'd use the Java classes directly (the command line tool is just a wrapper for the Java, after all), but in other languages (I'm thinking PHP), would it be most

Re: Nutch administration web interface?

2006-04-11 Thread Rida Benjelloun
Hi Robert, You can see this page http://wiki.apache.org/nutch/NutchAdministrationUserInterface. But I don't have any idea about the advancement of this project. Best regards. On 4/10/06, Robert Douglass [EMAIL PROTECTED] wrote: Hi, has anyone done any work on a web interface for administering

Re: Nutch administration web interface?

2006-04-11 Thread Stefan Groschupf
... a beta will be available soon. Am 11.04.2006 um 22:22 schrieb Rida Benjelloun: Hi Robert, You can see this page http://wiki.apache.org/nutch/NutchAdministrationUserInterface. But I don't have any idea about the advancement of this project. Best regards. On 4/10/06, Robert Douglass

Re: Nutch administration web interface?

2006-04-11 Thread carmmello
Will this interface also cope with Nutch 0.7 or just the new 0.8? - Original Message - From: Stefan Groschupf [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Tuesday, April 11, 2006 5:53 PM Subject: Re: Nutch administration web interface? ... a beta will be available soon.

Re: Nutch administration web interface?

2006-04-11 Thread Stefan Groschupf
just 0.8. Am 11.04.2006 um 23:08 schrieb carmmello: Will this interface also cope with Nutch 0.7 or just the new 0.8? - Original Message - From: Stefan Groschupf [EMAIL PROTECTED] style.com To: nutch-user@lucene.apache.org Sent: Tuesday, April 11, 2006 5:53 PM Subject: Re: Nutch

Re: Saving Metadata to Mysql

2006-04-11 Thread John Reidy
I am looking at something similar. I would guess the place to put it is the indexer. As I understand it the parser runs for just about everything fetched, however the indexer is only run for pages you want to index. I am also looking at having static objects (Eg a connection) that is

Re: Saving Metadata to Mysql

2006-04-11 Thread sudhendra seshachala
Sorry to just jumpping in. We have doc id associated when we index. We could store the doc id in mysql table.We could use the docid to query the nutch database.. When parsing, capture things needed as part of metadata Index the metadata. the docId associated is stored in mysql. Does that give

RE: Nutch 500 Error

2006-04-11 Thread Paul Stewart
Thanks I was doing the java command wrong... Back to my original problem - I re-ran throught the entire tutorial to ensure I was doing it right and it seems proper How do I tell Nutch where to look specifically in the code for the segments and indexes in case it is in the wrong place?

RE: Nutch 500 Error

2006-04-11 Thread sudhendra seshachala
check the nutch-default.xml there should be a property searcher.dir Provide the path for the index folder. Better still copy the property node and paste it in nutch-site.xml provide the path for the index folder. For ex: If the index folder is stored as home/nutch/crawl - crawldb -

Same Error (Version 0.8)

2006-04-11 Thread mikeyc
Hey Chris, Any idea why I would get the same error message even though I updated my nutch-site.xml and parse-plugins.xml files? 060411 230237 ParserFactory: Plugin: org.apache.nutch.xxx.xxx.xxx mapped to contentType text/html via parse-plugins.xml, but not enabled via plugin.includes in

Re: Same Error (Version 0.8)

2006-04-11 Thread Chris Mattmann
Hi Mike, Could you post the snippet from your nutch-site.xml where you enable plugin: org.apache.nutch.xxx.xxx.xxx. Could you also be more specific and post the entire name of the plugin that it printed in your log file? This warning message basically means that there was an entry in the

Re: Same Error (Version 0.8)

2006-04-11 Thread mikeyc
Sure no problem. log message 060411 235725 ParserFactory: Plugin: org.apache.nutch.microformats.hreview.HReviewParser mapped to contentType text/html via parse-plugins.xml, but not enabled via plugin.includes in nutch-default.xml parse-plugins.xml mimeType name=application/xhtml+xml

RE: Auto-crawling re-crawling the web site

2006-04-11 Thread Cherian Thomas
Hi, On linux OS and using tomcat 5.0 we could get new pages without server restart. On windows this problem persists because tomcat puts a lock on the directory where indexes are stored. -Cherian Thomas -Original Message- From: bob knob [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 11,

RE: Enabling different file types

2006-04-11 Thread Cherian Thomas
Hi, Enter the following the in the nutch-site.xml. nutch-conf property nameplugin.includes/name valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|h tml|js|pdf|msword|zip|mspowerpoint|msexcel)|index-basic|query-(basic|sit e|url)/value descriptionRegular expression