date:20060411

Re: Does hadoop not reclaim blocks when files are deleted?

2006-04-11 Thread Andrzej Bialecki

Shawn Gervais wrote: Andrzej Bialecki wrote: Shawn Gervais wrote: Greetings list, This is my DFS report: Total raw bytes: 709344133120 (660.62 Gb) Used raw bytes: 302794461922 (281.99 Gb) % used: 42.68% Total effective bytes: 11826067632 (11.01 Gb) Effective replication multiplier:

Re: [Nutch-general] Add new content on the fly!

2006-04-11 Thread Kelvin Tan

Dave, you could think about running a separate crawler to handle these ad-hoc requests, perform the crawl, generate the index, then merge with the live index. This will result in a shorter turn-around time for the paying customers anyhow.. kelvin On Sat, 8 Apr 2006 16:32:30 -0400,

Re: Small dev question

2006-04-11 Thread Andrzej Bialecki

Gal Nitzan wrote: Hi Andrzej, I have two questions in regards to ParseOutputFormat.java: 1. On line 102 a String[] is used. Do you think it might be better to use a ListArray? It will save a few cycles down the road -- it shall save you to use validCount and will save you the if on line 121. I

Enabling different file types

2006-04-11 Thread bob knob

Hi, it's me again, If I'm going to use Nutch, I need xls, ppt, doc file types to be searchable if at all possible. The wiki says most file types are disabled by default, but they can be turned on by changing conf/nutch-site.xml. Unfortunately there is no documentation that I can find for this

Re: Enabling different file types

2006-04-11 Thread Jérôme Charron

types to be searchable if at all possible. The wiki says most file types are disabled by default, but they can be turned on by changing conf/nutch-site.xml. Unfortunately there is no documentation that I can find for this file... any ideas how to do it, or sample xml that somebody could send

Adaptive fetch patch

2006-04-11 Thread Raghavendra Prabhu

Hi Andrzej Is the adaptive fetch patch in synch with the main code As i mentioned it will be useful if we have this feature and will help save unnecessary recrawls of static html pages resulting in unnecessary bandwidth usage. Rgds Prabhu

Re: Enabling different file types

2006-04-11 Thread Rajesh Munavalli

Follow these steps for nutch-0.7.2: (1) Modify the nutch-default.xml for the following property For ex: if you want to include doc file type, replace the value node to parse-(text|html|doc) as shown below. property nameplugin.includes/name

Re: Enabling different file types

2006-04-11 Thread bob knob

Okay but it sounds like I need parser plugins for word, excel and powerpoint - plugins only has a parser-msword directory. Has anyone created plugins for excel powerpoint? --- Jï¿½rï¿½me Charron [EMAIL PROTECTED] wrote: types to be searchable if at all possible. The wiki says most file

Re: Enabling different file types

2006-04-11 Thread Rajesh Munavalli

Have a look at http://jakarta.apache.org/poi/ On 4/11/06, bob knob [EMAIL PROTECTED] wrote: Okay but it sounds like I need parser plugins for word, excel and powerpoint - plugins only has a parser-msword directory. Has anyone created plugins for excel powerpoint? --- J�r�me Charron [EMAIL

Re: Enabling different file types

2006-04-11 Thread Jérôme Charron

Okay but it sounds like I need parser plugins for word, excel and powerpoint - plugins only has a parser-msword directory. Has anyone created plugins for excel powerpoint? They are available in the trunk version, not in the 0.7.x Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Nutch administration web interface?

2006-04-11 Thread Robert Douglass

Hi, has anyone done any work on a web interface for administering Nutch? How would one go about doing this? In Java, I imagine you'd use the Java classes directly (the command line tool is just a wrapper for the Java, after all), but in other languages (I'm thinking PHP), would it be most

Re: Nutch administration web interface?

2006-04-11 Thread Rida Benjelloun

Hi Robert, You can see this page http://wiki.apache.org/nutch/NutchAdministrationUserInterface. But I don't have any idea about the advancement of this project. Best regards. On 4/10/06, Robert Douglass [EMAIL PROTECTED] wrote: Hi, has anyone done any work on a web interface for administering

Re: Nutch administration web interface?

2006-04-11 Thread Stefan Groschupf

... a beta will be available soon. Am 11.04.2006 um 22:22 schrieb Rida Benjelloun: Hi Robert, You can see this page http://wiki.apache.org/nutch/NutchAdministrationUserInterface. But I don't have any idea about the advancement of this project. Best regards. On 4/10/06, Robert Douglass

Re: Nutch administration web interface?

2006-04-11 Thread carmmello

Will this interface also cope with Nutch 0.7 or just the new 0.8? - Original Message - From: Stefan Groschupf [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Tuesday, April 11, 2006 5:53 PM Subject: Re: Nutch administration web interface? ... a beta will be available soon.

Re: Nutch administration web interface?

2006-04-11 Thread Stefan Groschupf

just 0.8. Am 11.04.2006 um 23:08 schrieb carmmello: Will this interface also cope with Nutch 0.7 or just the new 0.8? - Original Message - From: Stefan Groschupf [EMAIL PROTECTED] style.com To: nutch-user@lucene.apache.org Sent: Tuesday, April 11, 2006 5:53 PM Subject: Re: Nutch

Re: Saving Metadata to Mysql

2006-04-11 Thread John Reidy

I am looking at something similar. I would guess the place to put it is the indexer. As I understand it the parser runs for just about everything fetched, however the indexer is only run for pages you want to index. I am also looking at having static objects (Eg a connection) that is

Re: Saving Metadata to Mysql

2006-04-11 Thread sudhendra seshachala

Sorry to just jumpping in. We have doc id associated when we index. We could store the doc id in mysql table.We could use the docid to query the nutch database.. When parsing, capture things needed as part of metadata Index the metadata. the docId associated is stored in mysql. Does that give

RE: Nutch 500 Error

2006-04-11 Thread Paul Stewart

Thanks I was doing the java command wrong... Back to my original problem - I re-ran throught the entire tutorial to ensure I was doing it right and it seems proper How do I tell Nutch where to look specifically in the code for the segments and indexes in case it is in the wrong place?

RE: Nutch 500 Error

2006-04-11 Thread sudhendra seshachala

check the nutch-default.xml there should be a property searcher.dir Provide the path for the index folder. Better still copy the property node and paste it in nutch-site.xml provide the path for the index folder. For ex: If the index folder is stored as home/nutch/crawl - crawldb -

Same Error (Version 0.8)

2006-04-11 Thread mikeyc

Hey Chris, Any idea why I would get the same error message even though I updated my nutch-site.xml and parse-plugins.xml files? 060411 230237 ParserFactory: Plugin: org.apache.nutch.xxx.xxx.xxx mapped to contentType text/html via parse-plugins.xml, but not enabled via plugin.includes in

Re: Same Error (Version 0.8)

2006-04-11 Thread Chris Mattmann

Hi Mike, Could you post the snippet from your nutch-site.xml where you enable plugin: org.apache.nutch.xxx.xxx.xxx. Could you also be more specific and post the entire name of the plugin that it printed in your log file? This warning message basically means that there was an entry in the

Re: Same Error (Version 0.8)

2006-04-11 Thread mikeyc

Sure no problem. log message 060411 235725 ParserFactory: Plugin: org.apache.nutch.microformats.hreview.HReviewParser mapped to contentType text/html via parse-plugins.xml, but not enabled via plugin.includes in nutch-default.xml parse-plugins.xml mimeType name=application/xhtml+xml

RE: Auto-crawling re-crawling the web site

2006-04-11 Thread Cherian Thomas

Hi, On linux OS and using tomcat 5.0 we could get new pages without server restart. On windows this problem persists because tomcat puts a lock on the directory where indexes are stored. -Cherian Thomas -Original Message- From: bob knob [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 11,

RE: Enabling different file types

2006-04-11 Thread Cherian Thomas

Hi, Enter the following the in the nutch-site.xml. nutch-conf property nameplugin.includes/name valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|h tml|js|pdf|msword|zip|mspowerpoint|msexcel)|index-basic|query-(basic|sit e|url)/value descriptionRegular expression

Re: Does hadoop not reclaim blocks when files are deleted?

Re: [Nutch-general] Add new content on the fly!

Re: Small dev question

Enabling different file types

Re: Enabling different file types

Adaptive fetch patch

Re: Enabling different file types

Re: Enabling different file types

Re: Enabling different file types

Re: Enabling different file types

Nutch administration web interface?

Re: Nutch administration web interface?

Re: Nutch administration web interface?

Re: Nutch administration web interface?

Re: Nutch administration web interface?

Re: Saving Metadata to Mysql

Re: Saving Metadata to Mysql

RE: Nutch 500 Error

RE: Nutch 500 Error

Same Error (Version 0.8)

Re: Same Error (Version 0.8)

Re: Same Error (Version 0.8)

RE: Auto-crawling re-crawling the web site

RE: Enabling different file types

24 matches

Site Navigation

Mail list logo

Footer information