Shawn Gervais wrote:
Andrzej Bialecki wrote:
Shawn Gervais wrote:
Greetings list,
This is my DFS report:
Total raw bytes: 709344133120 (660.62 Gb)
Used raw bytes: 302794461922 (281.99 Gb)
% used: 42.68%
Total effective bytes: 11826067632 (11.01 Gb)
Effective replication multiplier:
Dave, you could think about running a separate crawler to handle these ad-hoc
requests, perform the crawl, generate the index, then merge with the live
index. This will result in a shorter turn-around time for the paying customers
anyhow..
kelvin
On Sat, 8 Apr 2006 16:32:30 -0400,
Gal Nitzan wrote:
Hi Andrzej,
I have two questions in regards to ParseOutputFormat.java:
1. On line 102 a String[] is used. Do you think it might be better to use a
ListArray? It will save a few cycles down the road -- it shall save you to
use validCount and will save you the if on line 121. I
Hi, it's me again,
If I'm going to use Nutch, I need xls, ppt, doc file
types to be searchable if at all possible. The wiki
says most file types are disabled by default, but they
can be turned on by changing conf/nutch-site.xml.
Unfortunately there is no documentation that I can
find for this
types to be searchable if at all possible. The wiki
says most file types are disabled by default, but they
can be turned on by changing conf/nutch-site.xml.
Unfortunately there is no documentation that I can
find for this file... any ideas how to do it, or
sample xml that somebody could send
Hi Andrzej
Is the adaptive fetch patch in synch with the main code
As i mentioned it will be useful if we have this feature and will help save
unnecessary recrawls of static html pages resulting in unnecessary bandwidth
usage.
Rgds
Prabhu
Follow these steps for nutch-0.7.2:
(1) Modify the nutch-default.xml for the following property
For ex: if you want to include doc file type, replace the value node to
parse-(text|html|doc) as shown below.
property
nameplugin.includes/name
Okay but it sounds like I need parser plugins for
word, excel and powerpoint - plugins only has a
parser-msword directory. Has anyone created plugins
for excel powerpoint?
--- J�r�me Charron [EMAIL PROTECTED]
wrote:
types to be searchable if at all possible. The
wiki
says most file
Have a look at http://jakarta.apache.org/poi/
On 4/11/06, bob knob [EMAIL PROTECTED] wrote:
Okay but it sounds like I need parser plugins for
word, excel and powerpoint - plugins only has a
parser-msword directory. Has anyone created plugins
for excel powerpoint?
--- J�r�me Charron [EMAIL
Okay but it sounds like I need parser plugins for
word, excel and powerpoint - plugins only has a
parser-msword directory. Has anyone created plugins
for excel powerpoint?
They are available in the trunk version, not in the 0.7.x
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
Hi,
has anyone done any work on a web interface for administering Nutch?
How would one go about doing this? In Java, I imagine you'd use the Java
classes directly (the command line tool is just a wrapper for the Java,
after all), but in other languages (I'm thinking PHP), would it be most
Hi Robert,
You can see this page
http://wiki.apache.org/nutch/NutchAdministrationUserInterface. But I don't
have any idea about the advancement of this project.
Best regards.
On 4/10/06, Robert Douglass [EMAIL PROTECTED] wrote:
Hi,
has anyone done any work on a web interface for administering
... a beta will be available soon.
Am 11.04.2006 um 22:22 schrieb Rida Benjelloun:
Hi Robert,
You can see this page
http://wiki.apache.org/nutch/NutchAdministrationUserInterface. But
I don't
have any idea about the advancement of this project.
Best regards.
On 4/10/06, Robert Douglass
Will this interface also cope with Nutch 0.7 or just the new 0.8?
- Original Message -
From: Stefan Groschupf [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Tuesday, April 11, 2006 5:53 PM
Subject: Re: Nutch administration web interface?
... a beta will be available soon.
just 0.8.
Am 11.04.2006 um 23:08 schrieb carmmello:
Will this interface also cope with Nutch 0.7 or just the new 0.8?
- Original Message - From: Stefan Groschupf [EMAIL PROTECTED]
style.com
To: nutch-user@lucene.apache.org
Sent: Tuesday, April 11, 2006 5:53 PM
Subject: Re: Nutch
I am looking at something similar.
I would guess the place to put it is the indexer. As I understand it the
parser runs for just about everything fetched, however the indexer is
only run for pages you want to index.
I am also looking at having static objects (Eg a connection) that is
Sorry to just jumpping in.
We have doc id associated when we index. We could store the doc id in mysql
table.We could use the docid to query the nutch database..
When parsing, capture things needed as part of metadata
Index the metadata. the docId associated is stored in mysql.
Does that give
Thanks I was doing the java command wrong...
Back to my original problem - I re-ran throught the entire tutorial to
ensure I was doing it right and it seems proper How do I tell Nutch
where to look specifically in the code for the segments and indexes in
case it is in the wrong place?
check the nutch-default.xml
there should be a property searcher.dir
Provide the path for the index folder.
Better still copy the property node and paste it in nutch-site.xml
provide the path for the index folder.
For ex:
If the index folder is stored as
home/nutch/crawl
- crawldb
-
Hey Chris,
Any idea why I would get the same error message even though I updated my
nutch-site.xml and parse-plugins.xml files?
060411 230237 ParserFactory: Plugin: org.apache.nutch.xxx.xxx.xxx mapped to
contentType text/html via parse-plugins.xml, but not enabled via
plugin.includes in
Hi Mike,
Could you post the snippet from your nutch-site.xml where you enable
plugin: org.apache.nutch.xxx.xxx.xxx. Could you also be more specific and
post the entire name of the plugin that it printed in your log file? This
warning message basically means that there was an entry in the
Sure no problem.
log message
060411 235725 ParserFactory: Plugin:
org.apache.nutch.microformats.hreview.HReviewParser mapped to contentType
text/html via parse-plugins.xml, but not enabled via plugin.includes in
nutch-default.xml
parse-plugins.xml
mimeType name=application/xhtml+xml
Hi,
On linux OS and using tomcat 5.0 we could get new pages without server
restart.
On windows this problem persists because tomcat puts a lock on the
directory where indexes are stored.
-Cherian Thomas
-Original Message-
From: bob knob [mailto:[EMAIL PROTECTED]
Sent: Tuesday, April 11,
Hi,
Enter the following the in the nutch-site.xml.
nutch-conf
property
nameplugin.includes/name
valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|h
tml|js|pdf|msword|zip|mspowerpoint|msexcel)|index-basic|query-(basic|sit
e|url)/value
descriptionRegular expression
24 matches
Mail list logo