I just started using Nutch to index HTML, Text, Microsoft Documents, PDF.
Our application is Struts Based and we are querying Nutch straight from our
application. The query I have going right now is basically searching the
whole site. I am trying to figure out two things :
1. How to create sear
Hi all
I'm interested in playing with term frequency values in a nutch index on a
per document and index wide scope.
for example, something similar to this lucene faq entry.
http://tinyurl.com/ra3ys
so what is the 'correct' way to inspect the nutch index for these values.
Particularly against t
Gianni-
Here's the recrawl script that Jacob mentioned:
http://wiki.apache.org/nutch/IntranetRecrawl
[Note: There are 0.7.x and 0.8 versions]
Jacob-
I noticed that the 0.8 script had an issue with after merging too.
After it merges the segments, it fails to remove all the segments that
it used to
Ok. If you're crawling with this settings you don't need to reindex your
segments again. And how about the plugins that you are using? Are you using
the language-identifier plugin? If not, try it.
Regards,
Obs: Eu falo português :)
On 9/25/06, carmmello <[EMAIL PROTECTED]> wrote:
This issue h
This issue happens even when I start a new crawl. So, I'm not reindexing
the segments. The indexing is done by nutch itself, using the intranet
method.
Do you mean that after this is done, do I have to reindex the segments, once
again? But, if so, why the english common terms are recognized f
Has you reindexed your segments? It's important, because it makes nutch
recognize your common terms. I've tried it and the only thing I've noted was
the index size that is more big than the original (before use the common
terms).
On 9/25/06, carmmello <[EMAIL PROTECTED]> wrote:
I'm using Nutch
I'm using Nutch 0.7.2 and have added to the common-terms.utf8 in the conf
folder (and also under the classes folder, inside the ROOT folder on TomCat),
some common terms in portuguese, one per line , like:
content:da
contente:de
contente:eu
..
However, when I
I had the same exact problem and discovered the source of the Exceptions. I
now have Nutch crawling on Windows. The problem is with cygwin and Windows
path names that contain spaces (or possibly even other characters). Since
I couldn?t figure out how to make cygwin paths behave, I copied my Nu
In fact , the site I want to index is on the web,
when I make the crawl, many others sites are indexed, they are referenced in
pages of the site I want to index.
Until this point it seems to be good.
But, the problem is that for all these others sites, I have many pages in the
index
and for my sp
You can get it working on Windows if you're willing to work for it. To use
Nutch OOTB, you have to install Cygwin since the provided Nutch launcher is
written in Bash.
Members of the community have provided alternatives, such as this Python
lanucher: http://wiki.apache.org/nutch/CrossPlatformNu
Hi,
I did some search in the Nutch and Hadoop lists and there are some
modifications mentioned
(like UTF8 to Text) and somebody mentioned that with some adjusting it
works.
Can anybody confirm that using Nutch0.8/0.9 with Hadoop 0.6.x is robust
enough and if the answer
is yes can list some mod
Can anyone give me an opinion on how adding an ontology to their search
engine helps to improve search results?
Thanks in advance,
Chad
Florian Fricker wrote:
Yea, after a closer look it seems to be a problem with your xeres
library.
To solve this, one needs to update tomcat's xerces librar
On Mon, Sep 25, 2006 at 04:03:45PM +0200, Kursun, Mahmut wrote:
> Hi,
>
> I want to install Nutch for testing purposes and would like to know
> which OS, Filesystem and sort of Harddiscs other Nutch users prefer.
>
> What I am going to use is Fedora Core 6 Test 3 with ext3 on a 40-80 GB
> IDE Har
I think you have in your file which is being indexed something like
javascript:something
this makes nutch think javascript is a protocol and throws a malformed
url exception
try "javascript: somthing"
or you go into the code and ignore the MalformedURLException
at org.apache.nutch.net.BasicUrlNorm
Hi,
Works perfect with Mac OS X 10.4.7 as the operating system and HFS-Plus
as filesystem.
Regards
Kursun, Mahmut wrote:
Hi,
I want to install Nutch for testing purposes and would like to know
which OS, Filesystem and sort of Harddiscs other Nutch users prefer.
What I am going to use is F
daktion com! - Das Computer-Magazin
> Neue Mediengesellschaft mbH
> Bayerstr. 26
> 80335 München
> Telefon: +49 / 89 / 74 117-641
> Telefax: +49 / 89 / 74 117-132
> [EMAIL PROTECTED]
> http://www.com-magazin.de
> __ NOD32 1.1773 (20060925) Information __
&
un
> Redaktion com! - Das Computer-Magazin
> Neue Mediengesellschaft mbH
> Bayerstr. 26
> 80335 München
> Telefon: +49 / 89 / 74 117-641
> Telefax: +49 / 89 / 74 117-132
> [EMAIL PROTECTED]
> http://www.com-magazin.de
> __ NOD32 1.1773 (20060925) Information ___
Hi,
I want to install Nutch for testing purposes and would like to know
which OS, Filesystem and sort of Harddiscs other Nutch users prefer.
What I am going to use is Fedora Core 6 Test 3 with ext3 on a 40-80 GB
IDE Harddisc. That is also to test Fedora Core 6 Test 3. But I am free
to install any
Hi,
While im running WordCount Example in distributed Machines it makes the
folloing err, please help me
my hadoop-site.xml is
fs.default.name localhost:9000
mapred.job.tracker
localhost:9001
dfs.replication 1
06/09/25 15:56:21 INFO conf.Configuration: parsing
jar:
Hi,
I'm sorry but I still don't succeed in indexing all the content of my web site.
In the log I have some errors :
2006-09-25 15:35:42,859 ERROR parse.OutlinkExtractor - getOutlinks
java.net.MalformedURLException: unknown protocol: javascript
at java.net.URL.(URL.java:574)
at java.net.URL.(UR
20 matches
Mail list logo