I came across an interesting overview of ntlm authentication
possibilities at http://www.oaklandsoftware.com/papers/ntlm.html
I thought I'd just mention it here in case anyone who knows how nutch
authentication works under the hood has anything to say about the
listed options.
The solution
2006/11/16, [EMAIL PROTECTED] [EMAIL PROTECTED]:
I have added depth limitation for version 0.7.2. If to someone it is
interestingly i can contribute it.
I am using depth limitation in 0.8.1, but am looking to 0.7.2 as the
next version I work with so I'm very interested.
t.n.a.
2006/11/13, carmmello [EMAIL PROTECTED]:
Hi,
Nutch, from version 0.8 is, really, very, very slow, using a single machine,
to process data, after the crawling. Compared with Nutch 0.7.2 I would say,
...
this series. I don`t believe that there are many Nutch users, in the real
world of
2006/11/3, Josef Novak [EMAIL PROTECTED]:
Hi,
Very short question (hopefully). Is it possible to get bin/nutch
fetch to print a log of the pages being downloaded to the command
terminal? I have been using 0.7.2 up until now; in that version the
fetch command outputs errors and the names of
2006/10/29, Cristina Belderrain [EMAIL PROTECTED]:
Hi Tomi,
please take a look at the following tutorial:
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
Apparently, Nutch's search application already shows hit summaries...
Anyway, you can always retrieve each
Is there a way to have nutch return some hit context (a la google) to
better identify the hit?
For example, if I search for nutch, a link pointing to
http://lucene.apache.org/nutch/; would be followed by the following
context:
This is the first *Nutch* release as an Apache Lucene sub-project.
...
2006/10/23, Andrzej Bialecki [EMAIL PROTECTED]:
Tomi NA wrote:
2006/10/18, [EMAIL PROTECTED] [EMAIL PROTECTED]:
Btw we have some virtual local hosts, hoz does the
db.ignore.external.links
deal with that ?
Update:
setting db.ignore.external.links to true in nutch-site (and later also
2006/10/18, [EMAIL PROTECTED] [EMAIL PROTECTED]:
Btw we have some virtual local hosts, hoz does the db.ignore.external.links
deal with that ?
Update:
setting db.ignore.external.links to true in nutch-site (and later also
in nutch-default as a sanity check) *doesn't work*: I feed the crawl
2006/10/14, Tomi NA [EMAIL PROTECTED]:
2006/10/14, Toufeeq Hussain [EMAIL PROTECTED]:
From internal tests with ntlmaps + Nutch the conclusion we came to was
that though it kinda-works it puts a huge load on the Nutch server
as ntlmaps is a major memory-hog and the mixture of the two leads
2006/10/18, Frederic Goudal [EMAIL PROTECTED]:
Hello,
I'm begining to play with nutch to index our own web site.
I have done a first crawl and I have trid the recrawl script.
While fetching I have lines like that :
fetching http://www.yourdictionary.com/grammars.html
fetching
2006/10/14, Toufeeq Hussain [EMAIL PROTECTED]:
From internal tests with ntlmaps + Nutch the conclusion we came to was
that though it kinda-works it puts a huge load on the Nutch server
as ntlmaps is a major memory-hog and the mixture of the two leads to
performance issues. For a PoC this will
2006/10/13, Guruprasad Iyer [EMAIL PROTECTED]:
Hi Tomi,
using a ntlmaps proxy
How do I get this proxy?
You tell nutch to use the proxy and you provide the proxy with adequate
access priviledges.
How do I do this? Can you elaborate?
I am a new Nutch user and am very much in the learning phase.
2006/10/10, Cristina Belderrain [EMAIL PROTECTED]:
On 10/9/06, Tomi NA [EMAIL PROTECTED] wrote:
This is *exactly* what I was thinking. Like Stefan, I believe the
nutch analyzer is a good foundation and should therefore be extended
to support the or operator, and possibly additional
2006/10/8, Stefan Neufeind [EMAIL PROTECTED]:
if it's not the full feature-set, maybe most people could live with it.
But basic boolean queries I think were the root for this topic. Is there
an easier way to allow this in Nutch as well instead of throwing quite
a bit away and using the
On 9/27/06, Sami Siren [EMAIL PROTECTED] wrote:
Nutch Project is pleased to announce the availability of 0.8.1 release
of Nutch - the open source web-search software based on lucene and hadoop.
The release is immediately available for download from:
http://lucene.apache.org/nutch/release/
On 9/26/06, Jim Wilson [EMAIL PROTECTED] wrote:
I'd do it, but I'm too busy being consumed with worries about the lack of
support for HTTP/NTLM credentials and SMB fileshare indexing.
Arrrgg - tis another sad day in the life of this pirate.
We seem to share the same problems...they haven't
On 9/25/06, Jim Wilson [EMAIL PROTECTED] wrote:
flamebait
You can get it working on Windows if you're willing to work for it. To use
Nutch OOTB, you have to install Cygwin since the provided Nutch launcher is
written in Bash.
Members of the community have provided alternatives, such as this
On 9/22/06, Trym B. Asserson [EMAIL PROTECTED] wrote:
Any other suggestions? Tomi, you said you'd had difficulties too with
certain MS documents, did you manage to find a work-around or did you
just have to ignore these documents? So far we've only concentrated on
using the plugins in Nutch 0.8
On 9/22/06, Andrzej Bialecki [EMAIL PROTECTED] wrote:
You are not the first one to consider using OO.org for Word conversion.
However, this solution brings with it a large dependency (ca 250MB
installed), which requires proper installation; and also the UNO
interface is reported to be
On 9/21/06, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Benjamin Higgins wrote:
How can I instruct Nutch to refetch specific files and then update the
index
entries for those files?
I am indexing files on a fileserver and I am able to produce a report of
changed files about every 30 minutes.
On 9/21/06, Jim Wilson [EMAIL PROTECTED] wrote:
I haven't had this particular problem, but here's something to consider:
After you remove the TextBox objects you have to re-save the document. Is
the new document the same version as the previous one? By this I mean, the
same Word version (97,
On 9/21/06, Jacob Brunson [EMAIL PROTECTED] wrote:
On 9/21/06, Gianni Parini [EMAIL PROTECTED] wrote:
-Is it possible to have an automatic recrawling? have i got to write
my own application by myself? I need an application running in
background that re-crawl my intranet site 2-3 times
On 9/20/06, Benjamin Higgins [EMAIL PROTECTED] wrote:
In Nutch 0.7, I wanted to change Nutch's behavior such that when I inject a
file it will add the page, even if it is already present.
I did this because I can prepare a list of changed files that I have on my
intranet and want Nutch to
On 9/20/06, Tomi NA [EMAIL PROTECTED] wrote:
On 9/20/06, Benjamin Higgins [EMAIL PROTECTED] wrote:
In Nutch 0.7, I wanted to change Nutch's behavior such that when I inject a
file it will add the page, even if it is already present.
I did this because I can prepare a list of changed files
On 9/18/06, NG-Marketing, M.Schneider [EMAIL PROTECTED] wrote:
I figured it out. I used in my nutch-site.xml the following config
property
namesearcher.max.hits/name
value2048/value
If I change the value to nothing it works all fine. It took me a couple
of hours to figure it
On 9/16/06, Tomi NA [EMAIL PROTECTED] wrote:
On 9/15/06, Tomi NA [EMAIL PROTECTED] wrote:
On 9/14/06, Zaheed Haque [EMAIL PROTECTED] wrote:
Thats the way I set it up at first.
This time, I started with a blank slate, unpacked nutch and tomcat,
unpacked nutch-0.8.war into the webapps
On 9/18/06, Zaheed Haque [EMAIL PROTECTED] wrote:
Hi:
I have just checked your flash movie.. quick observation you are
running tomcat 4.1.31 and there is nothing you are doing that seems
wrong. Anyway after starting the servers can you search using the
following command
bin/nutch
On 9/17/06, NG-Marketing, Matthias Schneider [EMAIL PROTECTED] wrote:
Hello List,
i installed nutch 0.8 and i can fetch and index documents, but I can not
search them. I get the following error:
StandardWrapperValve[jsp]: Servlet.service() for servlet jsp threw exception
On 9/5/06, Zaheed Haque [EMAIL PROTECTED] wrote:
Hi:
I have a problem or two with the described procedure...
Assuming you have
index 1 at /data/crawl1
index 2 at /data/crawl2
Used ./bin/nutch crawl urls -dir /home/myhome/crawls/mycrawldir to
generate an index: luke says the index is valid
On 9/14/06, Zaheed Haque [EMAIL PROTECTED] wrote:
On 9/14/06, Tomi NA [EMAIL PROTECTED] wrote:
On 9/5/06, Zaheed Haque [EMAIL PROTECTED] wrote:
Hi:
I have a problem or two with the described procedure...
Assuming you have
index 1 at /data/crawl1
index 2 at /data/crawl2
Used ./bin
On 9/14/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
Everyone, thanks for the help with this. I hope to return the
assistance, once I am more familiar with 0.8. I am using tail -f now to
monitor my test crawls. It also look like you can use
conf/hadoop-env.sh to redirect log file output to
On 9/13/06, wmelo [EMAIL PROTECTED] wrote:
I have the same original doubt. I know that the log shows informations,
but, how to see the things happening, real time, like in nutch 0.7.2, when
you use the crawl command in the terminal?
try something like this (assuming you know what's good for
On 9/8/06, Jim Wilson [EMAIL PROTECTED] wrote:
Dear Nutch User List,
I am desperately trying to index an Intranet with the following
characteristics
1) Some sites require no authentication - these already work great!
2) Some sites require basic HTTP Authentication.
3) Some sites require NTLM
On 9/9/06, victor_emailbox [EMAIL PROTECTED] wrote:
Hi all,
I spent a lot of time to figure out why Nutch didn't respond to my
configuration in nutch-site.xml. I set db.ignore.external.links to true.
It didn't work. Then I realized that Nutch-default.xml also has same
On 9/7/06, David Wallace [EMAIL PROTECTED] wrote:
Just guessing, but could this be caused by session ids in the URL? Or
some other unimportant piece of data? If this is the case, then every
page would be added to the index when it's crawled, regardless of
whether it's already in there, with a
On 9/8/06, Andrzej Bialecki [EMAIL PROTECTED] wrote:
(moved to nutch-user)
Tomi NA wrote:
On 9/7/06, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Tomi NA wrote:
On 9/7/06, Nick Burch [EMAIL PROTECTED] wrote:
On Thu, 7 Sep 2006, Tomi NA wrote:
On 9/7/06, Venkateshprasanna [EMAIL PROTECTED
On 9/8/06, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Tomi NA wrote:
On 9/7/06, David Wallace [EMAIL PROTECTED] wrote:
Just guessing, but could this be caused by session ids in the URL? Or
some other unimportant piece of data? If this is the case, then every
page would be added to the index
I'd like the user to be able to find my three dogs.jpg if he
searches for three dogs, even though nutch doesn't have a .jpg
parser. Whatsmore, I'd like the user to be able to search against any
other extrinsic file attribute: date, file size, even mime type, all
without reading a single bit of
On 9/6/06, Andrei Hajdukewycz [EMAIL PROTECTED] wrote:
Another problem I've noticed is that it seems the db grows *rapidly* with each
successive recrawl. Mine started at 379MB, and it seems to increase by roughly
350MB every time I run a recrawl, despite there not being anywhere near that
On 9/7/06, heack [EMAIL PROTECTED] wrote:
I meet the same problem with you. I think if there exist a way to store a
description to .mp3 .wmv or .avi .. files, and could be searched.
I believe the problem can't be solved by adding a new parse plugin to
parse all other (binary) filetypes: this
On 9/5/06, Zaheed Haque [EMAIL PROTECTED] wrote:
Hi:
In the text file you will have the following
hostname1 portnumber
hostname2 portnumber
example
localhost 1234
localhost 5678
Does this work with nutch 0.7.2 or is it specific to the 0.8 release?
t.n.a.
On 9/6/06, Zaheed Haque [EMAIL PROTECTED] wrote:
On 9/6/06, Tomi NA [EMAIL PROTECTED] wrote:
On 9/5/06, Zaheed Haque [EMAIL PROTECTED] wrote:
Hi:
In the text file you will have the following
hostname1 portnumber
hostname2 portnumber
example
localhost 1234
localhost 5678
The task
---
I have less than 100GB of diverse documents (.doc, .pdf, .ppt, .txt,
.xls, etc.) to index. Dozens, or even hundreds and thousands of
documents can change their content, be created or deleted every day.
The crawler will run on a HP DL380 G4 server - don't know the exact
specs
On 9/3/06, Sidney [EMAIL PROTECTED] wrote:
Does nutch index images? If not or/and if so how can I go about creating a
separate search category for searching for images like the major search
engines have? If anyone can give any information on this I would be very
grateful.
You could go format
On 9/1/06, Frank Huang [EMAIL PROTECTED] wrote:
But when I execute ./nutch crawl there show some messages like fetch okay
,but can`t parse http://(omit...).pdf reason:failed omit..content
truncated at 70709 bytes.Parse can`t handle incomplete pdf file.
Haven't had time to go through the
On 8/30/06, Chris Mattmann [EMAIL PROTECTED] wrote:
Hi there Tomi,
On 8/30/06 12:25 PM, Tomi NA [EMAIL PROTECTED] wrote:
I'm attempting to crawl a single samba mounted share. During testing,
I'm crawling like this:
./bin/nutch crawl urls -dir crawldir4 -depth 2 -topN 20
I'm using luke
I'm attempting to crawl a single samba mounted share. During testing,
I'm crawling like this:
./bin/nutch crawl urls -dir crawldir4 -depth 2 -topN 20
I'm using luke 0.6 to query and analyze the index.
PROBLEMS
1.) search by file type doesn't work
I expected that a search file type:pdf would
I'm interested in crawling multiple shared folders (among other
things) on a corporate LAN.
It is a LAN of MS clients with Active Directory managed accounts.
The users routinely access the files based on ntfs-level (and
sharing?) permissions.
Idealy, I'd like to set up a central server
On 8/8/06, Björn Wilmsmann [EMAIL PROTECTED] wrote:
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hey,
I have run into the same problem, too. Sometimes nutch won't return
results for queries although there clearly are pages containing the
search term. I agree that this must have something to
On 7/29/06, Tomi NA [EMAIL PROTECTED] wrote:
On 7/29/06, Sami Siren [EMAIL PROTECTED] wrote:
Not expert on this area but perhaps you need to upgrade lucene .jar
files that are used by luke?
I believe I was a little bit hasty with the message I sent. I took a
second look and it just might
I am trying to crawl/index a shared folder in the office LAN: that
means a lot of .zip files, a lot of big .pdfs (5 MB) etc.
I sacrificed performance for memory effectiveness where I found the
tradeoff (indexer.mergeFactor = 5, indexer.minMergeDocs = 5), but
the crawl process breaks if I set
I successfully used luke with indexes created with nutch 0.7.2.
I tried the same with nutch 0.8, but luke sees it as a corrupt index.
Should this be happening?
I know this isn't the luke mailing list, but the information will
still be usefull to people using nutch.
Thanks,
t.n.a.
On 7/29/06, Sami Siren [EMAIL PROTECTED] wrote:
Not expert on this area but perhaps you need to upgrade lucene .jar
files that are used by luke?
I believe I was a little bit hasty with the message I sent. I took a
second look and it just might be that luke was right and the index is
invalid -
see what I come up with using 0.8 as I need the .xls and .zip
support, anyway.
t.n.a.
On 7/20/06, Teruhiko Kurosaka [EMAIL PROTECTED] wrote:
You'd have to enable index-more and query-more plugins, I believe.
-Original Message-
From: Tomi NA [mailto:[EMAIL PROTECTED]
Sent: 2006-7-19 10
These kinds of queries return no results:
date:19980101-20061231
type:pdf
type:application/pdf
From the release changes documents (0.7-0.7.2), I assumed these would work.
Upon index inspection (using the luke tool), I see there are no fields
marked date or type (althought I gather this is
55 matches
Mail list logo