ntlm - options overview

2006-11-25 Thread Tomi NA
I came across an interesting overview of ntlm authentication possibilities at http://www.oaklandsoftware.com/papers/ntlm.html I thought I'd just mention it here in case anyone who knows how nutch authentication works under the hood has anything to say about the listed options. The solution

Re: depth limitation

2006-11-16 Thread Tomi NA
2006/11/16, [EMAIL PROTECTED] [EMAIL PROTECTED]: I have added depth limitation for version 0.7.2. If to someone it is interestingly i can contribute it. I am using depth limitation in 0.8.1, but am looking to 0.7.2 as the next version I work with so I'm very interested. t.n.a.

Re: Strategic Direction of Nutch

2006-11-13 Thread Tomi NA
2006/11/13, carmmello [EMAIL PROTECTED]: Hi, Nutch, from version 0.8 is, really, very, very slow, using a single machine, to process data, after the crawling. Compared with Nutch 0.7.2 I would say, ... this series. I don`t believe that there are many Nutch users, in the real world of

Re: .7x - .8x

2006-11-03 Thread Tomi NA
2006/11/3, Josef Novak [EMAIL PROTECTED]: Hi, Very short question (hopefully). Is it possible to get bin/nutch fetch to print a log of the pages being downloaded to the command terminal? I have been using 0.7.2 up until now; in that version the fetch command outputs errors and the names of

Re: returning a description of a returned document

2006-10-29 Thread Tomi NA
2006/10/29, Cristina Belderrain [EMAIL PROTECTED]: Hi Tomi, please take a look at the following tutorial: http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html Apparently, Nutch's search application already shows hit summaries... Anyway, you can always retrieve each

returning a description of a returned document

2006-10-28 Thread Tomi NA
Is there a way to have nutch return some hit context (a la google) to better identify the hit? For example, if I search for nutch, a link pointing to http://lucene.apache.org/nutch/; would be followed by the following context: This is the first *Nutch* release as an Apache Lucene sub-project. ...

Re: Fetching outside the domain ?

2006-10-25 Thread Tomi NA
2006/10/23, Andrzej Bialecki [EMAIL PROTECTED]: Tomi NA wrote: 2006/10/18, [EMAIL PROTECTED] [EMAIL PROTECTED]: Btw we have some virtual local hosts, hoz does the db.ignore.external.links deal with that ? Update: setting db.ignore.external.links to true in nutch-site (and later also

Re: Fetching outside the domain ?

2006-10-23 Thread Tomi NA
2006/10/18, [EMAIL PROTECTED] [EMAIL PROTECTED]: Btw we have some virtual local hosts, hoz does the db.ignore.external.links deal with that ? Update: setting db.ignore.external.links to true in nutch-site (and later also in nutch-default as a sanity check) *doesn't work*: I feed the crawl

Re: crawling sites which require authentication

2006-10-23 Thread Tomi NA
2006/10/14, Tomi NA [EMAIL PROTECTED]: 2006/10/14, Toufeeq Hussain [EMAIL PROTECTED]: From internal tests with ntlmaps + Nutch the conclusion we came to was that though it kinda-works it puts a huge load on the Nutch server as ntlmaps is a major memory-hog and the mixture of the two leads

Re: Fetching outside the domain ?

2006-10-18 Thread Tomi NA
2006/10/18, Frederic Goudal [EMAIL PROTECTED]: Hello, I'm begining to play with nutch to index our own web site. I have done a first crawl and I have trid the recrawl script. While fetching I have lines like that : fetching http://www.yourdictionary.com/grammars.html fetching

Re: crawling sites which require authentication

2006-10-14 Thread Tomi NA
2006/10/14, Toufeeq Hussain [EMAIL PROTECTED]: From internal tests with ntlmaps + Nutch the conclusion we came to was that though it kinda-works it puts a huge load on the Nutch server as ntlmaps is a major memory-hog and the mixture of the two leads to performance issues. For a PoC this will

Re: crawling sites which require authentication

2006-10-13 Thread Tomi NA
2006/10/13, Guruprasad Iyer [EMAIL PROTECTED]: Hi Tomi, using a ntlmaps proxy How do I get this proxy? You tell nutch to use the proxy and you provide the proxy with adequate access priviledges. How do I do this? Can you elaborate? I am a new Nutch user and am very much in the learning phase.

Re: Lucene query support in Nutch

2006-10-10 Thread Tomi NA
2006/10/10, Cristina Belderrain [EMAIL PROTECTED]: On 10/9/06, Tomi NA [EMAIL PROTECTED] wrote: This is *exactly* what I was thinking. Like Stefan, I believe the nutch analyzer is a good foundation and should therefore be extended to support the or operator, and possibly additional

Re: Lucene query support in Nutch

2006-10-09 Thread Tomi NA
2006/10/8, Stefan Neufeind [EMAIL PROTECTED]: if it's not the full feature-set, maybe most people could live with it. But basic boolean queries I think were the root for this topic. Is there an easier way to allow this in Nutch as well instead of throwing quite a bit away and using the

Re: [ANNOUNCE] Nutch 0.8.1 available

2006-09-27 Thread Tomi NA
On 9/27/06, Sami Siren [EMAIL PROTECTED] wrote: Nutch Project is pleased to announce the availability of 0.8.1 release of Nutch - the open source web-search software based on lucene and hadoop. The release is immediately available for download from: http://lucene.apache.org/nutch/release/

Re: Which Operating-System do you use for Nutch

2006-09-27 Thread Tomi NA
On 9/26/06, Jim Wilson [EMAIL PROTECTED] wrote: I'd do it, but I'm too busy being consumed with worries about the lack of support for HTTP/NTLM credentials and SMB fileshare indexing. Arrrgg - tis another sad day in the life of this pirate. We seem to share the same problems...they haven't

Re: Which Operating-System do you use for Nutch

2006-09-26 Thread Tomi NA
On 9/25/06, Jim Wilson [EMAIL PROTECTED] wrote: flamebait You can get it working on Windows if you're willing to work for it. To use Nutch OOTB, you have to install Cygwin since the provided Nutch launcher is written in Bash. Members of the community have provided alternatives, such as this

Re: Nutch 0.8 - MS Word document parse failure : Can't be handled as micrsosoft document. java.util.NoSuchElementException

2006-09-22 Thread Tomi NA
On 9/22/06, Trym B. Asserson [EMAIL PROTECTED] wrote: Any other suggestions? Tomi, you said you'd had difficulties too with certain MS documents, did you manage to find a work-around or did you just have to ignore these documents? So far we've only concentrated on using the plugins in Nutch 0.8

Re: Nutch 0.8 - MS Word document parse failure : Can't be handled as micrsosoft document. java.util.NoSuchElementException

2006-09-22 Thread Tomi NA
On 9/22/06, Andrzej Bialecki [EMAIL PROTECTED] wrote: You are not the first one to consider using OO.org for Word conversion. However, this solution brings with it a large dependency (ca 250MB installed), which requires proper installation; and also the UNO interface is reported to be

Re: Forcing refetch and index of specified files

2006-09-22 Thread Tomi NA
On 9/21/06, Andrzej Bialecki [EMAIL PROTECTED] wrote: Benjamin Higgins wrote: How can I instruct Nutch to refetch specific files and then update the index entries for those files? I am indexing files on a fileserver and I am able to produce a report of changed files about every 30 minutes.

Re: Nutch 0.8 - MS Word document parse failure : Can't be handled as micrsosoft document. java.util.NoSuchElementException

2006-09-21 Thread Tomi NA
On 9/21/06, Jim Wilson [EMAIL PROTECTED] wrote: I haven't had this particular problem, but here's something to consider: After you remove the TextBox objects you have to re-save the document. Is the new document the same version as the previous one? By this I mean, the same Word version (97,

Re: Automatic crawling

2006-09-21 Thread Tomi NA
On 9/21/06, Jacob Brunson [EMAIL PROTECTED] wrote: On 9/21/06, Gianni Parini [EMAIL PROTECTED] wrote: -Is it possible to have an automatic recrawling? have i got to write my own application by myself? I need an application running in background that re-crawl my intranet site 2-3 times

Re: Changing page injection behavior in Nutch 0.8

2006-09-20 Thread Tomi NA
On 9/20/06, Benjamin Higgins [EMAIL PROTECTED] wrote: In Nutch 0.7, I wanted to change Nutch's behavior such that when I inject a file it will add the page, even if it is already present. I did this because I can prepare a list of changed files that I have on my intranet and want Nutch to

Re: Changing page injection behavior in Nutch 0.8

2006-09-20 Thread Tomi NA
On 9/20/06, Tomi NA [EMAIL PROTECTED] wrote: On 9/20/06, Benjamin Higgins [EMAIL PROTECTED] wrote: In Nutch 0.7, I wanted to change Nutch's behavior such that when I inject a file it will add the page, even if it is already present. I did this because I can prepare a list of changed files

Re: java.lang.NullPointerException

2006-09-18 Thread Tomi NA
On 9/18/06, NG-Marketing, M.Schneider [EMAIL PROTECTED] wrote: I figured it out. I used in my nutch-site.xml the following config property namesearcher.max.hits/name value2048/value If I change the value to nothing it works all fine. It took me a couple of hours to figure it

Re: how to combine two run's result for search

2006-09-18 Thread Tomi NA
On 9/16/06, Tomi NA [EMAIL PROTECTED] wrote: On 9/15/06, Tomi NA [EMAIL PROTECTED] wrote: On 9/14/06, Zaheed Haque [EMAIL PROTECTED] wrote: Thats the way I set it up at first. This time, I started with a blank slate, unpacked nutch and tomcat, unpacked nutch-0.8.war into the webapps

Re: how to combine two run's result for search

2006-09-18 Thread Tomi NA
On 9/18/06, Zaheed Haque [EMAIL PROTECTED] wrote: Hi: I have just checked your flash movie.. quick observation you are running tomcat 4.1.31 and there is nothing you are doing that seems wrong. Anyway after starting the servers can you search using the following command bin/nutch

Re: java.lang.NullPointerException

2006-09-17 Thread Tomi NA
On 9/17/06, NG-Marketing, Matthias Schneider [EMAIL PROTECTED] wrote: Hello List, i installed nutch 0.8 and i can fetch and index documents, but I can not search them. I get the following error: StandardWrapperValve[jsp]: Servlet.service() for servlet jsp threw exception

Re: how to combine two run's result for search

2006-09-14 Thread Tomi NA
On 9/5/06, Zaheed Haque [EMAIL PROTECTED] wrote: Hi: I have a problem or two with the described procedure... Assuming you have index 1 at /data/crawl1 index 2 at /data/crawl2 Used ./bin/nutch crawl urls -dir /home/myhome/crawls/mycrawldir to generate an index: luke says the index is valid

Re: how to combine two run's result for search

2006-09-14 Thread Tomi NA
On 9/14/06, Zaheed Haque [EMAIL PROTECTED] wrote: On 9/14/06, Tomi NA [EMAIL PROTECTED] wrote: On 9/5/06, Zaheed Haque [EMAIL PROTECTED] wrote: Hi: I have a problem or two with the described procedure... Assuming you have index 1 at /data/crawl1 index 2 at /data/crawl2 Used ./bin

Re: 0.8 Intranet Crawl Output/Logging?

2006-09-14 Thread Tomi NA
On 9/14/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Everyone, thanks for the help with this. I hope to return the assistance, once I am more familiar with 0.8. I am using tail -f now to monitor my test crawls. It also look like you can use conf/hadoop-env.sh to redirect log file output to

Re: 0.8 Intranet Crawl Output/Logging?

2006-09-13 Thread Tomi NA
On 9/13/06, wmelo [EMAIL PROTECTED] wrote: I have the same original doubt. I know that the log shows informations, but, how to see the things happening, real time, like in nutch 0.7.2, when you use the crawl command in the terminal? try something like this (assuming you know what's good for

Re: Fetching past Authentication

2006-09-09 Thread Tomi NA
On 9/8/06, Jim Wilson [EMAIL PROTECTED] wrote: Dear Nutch User List, I am desperately trying to index an Intranet with the following characteristics 1) Some sites require no authentication - these already work great! 2) Some sites require basic HTTP Authentication. 3) Some sites require NTLM

Re: Nutch-site.xml vs Nutch-default.xml

2006-09-09 Thread Tomi NA
On 9/9/06, victor_emailbox [EMAIL PROTECTED] wrote: Hi all, I spent a lot of time to figure out why Nutch didn't respond to my configuration in nutch-site.xml. I set db.ignore.external.links to true. It didn't work. Then I realized that Nutch-default.xml also has same

Re: Recrawling (Tomi NA)

2006-09-08 Thread Tomi NA
On 9/7/06, David Wallace [EMAIL PROTECTED] wrote: Just guessing, but could this be caused by session ids in the URL? Or some other unimportant piece of data? If this is the case, then every page would be added to the index when it's crawled, regardless of whether it's already in there, with a

Re: Indexing MS Powerpoint files with Lucene

2006-09-08 Thread Tomi NA
On 9/8/06, Andrzej Bialecki [EMAIL PROTECTED] wrote: (moved to nutch-user) Tomi NA wrote: On 9/7/06, Andrzej Bialecki [EMAIL PROTECTED] wrote: Tomi NA wrote: On 9/7/06, Nick Burch [EMAIL PROTECTED] wrote: On Thu, 7 Sep 2006, Tomi NA wrote: On 9/7/06, Venkateshprasanna [EMAIL PROTECTED

Re: Recrawling (Tomi NA)

2006-09-08 Thread Tomi NA
On 9/8/06, Andrzej Bialecki [EMAIL PROTECTED] wrote: Tomi NA wrote: On 9/7/06, David Wallace [EMAIL PROTECTED] wrote: Just guessing, but could this be caused by session ids in the URL? Or some other unimportant piece of data? If this is the case, then every page would be added to the index

parse url and file attributes only - no content

2006-09-07 Thread Tomi NA
I'd like the user to be able to find my three dogs.jpg if he searches for three dogs, even though nutch doesn't have a .jpg parser. Whatsmore, I'd like the user to be able to search against any other extrinsic file attribute: date, file size, even mime type, all without reading a single bit of

Re: Recrawling

2006-09-07 Thread Tomi NA
On 9/6/06, Andrei Hajdukewycz [EMAIL PROTECTED] wrote: Another problem I've noticed is that it seems the db grows *rapidly* with each successive recrawl. Mine started at 379MB, and it seems to increase by roughly 350MB every time I run a recrawl, despite there not being anywhere near that

Re: parse url and file attributes only - no content

2006-09-07 Thread Tomi NA
On 9/7/06, heack [EMAIL PROTECTED] wrote: I meet the same problem with you. I think if there exist a way to store a description to .mp3 .wmv or .avi .. files, and could be searched. I believe the problem can't be solved by adding a new parse plugin to parse all other (binary) filetypes: this

Re: how to combine two run's result for search

2006-09-06 Thread Tomi NA
On 9/5/06, Zaheed Haque [EMAIL PROTECTED] wrote: Hi: In the text file you will have the following hostname1 portnumber hostname2 portnumber example localhost 1234 localhost 5678 Does this work with nutch 0.7.2 or is it specific to the 0.8 release? t.n.a.

Re: how to combine two run's result for search

2006-09-06 Thread Tomi NA
On 9/6/06, Zaheed Haque [EMAIL PROTECTED] wrote: On 9/6/06, Tomi NA [EMAIL PROTECTED] wrote: On 9/5/06, Zaheed Haque [EMAIL PROTECTED] wrote: Hi: In the text file you will have the following hostname1 portnumber hostname2 portnumber example localhost 1234 localhost 5678

crawling frequently changing data on an intranet - how?

2006-09-05 Thread Tomi NA
The task --- I have less than 100GB of diverse documents (.doc, .pdf, .ppt, .txt, .xls, etc.) to index. Dozens, or even hundreds and thousands of documents can change their content, be created or deleted every day. The crawler will run on a HP DL380 G4 server - don't know the exact specs

Re: Does Nutch index images?

2006-09-03 Thread Tomi NA
On 9/3/06, Sidney [EMAIL PROTECTED] wrote: Does nutch index images? If not or/and if so how can I go about creating a separate search category for searching for images like the major search engines have? If anyone can give any information on this I would be very grateful. You could go format

Re: Could anyone teache me how to index the title or content of PDF?

2006-09-02 Thread Tomi NA
On 9/1/06, Frank Huang [EMAIL PROTECTED] wrote: But when I execute ./nutch crawl there show some messages like fetch okay ,but can`t parse http://(omit...).pdf reason:failed omit..content truncated at 70709 bytes.Parse can`t handle incomplete pdf file. Haven't had time to go through the

Re: intranet crawl problems: mime types; .doc-related exceptions; really, really slow crawl + possible infinite loop

2006-08-31 Thread Tomi NA
On 8/30/06, Chris Mattmann [EMAIL PROTECTED] wrote: Hi there Tomi, On 8/30/06 12:25 PM, Tomi NA [EMAIL PROTECTED] wrote: I'm attempting to crawl a single samba mounted share. During testing, I'm crawling like this: ./bin/nutch crawl urls -dir crawldir4 -depth 2 -topN 20 I'm using luke

intranet crawl problems: mime types; .doc-related exceptions; really, really slow crawl + possible infinite loop

2006-08-30 Thread Tomi NA
I'm attempting to crawl a single samba mounted share. During testing, I'm crawling like this: ./bin/nutch crawl urls -dir crawldir4 -depth 2 -topN 20 I'm using luke 0.6 to query and analyze the index. PROBLEMS 1.) search by file type doesn't work I expected that a search file type:pdf would

file access rights/permissions considerations - the least painful way

2006-08-10 Thread Tomi NA
I'm interested in crawling multiple shared folders (among other things) on a corporate LAN. It is a LAN of MS clients with Active Directory managed accounts. The users routinely access the files based on ntfs-level (and sharing?) permissions. Idealy, I'd like to set up a central server

Re: How do I write a nutch query.

2006-08-08 Thread Tomi NA
On 8/8/06, Björn Wilmsmann [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hey, I have run into the same problem, too. Sometimes nutch won't return results for queries although there clearly are pages containing the search term. I agree that this must have something to

Re: nutch 0.8 and luke

2006-07-31 Thread Tomi NA
On 7/29/06, Tomi NA [EMAIL PROTECTED] wrote: On 7/29/06, Sami Siren [EMAIL PROTECTED] wrote: Not expert on this area but perhaps you need to upgrade lucene .jar files that are used by luke? I believe I was a little bit hasty with the message I sent. I took a second look and it just might

max file size vs. available RAM size: crawl uses up all available memory

2006-07-31 Thread Tomi NA
I am trying to crawl/index a shared folder in the office LAN: that means a lot of .zip files, a lot of big .pdfs (5 MB) etc. I sacrificed performance for memory effectiveness where I found the tradeoff (indexer.mergeFactor = 5, indexer.minMergeDocs = 5), but the crawl process breaks if I set

nutch 0.8 and luke

2006-07-29 Thread Tomi NA
I successfully used luke with indexes created with nutch 0.7.2. I tried the same with nutch 0.8, but luke sees it as a corrupt index. Should this be happening? I know this isn't the luke mailing list, but the information will still be usefull to people using nutch. Thanks, t.n.a.

Re: nutch 0.8 and luke

2006-07-29 Thread Tomi NA
On 7/29/06, Sami Siren [EMAIL PROTECTED] wrote: Not expert on this area but perhaps you need to upgrade lucene .jar files that are used by luke? I believe I was a little bit hasty with the message I sent. I took a second look and it just might be that luke was right and the index is invalid -

Re: missing, but declared functionality

2006-07-28 Thread Tomi NA
see what I come up with using 0.8 as I need the .xls and .zip support, anyway. t.n.a. On 7/20/06, Teruhiko Kurosaka [EMAIL PROTECTED] wrote: You'd have to enable index-more and query-more plugins, I believe. -Original Message- From: Tomi NA [mailto:[EMAIL PROTECTED] Sent: 2006-7-19 10

missing, but declared functionality

2006-07-19 Thread Tomi NA
These kinds of queries return no results: date:19980101-20061231 type:pdf type:application/pdf From the release changes documents (0.7-0.7.2), I assumed these would work. Upon index inspection (using the luke tool), I see there are no fields marked date or type (althought I gather this is