Re: Code to be modified

2008-03-31 Thread Martin Kuen
neetg/SPD38/share/doc/ > -. > > as i want to craw /hm/vineetg/SPD38/libraries/ and > /hm/vineetg/SPD38/share/doc/ directories. > But it is still crawling the parent directories and generating error too. > > Did i configured regex-urlfiter.txt file correctly?? > > Vineet

Re: Code to be modified

2008-03-28 Thread Martin Kuen
Hi, I know that this advice can be found in some places on the internet. However, it's not true that you have to modify code to achieve it. see the faq in the nutch wiki: Nutch crawling parent directories for file protocol -> misconfigured URLFilters

Re: Urgent help reqd.....plz

2008-02-05 Thread Martin Kuen
Hi, I assume that you are probably running this program in Eclipse or some other IDE. However, you need to include the "path-to-nutch/conf" directory in your classpath. Otherwise the configuration files are not parsed/found on start-up. "plugins.folder" is a key from "nutch-default.xml" or " nutch

Re: New Installation - Problems - Error 500

2008-01-29 Thread Martin Kuen
rom apache and try it with that dist (and jdk-1.5). I cannot give you any advice on how to configure this for this kind of package. However, running tomcat using the apache dist is rather simple. Just unzip it and set JAVA_HOME and execute "bin/startup.sh". > > I know it'

Re: New Installation - Problems - Error 500

2008-01-29 Thread Martin Kuen
Hi, if you type "java -version" in your shell the shell will output the java version you are using. I assume the output will refer to to gcj not to the sun-jdk. You should change your environment variables or create the necassary ones. Open a shell and in your tomcat installation's root directory

Re: Newbie Questions: http.max.delays, view fetched page, view link db

2008-01-29 Thread Martin Kuen
ndex is not touched. lucene index --> the inverted index Best Regards, Martin > > Martin Kuen wrote: > > > > Hi, > > > > On Jan 29, 2008 11:11 AM, Vinci <[EMAIL PROTECTED]> wrote: > > > >> > >> Hi, > >> > >> I a

Re: Newbie Questions: http.max.delays, view fetched page, view link db

2008-01-29 Thread Martin Kuen
Hi, On Jan 29, 2008 11:11 AM, Vinci <[EMAIL PROTECTED]> wrote: > > Hi, > > I am new to nutch and I am trying to run a nutch to fetch something from > specific websites. Currently I am running 0.9. > > As I have limited resources, I don't want nutch be too aggressive, so I > want > to set some del

Re: Help: parsing pdf files

2008-01-18 Thread Martin Kuen
Hi, well honestly I cannot give you an advice on that. The native libraries are part of the hadoop distributed filesystem. I think you should ask this question on the hadoop users list, since this is a question regarding the hadoop dfs. Things "should" work without these native dependencies as w

Re: Help: parsing pdf files

2008-01-17 Thread Martin Kuen
> The length limit for downloaded content, in bytes. > If this value is nonnegative (>=0), content longer than it will be > truncated; > otherwise, no truncation at all. > > > > 2008/1/17, Krishnamohan Meduri <[EMAIL PROTECTED]>: > > Hi Martin, > > &

Re: Help: parsing pdf files

2008-01-16 Thread Martin Kuen
Hi, what comes to my mind is that there is a setting for the maximum size of a downloaded file. Have a look at "nutch-default.xml" and override it in "nutch-site.xml". pdf-files tend to be quite big (compared to html). so probably this is the source of your problem. pdf files are downloaded and ma

Re: Some erros with Log4J configuration with Nutch 0.8.1

2008-01-09 Thread Martin Kuen
Hi, I don't know this problem in particular. However, maybe the following pointers may be helpful: 1. Check the amount of "log4j.properties" files on your classpath. (e.g. path-to-tomcat/commons/classes too) Log4j will only pick one config file 2. Where on your classpath you can find a log4j.jar

Re: Help me! got a problem when running nutch in eclipse

2008-01-08 Thread Martin Kuen
hi, can you provide some additional information please? What is in the log and on the commandline? (pathtonutch/log/hadoop.log) Probably you should try to adjust the logging settings (log4j.properties in the conf directory) Which plugins are you using (and which are loaded)? Does the problem remai

Re: Crawling techniques?

2008-01-07 Thread Martin Kuen
Hi Viksit, maybe you are looking for this thread: http://www.nabble.com/Re%3A-The-ranking-is-wrong-tf4360656.html#a12436465 Cheers, Martin PS: nutch-user is the correct option. nutch-agent is primarly for site-owners who want to report misbehaving nutch bots. On Jan 7, 2008 4:52 AM, Viksit Ga

Re: form-based authentication?

2008-01-05 Thread Martin Kuen
Hi, I On Jan 5, 2008 6:50 PM, <[EMAIL PROTECTED]> wrote: > Hi, > > I'm pretty sure the answer is negative, but I've got to ask - is support for > form-based authentication available somewhere within Nutch? > I believe Nutch does not support form-based auth, so the next question to ask > is - i

Re: Running the bin/nutch crawl command with Cygwin

2007-12-28 Thread Martin Kuen
Hi, "major.minor version 49.0" indicates that the bytecode is for a Java 1.5 or later. The error (should) mean the code is being loaded by a vm pre-1.5. Do you have (somewhere) a JRE on your machine which is pre-1.5? If you just type "java" on the command-line windows (and cygwin respectively) wil

Re: semantics of meta noindex

2007-12-19 Thread Martin Kuen
Hi Charlie, IMO if the maintainer doesn't want a page to to be searchable at all the page should be excluded using robots.txt (my intuition). Unfortunately, I cannot tell you how Nutch finally handles such a page in its index. My two cents, Martin On Dec 19, 2007 1:04 AM, charlie w <[EMAIL PRO

Re: Proble with pdf and word indexing

2007-12-13 Thread Martin Kuen
Hi, maybe the following ideas are helpful for you: > Monica wrote > I have a system with the nutch configure.The all html pages that they are > generated dinamically with Servlets a JSP, are correctly indexing with crawl, > but I have a problem with the pdf and word files. My system save those f

Re: maybe dumb question about nutch index and segments file

2007-09-20 Thread Martin Kuen
hi, regarding hit summaries: The summaries are generated at search time. This is necessary, since different queries will generate different summaries (and different terms will be highlighted). The parsed text is stored in the various "segments/" folders. I don't know which directory it actually pi

Re: How to change logging level to see trace message?

2007-09-17 Thread Martin Kuen
hi, have a look at "conf/log4j.properties". Using this file you can change the logging level. example: "log4j.logger.org.apache.nutch.crawl.Crawl=INFO,cmdstdout" change to: "log4j.logger.org.apache.nutch.crawl.Crawl=TRACE,cmdstdout" will change the logging level to for the crawl-tool to "trace". N

Re: maybe dumb question about nutch index and segments file

2007-09-13 Thread Martin Kuen
hi, Nutch stores more data than lucene. The lucene index is a subset of what you call "nutch index". If you follow the nutch tutorial you'll find the lucene index in "crawl/indexes". That's the location you should try to open. In that directory you'll also find a file called "segments.gen". IMO th

Re: Crawler fetching weird urls

2007-09-11 Thread Martin Kuen
hi, the commands "readdb" and "readlinkdb" could be interesting for you: http://wiki.apache.org/nutch/08CommandLineOptions If you want to see the in/outlinks (readlinkdb) of a given page you must fist invoke the "invertlinks" command. Unfortunatly, I don't know how to remove an individual url fr

Re: Downloading file types to file system

2007-09-11 Thread Martin Kuen
hi, I don't think that nutch can be configured to store each downloaded file as a file (one file downloaded - one file on your local disk). The "byte array called content" can be directly stored I think. I think that's worth giving it a try. The fetcher uses (binary) streams to handle the download

Re: about nutch pagerank

2007-08-16 Thread Martin Kuen
hi Ting, Have a look at the "scoring-opic" plugin and the package " org.apache.nutch.scoring.*". "opic" is the algorithm used by nutch to determine a page's static importance. Basically speaking it does the same job as google's pagerank algorithm. some issues (probarbly fixed?) regarding the imp

Re: UBUNTU total hits 0

2007-08-14 Thread Martin Kuen
Hi Fabian, sorry, but I can only "reply" with a bunch of questions . . . On 8/14/07, Fabian López <[EMAIL PROTECTED]> wrote: > > Hi, > after following the tutorial of Nutch 0.8, when I try to search with > > bin/nutch org.apache.nutch.searcher.NutchBean apache > > I receive "Total Hits:0" > > I h

Re: Fetcher get slower and slower in one run of crawling

2007-08-09 Thread Martin Kuen
ith > enough pages to test. > > So with more than 12M pages of wikipedia, I guess it is almost impossible > to > crawl wikipedia on line. > How does google do this? > > > Martin Kuen wrote: > > > > hi there, > > > > the property "server.del

Re: Fetcher get slower and slower in one run of crawling

2007-08-09 Thread Martin Kuen
hi there, the property "server.delay" is the delay for one site (e.g. wikipedia). So, if you have a delay of 0.5 you'll fetch 2 pages per second. In my opinion there is something about the fetcher's code that doesn't makes it obey this rule in the very beginning . . . probarbly at start-up 30 thr

Re: Tomcat without Apache

2007-07-31 Thread Martin Kuen
hi, no, you can use tomcat as a webserver. Read about the concept of "Connectors" in the tomcat documentation, if you're interested. cheers On 7/31/07, Kursun, Mahmut <[EMAIL PROTECTED]> wrote: > > I made a funny experience while trying out nutch. > > I think it was on a machine with Fedora Cor