Re: Help with bin/nutch server 8081 crawl

2006-03-07 Thread Marko Bauhardt
This two patches could fix the problem. The first patch is a hadoop patch and the other patch is a nutch patch. I dont know whether i should create a bug in the nutch-jira and hadoop-jira? Anyway... here are the two patches. Index: src/java/org/apache/hadoop/ipc/Server.java

Re: org.apache.nutch.net.URLFilter not found.

2006-03-10 Thread Marko Bauhardt
Sounds like nutch does not found your plugins. A stacktrace from your exception could help. Please verify your nutch-default.xml at the property property nameplugin.folders/name valueplugin/value descriptionDirectories where nutch plugins are located. Each element may be a relative

Re: URL containing ?, and =

2006-03-10 Thread Marko Bauhardt
Am 10.03.2006 um 05:58 schrieb Vertical Search: Okay, I have noticed that for URLs containing ?, and = I cannot crawl. I have tried all combinations of modifying crawl-urlfilter.txt and # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] Try [EMAIL

Re: URL containing ?, and =

2006-03-10 Thread Marko Bauhardt
Do you crawl the intranet or do you crawl the web? If you crawl the web then you must edit the urlfilter-regex.txt and not the crawl- urlfilter.txt. In your first mail you said you get an exception like org.apache.nutch.net.URLFilter not found. Does the exception still occur? Marko

Re: 0.8: NullPointerException Optimizing index when crawling

2006-03-15 Thread Marko Bauhardt
Am 14.03.2006 um 23:20 schrieb ArentJan Banck: java.lang.NullPointerException at org.apache.nutch.indexer.Indexer$OutputFormat$1.write (Indexer.java:109) What for index plugins do you have configured in your nutch- default.xml or nutch-site.xml? Be sure that the index-basic plugin

Re: newbie question about nutch 0.8

2006-03-16 Thread Marko Bauhardt
Am 16.03.2006 um 06:43 schrieb Ilia S. Yatsenko: And got next error: file not found index/segment Do you have the property searcher.dir in the nutch-default.xml or nutch-site.xml configured in your nutch webapp? property namesearcher.dir/name valuecrawl/value description Path to

Re: newbie question about nutch 0.8

2006-03-16 Thread Marko Bauhardt
Am 16.03.2006 um 06:43 schrieb Ilia S. Yatsenko: And got next error: file not found index/segment Your folder structure should be: YOUR_SEARCH_FOLDER/crawldb YOUR_SEARCH_FOLDER/linkdb YOUR_SEARCH_FOLDER/segments/2006... YOUR_SEARCH_FOLDER/indexes/part-

Re: Custom Distributed crawl - NDFS?

2006-03-16 Thread Marko Bauhardt
Am 16.03.2006 um 12:50 schrieb Grégory Debord: Hi all, I would like to implement a distributed crawl which would be something like this : The hadoop project is used for working with a dfs. In hadoop exists one master (namenode, jobtracker) and n slaves (datanodes and tasktrackers).

Re: Searching specific domains

2006-03-17 Thread Marko Bauhardt
Am 17.03.2006 um 00:28 schrieb MagRaj: Is it possible to create a new segment(contains all the pages of that url) for each url?? You can use the regex-urlfilter.txt to accept only the urls you want. But for every new segment you have to change the regex-urlfilter.txt. A better way is to

Re: Help Setting Up Nutch 0.8 Distributed

2006-03-18 Thread Marko Bauhardt
Am 17.03.2006 um 17:20 schrieb Dennis Kubes: Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: 310) at org.apache.nutch.crawl.Injector.inject(Injector.java:114) at

Re: Searching specific domains

2006-03-18 Thread Marko Bauhardt
Am 17.03.2006 um 20:22 schrieb MagRaj: Thanks Marko for your suggestion. But, here is my problem: Find below the config files with sample data i have: urls.txt has got 5 urls (just as an example)

Re: Nutch client and move plugins

2006-03-20 Thread Marko Bauhardt
Am 20.03.2006 um 04:30 schrieb Berlin Brown: Is there a way to specify where the plugins are located? property nameplugin.folders/name valueYOUR PLUGIN PATH/value descriptionDirectories where nutch plugins are located. Each element may be a relative or absolute path. If absolute,

Re: is there a separate mailing list for hadoop now

2006-03-23 Thread Marko Bauhardt
Am 23.03.2006 um 08:42 schrieb Raghavendra Prabhu: Hi Is there a separate mailing list for hadoop right now http://lucene.apache.org/hadoop/mailing_lists.html Marko

Re: WhiteListBlackList

2006-05-22 Thread Marko Bauhardt
Am 22.05.2006 um 13:50 schrieb Murat Ali Bayir: Hi, I have problem when I am using black-white list url filtering. I have two directiory for filtering called NegativeURLS and PositiveURLS ** *** in

Re: Setting query.host.boost etc. in nutch-site.xml does not work?

2006-05-22 Thread Marko Bauhardt
This is a bug in the query-basic plugin. The boosting values in the nutch-default.xml are not used. We should open a bug in jira. This simple patch should work. Index: src/plugin/query-basic/src/java/org/apache/nutch/searcher/ basic/BasicQueryFilter.java

Re: Sorting in nutch-webinterface - how?

2006-05-25 Thread Marko Bauhardt
Am 25.05.2006 um 13:21 schrieb Stefan Neufeind: Hi, I did use index-basic and index-more. I see lastModified in the RSS-output. Now I want to sort=lastModified - does not work. Try sort=date. Regards, Marko

Re: Sorting in nutch-webinterface - how?

2006-05-25 Thread Marko Bauhardt
Hmm, that works. But why - since I think the field is named lastModified. LastModified is only used if lastModified is available about the html meta tags. If that true, lastModified is stored but not indexed. However the date field is always indexed. Is lastModified is available as

Re: Sorting in nutch-webinterface - how?

2006-05-26 Thread Marko Bauhardt
Am 26.05.2006 um 01:57 schrieb Stefan Neufeind: Modified. If not, date=FetchTime. Hi Marko, Hi Stefan, that hint really helped. Can you maybe also help me out with sort=title? See also: http://issues.apache.org/jira/browse/NUTCH-287 The problem is that it works on some searches - but

Re: deleting URL duplicates - never actually deleted?

2006-07-02 Thread Marko Bauhardt
# De-duplicate indexes # bogus argument is ignored but needed due to # a bug in the number of args expected bin/nutch dedup crawl/segments bogus The dedup command works only on many indexes and not on one or many segments. The directory structure of an index looks like:

Re: problem with fetching PDF or word format

2006-07-05 Thread Marko Bauhardt
Am 05.07.2006 um 17:24 schrieb aicha BEN: hello, Hi, fetch of file:///C:/doc/test.pdf failed with: java.lang.Exception: org.apache.nutch.protocol.file.FileError: File Error: 404 Exists the pdf file? Error Code 404 sounds like 'File Not Found'. Marko

Re: Nutch 0.8 java 1.4/1.5

2006-07-17 Thread Marko Bauhardt
Hi, try to export JAVA_HOME in your $HOME/.bashrc and $HOME/.bash_profile. You must also export $JAVA_HOME/bin in your $PATH variable. e.g. export PATH=$JAVA_HOME/bin:$PATH. It is important that you export your $JAVA_HOME/bin before the rest of the other $PATH variables. The first

Re: NullPointException

2006-08-03 Thread Marko Bauhardt
Hi, if you delete segments then be sure that you doesnt have an index from this segment. The segment contains the parsed content and the index is the index from this content. If you delete the segment and you doing a search on this index, a NPE occurs because no summary (parsed content) are

Re: NullPointException

2006-08-03 Thread Marko Bauhardt
Am 03.08.2006 um 18:52 schrieb Lourival Júnior: My questions: Why it occurs? How can I know which segments can be deleted? You must know which segment are indexed. You can not index all segments and after that delete these segments. The Indexer index the name of the segment that the

Re: indexing or search problem?

2006-08-05 Thread Marko Bauhardt
Am 04.08.2006 um 12:33 schrieb Rocio Chongtay: Hi, Hi How can I check if my indexing is has gone well if so far I cannot search? I have followed the step by step guide all the way to indexing and setting the GUI in tomcat. in my indexes/part-0 folder I can see files like:

Re: scheduling

2009-08-18 Thread Marko Bauhardt
On Aug 18, 2009, at 7:04 AM, fa...@butterflycluster.net wrote: hi, Hi Fadzi, I have a requirement to build a simple UI for starting stopping the crawler, and also a scheduling mechanism (Quartz). Has anyone attempted this before? We have started to implement the upgrade of the

Re: scheduling

2009-08-18 Thread Marko Bauhardt
Do you start the gui from eclipse or from binary package? marko On Aug 18, 2009, at 9:57 AM, fa...@butterflycluster.net wrote: tried that; no joy still. are there any specifics i need to put in nutch-site.xml? because mine is blank at the moment. Quoting Marko Bauhardt m...@101tec.com

Re: scheduling

2009-08-18 Thread Marko Bauhardt
on 127.0.0.1:50060/general. but the gui is currently only in german language :(. in the next days we translate it via i18n. marko Quoting Marko Bauhardt m...@101tec.com: On Aug 18, 2009, at 9:36 AM, fa...@butterflycluster.net wrote: Hi Marko, Hi I am trying to run the AdminApp

Re: topN value in crawl

2009-08-20 Thread Marko Bauhardt
On Aug 19, 2009, at 8:42 PM, alx...@aim.com wrote: hi Thanks. What if urls in my seed file do not have outlinks, let say .pdf files. Should I still specify topN variable? All I need is to index all urls in my seed file. And they are about 1 M. topN means that your generated

Re: Possible memory leak in Nutch-1.0 ?

2009-08-20 Thread Marko Bauhardt
Hi. You said that you open and close the nutch bean at every request. first this is very expensive. create the nutch bean only once and save it in the application and read it from the application if needed. second!! not sure but maybe it is possible that the PluginRepository has the memory

Re: Possible memory leak in Nutch-1.0 ?

2009-08-20 Thread Marko Bauhardt
On Aug 20, 2009, at 5:42 PM, Mark Round wrote: not sure but maybe it is possible that the PluginRepository has the memory leak. i think the cache (the weakhashmap) is growing and growing. Is this the same issue as reported here : https://issues.apache.org/jira/browse/NUTCH-356 ? ups. yes.

graphical user interface v0.2 for nutch

2009-09-24 Thread Marko Bauhardt
Hi list. we have pushed the second nutch gui release version 0.2. You can download the binary or the sources on http://github.com/101tec/nutch/downloads Two main features are implemented in this version + Security. You can start the admin gui with login feature, usernames and passwords can

Re: graphical user interface v0.2 for nutch

2009-09-30 Thread Marko Bauhardt
On Sep 30, 2009, at 3:47 PM, Bartosz Gadzimski wrote: Hello, Hi Bartosz First - great job, it looks and works very nice. :) Thanks! I have a question about urlfilters. Is this possible to get regex- urlfilter per instance (different for each instance) ? good idea. i think you

Re: graphical user interface v0.2 for nutch

2009-09-30 Thread Marko Bauhardt
Hi David. sorry i dont understand your question. documentation about the nutch gui can you find here http://wiki.github.com/101tec/nutch marko On Sep 30, 2009, at 4:02 PM, David Jashi wrote: Any documentation on how to add this GUI to existing NUtch instance? პატივისცემით, დავით ჯაში

Re: graphical user interface v0.2 for nutch

2009-09-30 Thread Marko Bauhardt
version you have patched? you can try to make a diff on the release-1.0 to create a patch file. after that you can checkout or download the gui and try to apply your patch. maybe this could work. marko პატივისცემით, დავით ჯაში On Wed, Sep 30, 2009 at 18:19, Marko Bauhardt m

Re: NutchBean refresh index problem

2009-10-05 Thread Marko Bauhardt
On Oct 2, 2009, at 3:38 PM, Haris Papadopoulos wrote: Hi, hi haris. maybe you can use some code snippets from the nutch gui v0.2 (http://github.com/101tec/nutch ). this version has an api to reload the searcher (only nutchbeans are supported). for example: SearcherFactory

http keep alive

2009-10-14 Thread Marko Bauhardt
hi. is there a way for using http-keep-alive with nutch? supports protocol-http or protocol-httpclient keep alive? i cant find the using of http-keep-alive inside the code or in configuration files? thanks marko