Re: Improvement of Nutch 0.7.2

2007-02-11 Thread Piotr Kosiorowski
I have created 0.7.3 label i JIRA and I am willing to commit useful patches in this branch. I do not have time to develop new code myself (and if I would I will better spend it in trunk in my opinion). So if you have anything to submit I woul dbe willing to commit it. Regards Piotr On 2/12/07,

Re: 0.7.3 version

2006-11-23 Thread Piotr Kosiorowski
As no objections were raised I created a 0.7.3 version in JIRA so we can start assigning current JIRA issues to it. Regards Piotr Piotr Kosiorowski wrote: Hello committers, Based on a recent discussion on nutch user list - (Strategic Direction of Nutch) I would like to prepare 0.7.3 release

Re: Strategic Direction of Nutch

2006-11-16 Thread Piotr Kosiorowski
We should use JIRA for it as Arun said. I will send separate email to committers with a bit better title to get acceptance for preparation of 0.7.3 version and create 0.7.3 version in JIRA so you can assign issues to it. Regards Piotr On 11/16/06, Arun Kaundal [EMAIL PROTECTED] wrote: Hi Nitin,

Fwd: 0.7.3 version

2006-11-16 Thread Piotr Kosiorowski
Hello, I am forwarding an email sent to committers so nutch users are also aware of this initiative. Regards, Piotr -- Forwarded message -- From: Piotr Kosiorowski [EMAIL PROTECTED] Date: Nov 16, 2006 10:09 PM Subject: 0.7.3 version To: nutch-dev@lucene.apache.org Hello

Re: Strategic Direction of Nutch

2006-11-15 Thread Piotr Kosiorowski
I agree with Andrzej. On my part if some takes the effort of preparing patches and testing I as a committer (not very active one recently) may focus on 7.2 issues and commit the patches. And in future prepare 7.3 release. Regards, Piotr On 11/15/06, Andrzej Bialecki [EMAIL PROTECTED] wrote:

Re: Strategic Direction of Nutch

2006-11-12 Thread Piotr Kosiorowski
Anthony, I do not think nutch can forget about small implementations. It was one of its strong points and I do think we will want to support them. For any issues please report them in JIRA and I am sure they would be taken care of. Regards Piotr On 11/12/06, Anthony May [EMAIL PROTECTED] wrote:

Re: details: stackoverflow error

2006-04-07 Thread Piotr Kosiorowski
from the shell, it works fine. Thanks, --Rajesh On 4/6/06, Piotr Kosiorowski [EMAIL PROTECTED] wrote: Which Java version do you use? Is it the same for all urls or only for specific one? If URL you are trying to crawl is public you can send it to me (off list if you wish) and I can check

Re: details: stackoverflow error

2006-04-06 Thread Piotr Kosiorowski
Which Java version do you use? Is it the same for all urls or only for specific one? If URL you are trying to crawl is public you can send it to me (off list if you wish) and I can check it on my machine. Regards Piotr Rajesh Munavalli wrote: I had earlier posted this message to the list but

Re: Nutch 0.7.2 release

2006-04-02 Thread Piotr Kosiorowski
of the release notes that you posted (292986), the changes for 0.7.2 are missing. Rgrds, Thomas On 4/1/06, Piotr Kosiorowski [EMAIL PROTECTED] wrote: Hello all, The 0.7.2 release of Nutch is now available. This is a bug fix release for 0.7 branch. See CHANGES.txt ( http

Re: Nutch 0.7.2 release | upgrading from 0.7.1?

2006-04-02 Thread Piotr Kosiorowski
The 0.7.2 release should work without problems with 0.7.1 data. Regards Piotr On 4/2/06, Håvard W. Kongsgård [EMAIL PROTECTED] wrote: What about upgrading from 0.7.1? Can I use my existing db and segments? Piotr Kosiorowski wrote: Hello all, The 0.7.2 release of Nutch is now available

Nutch 0.7.2 release

2006-04-01 Thread Piotr Kosiorowski
Hello all, The 0.7.2 release of Nutch is now available. This is a bug fix release for 0.7 branch. See CHANGES.txt (http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=292986) for details. The release is available on http://lucene.apache.org/nutch/release/.

Re: A possible error in the tutorial

2006-03-29 Thread Piotr Kosiorowski
Thanks. Fixed in SVN. Will be deployed on the Web site with 0.7.2 release. fabrizio silvestri wrote: Hi guys.. Just a quick question about the tutorial: the line bin/nutch org.apache.nutch.crawl.DmozParser content.rdf.u8 -subset 5000 dmoz/urls shouldn't be bin/nutch

Re: Getting java.io.IOException: Couldn't rename \tmp\nutch\mapred\local\map_n68li2\part-0.out with Nutch 0.8

2006-01-06 Thread Piotr Kosiorowski
As I stated in recent email on similar subject - disable antivirus software if you have one. I have seen many cases when AV was keeping file locked on Windows. Regards Piotr On 1/6/06, Arun Kaundal [EMAIL PROTECTED] wrote: Anybody PLz rely I am waiting for it On 1/6/06, Arun Kaundal

Re: java.io.IOException: already exists

2006-01-04 Thread Piotr Kosiorowski
It looks like majority of people who get it run it on Windows - is it the same in your case? Maybe some kind of antivirus software is preventing the folder from being deleted? Regards Piotr On 1/4/06, Nguyen Ngoc Giang [EMAIL PROTECTED] wrote: Hi all, I'd like to bring back this topic,

Re: build instructions?

2005-12-19 Thread Piotr Kosiorowski
It is a known bug in 0.7.1 distribution. You can get the sources directly from svn and it build fine. It is also fixed in preparation for 0.7.2 release and in trunk. Or you can fix it locally by creating empty src/java folder I am not sure if it is the only one empty folder missing in

Re: try to restart aborted crawl

2005-12-07 Thread Piotr Kosiorowski
Hi, I had the same problems with JVM crashes and it was in fact hardware problem (memory). It can also be a problem with your software config (but as far as I remember you are using quite standard configuration). I doubt it has anything to do with nutch (except nutch stresses JVM/whole box) so

Re: Nutch returns irrelevant site

2005-12-07 Thread Piotr Kosiorowski
You can use explain page to find out why this page is scored the way it is. I would expect anchor text would be th emain component of it. Regards Piotr Aled Jones wrote: Hi I'm currently setting up a nutch search engine that searches travel websites. It works quite well but sometimes returns

Re: Why Nutch 0.7.1 Does Not Compile???

2005-11-23 Thread Piotr Kosiorowski
I compiled it for the release - linux, jdk 1.4.2, ant 1.6.2. No problems - otherwise I would be unable to relase it :). Please report your environment for such problems. Regards Piotr, Victor Lee wrote: ok, it's weird now. If I use the command ant jar, it builds successfully. If I use ant

Re: Using FetchListEntry -dumpurls

2005-11-13 Thread Piotr Kosiorowski
Hi, I think this is the reason: Exception in thread main java.lang.NoClassDefFoundError: net/nutch/pagedb/FetchListEntry In 0.7 branch all classes where moved to org.apache.nutch package structure and scripts where updated so you are probably using old script with new release. Regards Piotr

Re: Wrapping Nutch

2005-10-11 Thread Piotr Kosiorowski
Yes. You are right - first exception idndicates that some required plugins are missing. And this is probably because you do not have plugins directory with all plugins in classpath P. On 10/10/05, Matt Clark [EMAIL PROTECTED] wrote: Sorry, One more clue. Here is the colsole output - 051010

Re: Dedup won't actually dedup

2005-10-09 Thread Piotr Kosiorowski
Hello Jon, As far as I remember dedup marks the records as deleted only without physically removing them. And first action of dedup is to clear old deletions (as it is written in log). So if you repeat it you will get the same number of deleted records each time. Regards Piotr Jon Shoberg

Re: Fwd: problem about the fetch of dinamic page

2005-10-02 Thread Piotr Kosiorowski
You can use nutch readdb command to check if urls you are interested in where added to WebDB - if yes check the segments if they contain these urls. Please review the logs from fetch to check if there was an attempt to fetch from these urls (you might have some problem with authentication).

Re: a few questions

2005-10-01 Thread Piotr Kosiorowski
Earl Cahill wrote: Tempted to do each question as a separate email, but here you go. 1. Does nutch use pure lucene for its indexing? Does the nutch index = lucene + potentially ndfs? If I am going to run a search web service, I am just wondering what advantages nutch would serve over lucene.

Nutch 0.7.1 release

2005-10-01 Thread Piotr Kosiorowski
The 0.7.1 release of Nutch is now available. This is a bug fix release. See CHANGES.txt (http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=292986) for details. The release is available here (http://lucene.apache.org/nutch/release/). Please report all problems

Re: link analysis and update segments

2005-09-26 Thread Piotr Kosiorowski
UpdateDB copies link information and score from the WebDB to segments so it is important to have score calculated before updatedb is run. One can use current standard nutch score (based on number of inlinks) or try to use analyze - I have committed a patch for it some time ago that might help

Re: Re-indexing segments to add more field information

2005-09-13 Thread Piotr Kosiorowski
Hello, I think it is enough to delete index.done file and index folder. I did it this way some time ago. Regrads Piotr Mike Berrow wrote: I would like to re-build the indexes I have in existing segments using a custom index filter plug-in (adds more field information to assist with a custom

Re: Link Analysis Score..

2005-09-04 Thread Piotr Kosiorowski
There are many ways nutch can boost document in the index. But I suspect you are refereing to analyze process - it uses PagrReank computation for page score. For details read DistributedAnalysisTool - especially computeRound method. Regards Piotr Rozina Sorathia wrote: I wanted to know where

Re: Analyser error

2005-08-31 Thread Piotr Kosiorowski
. ' bin/nutch admin db -create' 4. I'll then updatedb db from a fetched segment, this should fill it up with links? 5. 'bin/nutch analylze db 7' And it fails here with three 'tmpsomething' directories and webdb.new -Original Message- From: Piotr Kosiorowski [mailto:[EMAIL PROTECTED] Sent

Re: PDF support? Does crawl parse p

2005-08-31 Thread Piotr Kosiorowski
Hello Diane, There is a plugin to parse pdf files. You have to enable it in nutch-site.xml (just copy entry from nutch-default.xml). You have to change plugin.includes property to include parse-pdf plugin: [...] parse-(text|html) [...] to [...] parse-(text|html|pdf) [...] Regards Piotr Diane

Re: permissions error with nutch 0.7

2005-08-25 Thread Piotr Kosiorowski
It looks like it has problem creating lucene lock file - I think it is usually created in /tmp if you are running it on Unix. Can you check if you can correctly access it? Regards Piotr On 8/25/05, Jason Martens [EMAIL PROTECTED] wrote: On Wed, 2005-08-24 at 18:20 -0700, Michael Ji wrote: did

Re: FetchedSegments.getSummary() for a PDF

2005-08-25 Thread Piotr Kosiorowski
and then reindex? -lucas On Aug 25, 2005, at 11:28 AM, Piotr Kosiorowski wrote: As I understand if you had parse-pdf disabled you have to reparse (snd then reindex) segments. There is no standard way to do it (I think it might be done with some tricks). The easiest way would be to refetch it with pdf

Re: Nutch 0.7 released

2005-08-17 Thread Piotr Kosiorowski
be I don't understanding correctly but according to README.txt file included in release pack I can't find docs/en/tutorial.html and docs/en/developers.html files. Lukas On 8/17/05, Piotr Kosiorowski [EMAIL PROTECTED] wrote: Hi, New Nutch release was prepared today. This is the first Nutch

Re: webdb - orphaned pages?

2005-08-10 Thread Piotr Kosiorowski
Hello, Pages from WebDB are not deleted automatically. Nutch does not check if page has inlinks during fetchlist generation - so orphaned page would be refetched. It will stop to refetch the page if page becomes unavailable for some number of fetch attempts. Regards Piotr On 8/10/05, Raymond

Re: Problem in Incremental crawling with 4GB segment directories

2005-08-08 Thread Piotr Kosiorowski
Hello Ayyanar , Please be more specific with your setup and problem description. Recently I fetched a segment that contains 73GB of data now so I do not think size of your segment is a problem. Regards Piotr On 8/8/05, Ayyanar Inbamohan [EMAIL PROTECTED] wrote: Hi, I have 4 GB data, which is

Re: Is it possible to have multiple search.dir in nutch-site.xml, Please reply immediately

2005-08-08 Thread Piotr Kosiorowski
No it is not possible. But you can search in both if you merge the segments or simply put them in one location. Regards Piotr On 8/8/05, Ayyanar Inbamohan [EMAIL PROTECTED] wrote: Hi all, Is it possible to have multiple search.dir in nutch-site.xml, Please reply immediately I want to

Re: regex-url filter

2005-08-08 Thread Piotr Kosiorowski
Hello, I am not sure which way is better but I would look for dot: orginal http://([a-z0-9]*\.)*.(com|org|net|biz|edu|biz|mil|us|info|cc)/ modifiedhttp://([a-z0-9]*\.)*(com|org|net|biz|edu|biz|mil|us|info|cc)/ In my opinion dot before com,org etc is already included in ([a-z0-9]*\.)* and

Re: distributed search

2005-08-05 Thread Piotr Kosiorowski
If you have two search servers search1.mydomain.com search2.mydomain.com Then on each of them run ./bin/nutch server 1234 /index Now go to your tomcat box. In the directory where you used to have segments dir (either tomcat startup directory or directory specified in nutch config xml). Create

Re: Preventing the fetch command from going to certain URLs

2005-07-29 Thread Piotr Kosiorowski
Hello Joe, If you are using whole web crawling you should change regex-urlfilter.txt insead of crawl-urlfilter.txt. Piotr On 7/28/05, Vacuum Joe [EMAIL PROTECTED] wrote: I have a simple question: I'm using Nutch to do some whole-web crawling (just a small dataset). Somehow Nutch has gotten

Re: [Nutch-general] number of indexed pages

2005-07-29 Thread Piotr Kosiorowski
Hello, First one will give you number of pages in WebDB and not all of them are indexed. Regards, Piotr On 7/29/05, Erik Hatcher [EMAIL PROTECTED] wrote: Two options: bin/nutch readdb crawl/db -stats or use Luke (Google for luke lucene) to open the Lucene index. Erik On

Re: Problem Starting Nutch (Tutorial like)

2005-07-28 Thread Piotr Kosiorowski
Hello, I would rather suspect some misconfiguration of networking. According to JavaDcoc: InetAddress.getLocalHost() throws UnknownHostException - if no IP address for the host could be found. Regards Piotr On 7/28/05, blackwater dev [EMAIL PROTECTED] wrote: Are you sure your urls file doesn't

Re: total pages

2005-07-27 Thread Piotr Kosiorowski
Hello, I assume second counts are printed by some tool accessing WebD. Right? If so - 2 250 000 is the number of pages generated to be fetched (so all fetched pages, fetch attempts with error) - simply total number of pages in segments. The second number is amount of Pages/Links in WebDB -

Re: Merge Crawl results

2005-07-27 Thread Piotr Kosiorowski
Hello, You can merge segments for these two crawls using nutch mergesegs, in fact you can simply copy all segment directories to one place. But it would not be a full merge of crawls as right now there is no way to merge WebDB for these two crawls. You can deduplicate it using nutch dedup

Re: Searching by content type

2005-07-27 Thread Piotr Kosiorowski
Hello, Please have a look at index-more and query-more plugins for content-type handling. Regards Piotr Vacuum Joe wrote: I have been looking through the API docs and I can't figure this out. Here is my question: Is there a way to search based on meta-information, such as content type, or even

Re: prioritizing newly injected urls for fetching

2005-07-27 Thread Piotr Kosiorowski
Hello Kamil, Do you want to generate a fetchlist with urls that are present in WebDB but where not fetched till now? I am not sure what you are trying to achive but, you can generate any fetchlist you want using latest tool by Andrzej Bialecki (http://issues.apache.org/jira/browse/NUTCH-68)

Re: Skipping the final indexing step?

2005-07-21 Thread Piotr Kosiorowski
Hello Otis, If you are only reading ParseData and FetcherOutput from nutch segment you do not need lucene index at all. So you can safely skip -i switch. Regards Piotr On 7/21/05, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hello, I'm using SegmentMergeTool to merge some large segments, and I

Re: How to view the URLs stored in a segment

2005-07-19 Thread Piotr Kosiorowski
Hellom Bryan Woliner wrote: A couple more (basic) questions: When FetchListEntry is called with the -dumpurls option, where does the fetchlist get dumped, in what format, and how do I access it? The list of urls is dumped to stdout (System.out). Format of single line: Recno record_number:

Re: ndfs stuff

2005-07-07 Thread Piotr Kosiorowski
Hello Ferenc, Some documentation on running ndfs can be found on wiki: http://wiki.apache.org/nutch/NutchDistributedFileSystem Regards, Piotr [EMAIL PROTECTED] wrote: Have any location the ndfs usage documentation? Regards, Ferenc