I have created 0.7.3 label i JIRA and I am willing to commit useful
patches in this branch. I do not have time to develop new code myself
(and if I would I will better spend it in trunk in my opinion). So if
you have anything to submit I woul dbe willing to commit it.
Regards
Piotr
On 2/12/07,
As no objections were raised I created a 0.7.3 version in JIRA so we can
start assigning current JIRA issues to it.
Regards
Piotr
Piotr Kosiorowski wrote:
Hello committers,
Based on a recent discussion on nutch user list - (Strategic Direction
of Nutch) I would like to prepare 0.7.3 release
We should use JIRA for it as Arun said. I will send separate email to
committers with a bit better title to get acceptance for preparation
of 0.7.3 version and create 0.7.3 version in JIRA so you can assign
issues to it.
Regards
Piotr
On 11/16/06, Arun Kaundal [EMAIL PROTECTED] wrote:
Hi Nitin,
Hello,
I am forwarding an email sent to committers so nutch users are also
aware of this initiative.
Regards,
Piotr
-- Forwarded message --
From: Piotr Kosiorowski [EMAIL PROTECTED]
Date: Nov 16, 2006 10:09 PM
Subject: 0.7.3 version
To: nutch-dev@lucene.apache.org
Hello
I agree with Andrzej. On my part if some takes the effort of
preparing patches and testing I as a committer (not very active one
recently) may focus on 7.2 issues and commit the patches. And in
future prepare 7.3 release.
Regards,
Piotr
On 11/15/06, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Anthony,
I do not think nutch can forget about small implementations. It was
one of its strong points
and I do think we will want to support them. For any issues please
report them in JIRA and I am sure they would be taken care of.
Regards
Piotr
On 11/12/06, Anthony May [EMAIL PROTECTED] wrote:
from the shell, it works fine.
Thanks,
--Rajesh
On 4/6/06, Piotr Kosiorowski [EMAIL PROTECTED] wrote:
Which Java version do you use?
Is it the same for all urls or only for specific one?
If URL you are trying to crawl is public you can send it to me (off list
if you wish) and I can check
Which Java version do you use?
Is it the same for all urls or only for specific one?
If URL you are trying to crawl is public you can send it to me (off list
if you wish) and I can check it on my machine.
Regards
Piotr
Rajesh Munavalli wrote:
I had earlier posted this message to the list but
of the release notes that you posted
(292986),
the changes for 0.7.2 are missing.
Rgrds, Thomas
On 4/1/06, Piotr Kosiorowski [EMAIL PROTECTED] wrote:
Hello all,
The 0.7.2 release of Nutch is now available. This is a bug fix release
for 0.7 branch. See CHANGES.txt
(
http
The 0.7.2 release should work without problems with 0.7.1 data.
Regards
Piotr
On 4/2/06, Håvard W. Kongsgård [EMAIL PROTECTED] wrote:
What about upgrading from 0.7.1? Can I use my existing db and segments?
Piotr Kosiorowski wrote:
Hello all,
The 0.7.2 release of Nutch is now available
Hello all,
The 0.7.2 release of Nutch is now available. This is a bug fix release
for 0.7 branch. See CHANGES.txt
(http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=292986)
for details. The release is available on
http://lucene.apache.org/nutch/release/.
Thanks. Fixed in SVN. Will be deployed on the Web site with 0.7.2 release.
fabrizio silvestri wrote:
Hi guys..
Just a quick question about the tutorial:
the line
bin/nutch org.apache.nutch.crawl.DmozParser content.rdf.u8 -subset
5000 dmoz/urls
shouldn't be
bin/nutch
As I stated in recent email on similar subject - disable antivirus software
if you have one. I have seen many cases when AV was keeping file locked on
Windows.
Regards
Piotr
On 1/6/06, Arun Kaundal [EMAIL PROTECTED] wrote:
Anybody PLz rely I am waiting for it
On 1/6/06, Arun Kaundal
It looks like majority of people who get it run it on Windows - is it the
same in your case?
Maybe some kind of antivirus software is preventing the folder from being
deleted?
Regards
Piotr
On 1/4/06, Nguyen Ngoc Giang [EMAIL PROTECTED] wrote:
Hi all,
I'd like to bring back this topic,
It is a known bug in 0.7.1 distribution. You can get the sources
directly from svn and it build fine. It is also fixed in preparation for
0.7.2 release and in trunk. Or you can fix it locally by creating empty
src/java folder I am not sure if it is the only one empty folder missing
in
Hi,
I had the same problems with JVM crashes and it was in fact hardware
problem (memory). It can also be a problem with your software config
(but as far as I remember you are using quite standard configuration).
I doubt it has anything to do with nutch (except nutch stresses
JVM/whole box) so
You can use explain page to find out why this page is scored the way it
is. I would expect anchor text would be th emain component of it.
Regards
Piotr
Aled Jones wrote:
Hi
I'm currently setting up a nutch search engine that searches travel
websites. It works quite well but sometimes returns
I compiled it for the release - linux, jdk 1.4.2, ant 1.6.2. No
problems - otherwise I would be unable to relase it :).
Please report your environment for such problems.
Regards
Piotr,
Victor Lee wrote:
ok, it's weird now. If I use the command ant jar, it builds successfully. If I use
ant
Hi,
I think this is the reason:
Exception in thread main java.lang.NoClassDefFoundError:
net/nutch/pagedb/FetchListEntry
In 0.7 branch all classes where moved to org.apache.nutch package
structure and scripts where updated so you are probably using old script
with new release.
Regards
Piotr
Yes. You are right - first exception idndicates that some required plugins
are missing. And this is probably because you do not have plugins directory
with all plugins in classpath
P.
On 10/10/05, Matt Clark [EMAIL PROTECTED] wrote:
Sorry,
One more clue. Here is the colsole output -
051010
Hello Jon,
As far as I remember dedup marks the records as deleted only without
physically removing them.
And first action of dedup is to clear old deletions (as it is written in
log). So if you repeat it you will get the same number of deleted
records each time.
Regards
Piotr
Jon Shoberg
You can use nutch readdb command to check if urls you are interested
in where added to WebDB - if yes check the segments if they contain
these urls. Please review the logs from fetch to check if there was an
attempt to fetch from these urls (you might have some problem with
authentication).
Earl Cahill wrote:
Tempted to do each question as a separate email, but
here you go.
1. Does nutch use pure lucene for its indexing? Does
the nutch index = lucene + potentially ndfs? If I am
going to run a search web service, I am just wondering
what advantages nutch would serve over lucene.
The 0.7.1 release of Nutch is now available. This is a bug fix release.
See CHANGES.txt
(http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=292986)
for details. The release is available here
(http://lucene.apache.org/nutch/release/).
Please report all problems
UpdateDB copies link information and score from the WebDB to segments so
it is important to have score calculated before updatedb is run. One can
use current standard nutch score (based on number of inlinks) or try to
use analyze - I have committed a patch for it some time ago that might
help
Hello,
I think it is enough to delete index.done file and index folder. I did
it this way some time ago.
Regrads
Piotr
Mike Berrow wrote:
I would like to re-build the indexes I have in existing segments using
a custom index filter plug-in (adds more field information to assist with
a custom
There are many ways nutch can boost document in the index. But I suspect
you are refereing to analyze process - it uses PagrReank computation for
page score. For details read DistributedAnalysisTool - especially
computeRound method.
Regards
Piotr
Rozina Sorathia wrote:
I wanted to know where
. ' bin/nutch admin db -create'
4. I'll then updatedb db from a fetched segment, this should fill it up with
links?
5. 'bin/nutch analylze db 7'
And it fails here with three 'tmpsomething' directories and webdb.new
-Original Message-
From: Piotr Kosiorowski [mailto:[EMAIL PROTECTED]
Sent
Hello Diane,
There is a plugin to parse pdf files. You have to enable it in
nutch-site.xml (just copy entry from nutch-default.xml).
You have to change plugin.includes property to include parse-pdf plugin:
[...] parse-(text|html) [...] to [...] parse-(text|html|pdf) [...]
Regards
Piotr
Diane
It looks like it has problem creating lucene lock file - I think it is
usually created in /tmp if you are running it on Unix. Can you check
if you can correctly access it?
Regards
Piotr
On 8/25/05, Jason Martens [EMAIL PROTECTED] wrote:
On Wed, 2005-08-24 at 18:20 -0700, Michael Ji wrote:
did
and then reindex?
-lucas
On Aug 25, 2005, at 11:28 AM, Piotr Kosiorowski wrote:
As I understand if you had parse-pdf disabled you have to reparse (snd
then reindex) segments. There is no standard way to do it (I think it
might be done with some tricks). The easiest way would be to refetch
it with pdf
be I don't understanding correctly but according to README.txt
file included in release pack I can't find docs/en/tutorial.html and
docs/en/developers.html files.
Lukas
On 8/17/05, Piotr Kosiorowski [EMAIL PROTECTED] wrote:
Hi,
New Nutch release was prepared today. This is the first Nutch
Hello,
Pages from WebDB are not deleted automatically. Nutch does not check
if page has inlinks during fetchlist generation - so orphaned page
would be refetched. It will stop to refetch the page if page becomes
unavailable for some number of fetch attempts.
Regards
Piotr
On 8/10/05, Raymond
Hello Ayyanar ,
Please be more specific with your setup and problem description.
Recently I fetched a segment that contains 73GB of data now so I do
not think size of your segment is a problem.
Regards
Piotr
On 8/8/05, Ayyanar Inbamohan [EMAIL PROTECTED] wrote:
Hi,
I have 4 GB data, which is
No it is not possible. But you can search in both if you merge the
segments or simply put them in one location.
Regards
Piotr
On 8/8/05, Ayyanar Inbamohan [EMAIL PROTECTED] wrote:
Hi all,
Is it possible to have multiple search.dir in
nutch-site.xml, Please reply immediately
I want to
Hello,
I am not sure which way is better but I would look for dot:
orginal http://([a-z0-9]*\.)*.(com|org|net|biz|edu|biz|mil|us|info|cc)/
modifiedhttp://([a-z0-9]*\.)*(com|org|net|biz|edu|biz|mil|us|info|cc)/
In my opinion dot before com,org etc is already included in
([a-z0-9]*\.)* and
If you have two search servers
search1.mydomain.com
search2.mydomain.com
Then on each of them run
./bin/nutch server 1234 /index
Now go to your tomcat box. In the directory where you used to have
segments dir
(either tomcat startup directory or directory specified in nutch config xml).
Create
Hello Joe,
If you are using whole web crawling you should change regex-urlfilter.txt
insead of crawl-urlfilter.txt.
Piotr
On 7/28/05, Vacuum Joe [EMAIL PROTECTED] wrote:
I have a simple question: I'm using Nutch to do some
whole-web crawling (just a small dataset). Somehow
Nutch has gotten
Hello,
First one will give you number of pages in WebDB and not all of them
are indexed.
Regards,
Piotr
On 7/29/05, Erik Hatcher [EMAIL PROTECTED] wrote:
Two options:
bin/nutch readdb crawl/db -stats
or use Luke (Google for luke lucene) to open the Lucene index.
Erik
On
Hello,
I would rather suspect some misconfiguration of networking.
According to JavaDcoc:
InetAddress.getLocalHost() throws
UnknownHostException - if no IP address for the host could be found.
Regards
Piotr
On 7/28/05, blackwater dev [EMAIL PROTECTED] wrote:
Are you sure your urls file doesn't
Hello,
I assume second counts are printed by some tool accessing WebD. Right?
If so - 2 250 000 is the number of pages generated to be fetched (so all
fetched pages, fetch attempts with error) - simply total number of pages
in segments. The second number is amount of Pages/Links in WebDB -
Hello,
You can merge segments for these two crawls using nutch mergesegs, in
fact you can simply copy all segment directories to one place. But it
would not be a full merge of crawls as right now there is no way to
merge WebDB for these two crawls. You can deduplicate it using nutch
dedup
Hello,
Please have a look at index-more and query-more plugins for content-type
handling.
Regards
Piotr
Vacuum Joe wrote:
I have been looking through the API docs and I can't
figure this out. Here is my question:
Is there a way to search based on meta-information,
such as content type, or even
Hello Kamil,
Do you want to generate a fetchlist with urls that are present in WebDB
but where not fetched till now?
I am not sure what you are trying to achive but, you can generate any
fetchlist you want using latest tool by Andrzej Bialecki
(http://issues.apache.org/jira/browse/NUTCH-68)
Hello Otis,
If you are only reading ParseData and FetcherOutput from nutch segment
you do not need lucene index at all. So you can safely skip -i switch.
Regards
Piotr
On 7/21/05, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
Hello,
I'm using SegmentMergeTool to merge some large segments, and I
Hellom
Bryan Woliner wrote:
A couple more (basic) questions:
When FetchListEntry is called with the -dumpurls option, where does the
fetchlist get dumped, in what format, and how do I access it?
The list of urls is dumped to stdout (System.out). Format of single line:
Recno record_number:
Hello Ferenc,
Some documentation on running ndfs can be found on wiki:
http://wiki.apache.org/nutch/NutchDistributedFileSystem
Regards,
Piotr
[EMAIL PROTECTED] wrote:
Have any location the ndfs usage documentation?
Regards,
Ferenc
47 matches
Mail list logo