and predefined settings for ngramsizes etc ) categorizer from
factory and tell it to do it's job when needed.
--
Sami Siren
didn't check out your version yet, but I have also written
a version wich is read/write capable, should we combine our efforts here?
--
Sami Siren
help lots of people to setup a site search.
I am also available for this if help is required.
--
Sami Siren
John X wrote:
On Thu, Jan 26, 2006 at 12:19:38PM -0800, Doug Cutting wrote:
John X wrote:
Please count me in.
Thanks, John.
My pleasure.
I forgot to mention that I'd prefer
();
+impl.setConf(conf);
+ } catch (Exception e) {
should there be a
conf.setObject(clazz,impl);
inside that try ?
--
Sami Siren
the next 12 hours.
I quess I am a bit late with this...
The goal of these changes is very good but I don't like the idea of
duplicating identical code (implementing the interface NutchConfigure)
instead of inheritance (extending NutchConfigured) in so many places.
--
Sami Siren
I apologize for stomping on your work a bit!
No problem at all - all this means is that we're going forward :)
--
Sami Siren
to do before it will reach
a form of a patch).
--
Sami Siren
much effort into this if it will soon be
obsolete. But if a small effort will give folks did you mean
support in the interim, that's not a bad thing. Of course, folks can
always apply this patch themselves...
Agreed, perhaps I meant to say that I will not apply it ;)
--
Sami Siren
place. If we
subsequently add a more general API then we could re-implement the
toHtml() method using that API, but I think a generic toHtml() method
will be useful for quite a while yet.
+1
--
Sami Siren
Jérôme Charron wrote:
(but if the nutch-site.xml overrides the plugin.include property and
doen't
include it it will not be activated, like any other plugin)
yes, that's what I ment, I quess that's the default case for people
hacking plugins.
--
Sami Siren
[EMAIL PROTECTED] wrote:
Spotted a reference to NutchReferehPolicy(); :
EntryRefreshPolicy policy=new NutchReferehPolicy();
Typo?
Yes it was, thanks for keeping your eyes open.
--
Sami Siren
Otis
It emulates a feature with same name from google appliance.
http://www.google.com/enterprise/mini/end_user_features.html
--
Sami Siren
[EMAIL PROTECTED] wrote:
Hi,
What exactly does this plugin do? I haven't seen it mentioned and the
README.txt doesn't really describe it.
Thanks,
Otis
hmm... didn't think about that, are there more opinions about this?
--
Sami Siren
Are you sure there is no trademark infringement here? Perhaps we
should call it something else, just to avoid any potential legal
unpleasantries ...
Piotr,
is there a reason why this (among other) documentation (for all relevant
versions)
could not be maintained in trunk?
--
Sami Siren
Piotr Kosiorowski wrote:
Andrzej Bialecki wrote:
+1, yes it would be really confusing. Since there are more and more
people trying 0.8, could we
and that would
give us more feedback about the overall quality.
If there is a consensus about this I can volunteer to be the RM.
--
Sami Siren
There seems to be two log4j.properties files in generated war, is this
intentional? However
it works just fine.
jar -tf nutch-0.8-dev.war |grep log4j.properties
WEB-INF/classes/log4j.properties
WEB-INF/classes/log4j.properties
--
Sami Siren
Jerome Charron (JIRA) wrote:
[ http
That's a good news.
Sami, I have not made changes to web2. Do you want that I switch web2 to
Commons Logging?
I am allready working with it and unfortunately facing some classloading
issues.
Hopefully the solution will come up sooner than later.
--
Sami Siren
using some kind of
component container then there.
--
Sami Siren
too small for larger jobs)
--
Sami Siren
://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/
--
Sami Siren
inject on a single machine(linux)
configuration, local fs without problems ).
--
Sami Siren
Gal Nitzan wrote:
To get the same behavior, just try to inject to a new crawldb that doesn't
exist.
The reason many doesn't get it is that crawldb already exists in their
environment.
true, I was injecting to existing crawldb.
--
Sami Siren
InputFormat.areValidInputDirectories(). The former is probably
easier. I've attached a patch. Does this fix things for folks?
Patch works for me.
--
Sami Siren
= ((CharacterData)servlet).getData().trim();
String urlPattern = ((CharacterData)pattern).getData().trim();
What is the compilation error you are seeing and in what environment
(os, jvm)?
--
Sami Siren
0.8 has subcollection plugin. It can add subollection id for set of urls
and then you can limit searching to subcollections. Is that what you're
after?
--
Sami Siren
Stefan Neufeind (JIRA) wrote:
[ http://issues.apache.org/jira/browse/NUTCH-271?page=comments#action_1246
look at it :)
don't be polite, just as polite as it's required
I'm ok with the original logic.
--
Sami Siren
I will start working on the 0.8 release on next moday or tuesday. If
there are no surprises I will post another message when package is
available for testing.
--
Sami Siren
Doug Cutting wrote:
+1
Piotr Kosiorowski wrote:
+1.
P.
Andrzej Bialecki wrote:
Sami Siren wrote:
How would folks
?
--
Sami Siren
.
--
Sami Siren
Andrzej Bialecki wrote:
Sami Siren wrote:
There is a package available for testing in
http://people.apache.org/~siren/nutch-0.8/
please give it some testing and post in your opinion - is it good
enough to be a public release?
I have some doubts because of NUTCH-266, but so far only 3
fetcher.
+1
--
Sami Siren
most propably you have run out of space in tmp (local) filesystem
use properties like
property
namemapred.system.dir/name
value!-- path to fs that contains a lots of space --/value
/property
property
namemapred.local.dir/name
value!-- path to fs that contains a lots of space --/value
Uroš Gruber wrote:
Andrzej Bialecki wrote:
Sami Siren (JIRA) wrote:
I am not sure to what you refer to by this 3-4 sec but yes I agree
threre are more aspects to optimize in fetcher, what I was firstly
concerned was the fetching IO speed what was getting ridiculously
low (not quite sure
/value
descriptionThis number is the maximum number of threads that
should be allowed to access a host at one time./description
/property
--
Sami Siren
Jérôme Charron wrote:
What you probably mean is something equivalent to Unix strings(1). I
have a plugin that implements this, which I could contribute if there's
interest.
+1
hmm.. strings on couple of randomply selected pdf gives me content I
wouldn't wanna search against.
--
Sami
small part of the html content even for every request.
It might be a good idea extend current functionality with some kind of
tagging of reduntant (by content) urls in webdb to prevent them from
being fetched again.
--
Sami Siren
) classpath settings hell will be over which could help naive
users (like me) a lot.
(and silently I hope that what I have said now holds for M2 too)
Just for clarification when I was talking about maven I meant maven 2,
maven 1 is a dead end.
--
Sami Siren
of code
monolithically packaged that we have identified are of three kinds:
IMO to solve the main problem one does not need to set up another
project, just refactor and repackage.
--
Sami Siren
with
and ' in there name (currently I check for ' and change the href
quotes). Same problem for file://
There could perhaps be a different crawler implementation to crawl local
filesystem and these shared windows resources (and perhaps webdav too)
efficiently.
--
Sami Siren
to me. Can you submit a patch?
--
Sami Siren
doing this, but I don't think I have
the permissions to do so.
I am not able to do it either, or then I just don't know how, can Doug
help us here?
--
Sami Siren
, but if you can submit a patch
(perhaps with testcase that demonstrates this) then it will be easier to
act on.
--
Sami Siren
for having multiple values for single key?
--
Sami Siren
Andrzej Bialecki wrote:
Sami Siren wrote:
Andrzej Bialecki wrote:
Jukka Zitting wrote:
The Parser interface is also bound to the ideas of fetching content
from the network and indexing it using a standard content model
through the Content and Parse dependencies. For the Tika project I'd
like
classCastException.
Why do I get this exception? I looked at old sources but didn't find
distinctions in algorithm. What do I miss?
Nutch is not compatible with latest hadoop from svn.
--
Sami Siren
FYI
I'll roll out nutch 0.8.1 later this week to release fix for couple of
severe problems in 0.8.
--
Sami Siren
Andrzej Bialecki wrote:
Sami Siren wrote:
FYI
I'll roll out nutch 0.8.1 later this week to release fix for couple of
severe problems in 0.8.
There are a couple issues that have to make it into this release,
related to serious bugs in scoring - I plan to commit them by the end of
the week
- not saying we should keep up but every now and then there
exists bugs that need to get fixed.
--
Sami Siren
I just wrapped up 0.8.1 release, sneak preview is temporarily available
at http://people.apache.org/~siren/nutch-0.8.1/
I'll update the website and announce it after it has hit the mirrors and
nothing serious is not found in it in the following 48 hrs.
--
Sami Siren
Andrzej Bialecki wrote
for 0.9 in
near (lets say - few months) future?
There is no concrete plans on 0.9.0 yet. IMO committing fixes to 0.8
branch is worth the effort as long as we do not know better, with new
features to 0.8 I would think twice.
What do others think?
--
Sami Siren
Example: we should upgrade branch 0.8 to use hadoop-0.5, but we will
Wouldn't this change the requirement for java version also from 1.4 to 5?
--
Sami Siren
was also logged on jira issue
http://issues.apache.org/jira/browse/NUTCH-360
--
Sami Siren
I quickly screened though the massive patch, didn't pick anything
special except the stuff from clustering-carrot2, formatting changes?
+1
--
Sami Siren
Andrzej Bialecki wrote:
Andrzej Bialecki (JIRA) wrote:
[ http://issues.apache.org/jira/browse/NUTCH-383?page=all ]
Andrzej
Sami Siren (JIRA) wrote:
[[ Old comment, sent by email on Sun, 06 Aug 2006 08:06:13 +0300 ]]
looks like somebody just enabled email-to-jira-comments-feature. I was
just wondering would it be good to use this feature more widely.
This could be achieved by removing the replyto header
without them?
--
Sami Siren
the rules are loaded from. Thanks
This is done for all plugins implementing Configurable interface in
org.apache.nutch.plugin.Extension at line #162.
--
Sami Siren
What kind of hardware are you running on? Your pages per sec ratio seems
very low to me.
How big was your crawldb when you started and how big was it at end?
What kind of filters and normalizers are you using?
--
Sami Siren
AJ Chen wrote:
I checked out the code from trunk after Sami
could use something like:
SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf);
Text url = new Text(); //key
Content content = new Content();//value
while (reader.next(url, content)) {
//now just use url and content the way you like
}
--
Sami Siren
protocol
field keys may need spell checking services.
If there's a real need for spell checking on other keys one can just add
more classes to the array no big deal.
--
Sami Siren
plan to include them it would be nice to include also version for
windows, anyone capable of building those?
--
Sami Siren
[1] http://blog.foofactory.fi/
Could you please create a JIRA issue and attach this patch there so it
won't get lost. It also helps to keep uptodate the CHANGES file as you
can just copy-paste from there when you do a commit.
--
Sami Siren
Brian Whitman wrote:
The parse-mp3 plugin seems to be saving a state of the previous
available.
--
Sami Siren
and it _should_ fix that problem.
--
Sami Siren
Scott Green wrote:
Thanks Dennis! Your methond should work.
And I really hope there is one directly method say getPluginRootDir()
in the plugin implementation.
I'd recommend taking path shown by Andrzej because IMO it's bad design
to depend on plugin system from a plugin.
--
Sami Siren
NUTCH-87Efficient
site-specific crawling for a large number of sites
Are there any opinions about issues that should go in before the next
release (Answering yes means that you are willing to provide a patch for
it).
--
Sami Siren
on that! Why not target it to
trunk version of Nutch?
- a web server to serve plugin jsp's
Why not make it regular war? also please consider making a clean
separation of view/logic when you implement the web ui.
--
Sami Siren
intrusive to fix just before the release - and needs
additional discussion.
+1
NUTCH-68 A
tool to generate arbitrary fetchlists
Easy to port this to 0.9.0 - I can do this.
cool.
I'll start working on the headers and stuff to get the blocking issue away.
--
Sami Siren
in
SequenceFile.Sorter.MergeQueue
However I cannot find from the change logs of hadoop that what the
change is that is causing nutch these problems.
--
Sami Siren
However I cannot find from the change logs of hadoop that what the
change is that is causing nutch these problems.
It's HADOOP-331, so i guess at least the changes/additions in map() is
required.
--
Sami Siren
functionality but different code base.
--
Sami Siren
to get going. If people thing this is right
direction and it goes beyond talk then perhaps after that we could start
talking about separate project.
--
Sami Siren
Brian Whitman wrote:
On Jan 21, 2007, at 6:47 AM, Sami Siren wrote:
However I cannot find from the change logs of hadoop that what the
change is that is causing nutch these problems.
It's HADOOP-331, so i guess at least the changes/additions in map() is
required.
Hi, just following up
Gal Nitzan wrote:
Got it. I used latest trunk for a few hours and it seems that it changed the
version of Crawldatum to ver. 5 :(
yes, version is updated on write
) segment(s).
--
Sami Siren
Gal Nitzan wrote:
Thanks Sami,
By redo do you mean re-parse or re-fetch + re-parse
generate - fetch - parse
--
Sami Siren
on a page whose text was from the page. Product
search too, perhaps.
These are excellent points I am totally +1 for the api change, it opens
doors for a lot of new possible applications.
--
Sami Siren
experiences from running it on
reasonable sized crawls, so my suggestion is that don't decide this on
paper.
--
Sami Siren
it as is and release 0.9.0.
--
Sami Siren
the expression, now it uses a shell built-in - I'm
not sure if these two follow the same evaluation rules on all supported
platforms ... Please revert it to the earlier syntax.
revert it so it desn't work on linux, are you sure?
--
Sami Siren
something?)
--
Sami Siren
have been discussed first.
I reverted it and reopened NUTCH-432.
--
Sami Siren
!
--
Sami Siren
Andrzej Bialecki wrote:
Sami Siren wrote:
How the code ended up in this place on Linux? The $cygwin condition
should have prevented that, because it evaluates to true only on Cygwin,
where this utility is required to translate the paths.
You also changed the if syntax - before it was using
I change the if to
if [ test $cygwin -a X${JAVA_LIBRARY_PATH} != X ]; then
JAVA_LIBRARY_PATH=`cygpath -p -w $JAVA_LIBRARY_PATH`
fi
it works, ok for me.
eh, forget that part :)
--
Sami Siren
.
--
Sami Siren
2007/3/21, Andrzej Bialecki [EMAIL PROTECTED]:
Any other stuff we need to fix before the release?
I am satisfied except the broken bin/nutch.
Fixed now - tested both under Cygwin and Fedora.
Thanks, I can confirm that it works now :)
--
Sami Siren
for me it works:
...
BUILD SUCCESSFUL
Total time: 4 minutes 3 seconds
--
Sami Siren
2007/3/21, Andrzej Bialecki [EMAIL PROTECTED]:
Dennis Kubes wrote:
I am good to go as well.
Hmm ... Test suite fails for me, with a cryptic message (cryptic because
the plugin test itself succeeds
2007/3/21, Andrzej Bialecki [EMAIL PROTECTED]:
Sami Siren wrote:
for me it works:
...
BUILD SUCCESSFUL
Total time: 4 minutes 3 seconds
I did a fresh checkout to an empty dir, rebuilt and it's still failing -
perhaps you have some uncommitted changes in your working copy ... ?
no, I
things are working fine. Just not indexing.
Can you please check the log files for more specific error message(s).
Indexing works ok for me but I have only tried it with small segments so
far.
--
Sami Siren
Steve Severance wrote:
-Original Message-
From: Sami Siren [mailto:[EMAIL PROTECTED]
Sent: Thursday, March 22, 2007 4:27 PM
To: nutch-dev@lucene.apache.org
Subject: Re: indexing with current trunk
Are you running on localrunner or distributed mode, is distributed then
check
benching to indexing and searching too.
[1] http://blog.foofactory.fi/2007/03/twice-speed-half-size.html
--
Sami Siren
(if it is
out by then ;)
--
Sami Siren
-archive.com/dev@jackrabbit.apache.org/msg04641.html)
--
Sami Siren
rest of lucene sub projects. To create
it in similar format as the rest of lucene one could use
md5sum file file.md5
We should probably adopt to same convention or wdot?
--
Sami Siren
the sum itself is obviously the same :) The point in this is to use
same
conventions in Lucene family, not strictly required, but still IMO it just
looks better.
--
Sami Siren
three binding +1 votes are cast.
[X] +1 Release the packages as Apache Nutch 0.9
--
Sami Siren
promoted).
What is the benefit of using a branch before a release?
--
Sami Siren
2007/3/29, Andrzej Bialecki [EMAIL PROTECTED]:
Sami Siren wrote:
IMO we should have had a 0.9-rc1 tag, apply patch to trunk, have
0.9-rc2 tag and so on until we are satisfied.
Then when we're actually satisfied create tag for 0.9 (copy from rc
that got promoted).
What is the benefit
2007/3/29, Andrzej Bialecki [EMAIL PROTECTED]:
Sami Siren wrote:
2007/3/29, Andrzej Bialecki [EMAIL PROTECTED]:
Sami Siren wrote:
IMO we should have had a 0.9-rc1 tag, apply patch to trunk, have
0.9-rc2 tag and so on until we are satisfied.
Then when we're actually satisfied create
rewriting the how to release
page in wiki.
--
Sami Siren
1 - 100 of 486 matches
Mail list logo