+1
On Jan 18, 2008 5:22 AM, Sami Siren [EMAIL PROTECTED] wrote:
Andrzej Bialecki wrote:
Hi all,
My opinion is that we should mark it EOL, and close all JIRA issues that
are relevant only to 0.7.x, with the status Won't Fix.
+1
--
Sami Siren
--
Jérôme Charron
Directeur
These guards were all introduced by a patch some time ago. I complained
at the time and it was promised that this would be repaired, but it has
not yet been.
Yes, Sorry Doug that's my own fault
I really don't have time to fix this :-(
Best regards
Jérôme
On 2/13/07 11:17 AM, Jérôme Charron [EMAIL PROTECTED] wrote:
These guards were all introduced by a patch some time ago. I
complained
at the time and it was promised that this would be repaired, but it has
not yet been.
Yes, Sorry Doug that's my own fault
I really don't have time to fix
i used an existing ThaiAnalyzer which was in lucene package.
ok - i renamed the lucene.analysis.th.* to nutch.analysis.th.* - compiled
and
placed all class files in a jar - analysis-th.jar (do i need to bundle the
ngp file in the jar as well ?)
1. You don't have to refactor the lucene analyzer.
ok. I was able to enable the language identifier plugin by adding the
value
in plugin.includes attribute
in nutch-site.xml - but i'm not sure just by doing that I can have thai
text
recognized and tokenized
properly.
What else do I have to do ? Please help me.
1. You must create a thai NGP
I'm thinking about implementing the (draft) shared MIME database spec
[1] from freedesktop.org in Tika as a modern MIME magic implementation
to help automatically detect and handle the types of resources where
insufficient typing metadata is available. The specified typing
information also
What you probably mean is something equivalent to Unix strings(1). I
have a plugin that implements this, which I could contribute if there's
interest.
+1
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
I have the same problem on a distribute environment! :-(
So I think can confirm this is a bug.
Thanks for this feedback Stefan.
We should fix that.
What I suggest, is simply to remove the line 75 in createJob method from
CrawlDb :
setInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));
In
Hi,
I encountered some problems with Nutch trunk version.
In fact it seems to be related to changes related to Hadoop-0.4.0 and JDK
1.5
(more precisely since HADOOP-129 and File replacement by Path).
In my environment, the crawl command terminate with the following error:
2006-07-06
It seems to be a side effect of NUTCH-169 (remove static NutchConf).
Prior to this, the language identifier was a singleton.
I think we should cache its instance in the conf as we do for many others
objects
in Nutch.
Enrico, could you please create a JIRA issue.
Thanks
Jérôme
--
as far I can see nutch's html parser does only support the meta tag
noindex (meta name=ROBOTS content=NOINDEX,NOFOLLOW ) but there
is an inoffiziel html noindex tag.
http://www.webmasterworld.com/forum10003/2703.htm
Hello Stefan,
Here is a previous discussion about this :
I don't think guards should be added everywhere.
That's right Doug.
It was a rude first pass on logging.
The next pass (finest) will be done with NUTCH-310.
Rather, guards should only be added
in performance critical code, and then only for Debug-level output.
Info and Warn levels are
I'm somewhat worried about the possible clash in the conf name-space -
usually, when we store Object's in Configuration instance, we use their
full class name, or at least a long and most probably unique string. In
this case, we use just http, https, ftp, file and so on ...
Would it make sense if
There seems to be two log4j.properties files in generated war, is this
intentional?
Not intentional. A side effect.
In fact, the first one is the one that comes from conf dir (I will exlude it
in war so that it will be clearier).
The second one (that override the first one) is the good one that
Hi,
I'm currently working on NUTCH-303 so that nutch uses commons logging facade
API and log4j
as the default implementation. All the code is actually switched to and uses
Commons Logging API, and I have
replaced some System.out and printStackTrace to make use of Commons Logging.
To finalize
Is there an API doc or design doc that I can read to
understand where you are? Is the language plugin architecture
already in the main trunk?
The only available document is
http://wiki.apache.org/nutch/MultiLingualSupport
and sometimes I maintain this page
URL: http://svn.apache.org/viewvc?rev=411943view=rev
Log:
Updating to Hadoop release 0.3.1. Hadoop now uses Jakarta Commons
Logging, configured for log4j by default.
If log4j is now included in the core, we can remove the lib-log4j plugin.
If no objection, I will doing it.
Jérôme
--
As far I understand hadoop use commons logging. Should we switch to
use commons logging as well?
Why not...
(but using commons logging doesn't exclude to have a default implementation,
such as log4j used by hadoop).
You're right -- changing anything with the input (snippets length,
number of documents etc) will alter the clusters. This is basically how
it works. If you want clustering in your search engine then, depending
on the type of data you serve, you'll have to experiment with the
settings a bit and
Add 3. Clustering would benefit from a plain text version.
Yes Dawid, but it is already committed = the clustering now uses the plain
text version returned by the toString() method.
Dawid, I have a question about clustering.
Actually, the clustering uses the summaries as input. I assumes it
(but if the nutch-site.xml overrides the plugin.include property and
doen't
include it it will not be activated, like any other plugin)
yes, that's what I ment, I quess that's the default case for people
hacking plugins.
Oh, yes Sami, I understand what you mean...
Sorry, I just forgot to
Bob Carpenter of alias-i had this to say when I brought up this very
idea:
http://article.gmane.org/gmane.comp.jakarta.lucene.devel/12599
Thanks for you response Marvin.
But finally my question is : shouldn't the nutch clustering uses some
fixed size snippets instead of the configurable
This means there's no markup in the OpenSearch output?
Yes, no markup for now.
Shouldn't there be?
The restriction on description field is : Can contain simple escaped HTML
markup, such as b, i, a, and img elements.
So, ya, why not. We can add b around highlights.
What you and others
String toString(Encoder, Formatter) like in the Lucene's Highlighter and
provide some basic implementations of Encoder and Formatter.
That sounds fine, but in the meantime, let's not reproduce the
html-specific code in lots of places. We need it in both search.jsp and
in
As far I know a lot of http servers response with chunked content at
least all that return dynamically generated pages.
Should I file a bug?
Any thoughts?
In fact, the requests issued from http plugin are in HTTP 1.0, so the
servers should never return some chuncked content.
I think that the
Sorry i cant give more then an idea, I'm not a java developer, but I think
the idea could prove useful.
The idea is to limit the length of sentences that get entered into the
index. So, after parsing a page, and words that don't make what appears to
be a complete sentence get ignored.
Douglas,
I'm not so sure. When crawling Apache we had trouble with this feature.
Some HTML files that had an XML header and the server identified as
text/html Nutch decided to treat as XML, not HTML.
Yes, the current version of the mime-type resolver is a crude one.
XML, HTML, RSS and all XML based
parse-oo plugin manifest is valid with plugin.dtd
Oops, I didn't catch that... Thanks!
No problem Andrzej.
It is just a cosmetic change since the plugin.xml are not validated at
runtime (it is in my todo list),
and the contentType and pathSuffix parameters are more or less deprecated.
Jérôme
Are you mainly concerned with charset in Content-Type?
Not specifically.
But while looking at these content-type inconsistency, I noticed that there
is some prossible
troubles with charset in content-type.
Currently, what happens when Content-Type exists in both HTTP layer and in
META tag
I'm not sure if that is the right thing.
If the site administrator did a poort job and a wrong media type is
advertized, it's the site
problem and Nutch shouldn't be fixing it, in my opinion. Those sites
would
not work properly with the browsers any way, and Nutch doesn't need to
work
Is this just needed for references from javadoc? If so, then this can
be copied to build/docs, no?
Yes. Committed.
Jérôme
Hi all,
Just for fun, I have created a public nutch calendar on Google Calendar.
You can add it to your Google calendars or acces it via these URLs:
Feed URL is :
http://www.google.com/calendar/feeds/[EMAIL PROTECTED]/public/basic
ICAL URL is :
http://www.google.com/calendar/ical/[EMAIL
I would like to come back on this issue:
The Content object holds two content-types:
1. The raw content-type from the protocol layer (http header in case of
http) in the Content's metadata
2. The guessed content-type in a private field content-type.
When a ParseData object is created, it takes
Piotr, please keep oro-2.0.8 in pmd-ext
I do not agree here - we are going to make a new release next week and
releasing with two versions of oro does not look nice. oro is quite
stable product and changes are in fact minimal:
http://svn.apache.org/repos/asf/jakarta/oro/trunk/CHANGES
OK for
It seems there is an inconsistency with content-type handling in Nutch:
1. The protocol level content-type header is added in content's metadata.
2. The content-type is then checked/guessed while instanciating the Content
object and stored in a private field
(at this step, the Content object can
I found your idea very interesting. I will be interested to contribute to
the Parse Plugins Framework. I have developed similar one using Lucene.
The
project name is Lius.
Hi Rida,
Yes, I know Lius.
It seems very interesting, and I think it would be very interesting too
if we can merge our
that right now it is checking only main code (without plugins?).
Yes, that's correct -- I forgot to mention that. PMD target is hooked up
with tests and stops the build if something fails. I thought the core
code should be this strict; for plugins we can have more relaxed rules
-1
Since
My feeling was simply that the closest we are to Nutch-1.0, the more be
need
some QA metrics (for us and for nutch users). No?
I absolutely agree Jérôme, really. It's just that developers usually
tend to hook up dozens of QA plugins and never look at what they output
(that's the usual
I will make it totally separate target (so test do not
depend on it).
+1
The goal is to allow other developers to play with pmd easily but at the
same time I do not want the build to be affected.
+1
I would like also to look at possibility to generate crossreferenced HTML
code from
Do you guys have any additional insights / suggestions whether NUTCH-240
and/or NUTCH-61 should be included in this release?
NUTCH-240 : I really like the idea, but for now, I agree with that is API is
still ugly. I would like to help in the next weeks...
So for me it should not be included in
With code coverage... I don't know. It's
up to you guys -- you spend much more time on Nutch code than I do and
you know best what is needed and what isn't.
My feeling was simply that the closest we are to Nutch-1.0, the more be need
some QA metrics (for us and for nutch users). No?
Jérôme
PMD looks like a useful such tool:
http://pmd.sourceforge.net/ant-task.html
I would not be opposed to integrating PMD or something similar into
Nutch's build.xml. What do others think? Any volunteers?
+1 (Very configurable, very good tool!)
I'm reluctant to move the extension interface away from the parameter
and return value classes used by that interface.
I'm reluctant too... I asked, in case someone has a magic idea...
Could we instead add a
super-interface that all extension-point interfaces extend? That way
all of the
No, I don't think so. These are strongly related bundles of plugins.
When you change one chances are good you'll change the others, so it
makes sense to keep their code together rather than split it up. Folks
can still find all implementations of an interface in the javadoc, just
not always
I don't think it upside down. Plugins should not share packages with
core code, since that would permit them to use package-private APIs.
Also, re-arranging the code to make the javadoc nice is right, since the
javadoc is a primary means of describing the code.
Yes, but what I mean is that
I've implemented the spelling correction for the RSS Opensearch feed,
hopefully in keeping with the opensearch guidelines.
If this format is ok, I'll submit an optional patch alongside the
current one at http://issues.apache.org/jira/browse/NUTCH-48 .
+1
Jérôme
--
http://motrech.free.fr/
Beside that, we may should add a kind of timeout to the url filter in
general.
Since it can happen that a user configure a regex for his nutch setup
that run in the same problem as we had run right now.
Something like below attached.
Would you agree? I can create a serious patch and test it
If it were easy to implement all java regex features in
dk.brics.automaton.RegExp, then they probably would have. Alternately,
if they'd implemented all java regex features, it probably wouldn't be
so fast. So I worry that attempts to translate are doomed. Better to
accept the differences:
I updated to the latest SVN revision (385691) today, and I am now seeing
a
Null Pointer exception in the AnalyzerFactory.java class.
Fixed (r385702). Thanks Chris.
NOTE: not sure if returning null is the right thing to do here, but hey,
at
least it made my crawl finish! :-)
It is the
It's not only faster, it also scales better for large and complex
expressions, it is also possible to build automata from several
expressions with AND/OR operators, which is the use case we have in
regexp-utlfilter.
It seems awesome!
Does somebody plans to switch to this lib in nutch?
Does
Thanks for volunteering, you're welcome ... ;-)
Good job Andrzej !;-)
So, That's now in my todo list to check the perl5 compatibility issue and to
provide some benchs to the community...
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
I think algortihm # 1 is what google uses.
google ignores content that does not change from page to page, as well
as content that isn't part of a pblock of text.
Are you sure?
Take a look at this search results:
It seems that the usage of AnalyzerFactory was removed while porting Indexer
to map/reduce.
(AnalyzerFactory is no more called in trunk code)
Is it intentional?
(if no, I have a patch that I can commit, so thanks to confirm)
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
In fact, my first need was to be able to configure the boost for
RawFieldQueryFilter.
The idea is then to give to the user a better control of boost values by
simply :
* add a setBoost(float) method to RawFieldQueryFilter.
* (add a setLowerCase(boolean) method to RawFieldQueryFilter)
* Add some
In a distributed configuration one needs to rebuild the job jar each
time anything changes, and hence must check all plugins, etc. So I
would appreciate it if this didn't take quite so long.
Make sense!
Here is my proposal. For each plugin:
* Define a target containing core (will be used when
Adding DOAP for Nutch. Contributed by Chris Mattmann.
Added:
lucene/nutch/trunk/site/doap.rdf
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java
lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDb.java
On 3/3/06, Doug Cutting [EMAIL PROTECTED] wrote:
Jérôme Charron wrote:
Here is my proposal. For each plugin:
* Define a target containing core (will be used when building single
plugin)
* Define a target not containing core (will be used when building whole
code)
I commit this as soon
This is something google does very well, and something nutch must match
to compete.
Richard, it seems you are a real pdf guru, so any code contribution to nutch
is welcome.
;-)
Regards
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
Yes, but please do not cross-post - many of us are subscribed to both
groups, and we're getting multiple copies of your posts...
+1
I agree, this is inconsistent and should be changed. I think all places
should use -1 as a magic value, because it's obviously invalid.
+1
Richard, could you
Calling compile-core for every plugin makes builds really slow.
I was surprised that nobody complain about this... ;-)
I
think it's safe to assume that the core has already been compiled before
plugins are compiled. Don't you?
It just ensure that the last modified core version is
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/parse/pdf/packa
ge-summary.html org.apache.nutch.parse.pdf (Nutch 0.7.1 API)
but I dont see it in the source of 0.7.1 downloaded
I see it on cvs here:
http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/parse-pdf/s
Putting the wellformed version of the plugin code you provided generated
the follwong exception:
Does the nutch-extensionpoints plugin is activated?
Sounds very good! I may missed - that are you able to extract the
dependencies from the plugin.xml without hacking ant?
Yes, by using the xmlproperty task: it defines a property for each path
found in the xml document
( http://ant.apache.org/manual/CoreTasks/xmlproperty.html )
Jérôme
--
Yes, there is an easier way. Implement a custom task to which you'll
pass a path to plugin.xml and a name for a path. The task (Java code)
will create a named (id) path object which can be subsequently used in
ant with classpath refid=xxx /.
This requires a custom ant task, but as you
may you will find that interesting also:
http://maven.apache.org/using/multiproject.html
Thanks Stefan.
Maven seems to be a really good project software management tool.
But for now, I don't plan to migrate to maven...
(I don't have enought knowledge about it and so I don't have a good overview
There are a number of duplicated libs in the plugins, namely:
Isn't it already reported in http://issues.apache.org/jira/browse/NUTCH-196?
I have still provided a patch for a log4j lib.
If there is no objection, I will commit it and go ahead for
* lib-commons-httpclient
* lib-nekohtml
Jérôme
Hi all,
I just notice an inconsistency when there is a parsing failure :
1. The Fetcher return an empty ParseImpl instance (it contains no metadata,
especially SEGMENT_NAME_KEY and SIGNATURE_KEY)
2. Then, the Indexer tries to add the fields segment and digest from the
metadata keys
Hi,
I have made some experiments with the 3.0-alpha1 version of Jakarta POI
(used by parse-msword and parse-mspowerpoint).
Since this version contains the hwpf package it enables to parse msword
documents too (the actual version in lib-jakarta-poi plugin doesn't contain
this package).
The benefit
Is this happening with the latest code?
Yes.
But by looking in the svn repository . it is my fault ... sorry
(NUTCH-139)
I fix that right now.
Thanks
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
Hi,
It seems that the javaswf.jar lib was builded using jdk 1.5:
class file has wrong version 49.0, should be 48.0
Does I missed something, or Nutch should still be compiled using jdk 1.4.x ?
Please confirm, so that I can commit a new javaswf.jar builded with jdk 1.4
Regards
Jérôme
--
+1
On 2/1/06, Stefan Groschupf [EMAIL PROTECTED] wrote:
+1
Am 01.02.2006 um 22:35 schrieb Andrzej Bialecki:
Hi,
I just found out that it's not possible to invoke main() methods of
plugins through the bin/nutch script. Sometimes it's useful for
testing and debugging - I can do it
Please use JIRA (http://issues.apache.org/jira/browse/NUTCH) - create a
new issue and attach the file.
Perhaps you can use this already existing issue
http://issues.apache.org/jira/browse/NUTCH-23
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
Is it reasonable to guess language info. from target servers geographical
info.?
Yes, it could be another clue to guess language.
But the problem is then to find how to use all these indices.
For instance, the actual solution is the easiest one, but certainly not the
more efficient one:
For
Any plan to implement this ? I mean move LanguageIdentifier class
intto nutch core.
As I already suggested it on this list, I really would like to move the
LanguageIdentifier class (and profiles) to
an independant Lucene sub-project (and the MimeType repository too).
I don't remember why but
+1. Other local modifications which I use frequently:
* exporting a list of supported languages,
* exporting an NGramProfile of the analyzed text,
* allow processing of chunks of input (i.e.
LanguageIdentifier.identify(char[] buf, int start, int len) ) - this is
very useful if the text to
I am wondering Analyzer of nutch in svn trunk is chosen by
languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did).
It's not really choosen by the languageidentifier, but coosen regarding the
value of the lang attribute (for now, that's right, only the
languageidentifier add this
the following code would fail in case the meta tags are in upper case
Node nameNode = attrs.getNamedItem(name);
Node equivNode = attrs.getNamedItem(http-equiv);
Node contentNode = attrs.getNamedItem(content);
This code works well, because Nutch HTML Parser uses Xerces
Hi Stefan,
No in fact, I have refactored the code of protocol-http plugins, not html
parser.
So, I don't think the log4 error comes from this code.
Regards
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
I have the same problem too.
I don't understand what happens.
In fact, the CommandRunner returns a -1 exit code, but nothing in the error
output and the good string in the standard output (nutch rocks nutch rocks
nutch rocks).
All seems to be ok but the exit code.
Jérôme
On 1/9/06, Piotr
... in fact, not really... really unrelated !!!
I remove it immediately.
Thanks
On 1/9/06, Doug Cutting [EMAIL PROTECTED] wrote:
[EMAIL PROTECTED] wrote:
--- lucene/nutch/trunk/src/plugin/build.xml (original)
+++ lucene/nutch/trunk/src/plugin/build.xml Sun Jan 8 16:13:42 2006
@@ -6,13
A related issue is that these two plugins replicate a lot of code. At
some point we should try to fix that. See:
http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html
I have beginning working on this. Nobody else? Can I go on?
Jérôme
--
http://motrech.free.fr/
Excuse me in advance, I probably missed something, but what are the use
cases for having many NutchConf instances with different values?
Running many different tasks in parallel, each using different config,
inside the same JVM.
Ok, I understand this Andrzej, but it is not really what I call
Andrzej,
How do you choose the NutchConf to use ?
Here is a short discussion I had with Doug about a kind of dynamic NutchConf
inside the same JVM:
... By looking at the mailing lists archives it seems that having some
behavior depending on the documents URL is a recurrent problem (for instance
Thanks for the fast response,
Do you know where I can find a compressed version?
Here are the nightly builds:
http://cvs.apache.org/dist/lucene/nutch/nightly/
Regards
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
Just continue voting I will continue with my tally sheet. :-)
Why not creating a wiki page... so that you don't have to do this bad
work.
Jérôme
What people think if we collect a list of issues and make a voting
iteration?
+1
Hi,
I would like to remove all the hard-coded content-type checks spread over
all the parse plugins.
In fact, the content-type/plugin-id mapping is now centralized in the
parse-plugin.xml file, and there's no
more needs for the parser to check the content-type.
The basic idea was:
1. The
+1
A simple solution that provides a standard way to access common meta data.
Great!
--
http://motrech.free.fr/
http://www.frutch.org/
+1 for a 0.7.2 release.
Here are the issues/revisions I can merge to 0.7 branch.
These changes mainly concern the parser-factory changes (NUTCH-88)
http://issues.apache.org/jira/browse/NUTCH-112
http://issues.apache.org/jira/browse/NUTCH-135
http://svn.apache.org/viewcvs.cgi?rev=356532view=rev
The total number of hits (approx) is 2,780,000,000. BTW, I find it
curious that the last 3 or 6 digits always seem to be zeros ... there's
some clever guesstimation involved here. The fact that Google Suggest is
able to return results so quickly would support this suspicion.
For more
Suggestion:
For consistency purpose, and easy of nutch management, why not filtering the
extensions based on the activated plugins?
By looking at the mime-types defined in the parse-plugins.xml file and the
activated plugins, we know which content-types will be parsed.
So, by getting the file
Sounds really good (and it is requested by a lot of nutch users!).
+1
Jérôme
On 12/1/05, Doug Cutting [EMAIL PROTECTED] wrote:
Matt Kangas wrote:
#2 should be a pluggable/hookable parameter. high-scoring sounds like
a reasonable default basis for choosing recrawl intervals, but I'm sure
Right, but the URL filters run long before we know the mime type, in
order to try to keep us from fetching lots of stuff we can't process.
The mime type is not known until we've fetched it.
Yes, the fetcher can't rely on the document mime-type.
The only thing we can use for filtering is the
Do we talk about parsing rdf or do we discuss to store parsed html
text in rdf and convert it via xslt to pure text?
I may misunderstand something. I very like the idea of a general rdf
parser. Back in the days i played around with jena.sf.net
Parsing yes, replace nutch sequence file and the
I would be disappointed by this move - language identifier is an
important component in Nutch. Now the mere fact that it's bundled with
Nutch encourages its proper maintenance. If there is enough drive in
terms of willingness and long-term commitment it would make sense to
move it to a
Yes, Lucene is the best fit for what you're after. Nutch is built on
Lucene, and adds web crawling on top. You don't need a web crawler,
so using Lucene directly is the best fit - of course you'll have to
write code to integrate Lucene.
Erik,
I was thinking about it for a while, but don't
hmmm.. so that means if we want to customize logging
it would be for every plugin potentially?
Perhaps a common logger would atleast make some degree
of sense.
I really think it make sense.
When I fixed the issue about plugin dependencies, I began to create a log4j
plugin
in order to remove
There is one potential problem that I see -- Nutch plugins require
explicit JAR references. If you want to switch between algorithms you'll
need to either put all Carrot2 JARs in the descriptor, put them in
CLASSPATH before Nutch starts or do some other trickery with class
loading.
Only
I think would be neat to have the NutchAnalyzer also as a plugin, from my
understanding right now if I want to analyze in a different way, I need to
hack the nutch source code, if we are going to have different plugins for
different analyzers that will help. Some specific application may use
I read about the MultiLingualSupport, but I didn't see it in the
repository, I think is cool.
The analyzer extension point is defined by the Analyzer abstract class:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/analysis/NutchAnalyzer.java
The default analyzer
1 - 100 of 128 matches
Mail list logo