Welcome to the world of post 1.3 Nutch ;)
On Thursday, February 21, 2013, Amit Sela am...@infolinks.com wrote:
I basically just built with ant and copied the contents of deploy (job
file
+ nutch and crawl scripts) to nutch folder in my hadoop-user directory
on
the master.
I changed the crawl
by gora 0.2.1. I am actually using
0.90.6 for my hbase, but I don't how modify ivy.xml file to accomplish that.
thanks,
On 02/21/2013 10:39 AM, Lewis John Mcgibbney wrote:
http://s.apache.org/WbGsorry for ridiculous size of font
hth
On Thu, Feb 21, 2013 at 10:31 AM, kaveh minooie ka
: cvc-complex-type.3.2.2: Attribute 'rev' is
not allowed to appear in element 'include'. in file:/source/nutch/nutch/ivy/
**ivy.xml
it is the same with exclude tag included as well.
On 02/21/2013 11:19 AM, Lewis John Mcgibbney wrote:
replace
dependency org=org.apache.gora name=gora-hbase
http://svn.apache.org/repos/asf/nutch/tags/release-1.6/src/java/org/apache/nutch/util/NutchConfiguration.java
On Thu, Feb 21, 2013 at 12:03 PM, imehesz imeh...@gmail.com wrote:
hello,
I finally crossed all the terminal issues and I can run Nutch and Solr with
no problems from the command
)
at org.apache.gora.store.**DataStoreFactory.**createDataStore(**
DataStoreFactory.java:118)
this is output of a nutch inject commoand.
BTW, what is snappy ?
On 02/21/2013 12:17 PM, Lewis John Mcgibbney wrote:
Try this
dependency org=org.apache.gora name=gora-hbase rev=0.2.1
conf=*-default
Hi Roland,
You say you start a fetch run, does this mean the FetcherJob or
GeneratorJob? What kind of settings do you run your zNutch server with?
On Wednesday, February 20, 2013, Roland rol...@rvh-gmbh.de wrote:
Hi list,
we're experimenting with nutch 2.1 and cassandra 1.2.1 (on ? hosts).
: batchId: 1361367698-1708119958
FetcherJob: threads: 40
FetcherJob: parsing: true
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
--Roland
Am 20.02.2013 19:44, schrieb Lewis John Mcgibbney:
Hi Roland,
You say you start a fetch run, does this mean the FetcherJob
Hi,
Please head over to most recent thread on dev@ for potential improvements
for the Generator* code.
Thanks for invoking this discussion, it is well overdue.
Lewis
On Wed, Feb 20, 2013 at 12:55 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Alex,
On Wed, Feb 20, 2013
-
From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
To: user user@nutch.apache.org
Sent: Wed, Feb 20, 2013 12:56 pm
Subject: Re: nutch with cassandra internal network usage
Hi Alex,
On Wed, Feb 20, 2013 at 11:54 AM, alx...@aim.com wrote:
The generator also does not have filters
Hi Raja,
There are certainly issues with the 2.x branch (of which 2.1 is the most
recent release).
Dependencies are managed via Ivy, so to build 2.1, just use the ant runtime
target.
You can see the Gora artifacts here
http://search.maven.org/#search|ga|1|gora
On Wed, Feb 20, 2013 at 9:14 PM,
Hi,
NUTCH-1420 is now committed, so you can update your local copy of Nutch 2.x
if you are working from HEAD source.
So there was another issue here where the parse was only running on one
node in the cluster. Is this also the case with you?
On Tue, Feb 19, 2013 at 2:48 PM, t_gra
On Tue, Feb 19, 2013 at 3:40 PM, t_gra alexey.tiga...@gmail.com wrote:
I tried skipping pages with large content size, and it figured that
ALL my pages have content 125981292 bytes long (and probably the same
contents).
And this is okay? I don;t really understand.
BTW, what number of
wrote:
So what (stable) version of Nutch and which architecture would best fit my
cluster ?
Is there a quick (simplified) deployment if I already have a running
cluster and I don't want to change it's existing data or configuration ?
Thanks.
On Fri, Feb 15, 2013 at 12:42 AM, Lewis John
for solr and
zookeeper it is not affecting the slf4j?
thanks,
On 02/16/2013 09:42 AM, Lewis John Mcgibbney wrote:
A solution would be to manually prune the dependencies which are fetched
via Ivy. If old slf4j dependencies are fetched for Hadoop via Ivy then
maybe we need to make the exclusions
?
Thanks.
Alex.
-Original Message-
From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
To: user user@nutch.apache.org
Sent: Sat, Feb 16, 2013 10:58 am
Subject: Re: fields in solrindex-mapping.xml
In short, it helps with searching when you can slice your data using
NUTCH-XX remove unused db.max.inlinks from
nutch-default.xml
trunk a7a1b41 NUTCH-1521 CrawlDbFilter pass null url to urlNormalizers
kaveh@d1r2n2:/2locos/source/nutch/nutch.git$
i am using branch 2.x
On 02/15/2013 06:02 PM, Lewis John Mcgibbney wrote:
Hi Kaveh,
Two seconds please. First
to include digest, tstamp, boost and batchid fields in
solrindex?
Thanks.
Alex.
-Original Message-
From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
To: user user@nutch.apache.org
Sent: Fri, Feb 15, 2013 4:21 pm
Subject: Re: fields in solrindex-mapping.xml
Hi Alex,
OK so we
presentable without any issues but i am not sure if we have any special
characters within our content. I can check and tell you more on monday
when
i go back to work.
I use Nutch-2.x with Hbase.
Kiran.
On Sat, Feb 16, 2013 at 3:01 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote
Can you dump your webdb and check what the various fields are like?
Can you read these in an editor?
I think there may be some problems with the serializers in gora-cassandra
but Iam not sure yet.
Lewis
On Saturday, February 16, 2013, t_gra alexey.tiga...@gmail.com wrote:
Hi All,
Experiencing
And you want to get to the bottom of the batchId = null?
You haven't actually asked a question.here.
On Thursday, February 14, 2013, Dragan Menoski dragan.meno...@x3mlabs.com
wrote:
Hi,
I try to set Nutch 2.1 and Solr 4.0 with MySQL database, according to the
instruction in this link:
Hi Alex,
So we can tack this one.
https://issues.apache.org/jira/browse/NUTCH-1532
Thanks
Lewis
On Fri, Feb 15, 2013 at 4:21 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Alex,
OK so we can certainly remove segment from 2.x solr-index-mapping.xml. It
would however be nice
Hi Kaveh,
Two seconds please. First lets set some thing straight.
Nutch trunk is from here [0]
Nutch 2,x is from here [1]
Which one do you use?
On Fri, Feb 15, 2013 at 4:53 PM, kaveh minooie ka...@plutoz.com wrote:
but here is my problem. I tried to build the nutch using ver 1.4.3 of the
was it was storing only the text of the page and not the
full
html content of the page.
How do i store the full html content of the page also?
Hope to see the patches soon.
Thanks
lewis john mcgibbney wrote
Certainly.
I am currently reviewing the code and will hopefully have patches
Hi Amit,
On Thu, Feb 14, 2013 at 6:24 AM, Amit Sela am...@infolinks.com wrote:
I already have a running Hadoop cluster with Hadoop 1.0.3 and HBase 0.94.2,
and I saw that Nutch 2.1 with Gora supports HBase as backend.
First thing's first. We cannot guarantee that Gora and subsequently Nutch
Hi Alex,
Tstamp represents fetch tiem, used for deduplication.
Boost is for scoring-opic and link. This is required in 2.x as well.
I don't have the code right now, but you can try removing digest and
segment. To me they both look legacy.
There is a wiki page on index structure which you can
Hi All,
This year again I will be getting involved in GSoC program.
If you are interested in participating please get in touch on the relevant
dev@ list and we can initiate discussion.
See you on dev@
Best
Lewis
-- Forwarded message --
From: Carol Smith
Date: Monday, February 11,
.
On Thu, Feb 7, 2013 at 5:12 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
So the problem for you is resolved?
The main (typical) problem here is in the underlying gora-sql library
and
some rather difficult to master gora-sql-mapping.xml constraints.
Hope all is resolved
and not the
full
html content of the page.
How do i store the full html content of the page also?
Hope to see the patches soon.
Thanks
lewis john mcgibbney wrote
Certainly.
I am currently reviewing the code and will hopefully have patches for
Nutch trunk cooked up for tomorrow.
I'll update
+1
This is a ridiculous size of tmp for a crawldb of minimal size.
There is clearly something wrong
On Friday, February 8, 2013, Tejas Patil tejas.patil...@gmail.com wrote:
I dont think there is any such property. Maybe its time for you to cleanup
/tmp :)
Thanks,
Tejas Patil
On Fri, Feb
Is truncating content not a possibility? By default, parsing is skipped for
truncated docs IIRC.
On Fri, Feb 8, 2013 at 4:18 PM, Eyeris Rodriguez Rueda eru...@uci.cuwrote:
I have an idea of what was the problem, there is a url that contain a
repository of pdf documents and nutch delay and
It will prduce more output on the fetcher part of your hadoop.log not on
the parsechecker tool itself that is why you are seeing nothing more.
Are you still having problems with the truncation aspect?
Lewis
On Thu, Feb 7, 2013 at 11:07 AM, Ward Loving w...@appirio.com wrote:
Lewis:
machine and 50 GB for
solr
machine. Please some advice or explanation will be accepted.
Thanks for your time.
- Mensaje original -
De: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
Para: user@nutch.apache.org
Enviados: Jueves, 7 de Febrero 2013 13:06:11
Asunto: Re: Could
. Not as simple to detect when you've
loaded data previously. Thanks for your assistance.
On Thu, Feb 7, 2013 at 3:01 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
It will prduce more output on the fetcher part of your hadoop.log not on
the parsechecker tool itself
Please let us know how you get on as we can add this to the 2.x errors
section of the wiki.
Thanks and good luck with the problem.
Lewis
On Wed, Feb 6, 2013 at 4:45 PM, k4200 k4...@kazu.tv wrote:
Hi Lewis,
Thanks for your reply.
2013/2/7 Lewis John Mcgibbney lewis.mcgibb...@gmail.com:
Hi
Hi,
We are not good to go with Solr 4.1 yet. There are changes required to
schema.xml as well as the indexer package in nutch to accommodate api
changes in 4.1.
Please check our Jira for these issues. I am happy to help with the update
however it will block some other proposed changes to the
I've eventually added this to our FAQ's
http://wiki.apache.org/nutch/FAQ#Can_I_parse_during_the_fetching_process.3F
This should explain for you.
Lewis
On Wed, Feb 6, 2013 at 6:31 PM, Weilei Zhang zhan...@gmail.com wrote:
Hi
I have a performance question:
why fetcher and parser is staged in
Can you use the parsechecker tool with fetcher.verbose overriden as true
and the same settings on one of the (HTML?) documents giving you bother?
The gora-sql-0.1.1 -incubating module is becoming a real pain to be honest.
On Wed, Feb 6, 2013 at 6:44 PM, Ward Loving w...@appirio.com wrote:
with
Nutch. I replaced hbase-0.90.4.jar with hbase-0.90.6-cdh3u5.jar and
the problem resolved.
Regards,
Kaz
2013/2/7 Lewis John Mcgibbney lewis.mcgibb...@gmail.com:
Please let us know how you get on as we can add this to the 2.x errors
section of the wiki.
Thanks and good luck with the problem
Nice, thanks for letting us know.
I take it you were using an amended schema?
On Wed, Feb 6, 2013 at 7:46 PM, alx...@aim.com wrote:
Hi,
Not sure about solrdedup, but solrindex worked for me in nutch-1.4 with
solr-4.1.0.
Alex.
-Original Message-
From: Lewis John Mcgibbney
to the discussion. Thanks for the input Ken.
Lewis
On Wed, Feb 6, 2013 at 8:21 PM, Ken Krugler kkrugler_li...@transpac.comwrote:
Hi Lewis,
On Feb 6, 2013, at 6:50pm, Lewis John Mcgibbney wrote:
I've eventually added this to our FAQ's
http://wiki.apache.org/nutch/FAQ
On Wed, Feb 6, 2013 at 9:35 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi,
Two observations here
1) Did you try any versions more recent than 1.9.12? I assume you are
talking about the net.sourceforge.nekohtml groupId artifact [0] as oppose
to the nekohtml groupId artifact [1]?
2
Done.
Committed @ r1442838 in 2.x HEAD
Thanks
Lewis
On Tue, Feb 5, 2013 at 12:05 AM, Ferdy Galema ferdy.gal...@kalooga.comwrote:
Absolutely. We should remove any unused property that is not in the
planning for (re)implementing.
On Tue, Feb 5, 2013 at 2:12 AM, Lewis John Mcgibbney
Hi Adriana,
Thanks for the update, I've added the solution to our wiki for others
to consult in the future
http://s.apache.org/jcs
Thank you for getting back to us on this one.
Lewis
On Mon, Feb 4, 2013 at 2:18 AM, Adriana Farina
adriana.farin...@gmail.com wrote:
I solved my issue and I want
Hi Kiran,
You are using 2.x still?
On Mon, Feb 4, 2013 at 8:57 AM, kiran chitturi
chitturikira...@gmail.com wrote:
The file clearly shows that urls with status 1 have the protocolStatus(NOT
FOUND). Those seeds are never moved to status (db_gone) that is status 3 if
i am correct.
Did
This looks like a bit of deprecation in nutch-default.xml then.
We can remove the unused property?
On Mon, Feb 4, 2013 at 8:10 AM, Ferdy Galema ferdy.gal...@kalooga.com wrote:
Hi Lewis,
The relevant property seems to be db.update.max.inlinks
On Fri, Feb 1, 2013 at 4:27 AM, Lewis John
to inherit all public methods from
NutchIndexWriter
Can you help me with that? Then i can rebuild and check if it works.
lewis john mcgibbney wrote
As you will see the code has not been amended in a year or so.
The positive side is that you only seem to be getting one issue with javac
On Tue
Can you briefly describe the problem here Sourajit?
On Thu, Jan 31, 2013 at 9:01 AM, Sourajit Basak
sourajit.ba...@gmail.com wrote:
Seems to be related to NUTCH-374 but that shows as fixed.
I have set Nutch to accept unlimited content size this page is gzip
encoded.
On Thu, Jan 31, 2013
Hi Adriana,
On Thu, Jan 31, 2013 at 3:03 AM, Adriana Farina
adriana.farin...@gmail.com wrote:
Searching on google, I've found that it can be an issue due to /etc/hosts,
but it's correctly configured:
127.0.0.1 crawler1a localhost.localdomain localhost
where crawler1a is the
Hi All,
Is it just me who or do we actually use the following property in 2.x anywhere?
property
namedb.max.inlinks/name
value1/value
descriptionMaximum number of Inlinks per URL to be kept in LinkDb.
If invertlinks finds more inlinks than this number, only the first
N inlinks will
And your regex rules?
So is the URL fetched?
On Thu, Jan 31, 2013 at 8:47 PM, Sourajit Basak
sourajit.ba...@gmail.com wrote:
Here it goes.
Try to dump the content from this url with the following settings.
They should be under the 'il' field... can you confirm if your inlinks are
under the 'ol' field please?
On Wed, Jan 30, 2013 at 10:43 AM, alx...@aim.com wrote:
I see that inlinks are saved as ol in hbase.
Alex.
-Original Message-
From: kiran chitturi chitturikira...@gmail.com
Hi Kiran,
On Wed, Jan 30, 2013 at 11:10 AM, kiran chitturi
chitturikira...@gmail.comwrote:
I have checked the database after the dbupdate job is ran and i could see
only markers, signature and fetch fields.
Which Gora artifacts are you using?
We've recently fixed a bug in gora-cassandra [0]
You are not getting very many URLs!
On Tue, Jan 29, 2013 at 8:29 PM, peterbarretto peterbarrett...@gmail.comwrote:
2013-01-29 08:44:35,014 INFO crawl.CrawlDbReader - TOTAL urls: 96404
2013-01-29 08:44:35,018 INFO crawl.CrawlDbReader - status 1
(db_unfetched):
85672
Increase number of threads when fetching
Also please see nutch-deault.xml for paritioning of urls, if you know your
target domains you may wish to adapt the policy.
Lewis
On Sunday, January 27, 2013, peterbarretto peterbarrett...@gmail.com
wrote:
I want to increase the number of urls fetched at
This has certainly been explained in the past, however I can't find the
archived thread.
In short currently it is not possible.
I think it would be a nice feature for the injector though.
On Tuesday, January 22, 2013, 刘兆贵 liuzhao...@126.com wrote:
Dear,
I have a question, could you kind help
Lewis John Mcgibbney lewis.mcgibb...@gmail.com:
Hi Kaz,
On Sat, Jan 12, 2013 at 1:09 AM, k4200 k4...@kazu.tv wrote:
Here are the questions:
1. How to fix this? I'm guessing changing the block size in HBase
would fix the problem, but I don't know how. gora.properties, perhaps
It should be added that currently this functionality is available only on
the 2.x branch courtesy of Ferdy.
Lewis
On Wednesday, January 16, 2013, Stanislav Orlenko orlenko.s...@gmail.com
wrote:
Hi
bin/nutch elasticindex $elasticClusterName -reindex
it is enough for me
use bin/nutch
Hi Kiran,
For this I think you are looking at diving further into the Gora API and
codebase.
As you can see around line 232 [0], the Query is set and executed based on
the key.
What you wish to do would possible encompass setting fields via the Gora
Query API. There are some other useful methods
Hi Bayu,
Yes it will run fine on 1.6.
Lewis
On Sun, Jan 13, 2013 at 10:24 PM, Bayu Widyasanyata bwidyasany...@gmail.com
wrote:
On Mon, Jan 14, 2013 at 6:45 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Markus, implemented an extension of the AdaptiveFetchSchedule [0] which
Hi,
On Mon, Jan 14, 2013 at 3:56 AM, J. Gobel jj.go...@gmail.com wrote:
Hi Lewis,
Thanks for your mail.
My ideal goal would be to crawl the index.php several times per day, and
fetch the new urls from that page and parse them. Then I know 'for sure'
that my index is up to date.
Sounds
Hi,
On Fri, Jan 11, 2013 at 3:12 PM, Bayu Widyasanyata
bwidyasany...@gmail.comwrote:
We can see that some of parse processes were not completed successfully.
Yes I see this. I also see that you have a http.proxy.port = 8080 but no
proxy host and that the protocol-httpclient plugin is not
Hi Till,
Currently no. You would need to write your own implementation. You can look
at the protocol-* plugins in the link below for some guidance
http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/
Hth
Lewis
Friday, January 11, 2013, Till Plumbaum till.plumb...@dai-labor.de wrote:
Hi,
Hi,
java.io.IOException: java.lang.ClassNotFoundException:
com.mysql.jdbc.Driver
If you look at ivy.xml [0] you will see that the mysql-connector-java
dependency is commented out. Please uncomment it, then build Nutch 2.x src
again.
This will download the dependency and make it available on
11, 2013 at 7:01 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi,
java.io.IOException: java.lang.ClassNotFoundException:
com.mysql.jdbc.Driver
If you look at ivy.xml [0] you will see that the mysql-connector-java
dependency is commented out. Please uncomment
Can you include the contents of your parse-plugins.xml file please
The following two lines of logging look off to me
On Tue, Jan 8, 2013 at 10:38 PM, Arcondo Dasilva
arcondo.dasi...@gmail.comwrote:
application/xhtml+xml via parse$
parse-plugins.xml, but no$
seems that tika cannot parse
Hi Michael,
There is very little on this however IIRC it can be done using REST calls.
By default, if you initiate the Nutch server from the nutch script, it
starts a local Jetty server running Nutch from which crawls can be executed
via REST calls.
By no means is is this a feature of Nutch which
Hi Michael,
So far there has been no discussion on this topic with specific focus on
adding the functionality.
I also notice that NUTCH-827 is not marked for inclusion in 2.2.
I would urge you to open another issue describing your approach and
suggested solution specifically for 2.x... if this is
Hi David,
The best resources we have for this can be found on the wiki. These explain
quite a bit about the respective Nutch tools (Injector, Generator, etc.)
and how they are implemented in 2.x.
http://wiki.apache.org/nutch/Nutch2Crawling
On Tue, Jan 8, 2013 at 4:07 AM, Michael Gang
Hi Arcondo,
On Mon, Jan 7, 2013 at 10:12 PM, Arcondo Dasilva
arcondo.dasi...@gmail.comwrote:
My question : why I can't use Tika to parse Html instead of Neko ? is it
possible to get ride of Neko or it is mandatory ?
I would urge you to override the parsing logic in parse-plugins.xml [0]
Is this from a crawl command or from the bin script... or something else?
Your input arguments are not complete.
the -batch X switch will not work for anything, as such a parameter
simply does not exist.
Are you aware of how you ended up with the batchId being null?
What version of 2.x are you
Hi Michael,
On Tue, Jan 8, 2013 at 7:15 AM, Michael Gang michaelg...@gmail.com wrote:
JavaScript (for extracting links only?) (parse-js)
Yes, both in and outlinks if present.
I don't understand what this exactly means.
Let's say if i have a link
a onclick=do_something
or a jquery
Hi Bayu,
On Sat, Jan 5, 2013 at 7:43 AM, Bayu Widyasanyata
bwidyasany...@gmail.comwrote:
Anyone can give me a hint?
In parallel I changed to use nutch 1.6 binary and works well.
But curious to use the latest of nutch 2.1.
Please check out the latest 2.x branch here [0]. This uses Tika 1.2
Hi Arcondo,
Is this still a problem?
Lewis
On Tue, Jan 1, 2013 at 12:50 PM, Arcondo Dasilva
arcondo.dasi...@gmail.comwrote:
Hello,
I'm still getting the error even after ant clean and an entire rebuild. I
cannot parse a site and getting this error
java.util.concurrent.ExecutionException:
that. But I still don't understand the concept of
'batch id'. Besides, is it the right direction to capture 'batch' argument
in command line?
Thanks.
At 2012-12-19 22:07:23,Lewis John Mcgibbney lewis.mcgibb...@gmail.com
wrote:
Hi,
Currently the batchID is originally set by the GeneratorJob
Hi Sourajit,
You're suggesting that there is a clear case of compiled code duplication?
If this is the case I have no idea and further if this actually is the case
then we could address it... however I would be surprised if this were the
case.
Any ideas anyone?
Lewis
On Fri, Dec 28, 2012 at
:01 AM, Tejas Patil tejas.patil...@gmail.com
wrote:
Hey Lewis,
Yes. Thats a good idea. There are so many properties in nutch-default.xml
and having the deprecated ones adds to the confusion.
Thanks,
Tejas Patil
On Sat, Jan 5, 2013 at 11:12 PM, Lewis John Mcgibbney
Hi Jc,
This is correct. The command line parameters differ in key tools, the
generator being one.
I think we would be best to document this on the wiki as well as attempting
to implement useful command line options to stdout for all tools in 2.x,
this would shadow the verbose and more helpful
I think it would be good to phase out some of the deprecated configuration
properties if possible. We have had several stable releases with these
props included...
Lewis
On Jan 5, 2013 6:22 PM, Tejas Patil tejas.patil...@gmail.com wrote:
The generate.max.per.host is deprecated but still is used
Hi Rui,
The gora-sql backend is not stable so please do not be surprised if things
do not work flawlessly.
I would urge you to have a look at the gora-sql-mapping.xml file [0] and
check the respective field values for the columns you are attempting to map.
This aside, I would use the following
Hi Arcondo,
As Tejas pointed out, the jar is not on the classpath. This should be
automated by the Ant and Ivy configuration in Nutch however if it is not
then simply manually enforce it.
Lewis
On Wed, Jan 2, 2013 at 9:43 PM, Arcondo arcondo.dasi...@gmail.com wrote:
Hello,
I made an ant
...@gmail.comwrote:
Thanks for the explanation. I'm more a functional guy with no solid
background in Java.
Could you give some details on how to enforce it manually ?
Thanks in advance, Arcondo
On Thu, Jan 3, 2013 at 2:49 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
the jar
This sounds most like non-existence of robots.txt on the webserver.
Lewis
On Wed, Dec 19, 2012 at 5:26 AM, Rajani Maski rajinima...@gmail.com wrote:
Hi Tejas,
I found out the reason for why the blog was not getting crawled :
http://rajinimaski.blogspot.in/
This is because of the proxy
Hi,
Currently the batchID is originally set by the GeneratorJob#run() method
@line 169 [0], you will see that this can also be overridden by the
generate.batch.id property in nutch-site.xml
Currently if you look at line 117 in the crawl script [1] you will see that
there is a TODO to capture the
Hi James,
One of the plugins is Nutch uses Tika 1.2 as parser wrapper.
The list of Tika formats can be found below
http://tika.apache.org/1.2/formats.html
hth
Lewis
On Wed, Dec 12, 2012 at 4:02 PM, James Ford simon.fo...@gmail.com wrote:
Hello,
Which document types can nutch parse? I know
Hi,
You can take a look at around line 102 in the ParserChecker tool [0]
for details on how to find desired fields and display them.
hth
Lewis
[0]
https://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java?view=markup
On Wed, Dec 12, 2012 at 2:12 AM, alw37
Hi,
I think this is a bug and should be logged however as it is a rather
specific use case (with an older version of Nutch), I wonder if you
can confirm this with trunk? It would be great to log it against 1.7
(and/or 2.2) so we can work towards a solution.
Best
Lewis
On Tue, Dec 11, 2012 at
Hi Renato,
OK here we go :0)
On Mon, Dec 10, 2012 at 3:44 PM, Renato Marroquín Mogrovejo
renatoj.marroq...@gmail.com wrote:
I did notice that this pages weren´t fetched but the thing is that I
do want them to be fetched without actually having to fetch, and parse
them individually with the
Hi Eyeris,
Yeah I'll fix this, thank you for pointing this out.
For reference the link is below.
Thank you
Lewis
http://www.apache.org/dist/nutch/1.6/CHANGES_1.6.txt
On Sun, Dec 9, 2012 at 4:19 AM, Eyeris Rodriguez Rueda eru...@uci.cu wrote:
Thanks for this news Lewis, i was checking but
Hi Renato,
Firstly are you on 2.x? If so what gora- storage backend are you on?
If not what version of 1.x are you using.
After fetching have you parsed the pages?
How are you executing your crawl cycle. The one step command/script or
individually via a custom script? We advise against using
Hi Julien,
Thanks for initial review
On Wed, Nov 28, 2012 at 10:11 AM, Julien Nioche
lists.digitalpeb...@gmail.com wrote:
- CHANGES.txt contains dates in both MM/DD/ and DD/MM/ formats.
Shall we write the month in text form e.g. 7th July 2012 from now on?
Yes I am +1 for your
Lovely Javadoc Andrzej
On Fri, Nov 23, 2012 at 7:32 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
See:
http://nutch.apache.org/apidocs-2.1/org/apache/nutch/crawl/AdaptiveFetchSchedule.html
-Original message-
From:Eyeris Rodriguez Rueda eru...@uci.cu
Sent: Fri 23-Nov-2012
Hi attabi225,
This is really a question for the user@ list so I have copied everyone in.
Firstly, I would please ask you to use 1.5.1, we found a major (but
tiny) bug in 1.5 which renders it as a release we will forget about
for the time being ;0)
On Fri, Nov 23, 2012 at 1:48 PM,
Hi Everyone,
A candidate for the Apache Nutch 1.6 RC#1 is available at:
http://people.apache.org/~lewismc/apache_nutch_1.6/
The release candidate is a src.zip, src.tar.gz, bin-zip and bin-tar.gz
archive of the sources in:
http://svn.apache.org/repos/asf/nutch/tags/release-1.6
Further, a
Hi,
Moving this thread to user@ it has nothing to do with Nutch development.
On Wed, Nov 21, 2012 at 5:08 AM, Bhagya n.bhagyalaks...@gmail.com wrote:
Hi,
Thanks for your reply.
I installed cygwin, and on my windows. I am getting the below error while
running ant command. Please find the
Hi Donald,
I would advise you to re-generate the nutch job archive, as it appears
that your settings are not included within the job file you are trying
to deploy on your hadoop setup/cluster.
you can do this by running ant job (after making changes to the files
in conf) from $NUTCH_HOME
hth
Hi Jorge,
On Wed, Nov 21, 2012 at 3:21 PM, Jorge Luis Betancourt Gonzalez
jlbetanco...@uci.cu wrote:
I'm just building the plugin in this machine and testing on Ubuntu GNU/Linux
12.04 works just fine.
Excellent
Would this worth start an issue? It's seems to be just in my particular
Hi,
A more general question... is anyone using Nutch with Windows 7 successfully?
It might be nice to get a trunk and 2.x build on the Windows Jenkins
slaves just so we have an idea of this.
I've not been near Windows in years sorry.
Lewis
On Wed, Nov 21, 2012 at 9:12 PM, Prashant Ladha
Hi Joe,
On Wed, Nov 21, 2012 at 9:25 PM, Joe Zhang smartag...@gmail.com wrote:
Are you saying that as long as I crawl some page once, nutch will go and
refetch the page in 30 days by default, without me running the command
again?
No this is impossible (unless you have an automated job
Hi Erol,
What exactly did you do to get it working correctly? I am jkeen to
find out as I will not be able to retry with 2.x deployment until
later in the week.
Thanks
Lewis
On Wed, Nov 14, 2012 at 3:08 PM, Erol Akarsu eaka...@gmail.com wrote:
Lewis,
I finally run Nutch 2.1 and SOLR 4.0
it off
These 2 changes cleared the the issue.
Erol Akarsu
On Wed, Nov 14, 2012 at 10:49 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Erol,
What exactly did you do to get it working correctly? I am jkeen to
find out as I will not be able to retry with 2.x deployment
701 - 800 of 1408 matches
Mail list logo