-httpclient plugin.
And I don't see any handling of this in the patch.
lewis john mcgibbney wrote
Hi,
Have you looked at the patch for NUTCH-1486?
this is not just schema changes.
The patch is for 2.x but the process of porting it to new pluggable
indexing architecture for trunk is trivial
I've updated 1487 with the patch. Please test and get back to us. It would
be great to upgrade to Solr 4.x prior to pushing Nutch 1.7
Thank you so much
Lewis
On Thu, May 2, 2013 at 9:09 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi adfel70,
It is only a patch
What version are you using?
If you can I would advise you to upgrade to 2.x HEAD.
On Wed, May 1, 2013 at 4:32 AM, Bai Shen baishen.li...@gmail.com wrote:
My crawl loop consists of the following.
generate -topN
fetch -all
parse -all
updatedb
solrindex -all
With the fetch and parse the
In short, Gora needs to upgrade the use of HBase API to more recent version.
If you are able and willing to do so, we would be very very happy to have
you contribute to Gora.
https://issues.apache.org/jira/browse/GORA-201
On Wed, May 1, 2013 at 11:41 AM, AC Nutch acnu...@gmail.com wrote:
Hello
Hi James,
Please look for NUTCH-1545 capture batchid...
If you could review and use this patch it would be very very helpful.
thank you
lewis
On Tuesday, April 30, 2013, James Ford simon.fo...@gmail.com wrote:
Thanks for your answer!
I think I will create my own modified crawlscript then. But
I would most likely agree with Tejas.
Either that or you could use the delete and deleteByQuery operations for
http://gora.apache.org/docs/current/apidocs-0.2.1/index.html?org/apache/gora/hbase/store/HBaseStore.html
It depends on how you intend to use the software.
hth
On Tue, Apr 30, 2013 at
Hi,
There is a pretty difficult aspect to this problem which makes it difficult
for others/me to address.
There are a number of variables which may (depending on your task execution
between crawls) change the possibility/probability of some MARK not being
present.
The core problem here within the
I've opened NUTCH-1567 to track and address this.
https://issues.apache.org/jira/browse/NUTCH-1567
On Tue, Apr 30, 2013 at 9:39 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi,
There is a pretty difficult aspect to this problem which makes it
difficult for others/me to address
That would be very much appreciated.
Lewis
On Tue, Apr 30, 2013 at 5:00 AM, Bai Shen baishen.li...@gmail.com wrote:
I'll let you know if I figure out any good defaults.
Thanks.
On Sat, Apr 27, 2013 at 5:30 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Bai,
On Thu
Hi,
For reference, ideally you should fetch many smaller segments. This
prevents many baddies.
This sounds brutal, but I would just kill it.
You loose one segment... hopefully.
Lewis
On Tue, Apr 30, 2013 at 4:20 PM, AC Nutch acnu...@gmail.com wrote:
Hello All,
I've been looking around for a
Hi,
@Tejas, you will remember the work undertaken on NUTCH-1284 (the patch for
which you submitted included the fix for NUTCH-1042) relates to this.
I am not sure if the situations are identical, but they are closely linked
by the looks of it.
@ianin, can you look at the commentary and provide
.
-Original Message-
From: Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com]
Sent: Saturday, April 27, 2013 3:30 PM
To: user@nutch.apache.org
Subject: Re: Nutch 1.6 Processing of fetcher.max.crawl.delay
Hi,
@Tejas, you will remember the work undertaken on NUTCH-1284 (the patch
Hi Bai,
On Thu, Apr 25, 2013 at 4:33 AM, Bai Shen baishen.li...@gmail.com wrote:
Well, I still ended up having to set a content limit. Which is why I'm
wondering how the Nutch Gora integration works. I didn't see a lot of
documentation on it.
So far Nutch seems to be running okay with
- inject - fetch
The second inject will leave entries in the db without fetchmarks seen by
the fetcher later.
--Roland
On Fri, Apr 26, 2013 at 12:30 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Additionally, why do we log.DEBUG that there is a different batch id ( +
mark
I just found out this was logged by Markus many moons ago
https://issues.apache.org/jira/browse/NUTCH-992
It would be nice if you could update this Jira issue with any progress you
are able to make on it.
I am not able to help right now sorry.
Lewis
On Fri, Apr 26, 2013 at 2:14 PM, brian4
Hi All,
I went ahead and added some documentation to the wiki on this topic
*http://s.apache.org/Jb6
*
Please add to it where you see fit.
I still think that the logging is incorrect on this one.*
*
On Fri, Apr 26, 2013 at 12:47 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote
, Apr 25, 2013 at 7:31 AM, Carmine Paternoster
carmine...@gmail.comwrote:
Hi Lewis, thank you very much, for your answer. I do not know how, but I
solved it. No longer appear different batch id (null). In any case, I'm
using Nutch 2.1
Good day, Carmine
2013/4/24 Lewis John Mcgibbney
Yes
On Wed, Apr 24, 2013 at 11:41 AM, Yves S. Garret yoursurrogate...@gmail.com
wrote:
The dmoz directory, it should be located here, yes?
${APACHE_NUTCH_HOME}/runtime/local
On Tue, Apr 23, 2013 at 10:09 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
The DmozParser should
, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
The DmozParser should have created a flat file similar to a bootstrap
file
which you can inject.
The flat file should be inside a the dmoz directory (if you've followed
the
tutorial). Please make sure the file is present
Hi,
CC: user@nutch.apache.org
Questions like this should really go to the user@ list, you have a must
better change of being helped there are there are many many eyes.
On Wed, Apr 24, 2013 at 8:57 AM, d...@e-sentry.net wrote:
I would be really gratefull if you could provide some links on the
Hi Carmine,
CC: user@nutch.apache.org
On Wed, Apr 24, 2013 at 3:13 AM, Carmine Paternoster
carmine...@gmail.comwrote:
I configured Nutch and mySql following this guide (
http://nlp.solutions.asia/?p=180). everything worked fine, but at some
point in the database I find all elements with
Please reread my previous comments
On Wed, Apr 24, 2013 at 3:14 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
If you are using Nutch 2.x then drop the arguments for crawl/crawldb
as
Nutchn does not maintain a local crawldb in 2.x. We delegate Gora to
deal
Hi Yves,
On Wed, Apr 24, 2013 at 3:07 PM, Yves S. Garret
yoursurrogate...@gmail.comwrote:
The issue
that I'm having a hard time with at the moment is that I don't understand
how Gora would replace crawldb here (as in what the commands would
be to do this). I'm going to keep looking for how
can you please give examples of the files which were truncated?
thank you
Lewis
On Tuesday, April 23, 2013, Bai Shen baishen.li...@gmail.com wrote:
I just set http.content.limit back to the default and my fetch completed
successfully on the server. However, it truncated several of my files.
://wiki.apache.org/nutch/NutchTutorial#A3.1_Using_the_Crawl_Command
On Tue, Apr 23, 2013 at 3:30 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Just write a crawl script?
Effectively that's all the crawl script is, just chaining together
logical
tasks.
The one provided with Nutch
://maximilianomarin.com
Celular: (+56 9) 780 688 91
2013/4/22 Lewis John Mcgibbney lewis.mcgibb...@gmail.com
Can you look into the job archive and see what is wrong?
Maybe you need to rebuild the job archive
ant job from ${NUTCH.HOME}
On Mon, Apr 22, 2013 at 1:09 PM, Maximiliano Marin
conta
The DmozParser should have created a flat file similar to a bootstrap file
which you can inject.
The flat file should be inside a the dmoz directory (if you've followed the
tutorial). Please make sure the file is present, and that the CLI syntax is
correct.
If you are using Nutch 2.x then drop the
, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Yves,
We advise to use this script and modify it for your own needs
http://svn.apache.org/repos/asf/nutch/trunk/src/bin/crawl
hth
Lewis
On Tue, Apr 23, 2013 at 12:52 PM, Yves S. Garret
yoursurrogate...@gmail.com
wrote
Hi Maximiliano,
This version of HBase is most likely not compatabile with Gora
HBase ersion is: 0.94.2-cdh4.2.0
On Tue, Apr 23, 2013 at 8:08 PM, Maximiliano Marin
conta...@maximilianomarin.com wrote:
Hello:
First I want to give thank for all the replies in my last thread.
Now I am trying to
, Virtualization
MCTS: SQL Server 2008, Implementation and Maintenance
Web: http://maximilianomarin.com
Celular: (+56 9) 780 688 91
2013/4/24 Lewis John Mcgibbney lewis.mcgibb...@gmail.com
Hi Maximiliano,
This version of HBase is most likely not compatabile with Gora
HBase ersion is: 0.94.2
logging explicitly states that no solrUrl is set.
On Sunday, April 21, 2013, kiran chitturi chitturikira...@gmail.com wrote:
Hi Mick,
Since this is an error with Indexing, Can you check the logs from Solr
side
?
On Sun, Apr 21, 2013 at 4:15 AM, micklai lailixi...@gmail.com wrote:
HI
run your job jar from within the runtime/deploy directory.
On Monday, April 22, 2013, Maximiliano Marin conta...@maximilianomarin.com
wrote:
Hello guys:
I am trying to run nutch over Hadoop. Everything was ok. I modified files
by the tutorial that I have already read and in the moment of make
9) 780 688 91
2013/4/22 Lewis John Mcgibbney lewis.mcgibb...@gmail.com
run your job jar from within the runtime/deploy directory.
On Monday, April 22, 2013, Maximiliano Marin
conta...@maximilianomarin.com
wrote:
Hello guys:
I am trying to run nutch over Hadoop. Everything
hi Tejas,
this is a real excellent reply and very useful.
it would be really great if we could somehow have this kind of low level
information readily available on the Nutch wiki.
On Monday, April 22, 2013, Tejas Patil tejas.patil...@gmail.com wrote:
Fetcher threads try to get a fetch item (url)
22, 2013 at 8:09 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
hi Tejas,
this is a real excellent reply and very useful.
it would be really great if we could somehow have this kind of low level
information readily available on the Nutch wiki.
On Monday, April 22, 2013, Tejas
Hi Senthilkumar,
In short, search recrawl from the Nutch wiki to find an external blog post
on recrawling with Nutch. If you have anything to add to the post contact
the author. If on the other hand you need clarification on anything then
ping us here
Hth
Lewis
On Thursday, April 18, 2013,
Hi Raja,
The FetchSchedule [0] defines the contract for implementations that
manipulate fetch times and re-fetch intervals. FetchScheduleFactory [1]
caches the instance in the ObjectCache.
The Interface and classes (respectively) do not automate or semi-automate
actual scheduling e.g. execute the
Hi Alexander,
Please feel free to sign up to our wiki (please provide one of the dev team
with your uid) and link to your documentation.
Best
Lewis
On Monday, April 15, 2013, Alexander Chepurnoy kusht...@yahoo.com wrote:
You can find those files under Hadoop folder. Working with Hadoop+Nutch
is
Hi Yves,
This has nothing to do with Nutch. It strictly has to do with Gora. That
was my justification for moving a similar thread (it may actually even have
been this one) over to user@gora.
As Renato explained, by the looks of it Microsoft Azure platform has a
client library which enables you to
Hi Yves,
On Tue, Apr 16, 2013 at 1:43 PM, Yves S. Garret
yoursurrogate...@gmail.comwrote:
Thanks for your reply. Forgive me for being so clueless, but there's much
that I still don't know about Apache Nutch (and Hadoop for that matter, but
I'm
learning).
Not at all, I am learning as well.
,
SolrConstants.TIMESTAMP_FIELD,
SolrConstants.DIGEST_FIELD);
On Tue, Apr 9, 2013 at 9:15 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Before we do the upgrade we need to consolidate all of these use cases.
What criteria do we want to review and accept
help.
Thanks.
On Mon, Apr 8, 2013 at 10:31 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Amit,
I recently updated NUTCH-1486 [0] with a patch to work against Solr
4.2.1.
You will be able to pull stuff from this patch and push it into your
Solr 4
schema file, etc.
I
(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Thanks.
On Mon, Apr 8, 2013 at 10:33 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
I would probably be best to describe what you've tried here, possibly a
paste of your schema, what you've done
Hi Amit,
I recently updated NUTCH-1486 [0] with a patch to work against Solr 4.2.1.
You will be able to pull stuff from this patch and push it into your Solr 4
schema file, etc.
I will begin work on upgrading trunk to work with Solr 4 shortly... maybe
this afternoon.
If you are able to help with
I would probably be best to describe what you've tried here, possibly a
paste of your schema, what you've done (if anything) to the Nutch source to
get it working with Solr 4, etc.
The stack trace you get would also be beneficial.
Thank you
Lewis
On Mon, Apr 8, 2013 at 4:13 AM, Amit Sela
Hi Peter,
The patch attached to the issue is for trunk.
If you were able to make a patch for 2.x and upload it to the issue that
would be great. There are API differences so I can tell you that even
though the mongodb indexer classes have been applied, it. Will most likely
be a fruitless effort.
in an
additional
Parse plugin in order to prevent nutch from crawling the outlinks in the
article page?At 2013-01-15 13:31:11,Lewis John
Mcgibbney
lewis.mcgibb...@gmail.com wrote: I take it you are updating the
database with the crawl data? This will mark all links extracted
during
(Method.java:597)
at org.apache.hadoop.util.RunJar.**main(RunJar.java:156)
If I revert to previous release it works fine.
Thanks.
Alex.
-Original Message-
From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
To: user user@nutch.apache.org
Sent: Fri, Mar 29, 2013 4:30 pm
Hi Kaveh,
Firstly, as logged below, Gora attempts to associate your HBase table
configuration with specified tables (from within gora-hbase-mapping.xml)
however it seems that your case satisfies the condition if
(!tableName.equals(tableNameFromMapping)) meaining that the table name is
not equal to
when I ommit the -crawlId parameter ( forcing it
to use the default name webpage ), and more importantly it is new. I
haven't had this problem before, it just started to happening 2 days ago
when i pulled the latest commits to 2.x branch.
On 03/29/2013 09:50 AM, Lewis John Mcgibbney wrote:
Hi
(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
If I revert to previous release it works fine.
Thanks.
Alex.
-Original Message-
From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
the patch a try and see if that fixes my issue.
On Wed, Mar 27, 2013 at 4:29 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Nutch version please?
Sebastian and others worked on this a while ago.
I don't know about the progress on it. There is most certainly
open/resolved tickets
Nutch version please?
Sebastian and others worked on this a while ago.
I don't know about the progress on it. There is most certainly
open/resolved tickets for it on Jira please look there.
Thank you
Lewis
On Wed, Mar 27, 2013 at 12:26 PM, Bai Shen baishen.li...@gmail.com wrote:
I'm trying to
Hi Canan,
Thank you for bringing this up, I just noticed that 2.x does not have the
configurable property in nutch-default.xml
property
namehttp.redirect.max/name
value0/value
descriptionThe maximum number of redirects the fetcher will follow when
trying to fetch a page. If set to
://www.apachecon.eu/
...
There is already NUTCH-1419: report redirect and do not parse.
@Lewis: I'll review the latest patch soon, so we can sort this out.
@Canan: feel free to open a new Jira to make parsechecker follow
redirects. Thanks!
Sebastian
On 03/25/2013 10:27 PM, Lewis John Mcgibbney wrote
, NUTCH-1389, NUTCH-1419, and NUTCH-1501!
On 03/25/2013 11:22 PM, Lewis John Mcgibbney wrote:
Thanks for clarification on this one Seb.
I was aware that you were clued up on this and hoped you would drrop in.
On Monday, March 25, 2013, Sebastian Nagel wastl.na...@googlemail.com
wrote
Hi Alex,
We need to fix this.
Can you please open an issue in the Jira and we can address?
Thank you very much in advnace.
Lewis
On Mon, Mar 25, 2013 at 4:53 PM, alx...@aim.com wrote:
Hello,
I would like to let you know that, currently nutch -2.x does not index
redirected pages, independent
Hi All,
After some discussion and drumming up of interest within the Giraph
community, I've logged a Google Summer of Code issue [0] for this topic.
We are looking for interested students to come forward and participate in
the effort.
I logged this over in Giraph as there was no GSoC eefort
Can you please turn logging to DEBUG, then steo through the job.
Provide any observations please.
Lewis
On Sat, Mar 23, 2013 at 5:38 PM, kamaci furkankam...@gmail.com wrote:
After crawling when I run that command:
bin/nutch solrindex http://localhost:8983/solr -index
Sometims I get that
On the thread you pointed to Sebastian provides some clues on how to
properly DEBUG the issue.
You can try to DEBUG the issue. By this I mean actually DEBUGGING it, not
just setting logging to DEBUG and hoping for excellent results.. this will
unfortunately not happen.
Can you please confirm your
Hi Prasanna,
I would like to note for the record that I do not know of anyone running
2.X series within windows environment so I am keen to help you get this
working.
Once you build the project source, please make sure that the generated .job
file is on your classpath along with the other
Hi,
You are always encouraged to look at our Jira instance before asking
questions. It really helps both you and us solve problems efficiently.
Please check out
https://issues.apache.org/jira/browse/NUTCH-1377
And comment where you can.
When we eventually do the entire out of the box upgrade to
Hi Everyone,
On behalf of the Nutch PMC I would like to announce and welcome Feng Lu on
board as PMC and Committer on the project.
Amongst others, Feng has been an important part of the Nutch development
over the last while and we would like to welcome him.
@Feng,
Please feel free to say a bit
Nutch provides you with a pretty fine grained (common) logging mechanism.
If you check out conf/log4j.properties you can alter specific tools, or the
entire logging policy to obtain the coarseness you require.
In this instance, I would either set the logging for Injector to DEBUG, or
of course
Hi Amit,
I know this thread is a bit old now, however it is also something which
bugged me when I was looking into something else (InjectorJob counters).
On Tue, Mar 5, 2013 at 3:16 AM, Amit Sela am...@infolinks.com wrote:
And summing all counters does not equal the total map input...
Hi Mustafa,
1. Always tell us what version of the software you are using. It also helps
to mention whether your are using a binary version or src.
2. Please read the responses from users@, you haven't answered which
version of Nutch your using
3. As I explained, If you check out
seedurl as one of
the metadata.
I was looking for some plugin which I could use but in this case I did not
find any suitable plugin.
Regards,
Anand.
On 13 March 2013 22:40, Lewis John Mcgibbney lewis.mcgibb...@gmail.com
wrote:
Hi Anand,
The first step is to look at thew issue over on NUTCH
Hi Anand,
The first step is to look at thew issue over on NUTCH-1533
If you feel like addressing anything then please do.
This particular issue has nothing to do with Gora, or Hadoop so you will
not need to look at any of the code there.
I will also be working on that issue when I get some time.s
Hi Kiran
Please send me your wiki uid
On Wed, Mar 13, 2013 at 9:56 PM, kiran chitturi
chitturikira...@gmail.comwrote:
Hi!
I have noticed that there are certain sections of Nutch wiki that are
not up to date.
I am planning to update these pages with some pointers to the mailing list
Hi Kiran
On Wed, Mar 13, 2013 at 9:56 PM, kiran chitturi
chitturikira...@gmail.comwrote:
I am planning to update these pages with some pointers to the mailing list
discussion which give valuable information and also JIRA's.
Nice
Second, Does anyone have suggestions on improving/updating
There are numerous methods to do this.
*You can either assign some metadata to each URL chen injecting and
bootstrapping the system
*You could embed some meta tags or other distinguishing feature in the URLs
and use the facilities (existing or available in Jira) to identify these
pages.
*You may
Do you have an interest to work on implementing NUTCH-1533?
I would be happy to work on this as well.
Lewis
On Mon, Mar 11, 2013 at 7:39 PM, Anand Bhagwat abbhagwa...@gmail.comwrote:
Thanks for the information. I guess using the batch id is a good idea..
On 11 March 2013 21:50, Lewis John
Hi Marcel,
The WARN can be ignored. Really, it occurs when we commit a job and do the
clean up of a temporary directory. This is not a problem.
On Wed, Mar 6, 2013 at 6:56 AM, mma m...@aufwind.cc wrote:
Is there a posibility to get more Informationen in the hadoop logfile ?
Not in this
Hi All,
Over the last while we have been aware of Kiran's ongoing contribution to
the Nutch community.
It is with great pleasure that we invite Kiran to join the Nutch PMC and
also take up Committer role.
@Kiran, please feel free to say a bit about yourself and introduce what
brought you to
mergesegs operation.
I think this could be a useful feature to many Nutch users.
I can see that I wont get any more assistance here.
Thanks,
Jason
On Mar 6, 2013, at 6:18 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Jason,
There is nothing I can see here which concerns
The invocation Exception means that something further down is the problem.
It looks to be the presence of your URLNormalizer. Make sure the
configuration is all fine, make sure that the resources are available.
This is not a problem with Nutch code, rather how you are using Nutch in
your own code.
Hi,
On Tue, Mar 5, 2013 at 7:22 AM, raviksingh ravisingh.air...@gmail.comwrote:
I am new to Nutch.I have already configured Nutch with MYSQL. I have few
questions :
I would like to star by saying that this is not a great idea. If you read
this list you will see why.
1.Currently I am
Documentation - No
prior art - yes -
http://www.mail-archive.com/user@nutch.apache.org/msg06927.html
Jira issues - NUTCH-932
Please let us know how you get on. Getting some concrete documentation for
this would be excellent.
Thank you
Lewis
On Tue, Mar 5, 2013 at 7:33 AM, Anand Bhagwat
Hi Jason,
There is nothing I can see here which concerns Nutch.
Try solr lists please.
Thank you
Lewis
On Tuesday, March 5, 2013, Stubblefield Jason
mr.jason.stubblefi...@gmail.com wrote:
I have several Solr 3.6 instances that for various reasons, I don't want
to upgrade to 4.0 yet. My index
Hi,
If you look at the crawl script iirc there is no way to programmatically
obtain the generated batchId(s) from the generator.
This sounds like the source of the problem.
As Kiran said though, the Nutch crawl script is the way forward ;)
On Monday, March 4, 2013, kiran chitturi
Please don't go ahead and delete the parse directories just yet before you
hear back from others.
My suggestion would be to try and delete a subsection of the directories
and see if this is possible.
Have you changed some configuration and now want to parse out some more
content/structure?
On
for 1.x like 2.x has.
Regards,
Kiran.
On Mon, Mar 4, 2013 at 4:51 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Please don't go ahead and delete the parse directories just yet before
you
hear back from others.
My suggestion would be to try and delete a subsection
successfully with
plugins.
On Thu, Feb 28, 2013 at 4:58 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
honestly, I think we should get this fixed.
Can someone please explain to me why we don't build every plugin
within
Nutch 2.x?
I think we should.
On Thu
tried
to
build 2.x with Eclipse
i) Feed
ii) parse-swf
iii) parse-ext
iv) parse-zip
v) parse-metatags ( I wrote patch for this earlier, NUTCH-1478)
The above plugins need to be ported to build 2.x successfully with
plugins.
On Thu, Feb 28, 2013 at 4:58 PM, Lewis John Mcgibbney
This shouldn't be happening but we are aware (the Jira instance reflects
this) that there are some existing compatibility issues with Nutch 2.x HEAD.
IIRC Kiran had a patch integrated which dealt with some of these issues.
What I have to ask is what JDK are you using? I use 1.6.0_25 (I really need
to be ported due to the API changes.
https://issues.apache.org/jira/browse/NUTCH-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
On Thu, Feb 28, 2013 at 3:26 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
This shouldn't be happening but we are aware (the Jira
must have an awesome cluster to run this
:)
Thanks,
Tejas Patil
On Thu, Feb 28, 2013 at 12:06 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi,
I pushed a real simple script which I use as a cron job to bootsrtrap
Apache Nutch with 1M URLs every day.
For those wanting
Have you looked at the java code?
I am curious (and confused) about this different batch id (null) logging
and want to either get rid of it... or better... make it more informative
which would address both of our concerns.
I would like not only to document this in the java code but also on the
What for? What do you want?
We are discussing (in the Gora community) making a gora-pig module so that
there is a unified mechanism for doing pig driven inference of the data you
hold in gora-* stores. Are you interested in engaging in that conversation?
In all honesty (although indirectly linked)
Hi
On Wednesday, February 27, 2013, adfel70 adfe...@gmail.com wrote:
Yes I looked at the code.
Great
I saw that shouldProccess() check is performed on each file in the mapper.
I've got used in nutch1.* to a method in which in each cycle only a set of
urls is being processed.
Is nutch2.*
Hi,
It is clear that Nutch (or more specifically Nutch 2.x) is not
interoperable with this or some Hadoop distributions, in this case it is
CDH4.
It is not an easy problem to address from a community-to-work ratio point
of view, especially with Nutch 2.x where there are multiple libraries which
we
Glad to confirm that it was something wrong with your local windows
environment Danilo and that it is now fixed. I tried to get nightly windows
7 builds running for Nutch on the Apache build infrastructure but I've been
unable to do so yet.
On Wed, Feb 27, 2013 at 9:31 AM, Danilo Fernandes
We will be working on better support (gora-pig adapter) for this
functionality in Apache Gora 0.3.
For now Kiran's suggestion is by far the best.
Thank you
Lewis
On Tue, Feb 26, 2013 at 10:17 AM, kiran chitturi
chitturikira...@gmail.comwrote:
I found apache pig [1] convenient to use with Hbase
What is the problem? There is a community here that can help... if we know
what is wrong!
On Tue, Feb 26, 2013 at 7:44 AM, Danilo Fernandes
dan...@kelsorfernandes.com.br wrote:
I tried both and no one function! :(
-Mensagem original-
De: kiran chitturi
shed some light.
Thanks,
Tejas
Patil
On Tue, Feb 26, 2013 at 10:32 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com [15] wrote:
What is the problem? There
is a community here that can help... if we know what is wrong! On Tue,
Feb 26, 2013 at 7:44 AM, Danilo Fernandes
dan
Hi kaveh,
Size of crawl database is not an issue with regards to migration between
Nutch versions, it is a compatibility issue which you need to be concerned
about.
There are no tools currently available in Nutch (as far as I know) to read
URLs from hdfs and import/inject your crawl data into your
Hi Danilo,
You can check out the architecture changes here
http://wiki.apache.org/nutch/#Nutch_2.x
Nutch trunk (1.7-SNAPSHOT) is here
http://svn.apache.org/repos/asf/nutch/trunk/
2.x is here
http://svn.apache.org/repos/asf/nutch/branches/2.x/
On Mon, Feb 25, 2013 at 1:56 PM, Danilo Fernandes
Hi Markus,
This is very useful thank you.
Lewis
On Mon, Feb 25, 2013 at 3:08 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
Something seems to be missing here. It's clear that 1.x has more features
and is a lot more stable than 2.x. Nutch 2.x can theoretically perform a
lot better if you
e.g
about a large webtable which would have to be entirely passed to
mapreduce
even if only a handful of entries are to be processed.
Makes sense?
Julien
On 21 February 2013 01:52, Lewis John Mcgibbney
lewis.mcgibb...@gmail.comwrote:
Those filters are applied only to URLs which do
records?
Thanks.
Alex.
-Original Message-
From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
To: user user@nutch.apache.org
Sent: Wed, Feb 20, 2013 12:56 pm
Subject: Re: nutch with cassandra internal network usage
Hi Alex,
On Wed, Feb 20, 2013 at 11
601 - 700 of 1408 matches
Mail list logo