Hi Alex,
On Thu, Oct 9, 2014 at 9:45 AM, user-digest-h...@nutch.apache.org wrote:
I can't help but think that you have too many moving pieces here! Few of
which now appear to be 'stable' enough.
I would highly encourage you to look at
https://issues.apache.org/jira/browse/NUTCH-1843
This is
Can you also make sure that the cluster name and fully qualified address
and port agree between mapping and Gora.properties
Thanks
On Tuesday, September 30, 2014, Renato Marroquín Mogrovejo
renatoj.marroq...@gmail.com wrote:
Hi Kartik,
If TTL hasn't been set or if it has been set to 0, then
Hi Folks,
On Thu, Sep 25, 2014 at 10:30 PM, user-digest-h...@nutch.apache.org wrote:
I never used nutch web admin. Web admin that you used, is very old. Maybe
you can use our brand new web admin development (
https://issues.apache.org/jira/plugins/servlet/mobile#issue/NUTCH-841).
Now
it is
Hi Folks,
I've added a document for crawling hidden services .onion sites present
within the Tor network.
The documentation is available on the Nutch wiki
https://wiki.apache.org/nutch/SetupNutchAndTor
Hope some folks find this helpful.
Thank you to Roger Dingledine from Tor for his patience and
Hi Johannes
On Tue, Sep 16, 2014 at 10:19 AM, user-digest-h...@nutch.apache.org wrote:
is it possible to have nutch as a kind of stand-alone crawl server only
spoken to via the REST API?
Yes this is possible.
We just finished a Google Summer of Code project which addresses exactly
this via
Hi Markus,
On Tue, Sep 16, 2014 at 10:19 AM, user-digest-h...@nutch.apache.org wrote:
Hi - So you are not using it for scoring right, but to inspect the graph
of the web.
Yeah, I think that this is a pretty accurate statement.
Then there's certainly no need to weed out loops using the
Hi Markus,
On Wed, Sep 10, 2014 at 10:28 PM, user-digest-h...@nutch.apache.org wrote:
Weird, i didn't see my own mail arriving on the list, i sent it via kmail
but am on webmail now, which seems to work.
sigh ;)
Anyway, for vertical search on a whole website i would rely on your
are
popular, so that means large scale.
On Wednesday 10 September 2014 07:43:34 Lewis John Mcgibbney wrote:
Hi Markus,
On Wed, Sep 10, 2014 at 2:00 AM, user-digest-h...@nutch.apache.org
wrote:
Hey Lewis,
We didn't use it in the end, but did run the LinkRank on large amounts
Hi Sachin,
On Tue, Sep 9, 2014 at 8:38 AM, user-digest-h...@nutch.apache.org wrote:
hi all i am trying to make nutch2.3 compatible with hadoop 2 so now i am
facing some problems.
I have configured apache gora0.4 and hbase 0.94 with nutch2.3
so now when i inject the urls in the database a
Hi Julien,
Apologies about delay, this thread is old now but still important and
relevant.
On Mon, Sep 1, 2014 at 2:45 AM, user-digest-h...@nutch.apache.org wrote:
Our FAQ page [http://wiki.apache.org/nutch/FAQ] needs a bit of an update.
Some of the items on it are now irrelevant (search and
Hi cervenkovab,
This is an inherent design choice we made whilst developing gora-cassandra
module to what it is now.
Ultimately we store all data as a Byte Array. CQLSH subsequently gets data
as it is within Cassandra. Therefore no decoding is done on the client side
before the data is presented
Hi Martin,
On Mon, Sep 1, 2014 at 2:45 AM, user-digest-h...@nutch.apache.org wrote:
Thank you for working on this project, Феодор!
I hope you enjoyed working on it and you will continue contributing to Open
Source projects.
+1
Nutch community please let us know when you are ready to
Hi Folks,
I thought I would make an announcement regarding a project which has been
ongoing over the summer and which has now successfully passed Google Summer
of Code 2014 program.
Our stuent Fjodor Vershinin, approached the Nutch community some time in
February of this year to express his
Hi Nicholas,
NOTE: Thread name has changed to reflect diversion on topic.
On Fri, Aug 29, 2014 at 6:01 AM, user-digest-h...@nutch.apache.org wrote:
will you use config management like ansible backing vagrant?
Well thanks for the links here. The github repos they have indicates that
they
Hi Julien,
On Fri, Aug 29, 2014 at 6:01 AM, user-digest-h...@nutch.apache.org wrote:
Just out of interest, what sort of analytics do you do and why is it better
to do it in 2.x than 1.x?
Nowhere did I say it was better or worse than in 1.X. Let me be clear here.
I use Nutch 2.X, as I
Hi Mo,
On Thu, Aug 28, 2014 at 3:33 PM, user-digest-h...@nutch.apache.org wrote:
Sorry for the late reply.
Me included. This email was lost in the pile!
I use Nutch 2.x as it enables me to do analytics over the data I am
crawling. This is my justification for trying to maintain an further
Hi Azhar,
On Tue, Aug 19, 2014 at 5:16 AM, user-digest-h...@nutch.apache.org wrote:
As suggested in the report, I dropped my ~/.ivy2 folder, re-ran and was
able to build successfully. Once your repo is in a broken state then your
stuck with the hostname added to revision issue.
Fantastic.
Hi Everyone,
The Apache Nutch PMC are pleased to announce the immediate release of
Apache Nutch v1.9, we advise all current users and developers of the 1.X
series to upgrade to this release.
Apache Nutch is a highly extensible and scalable open source web crawler
software project. Nutch is a
Hi Azhar,
On Mon, Aug 18, 2014 at 1:01 AM, user-digest-h...@nutch.apache.org wrote:
Subject: Nutch Ant-Ivy build issue resolving HBase dependencies
Hi
I'm having a problem with resolving dependencies while building Nutch
2.2.1. Have added the dependency in ivy.xml to use gora-hbase.
Afternoon Troops,
72hrs has come and gone therefore I am closing the VOTE'ing. Results can be
seen below.
[4] +1 Push the release, I am happy :)
Julien Nioche *
Lu Feng *
Sebastian Nagel *
Lewis John McGibbney *
[ ] +0 I am not bothered either way
[ ] -1 I am not happy with this release
Hi,
On Tue, Aug 12, 2014 at 1:33 AM, user-digest-h...@nutch.apache.org wrote:
Hi, everyone:
I integrate nutch/solr/hbase to construct a search engine, it work well,
except that some fileds in the schma.xml are not indexed to solr.
The fields in !-- core fields -- and !-- fields for
-10-28
Lewis John McGibbney (CODE SIGNING KEY) lewi...@apache.orgPlease
VOTE as follows[ ] +1 Push the release, I am happy :)[ ] +0 I am not
bothered either way[ ] -1 I am not happy with this release candidate
(please state why)Firstly thank you to everyone that contributed to
Nutch, it is greatly
VOTE'ing will be open for 'at-least' 72 hours to allow people enough time
to cast their VOTE's.
Thanks
Lewis
On Tue, Aug 12, 2014 at 10:31 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi user@ dev@,This thread is a VOTE for releasing Apache Nutch 1.9. The
release candidate
Hi Hung,
Nutch 2.X ships with an hbase-site.xml file.
https://svn.apache.org/repos/asf/nutch/branches/2.x/conf/hbase-site.xml.template
Can you not use that for your configuration?
On Thu, Aug 7, 2014 at 6:44 PM, user-digest-h...@nutch.apache.org wrote:
We have been trying Nutch for 2 days,
Hi Mohammed,
On Wed, Jul 30, 2014 at 11:46 AM, user-digest-h...@nutch.apache.org wrote:
Available at: https://github.com/momer/nutch-selenium-grid-plugin
This looks fantastic. Are you interested in bringing in into the codebase?I
think that this would be very useful to many users of
Hi Muhamad,
On Thu, Jul 24, 2014 at 4:25 AM, user-digest-h...@nutch.apache.org wrote:
Anyone ever had MongoDB as storage in NUTCH 2.2.1 ?.
Advice me please.
I personally have not no, but there is absolutely no reason why you can't.
I know that the author's of the module e.g. Dictanova,
Hi Folks,
As part of an ongoing project this summer within my work, I am working with
Apache Nutch and therefore promoting the software as much as possible to
every I meet.
We received some feedback** on the site content and design and I wondered
if people would mind reviewing the feedback and
... and the link
https://docs.google.com/spreadsheets/d/1FKD30fzojIDrSQ03qPztLpjToaIohT9_FMuGca6kxug/edit?usp=sharing
On Mon, Jun 30, 2014 at 3:07 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Folks,
As part of an ongoing project this summer within my work, I am working
Hi Alex,
On Thu, Jun 26, 2014 at 5:48 AM, user-digest-h...@nutch.apache.org wrote:
I already came up with similar changes to the code as in this patch. Only
suggestion to this patch's code is that to move checking if url exists in
the datastore under
if (!additionsAllowed) {
Hi Folks,
For those that are interested, I would like to guide you towards the
growing documentation our student Fjodor is producing.
https://wiki.apache.org/nutch/NutchRESTAPI
Thanks Fjodor... keep thit coming this is great and extremely helpful.
Best
Lewis
--
*Lewis*
Hi Alex,
I am really sorry for not making the connection here.
On Tue, Jun 24, 2014 at 12:31 AM, user-digest-h...@nutch.apache.org wrote:
So far, this looks like a bug in updatedb when filtering with batchId.
I could only found one solution, to check if new pages are in the datastore
and
Hi Alex,
On Tue, Jun 17, 2014 at 2:06 PM, user-digest-h...@nutch.apache.org wrote:
I am using nutch-2.x with GORA_97.
You mean GORA-94, the Avro upgrade?
With which gora- backend please?
Further investigation shows that DbUpdateReducer
calls
inlinkedScoreData.clear();
I see this on
Hi Folks,
I've opened a channel on IRC for Nutch.
It's at #nutch
For those of you interested in joining the room via browser, you can do so
here
http://webchat.freenode.net/
Thanks
Lewis
--
*Lewis*
Yep
Do you fancy making your first commit to the new CMS?
;)
On Wed, Jun 18, 2014 at 10:51 AM, Markus Jelsma mar...@openindex.io wrote:
Cool Lewis. If this is there to stay, shouldn't we advertise it on our
homepage?
Markus
On Wednesday, June 18, 2014 10:34:13 AM Lewis John Mcgibbney
UPDATE
We are on #nutchbot
Someone took #nutch already!
See you there.
On Wed, Jun 18, 2014 at 10:34 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Folks,
I've opened a channel on IRC for Nutch.
It's at #nutch
For those of you interested in joining the room via browser
Hi Folks,
I recently attacked [0] which now enables us to run our site as a content
management system as oppose to a static host for primitive documentation.
Hopefully the site is easier to navigate now and of course it is MUCH
easier for us to maintain as a community.
Please see the README
Hi,
On Thu, Jun 5, 2014 at 11:19 PM, user-digest-h...@nutch.apache.org wrote:
# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]
Is this your preference?
Anyways, you need to try and debug why there ends up being no Map Input
records for the Generate phase.
I
which version of Nutch are you using?
Nutch 2 what?
On Thu, Jun 5, 2014 at 12:14 PM, Manikandan Saravanan
manikan...@thesocialpeople.net wrote:
Dear Lewis,
I’m running Nutch 2 on a Hadoop 1.2.1 cluster (2 nodes). I’m using
Cassandra as my backend datastore . I’m trying to crawl one link as
It looks like the InjectorJob phase successfully injects your 1 URL in to
Cassandra Keyspace.
On Thu, Jun 5, 2014 at 12:14 PM, Manikandan Saravanan
manikan...@thesocialpeople.net wrote:
14/06/05 15:01:02 INFO mapred.JobClient: Map input records=1
...
14/06/05 15:01:02 INFO
Hi Karl-Philipp,
On Thu, May 29, 2014 at 12:50 AM, user-digest-h...@nutch.apache.org wrote:
2. downloaded and started HBase, shell is running, test database
creation successful
Please make sure for Nutch version 2.2.1 you are using HBase version 0.90.4.
Please also make sure that you use
Hi,
On Mon, May 26, 2014 at 11:20 AM, Manikandan Saravanan
manikan...@thesocialpeople.net wrote:
I’m running Nutch 2
Which version?
Do you have the code packaged in to the .job jar?
You need to look in there and see? It seems that it is not there.
Hi Folks,
Has anyone done this before?
Is email archiving something which we can do or not?
I've been playing around with Geronimo's Javamail library and wondered if
we could use it as Protocol extensions for above protocol's.
Any thoughts?
Lewis
--
*Lewis*
Hi BlackIce,
On Sun, May 11, 2014 at 9:20 AM, user-digest-h...@nutch.apache.org wrote:
Subject: Nutch 2.x from svn.
java.lang.Exception: java.lang.IllegalStateException: Target host must not
be null, or set in parameters.
Which version of httpcore is included in /lib or on classpath?
Hi BlackIce,
On Sun, May 11, 2014 at 9:20 AM, user-digest-h...@nutch.apache.org wrote:
You are correct, I did some research and found it to be a TIKA issue, its
is fixed by setting the Title field to multivalued in schema.xml.I think
by default the Nutch schema should be updated
Hi BlackIce,
On Sat, May 3, 2014 at 10:52 PM, user-digest-h...@nutch.apache.org wrote:
Any idea on when Nutch 2.3 will be released?
Thnx
We have a roadmap here
https://issues.apache.org/jira/browse/NUTCH/?selectedTab=com.atlassian.jira.jira-projects-plugin:roadmap-panel
What would be
Hi BlackIce,
On Sat, May 3, 2014 at 10:52 PM, user-digest-h...@nutch.apache.org wrote:
Does anyone have a good nutch/solr 4.7 schema file?
What about the one here
http://svn.apache.org/repos/asf/nutch/branches/2.x/conf/schema.xml
???
Lewis
Hi,
On Fri, Apr 25, 2014 at 11:15 AM, user-digest-h...@nutch.apache.org wrote:
From what you said earlier,
Isn't that the same as contentLength in index-more plugin which is
determined according
to the type of download page?
Pretty much ;)
It would be interesting to see if you could use
Good Afternoon Everyone,
The Apache Gora team are very proud to announce the immediate release of
Gora 0.4 which is a major release for the project.
The Apache Gora open source framework provides an in-memory data model and
persistence for big data. Gora supports persisting to column stores, key
Hi Julien,
On Wed, Apr 23, 2014 at 1:56 PM, Julien Nioche
lists.digitalpeb...@gmail.com wrote:
Great news! Well done and thanks to everyone involved. I am sure this will
be popular with the Nutch 2.x users.
+1
BTW I can smell a rematch of
Hi Folks,
A quick message to say that by the end of the summer it looks like we will
have a killer Wicket-based Web Application for Nutch 2.x branch as the
project was successfully accepted into this years GSoC program. :) :) :)
Thanks
Lewis
--
*Lewis*
11th April 2014 - crawler-commons 0.4 is released
We are glad to announce the 0.4 release of Crawler Commons. Amongst other
improvements, this release includes support for Googlebot-compatible
regular expressions in URL specifications, further imprvements to
robots.txt parsing and an upgrade of
Hi Alex,
Regarding your findings and the code you've posted.
Can you please open an issue on Jira, posting in the code changes (or even
better a patch :) )
If you could like it to the Gora 0.4 upgrade issue in Nutch Jira it would
be excellent.
Thanks v much Alex.
Lewis
On Fri, Apr 11, 2014 at
Hi,
On Wed, Apr 9, 2014 at 8:43 AM, user-digest-h...@nutch.apache.org wrote:
Subject: Re: Crawl Anonymously
Set up a anonymous proxy server and configure nutch to crawl through the
proxy.
http://wiki.apache.org/nutch/SetupProxyForNutch
Hi
On Wed, Apr 9, 2014 at 8:43 AM, user-digest-h...@nutch.apache.org wrote:
user Digest 9 Apr 2014 14:43:51 - Issue 2188
I might not be thinking in the right direction so need some help. Is there
a way to find an approximate web content size of a particular website in
Nutch 2.2.1?
Hi reddibabu,
On Mon, Apr 7, 2014 at 7:20 AM, user-digest-h...@nutch.apache.org wrote:
I am using Nutch 1.7. Nutch setup was on Linux Box.
My requirement is to stop nutch crawling in middle and start it from where
it was stopped.
Here, I don't want to see missing crawled data. Is there any
Hi Alex,
On Tue, Apr 1, 2014 at 7:34 AM, user-digest-h...@nutch.apache.org wrote:
I have applied the patch to the current trunk
Here is the output of ant
...SNIP...
Also, here are files that import/use StateManager class, which seems was
removed from GORA_94
...SNIP...
Thanks.
Hi Folks,
We are please to announce that the Nutch PMC recently VOTE'd to extend an
invitation to Talat inviting him to join our PMC. His ongoing mailing list
contributions and code contributions (mostly) to the 2.X branch has been
evident for some time now and we are really glad to have him on
Hi Shane,
On Sat, Mar 29, 2014 at 10:15 PM, user-digest-h...@nutch.apache.org wrote:
MYSQL version 5.6.16
Nutch version 2.2
I would always suggest that you use the most up-to-date version of Nutch.
For the 2.x branch that is 2.2.1. This may or may not have been fixed.
Support for gora-sql
Hi Shane,
On Sat, Mar 29, 2014 at 10:15 PM, user-digest-h...@nutch.apache.org wrote:
How do you use the readdb command when using MYSQL there is no crawldb
created ?
A physical crawldb residing on HDFS is non-existent. It's equivalent in 2.x
is the WebPage table which you will see is created
Hi alxsss,
On Sat, Mar 29, 2014 at 10:15 PM, user-digest-h...@nutch.apache.org wrote:
I downloaded GORA_94 branch and with libs from it a get
14/03/27 11:21:19 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir:
test_urls
Exception in thread main java.lang.NoClassDefFoundError:
Hi everyone,
This is just a friendly reminder that there are just about 48 hours left
until the deadline for student applications this year.* Late proposals will
not be accepted for any reason.
It's great to see some applications coming in.
Best
Lewis
Hi Folks,
On Wed, Mar 19, 2014 at 7:49 PM, user-digest-h...@nutch.apache.org wrote:
Re: Book of Nutch
So what was the issue with this book?
You can see some rather interesting reviews online
http://s.apache.org/qy
Hi anupamk,
On Tue, Mar 18, 2014 at 2:45 AM, user-digest-h...@nutch.apache.org wrote:
While running the two crawler's concurrently I have run into the problems
and nutch sometimes throws a IOException saying that the .locked file
exists in crawldb. While one of crawl script tries to
Hi BlackIce,
On Wed, Mar 19, 2014 at 3:07 PM, user-digest-h...@nutch.apache.org wrote:
HI,
My first try to run Nutch in pseudo dist, when trying to run any nutch
comman from the /runtime/deploy folder I get following error:
Which version of Hadoop?
Check the classpath for the offending
Hi a.ciccia04,
On Wed, Mar 19, 2014 at 3:07 PM, user-digest-h...@nutch.apache.org wrote:
Im working with apache-nutch-2.2.1, hbase-0.90.4 solr-4.7.0
You've not stated how you've configured your stack.
You've not mentioned how many machines you run with.
This may simply be an IO problem
Hi anupamk,
On Sat, Mar 15, 2014 at 11:59 PM, user-digest-h...@nutch.apache.org wrote:
I would like to know if I can configure nutch to solrindex the Content::
part of the record rather than ParseText:: part.
I really don't know but it would be nice to make the configurable. It would
to everyone who
contributed to Nutch 1.8 development drive.
Best
Lewis
On Tue, Mar 11, 2014 at 10:17 PM, lewis john mcgibbney
lewi...@apache.orgwrote:
Hi user@ dev@,
This thread is a VOTE for releasing Apache Nutch 1.8 RC#2. The release
candidate comprises the following components
Good Evening,
The Apache Nutch PMC are pleased to announce the immediate release of
Apache Nutch v1.8.
Apache Nutch is a highly extensible and scalable open source web crawler
software project. Stemming from Apache Lucene, the project has diversified
and now comprises two codebases, namely:
Hi,
On Wed, Mar 5, 2014 at 4:32 AM, user-digest-h...@nutch.apache.org wrote:
There is a property to set the precedence for the jars in the submitted
MapReduce job over the Hadoop jars, I dont recall the name of the property
but you can try googling it.
mapreduce.job.user.classpath.first =
Hi Markus,
On Wed, Mar 5, 2014 at 4:32 AM, user-digest-h...@nutch.apache.org wrote:
-1 we still need to fix the last issue of the new segment merger test and
the indexing issues. We also introduced the hostdb in 1.8 but it doesnt
seem to work. Releasing 1.8 now is releasing a broken nutch
Hi All Nutch'ers,
This thread is a VOTE for releasing Apache Nutch 1.8. The release candidate
comprises the following components.
* A staging repository [0] containing various Maven artifacts
* A branch-1.8 of the trunk code [1]
* The tagged source upon which we are VOTE'ing [2]
* Finally, the
Hi Mateusz,
On Wed, Feb 26, 2014 at 9:59 AM, user-digest-h...@nutch.apache.org wrote:
I'm investigating Nutch REST api
Nice
(I couldn't find any documentation).
Yeah, this is something which will hopefully become part of this years GSoC
if the project gets a go ahead. Right now it is
Hi Julien,
On Fri, Feb 21, 2014 at 3:09 PM, user-digest-h...@nutch.apache.org wrote:
Hi,
Just in case you missed it, here is a blog post from Jordan Mendelson on
how they moved to Nutch :
http://commoncrawl.org/common-crawl-move-to-nutch/
Julien
I think a significant portion of this
Hi Gavin,
On Wed, Feb 12, 2014 at 9:24 AM, user-digest-h...@nutch.apache.org wrote:
ParserJob: starting
ParserJob: resuming:false
ParserJob: forced reparse:false
ParserJob: parsing all
Parsing http://www.tianya.cn/
Parsing http://www.163.com/
Parsing http://www.hao123.com/
Hi Manikandan,
On Wed, Feb 5, 2014 at 7:36 AM, Manikandan Saravanan
manikan...@thesocialpeople.net wrote:
I'm getting this when running the crawl script right after the parse phase
Exception in thread main java.lang.IllegalArgumentException: usage:
(-crawlId id)
Something wrong with
://thesocialpeople.net
On 4 February 2014 at 1:28:39 am, Lewis John Mcgibbney (
lewis.mcgibb...@gmail.com //lewis.mcgibb...@gmail.com) wrote:
Hi Manikandan,
On Mon, Feb 3, 2014 at 3:45 PM, user-digest-h...@nutch.apache.org
wrote:
And then, I'm running this:
$HADOOP_HOME/bin/hadoop jar /usr/local/nutch
Hi Manikandan,
On Mon, Feb 3, 2014 at 3:45 PM, user-digest-h...@nutch.apache.org wrote:
And then, I'm running this:
$HADOOP_HOME/bin/hadoop jar /usr/local/nutch/nutch.job
org.apache.nutch.crawl.Crawler dmoz -dir /user/hduser/crawl -depth 3 -topN
5000
You're using the Crawler class. This is
Hi Alberto,
On Fri, Jan 31, 2014 at 5:26 AM, user-digest-h...@nutch.apache.org wrote:
I noticed that the crawl script is generating the batch id but the Crawler
class don't. So I changed nutch's code to generate the batch id and the
problem solved.
My questions:
1. I guess that there is a
Hi d_k,
On Wed, Jan 22, 2014 at 2:18 PM, user-digest-h...@nutch.apache.org wrote:
nekohtml-0.9.5.jar and nekohtml-1.9.17.jar and I suspect that the file
that's actually loaded is 0.9.5.
So do I
If I execute 'ant clean' followed by 'ant runtime' (the patch was already
applied) the only
Hi Teague,
On Wed, Jan 22, 2014 at 2:18 PM, user-digest-h...@nutch.apache.org wrote:
@Markus: When you say that the problem may be with url filters, what can I
do about that?
By default Nutch uses a regex urlfilter for filtering out URLs which we
assume ill ultimately mess up your crawlDB.
Hi weishenyun,
On Wed, Jan 22, 2014 at 2:18 PM, user-digest-h...@nutch.apache.org wrote:
Hi lewis,
I also found that there is something wrong in the DBUpdaterReducer. See
below code block:
snip...
This sentence 'page.putToInlinks(new Utf8(inlink.getUrl()), new
Hi Folks,
Sorry for cross posting...
Simple question.
Is anyone using the above combination?
When I try and fetch a batchId e.g. ./bin/fetch 1390426083-1144459470,
sometimes I am unable to fetch pages and my logging indicates
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s,
Hi d_k,
On Mon, Jan 20, 2014 at 11:39 AM, user-digest-h...@nutch.apache.org wrote:
Posting back as promised. :-)
Great
I just encountered the error java.lang.NoClassDefFoundError:
org/cyberneko/html/parsers/DOMFragmentParser and applied the patch
NUTCH-1253-2.x-v2.patch from NUTCH-1253
Hi user@/dev@,
Please see the message below regarding travel assistance opportunities for
people wishing to attend Apache Con 2014 which will be held in Denver,
Colorado, April 7-9, 2014.
Kind Regards
Lewis
-- Forwarded message --
From: lewis john mcgibbney lewi...@apache.org
Hi Yann,
On Thu, Jan 16, 2014 at 6:09 AM, user-digest-h...@nutch.apache.org wrote:
Anything I can do on my side at this point?
If you can reproduce this at your end, and suspect that others should be
able to reproduce it as well, then I would suggest logging a ticket in our
Jira. Please make
Hi Jason,
On Thu, Jan 16, 2014 at 6:09 AM, user-digest-h...@nutch.apache.org wrote:
I had tried +^http://www.cancer.gov/cancertopics/druginfo
http://www.cancer.gov/cancertopics/druginfo/lungcancer.*
or +^http://www.cancer.gov/cancertopics/druginfo/
Hi weishenyun,
On Thu, Jan 16, 2014 at 6:09 AM, user-digest-h...@nutch.apache.org wrote:
I have tried to use Nutch 2.2.1 recently. Using HBase as storage and I
found
that the column family il(inbound link) was missing. I have set
db.update.max.inlinks = 1000 but none of il was there. Do
Hi Folks,
It's become obvious that folks who are dabbling with Nutch 2.x deployments
are either using, or wish to use, stable Gora SNAPSHOT's e.g. Gora trunk
0.4-SNAPHOT at time of writing.
I put together this wiki page [0] for those people.
Please send comments to the user@nutch list and we will
Hi Manikandan,
Please see
https://issues.apache.org/jira/browse/NUTCH-1696
Can you please try this patch out and comment on the issue.
I hope this works for you.
Thank you
Lewis
On Tue, Jan 7, 2014 at 6:05 PM, Manikandan Saravanan
manikan...@thesocialpeople.net wrote:
Hi,
I’m running Nutch
Hi Rajni,
On Sun, Dec 22, 2013 at 12:39 PM, user-digest-h...@nutch.apache.org wrote:
what can be error.
Please try updating the sql database with the updatedb command.
I would also advise you against using the crawl command and subsequent
class. Please chain together individual commands and
Hi rk_sharma,
On Wed, Dec 18, 2013 at 9:40 PM, user-digest-h...@nutch.apache.org wrote:
Hi i am using nutch on rhel-5 and facing an exception
[root@localhost local]# bin/nutch crawl urls -dir crawl -depth 3 -topN 5
InjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora
Hi Nibal,
On Sun, Dec 15, 2013 at 11:26 PM, user-digest-h...@nutch.apache.org wrote:
of Single Page Web-apps and JavaScript-only web-applications is
sky-rocketing.well, isn't this a high priority issue
It would appear not. Unless folk provide patches then core contributers
have not
Hi Rajni,
On Thu, Dec 19, 2013 at 3:53 PM, user-digest-h...@nutch.apache.org wrote:
Thanks for your suggestion.
I wants to use MySql as storage backend for Gora in Nutch-2.2.1 is their
any proper documentation that can be followed. Last two days i am searching
on that but could not find any
Hi Nguyen,
On Fri, Dec 13, 2013 at 4:28 AM, user-digest-h...@nutch.apache.org wrote:
I am crawling a list of home pages to discover new articles, crawler will
stop at depth 1.But at depth 1, crawler still add many new urls with depth
2, so event i only crawl up to depth 1 but crawldb still
Hi,
In the process of addressing ad porting NUTCH-840 [0], I've discovered a
couple of anomalies.
Within org.apache.nutch.tika.TestDOMContentutils#setup trunk uses
org.cyberneko.html.parsers.DOMFragmentParser like so
private static void setup() throws Exception {
conf =
Hi Folks,
A quick post here to promote an event Apostolos Giannakidis (Apache Gora's
GSoC student this year) and myself with be speaking at in Dublin this
coming Monday.
Event info and registration can be found below
Hi Talat,
On Sat, Dec 7, 2013 at 5:44 PM, user-digest-h...@nutch.apache.org wrote:
Hi Vangelis,
I draw a Nutch Software Architecture diagram. Maybe it can be help you.
https://drive.google.com/file/d/0B2kKrOleEOkRQllaTGdRZGFMY2M/
edit?usp=sharing
Talat
Would you be interested in
Hi d_k,
Can you please check out this issue
https://issues.apache.org/jira/browse/NUTCH-1253
I uploaded a patch on Fed 7th 2013 which has not been tested but which i
hope will fix this issue. Can you please read up on the Jira issue and test
the patch?
Please also see my comments below
On Tue,
Hi,
On Tue, Dec 10, 2013 at 8:46 PM, user-digest-h...@nutch.apache.org wrote:
So this leaves me with a question. Are there recommendations for a
properly
configured User-Agent string that identifies an instance of a Nutch Crawler
and does not run afoul of a firewall like this? Using the
Hi vagkarv,
On Wed, Nov 20, 2013 at 11:08 AM, user-digest-h...@nutch.apache.org wrote:
Hi! I use apache-nutch-2.2.1 and for my database mysql. I run the crawl
script and each time I check my database, I see that the number of links
with status=1 are many more than the number of links with
301 - 400 of 1408 matches
Mail list logo