Hi Paddy,
Some comments in addition to my response. You should try upgrading to Nutch
1.10 when we release very shortly. There has been so much work done since
1.8 that you can benefit from. Keep your ears peeled here for a release
candidate and then eventual release.
Please see response below.
Hi David,
On Wed, Aug 26, 2015 at 5:05 AM, user-digest-h...@nutch.apache.org wrote:
Is there any general feeling towards how close the next 2.x release is?
Yes. I feel strongly about getting one out there as soon as we have
stabilized the remaining issues
understand it.
2015-08-25 8:02 GMT+03:00 Lewis John Mcgibbney lewis.mcgibb...@gmail.com
javascript:_e(%7B%7D,'cvml','lewis.mcgibb...@gmail.com');:
Hi Cihad,
Which version of Nutch 2.X are you working with when you get these
errors?
On Sat, Aug 15, 2015 at 11:04 AM, user-digest-h
Hi Alp,
On Tue, Aug 11, 2015 at 10:03 PM, user-digest-h...@nutch.apache.org wrote:
While trying to use 2.4.x for Tika 1.8 (to use tesseract for ocr,
actually), tika could not parse application/pdf files. The mapping is
correct, in the plugin-xml, * are routed to tika, and the log states that
Hi Cihad,
Which version of Nutch 2.X are you working with when you get these errors?
On Sat, Aug 15, 2015 at 11:04 AM, user-digest-h...@nutch.apache.org wrote:
I run TestInjector. But there are an exeption as follow:
java.util.NoSuchElementException
at
Hi Amir,
On Fri, Jul 17, 2015 at 3:08 AM, user-digest-h...@nutch.apache.org wrote:
I'm trying to find a way to know, for every document, which crawl job
issued it.
I thought of indexing the crawlId as part of the indexed data, and I
thought of using the index-metadata plugin with index.db
Hi Divjot,
Please see reply below
On Wed, Jul 15, 2015 at 1:13 AM, user-digest-h...@nutch.apache.org wrote:
I have compiled nutch 2.3 code with gora 0.6 and using cloudera Hbase as
backend database. The code compiles fine and I am able to run it using the
bin/crawl command. The problem is
Hi Alp,
On Tue, Aug 11, 2015 at 10:03 PM, user-digest-h...@nutch.apache.org wrote:
Hello,
[snip]
1. nutch 2.3 sets the timestamp to a month later. date is 1970 Tried to use
index-more, but still lastmodified date is null. Investigating the
elasticsearch map, date, tstamp fields are set
Hi Alp,
On Tue, Jul 21, 2015 at 10:20 PM, user-digest-h...@nutch.apache.org wrote:
I would like to use Tesseract OCR within nutch, in order to parse scanned
pdf files (assuming this is the correct (and only?) way of doing that).
Skimming through the previous emails, I noticed the support is
Hi Lê Văn Thiệp,
On Fri, Jul 10, 2015 at 4:43 AM, user-digest-h...@nutch.apache.org wrote:
Subject: Re: KeeperErrorCode = ConnectionLoss for /hbase/master
Hi Lewis John Mcgibbney
I am using Nutch 2.x, Gora 0.5, and HBase 0.9.4.x
Thanks for your help!
Did you ever get this sorted out
Hi Alexandre,
Apologies for the hellishly long time before I've picked up this message!
Current status of 2.X branch is that it in need of some attention and major
upgrades to Key dependencies. This is inherited through the dependency upon
Apache Gora, as we need to release Apache Gora 0.6.1
Hi Markus,
On Tue, Jul 28, 2015 at 5:54 AM, user-digest-h...@nutch.apache.org wrote:
Hello - Nutch does not ship unit tests anymore as Maven artifacts, hence
we cannot use CrawlDBTestUtil in external projects. Should we ship them? Or
just copy the utils? What do you think?
Markus
I would
Hi Aditya,
The code and documentation you've referenced below is ancient.
If you want to use Nutch with a GUI, you need to use Nutch 2.X [0].
You need to investigate both the nutchserver [1] and webapp [2].
hth
Lewis
[0] http://svn.apache.org/repos/asf/nutch/branches/2.x/
[1]
Hi ThiepLV,
Which version of Nutch,, Gora Hadoop and HBase are you on?
On Wed, Jul 8, 2015 at 4:23 AM, user-digest-h...@nutch.apache.org wrote:
I am running InjectorJob by tutorial
https://wiki.apache.org/nutch/RunNutchInEclipse, But i receive as follows:
2015-07-07 22:33:38,269 ERROR
Hi Folks,
It is very common for us to see logging such as the following
fetching
http://www.eclap.eu/drupal/?q=en-US/node/65229/og/forumsort=ascorder=Topic
-activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
What I've noticed for some time is that fetchQueues.totalSize never seems
to
Hi Geoffry,
On Fri, Jun 19, 2015 at 10:06 AM, user-digest-h...@nutch.apache.org wrote:
I started with Nutch yesterday and have come up with four+ questions if
answered will help me on my way.
1. Is it correct Nutch 2.3 does not work with Solr 5.2.1? There seems
to be a dependency
Hi Alex,
On Fri, Jun 19, 2015 at 10:06 AM, user-digest-h...@nutch.apache.org wrote:
Is there any recommended or better way of running Nutch 1.x jobs from
Dynamic Web Project.
You mean using the Java API's we provide on Maven Central?
Hi Jessica,
On Fri, Jun 19, 2015 at 10:06 AM, user-digest-h...@nutch.apache.org wrote:
I'm writing a Java application that uses the Nutch REST API to execute the
crawl cycle. I need to be able to call the next job only when the previous
job is finished.
Right now, the only way I know to
Hi Jessica and Brooks,
On Fri, Jun 19, 2015 at 10:06 AM, user-digest-h...@nutch.apache.org wrote:
[snip]
Notice the 'prevFetchTime' field has been updated to show the next
date when this URL should be crawled (30 days from now - July 19). I
assume this is exactly what SHOULD
Actually please just see
https://issues.apache.org/jira/browse/NUTCH-2045
If you guys could test it would be great.
lewis
On Mon, Jun 22, 2015 at 11:21 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Jessica and Brooks,
On Fri, Jun 19, 2015 at 10:06 AM, user-digest-h
Hi Brooks
On Wed, Jun 17, 2015 at 7:59 AM, user-digest-h...@nutch.apache.org wrote:
Hi all,
[snip]
First things first, can you veryify your elastic search settings in
nutch-site.xml? e.g.
https://github.com/apache/nutch/blob/2.x/conf/nutch-default.xml#L1240-L1291
Please make sure that
Hi Jessica
On Wed, Jun 17, 2015 at 7:59 AM, user-digest-h...@nutch.apache.org wrote:
I'm having trouble understanding the concept of a batch and which elements
of the crawl cycle require a batchId.
A patch ID is essentially the same as a segment is in Nutch 1.X branch. It
defines a type of
Hi Ankit,
On Wed, Jun 17, 2015 at 7:59 AM, user-digest-h...@nutch.apache.org wrote:
Hi All,
Is it possible to store data into HDFS directly without using hbase while
crawling with Apache Nutch 2.3
Yes it is, please see the AvroStore and DataFileAvroStore Gora
implementations for writing
Hi Jessica,
On Fri, Jun 12, 2015 at 7:10 AM, user-digest-h...@nutch.apache.org wrote:
Hello. I am trying to test out the 2.3 REST API using curl, but I'm having
trouble with the commands.
[snip]
Did you get this issue sorted out?
Are there any more problems? The issue with casting to Long
Hi Ankit,
On Mon, Jun 8, 2015 at 2:13 AM, user-digest-h...@nutch.apache.org wrote:
I tried it with 1.10, but the shortened urls still dont get followed
through.
Have you tried changing logging level to TRACE within
conf/log4j.properties? This may provide more detail for you.
I think
Hi Breno,
On Tue, Jun 2, 2015 at 1:38 AM, user-digest-h...@nutch.apache.org wrote:
We are indexing several domains for a specific project, which may contain
duplicated content (e.g. pdf files). The users of the system come from
different organisations and wonder why the content is not
Hi Folks,
I wanted to post to this list some observations and findings we've
experienced regarding the above topic and how Nutch is behaving. [0]
Essentially, this comes down to the following By default, Vagrant maps the
'source' directory on the host machine to /vagrant on the client. This is
Hi Breno,
On Sun, May 31, 2015 at 12:30 AM, user-digest-h...@nutch.apache.org wrote:
I've implemented a custom domain aware Signature to be used in the
deduplication phase.
Nice! Out of curiosity can you share what your use case is? I would be
really interested to hear more as I am
Hi,
On Sun, May 31, 2015 at 12:30 AM, user-digest-h...@nutch.apache.org wrote:
Hi comunity.
Im using nutch 1.9 and solr 4.10.
I use nutch for parse zip documents, but the field language is empty in
solr for all of this documents and this is a problem for me.
ParseZip plugin use tika to
Hi Chaushu,
On Sun, May 31, 2015 at 12:30 AM, user-digest-h...@nutch.apache.org wrote:
I'm using Nutch 1.9 with Solr 4.10
I wanted to ask what are the advantages of Nutch 2 vs. Nutch 1 and if I
use Solr, there is a reason why should I use Nutch 2.
Nutch 1.X branch is the more maintained of
Hi Alex,
On Wed, May 20, 2015 at 1:03 AM, user-digest-h...@nutch.apache.org wrote:
Hi Lewis,
I am using Nutch 2.3
Grand. Thank you for the context. The patch is available at
https://issues.apache.org/jira/browse/NUTCH-2019
If you could test against Nutch 2.X HEAD it would be ideal.
Lewis
Hi Ralf,
On Wed, May 20, 2015 at 1:03 AM, user-digest-h...@nutch.apache.org wrote:
So by simply changing the Gora backend it should work? Thank you!
I'll try it out soon.
Yes. Exactly. If you have problems then get us here. We will be making a
releaee of Gora very soon and have fixed a
and my application is
accepted. The main reason why I have choosen the Nutch Project for GSOC is
knowing the Nutch closely. My subject is Nutch-1741 - Support of Sitemaps
in Nutch 2.x[1] . Thanks Lewis John McGibbney and Talat Uyarer for being
my mentors on this process. I hope I can contribute
[3]
https://issues.apache.org/jira/secure/attachment/12707721/SitemapCrawlerLifeCycle.pdf
Kind Regards
2015-05-19 1:16 GMT+03:00 Cihad Guzel cguz...@gmail.com:
Ok Lewis,
I signed up to wiki, my wiki username: cihadguzel
Thanks
2015-05-18 23:44 GMT+03:00 Lewis John Mcgibbney
Hi Saurabh
On Wed, May 13, 2015 at 7:38 PM, user-digest-h...@nutch.apache.org wrote:
But when I run
runtime/local/bin/nutch index -all
It I get:
SolrIndexerJob: java.lang.RuntimeException: job failed: name=Indexer,
jobid=job_local830597808_0001
at
Hi Halil,
On Wed, May 13, 2015 at 7:38 PM, user-digest-h...@nutch.apache.org wrote:
I had applied the GSoC this year for Nutch Project. Recently I got an
email that my application is accepted. My subject is Giving HTML5 support
for Apache Nutch 2.x. Lewis John McGibbney and Talat Uyarer
Hi Alex,
Which version of Nutch 2.x are you using?
Yes I think this is a bug and a patch would be great.
Thanks
Lewis
On Sat, May 9, 2015 at 4:31 PM, user-digest-h...@nutch.apache.org wrote:
Hi Lewis,
Thanks for replying, I will try and open a ticket after I'm sure its a
Nutch bug and
Hi Luigi,
On Mon, May 11, 2015 at 5:53 PM, user-digest-h...@nutch.apache.org wrote:
Hi Luigi,
Which type static file do you talk ? In 2.x every files store in data
store. IndexingJob can index in data store rows
The index-static plugin has not been ported to 2.X. If you would like to do
Hi Alex,
On Thu, May 7, 2015 at 11:44 AM, user-digest-h...@nutch.apache.org wrote:
Hi All,
I'm getting java.lang.ClassCastException: java.lang.Integer cannot be
cast to java.lang.Long when sending topN argument for /job/create using
Nutch 2.x RESTApi. Does any1 knows how to fix that?
Hi Everyone,
The Apache Nutch PMC are pleased to announce the immediate release of
Apache Nutch v1.10, we advise all current users and developers of the 1.X
series to upgrade to this release.
Nutch is a well matured, production ready Web crawler. Nutch 1.x enables
fine grained configuration,
Hi Folks,
I would like to close off the VOTE'ing for the Nutch 1.10 release candidate
as below.
The VOTE'ing resulted in the following
[4] +1 Push the release, I am happy :)
Lewis John McGibbney
Sebastian Nagel
Jorge Luis Betancourt González
Chris Mattmann
[1] +0 I am not bothered either way
Hi Folks,
The results should have been
[4] +1 Push the release, I am happy :)
Lewis John McGibbney *
Sebastian Nagel *
Jorge Luis Betancourt González *
Chris Mattmann *
Julien Nioche *
Asitang Mishra
[1] +0 I am not bothered either way
John Lafitte
[0] -1 I am not happy with this release
hi user@ dev@,
Check out some of the services we've implemented in Nutch 1.x.
The blog post introduces How we can use Maxminds GeoIP services to
implement reverse geocoding for server IP addresses.
Enjoy
http://blog.maxmind.com
Lewis
--
*Lewis*
Hi Dzmitry,
On Sat, Apr 25, 2015 at 5:08 AM, user-digest-h...@nutch.apache.org wrote:
I found out that there is a REST API for 2.x branch. However it works only
for local hadoop mode.
How did you verify this?
Is there any way to work with REST API in hadoop
distributed mode? I suppose
Dynamite Giuseppe!
On Sat, Apr 25, 2015 at 5:08 AM, user-digest-h...@nutch.apache.org wrote:
Hi Arthur,
On Sat, Apr 25, 2015 at 5:08 AM, user-digest-h...@nutch.apache.org wrote:
My Nutch is 2.3 with Gora and Hbase, below are the sample field values I
have scanned from HBase here:
[snip]
Q: Is there a way to configure Nutch/Gora/HBase so it will store the value
like following
Hi user@ dev@,This thread is a VOTE for releasing Apache Nutch 1.10.
The release candidate comprises the following components.* A staging
repository [0] containing various Maven artifacts* A branch-1.10 of
the trunk code [1]* The tagged source upon which we are VOTE'ing [2]*
Finally, the release
Hi Jeff,
On Thu, Apr 23, 2015 at 10:06 AM, user-digest-h...@nutch.apache.org wrote:
I am going through the nutch-default.xml file to learn and understand where
and how each of the config values are utilized.
The subcollection property in nutch-default.xml is:
property
Hi Okello,
On Fri, Apr 10, 2015 at 8:31 PM, user-digest-h...@nutch.apache.org wrote:
I've a setup of Nutch 2.3 with Cassandra 2.0.2. Everything seems to be
working fine save for one little issue. I'm using the crawl script. The 'p'
table is empty even though in the logs I can see there's no
Hi Andrzej,
On Fri, Apr 10, 2015 at 8:31 PM, user-digest-h...@nutch.apache.org wrote:
I can't find any information how to do the correct setup with any SQL
database.
Does someone have any idea what I'm doing wrong? Is the setup using SQL
database actually possible?
It's safe to say that
Hi Melih,
On Fri, Apr 10, 2015 at 8:31 PM, user-digest-h...@nutch.apache.org wrote:
Based on https://wiki.apache.org/nutch/CommandLineOptions, bin/nutc
webgraph, not available for Nutch 2.x, i would like to use this feature,
how could i achieve this in nutch 2.3 ?
Thanks?
You would need
Hi All,
The deadline for this years GSoC student submissions is approaching fast
and I would be very keen to see more proposals from the communities above.
I've been involved on and off with several students from across all of the
above communtiies hence the reason I am emailing these lists.
I
Hi Tizy,
On Thu, Mar 12, 2015 at 12:20 AM, user-digest-h...@nutch.apache.org wrote:
Is there any detailed step by step explanation on how to implement
HTTPPostAuthentication on Nutch 1.10.?
https://github.com/apache/nutch/blob/trunk/conf/httpclient-auth.xml.template#L61-L105
Hi Arthur,
On Thu, Mar 12, 2015 at 12:20 AM, user-digest-h...@nutch.apache.org wrote:
I downloaded http://svn.apache.org/repos/asf/nutch/branches/2.x/
re-run the compilation, still got the the error
Question: Are the following dependencies are correctly set in my ivy.xml?
dependency
Hopefully this makes better sense.
Lewis
On Thursday, March 5, 2015, Gaplan gap...@gmail.com wrote:
thans for answer Lewis.
i can't understand this.
Also please ensure that your urlfilter permits '?' In URLS entries
how can i do that ?
On Thu, Mar 5, 2015 at 10:17 PM, Lewis John Mcgibbney
Hi,
Please see
http://wiki.apache.org/nutch/FAQ#Nutch_doesn.27t_crawl_relative_URLs.3F_Some_pages_are_not_indexed_but_my_regex_file_and_everything_else_is_okay_-_what_is_going_on.3F
Also please ensure that your urlfilter permits '?' In URLS entries
Hth
Lewis
On Thursday, March 5, 2015, Gaplan
Hi Folks Sumant,
On Sun, Mar 1, 2015 at 1:14 PM, user-digest-h...@nutch.apache.org wrote:
Do you think its the issue of fetch job and parser job ?
It is a bug with gora-cassandra which I've logged at the issue below and I
am working on a fix right now.
Hi yeshwanth,
On Tue, Mar 3, 2015 at 1:48 PM, user-digest-h...@nutch.apache.org wrote:
any pointers on how to resolve this issue.
Yes, please see NUTCH-1946, I just uploaded another patch which is working
for me.
I am working my way through the =Cassandra bug which is a real PITA.
Thanks
Hi Folks,
I was getting 500 internal server error using Nutch trunk when attempting
to fetch content from this domain.
http://www.nature.com
Just for detail, Nature.com is a catalogue of journals and science
resources, including the journal *Nature*. Publishes science news and
articles across a
Hi lujinhong,
On Wed, Feb 25, 2015 at 3:06 PM, user-digest-h...@nutch.apache.org wrote:
I found some codes in package “org.apache.nutch.webui” in the nutch
source.
What are these codes for?
They are using the Web Administration UI powered by the Nutch 2.X REST API
which is
Hi Jonathan,
There are another two threads ongoing, namely
http://www.mail-archive.com/user%40nutch.apache.org/msg13237.html
and
http://www.mail-archive.com/user%40nutch.apache.org/msg13235.html
Please monitor those links and we can take it from there.
I would strongly suggest that you set
Hi sumant,
I've pasted your Hadoop counters below.
It would appear that for the ParseJob task, no record is being passed as
the input to the MR framework. This is the issue. There is a problem
between FetcherJob and ParserJob.
Can you readdb between fetching and parsing?
If you get out a record
Hi Sumant,
Please see my replies below
On Mon, Feb 23, 2015 at 10:11 PM, user-digest-h...@nutch.apache.org wrote:
I am using Nutch 2.x using Cassandra as storage. Currently I am just
crawling only one website, and data is getting loaded to Cassandra in byte
code format. When I use readdb
Hi Chris,
Please see responses inline
On Mon, Feb 23, 2015 at 10:11 PM, user-digest-h...@nutch.apache.org wrote:
I am using Nutch with Cassandra to perform web crawling, both for the
first time. I have yet been able to retrieve links beyond the first initial
seed link.
I am using
Hi Arthur,
This is due to restlet removing some of their dependencies from public
consumption I think! It is out of our hands and happened after we released
the 2.3 release.
Without knowing which backend you are trying to use, I would suggest that
you upgrade to the 2.3.1 branch which is the live
Hi
On Fri, Feb 20, 2015 at 1:04 PM, user-digest-h...@nutch.apache.org wrote:
Thanks Lewis for your answer.
I have readed the post and is is great that NUTCH-1480 was assigned to
markus. I agree with you that maybe it will be done in nutch 1.10 trunk,
however not problem if is for 1.11.
I
Hi Folks,
The Apache Gora team are pleased to announce the immediate availability of
Apache Gora 0.6.
This release addresses a modest 47 issues http://s.apache.org/gora-0.6
with some being major improvements, new functionality and dependency
upgrades. Most notably the release involves key
Hi Eyeris,
On Wed, Feb 18, 2015 at 12:10 PM, user-digest-h...@nutch.apache.org wrote:
I have a question and sorry if it is a trivial things.
Is there any way to index in multiple solr server (at least 2) using nutch
1.9 ?
I have configured solr with one master and 2 slaves, but i need 2
Hi Kartik and Alexis,
On Fri, Feb 6, 2015 at 5:19 AM, user-digest-h...@nutch.apache.org wrote:
The site you're trying to crawl is a Flash website. Unfortunatly that will
be a problem for Nutch.
Nutch doesn't render the page, only fetches it. It won't load Flash, CSS or
JS that are included
Hi Folks,
The Nutch team are currently on the lookout for interested students willing
to engage in this years Google Summer of Code Program [0].
What is GSoC? A global program that offers students stipends to write code
for open source projects. In 2014 the Apache Nutch project participated in
a
Hi Alexis,
On Wed, Feb 4, 2015 at 5:14 AM, user-digest-h...@nutch.apache.org wrote:
I've had some luck compiling for Mongo, but I get a NullPointerException
while injecting seeds.
What version of MongoDb are you using?
Supported version is 2.12.2, this is suited to recent Nutch 2.3.
It
Hi Talat,
On Wed, Jan 28, 2015 at 10:07 PM, user-digest-h...@nutch.apache.org wrote:
Subject: Nutch IRI URIs
Hi all,
Do you have any idea How can Nutch handle IRI URIs ?
My experience using IRI's is limited to the legal informatics domain where
they are used pretty extensively in legal
Hi Zein,
Please see the release announcement regarding versioning for backend
datastore support
http://nutch.apache.org/#22-january-2015-nutch-23-release
On Wed, Jan 28, 2015 at 10:07 PM, user-digest-h...@nutch.apache.org wrote:
I am trying to configure nutch2.3 with hbase 0.90.4 on ubuntu
Hey Yoniel,
On Thu, Jan 22, 2015 at 9:00 AM, user-digest-h...@nutch.apache.org wrote:
Lewis, I have reviewed the httpclient-configuration but the main problem
is that I can't crawl HTTPS site that uses self signed certificate. How I
can fix this problem?
Did you see this thread and
Hi Folks,
I'm working on obtaining forum data posted for various topics from across a
number of web sites.
An example would be the technolgy-related posts from
http://www.hackforums.net.
If I take the above site as an example, and attampt to use parsechecker, I
get the following with protocol-http
Hi Adamantios,
On Sat, Jan 24, 2015 at 2:05 PM, user-digest-h...@nutch.apache.org wrote:
How to tell Apache Nutch 2.3 to go through all http://URL/?pg={X} pages,
with {X} going from 1 to 348,
^(0?[1-9]|[1-4][0-9]|348)$
Please try the above substituting you variable with the proposed regex.
Hi Hesham,
On Sat, Jan 24, 2015 at 2:05 PM, user-digest-h...@nutch.apache.org wrote:
s in the conf directory all the configuration files for Nutch?
Yes
Also, if I want to to have a set of configurations for some URLs and
another set of configurations for other URLs I have to create a new
Hi Folks,
Apache Nutch PMC are very please to announce the release of Apache Nutch
v2.3. This release bears the fruits of the first Nutch Google Summer of
Code program engagement resulting in a Web Application for the Nutch 2.3
REST API.
The release also includes upgrades to Gora dependencies
Hi Everyone,
I am closing off this VOTE thread.
The VOTE'ing progressed with the following outcome
[4] +1 Push the release, I am happy :)
Lewis John McGibbney *
Renato Marroquín Mogrovejo
Sebastian Nagel *
Talat Uyarer *
[1] +0 I am not bothered either way
John Lafitte
[0] -1 I am not happy
Hi Talat,
On Sun, Jan 18, 2015 at 3:49 AM, user-digest-h...@nutch.apache.org wrote:
I finish my review yet.
- AdaptiveFetchSchedular do not work. In default settings float, it needs
integer.
Please log an issue and set a fix version, this is trivial to fix but a big
which is essential to
Hi Shadi,
On Thu, Jan 15, 2015 at 6:30 AM, user-digest-h...@nutch.apache.org wrote:
Thanks, how can I add avro support?
In short you cannot. The next task, if you want to add SQL support back
into Gora and subsequently Nutch 2.X, is to write the SQL backend for Gora
as suggested here
Hi Yoniel,
Please read the following
https://wiki.apache.org/nutch/HttpAuthenticationSchemes#Need_Help.3F
If nothing here provides you with a better idea then please write back to
us here.
Put simply we need more information regarding how your
httpclient-configuration has been set up.
Thanks
.
Lewis
On Fri, Jan 9, 2015 at 3:58 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi user@ dev@,
This thread is a VOTE for releasing Apache Nutch 2.3.
Quite incredibly we addressed 143 issues as per the release report
http://s.apache.org/nutch_2.3
The release candidate comprises
Hi Tamer,
On Fri, Jan 9, 2015 at 6:38 PM, user-digest-h...@nutch.apache.org wrote:
Guys
I came across the thread from Feb 2013 at
http://mail-archives.apache.org/mod_mbox/nutch-user/201302.mbox/%3CCAPp-OAu2HT82H8hHZ4B=zxch2+29ncjbfv+wagfp3wdpzex...@mail.gmail.com%3E
As I'm trying to use
Hi Folks,
Is the aim to have identical output from parse-tika and parse-html for
rendering of parse metadata?
With Nutch 1.10-SNAPSHOT with no local source code modifications, if we
take the following page [0], and turn metatags.names to wildcard *, with
parse-tika I get
Parse Metadata:
BOOM
https://issues.apache.org/jira/browse/NUTCH-1815
On Sat, Jan 10, 2015 at 10:15 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Folks,
Is the aim to have identical output from parse-tika and parse-html for
rendering of parse metadata?
With Nutch 1.10-SNAPSHOT with no local
Hi Folks,
Just wanted to make folk aware of some work Continuum Analytics have been
doing on bringing Nutch to the Python community.
https://github.com/ContinuumIO/nutchpy
Comtinuum are the folks behind most of the scientific Python stuff you've
ever used. If you've used Python before, then
Hi user@ dev@,
This thread is a VOTE for releasing Apache Nutch 2.3.
Quite incredibly we addressed 143 issues as per the release report
http://s.apache.org/nutch_2.3
The release candidate comprises the following components.
* A staging repository [0] containing various Maven artifacts
* A
Hi Markus,
On Wed, Jan 7, 2015 at 7:42 PM, user-digest-h...@nutch.apache.org wrote:
Hi - it is a strange piece indeed. You cannot just tell it where the
crawldb is, you need to tell it where the directory is, so specifying
current is ok, but not part-*
Thanks very much. I'll cook a patch up
Hi Renato,
On Thu, Dec 11, 2014 at 5:52 AM, user-digest-h...@nutch.apache.org wrote:
From quickly checking out the code (Host.java + HostDB +
HostDBUpdateReducer) it would seems like there is a bug exactly where you
pointed.
LOGGED!!!
https://issues.apache.org/jira/browse/NUTCH-1907
WOW
Hi Krishna,
On Thu, Dec 11, 2014 at 5:52 AM, user-digest-h...@nutch.apache.org wrote:
When I dump data from segments, I am getting entire html data. Shouldnot it
be just headings read from crawling. Why am I getting entire data?
Please help me. Thanks in advance.
No this is
Hi Folks,
Does anyone else have problems with the DomainStaticstics [0] tool?
I use it as follows
./bin/nutch domainstats /usr/local/.../crawldb/old/part-0/ output tld
Although it is generated, nothing is written to the output directory
./bin/nutch domainstats
Hi Folks,
I was looking into the code within Nutch 2.X HostDbUpdateReducer and
'think' I've discovered a bug in the way we output Host data.
https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/host/HostDbUpdateReducer.java#L87
I feel that the follwoing code
Hey Arthur,
On Thu, Dec 4, 2014 at 4:27 AM, user-digest-h...@nutch.apache.org wrote:
Any idea why the field ‘host’ is not loaded by SOLR?
Apologies for missing this thread.
Have you tried restarting your Solr core? I would suggest that you move
your old log(s) to an archive directory then
Hi Arthur,
Additionally, I would suggest that you try both the parse checker and index
checker tools on the offending URL
http://nutch.apache.org/apidocs/apidocs-1.1/allclasses-frame.html] unknown
field 'host'
On Thu, Dec 4, 2014 at 4:27 AM, user-digest-h...@nutch.apache.org wrote:
user
Copying in user@
On Thu, Nov 6, 2014 at 6:37 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi amit,
On Thu, Nov 6, 2014 at 1:54 PM, dev-digest-h...@nutch.apache.org wrote:
I have a small question about Nutch 2.X source code, i hope this is the
right mailing list for
that. i
Hi Segar,
On Wed, Oct 29, 2014 at 11:40 PM, user-digest-h...@nutch.apache.org wrote:
Follow the following steps:
1. Execute 'ant job' i.e. open build.xml and execute 'runtime(default)'
target.
It will generate 'runtime' folder in project.
2. Open nutch-default.xml and update
Hi Pablo,
This question has been raised a number of times of the user@nutch list, you
can use the archives linked to from the Nutch website.
I would suggest that the seed be populated to a new page metadata, which
could then be added via an indexing filter.
There may be other ways for achieving
Hi ozzy19
On Fri, Oct 17, 2014 at 11:09 AM, user-digest-h...@nutch.apache.org wrote:
Running the code on this url:
http://wiki.apache.org/nutch/JavaDemoApplication I get the following
message:
Found 0 hits.
how so? why do not you search pages that contain the keyword?
This page is very
15th October 2014 - crawler-commons 0.5 is released
We are glad to announce the 0.5 release of Crawler Commons. This release
mainly improves Sitemap parsing as well as an upgrade to Apache Tika 1.6
http://tika.apache.org.
See the CHANGES.txt
201 - 300 of 1408 matches
Mail list logo