The individual command line options on the wiki deal with this no?
http://wiki.apache.org/nutch/CommandLineOptions
I am a huge fan of making Nutch easier to use, and that it is more obvious
from a user perspective. I am also however very keen to write and maintain
documentation which stands the
Hi,
Please see plugin central for all your plugin therapy ;)
https://wiki.apache.org/nutch/PluginCentral
Also for your dependencies please see, most noticeably, parse-tika. Take a
look at plugin.xml
http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/parse-tika/
hth
Lewis
On Sun, Jun 30, 2013
Hi All,
The Apache Nutch PMC are extremely pleased to announce the immediate
release of Apache Nutch v1.7.
Apache Nutch is an open source web-search software project. Stemming
from Apache
Lucene http://lucene.apache.org/java/, it now builds on Apache
Solrhttp://lucene.apache.org/solr/adding
is the one I had not gone through until I checked your
response, but I do not find answers to any of my questions
(directly/indirectly) in it.
Ok
On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Hemant,
I strongly advise you to take some
-site.xml was changed to use AvroStore as storage class and
job
was rebuilt, and I reran inject, the output of which still shows that it
is
trying to use Memstore.
On Wed, Jun 26, 2013 at 11:05 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
The Gora MemStore was introduced
Hi,
It would be greatly appreciated if you could take some time to VOTE on the
release candidate for the Apache Nutch 2.2.1 artifacts. This candidate is
(amongst other things) a bug fix for NUTCH-1591 - Incorrect conversion of
ByteBuffer to String.
The big fix solved 8 issues:
Hi,
Can you please try this
http://s.apache.org/wIC
Thanks
Lewis
On Thu, Jun 27, 2013 at 8:01 AM, Jamshaid Ashraf jamshaid...@gmail.comwrote:
Hi,
I'm using nutch 2.x with HBase and tried to crawl
http://www.halliburton.com/en-US/default.page; site for depth level 5.
Following is the
Hi Martin,
Thanks for the mail. Please see my answers in line
On Thu, Jun 27, 2013 at 12:53 PM, Martin Aesch
martin.ae...@googlemail.comwrote:
Should I switch from 2.1/Cassandra to 2.2.1/Cassandra?
Once we release yes. we are currently VOTEíng on the release of 2.2.1. Some
background here, in
It looks like your on a pre 1.3 version of Nutch here.
It is highly recommended to upgrade.
Thanks
Lewis
On Wednesday, June 26, 2013, Amit Sela am...@infolinks.com wrote:
I did succeed in parsing using content and iterating over every line but
I'd prefer do it with DocumentFragment.
my
ended up using
org.jsoup.nodes.Document document = Jsoup.parse(content.
Listorg.jsoup.nodes.Node childNodes = document.childNodes();
On Wed, Jun 26, 2013 at 7:19 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
It looks like your on a pre 1.3 version of Nutch here.
It is highly
Hi Tony,
On Wed, Jun 26, 2013 at 1:13 AM, Tony Mullins tonymullins...@gmail.comwrote:
As you siad UpdateDBJob doesn't expect crawlId ,
No I didn't.
I said, is doesn't *use* (we'll not until yesterday it didn't) crawlId
parameter.
This is now fixed within the crawl script. and will be
On Wed, Jun 26, 2013 at 4:30 AM, Tony Mullins tonymullins...@gmail.comwrote:
Is it possible to crawl with crawlId but HBase only crates 'webpage' table
without crawlId prefix , just like Cassandra does?
I can't understand this question Tony.
And my other problems of DBUpdateJob's
...@gmail.com wrote:
Hi Lewis,
Thanks for your reply
I just set the values:
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
I already removed the Hbase table in the past. Can it be a cause?
Benjamin
On Tue, Jun 25, 2013 at 7:34 PM, Lewis John Mcgibbney
lewis.mcgibb
N.B. Previous message doesn't seem to have been mod'd through under my @
apache.org address so resending ;)
It has however been distributed to annou...@apache.org already
Hi All,
The Apache Nutch PMC are extremely pleased to announce the immediate
release of Apache Nutch v1.7.
Apache Nutch is
Hi Hemant,
I strongly advise you to take some time to look through the Nutch Tutorial
for 1.x and 2.x.
http://wiki.apache.org/nutch/NutchTutorial
http://wiki.apache.org/nutch/Nutch2Tutorial
Also please see the FAQ's, which you will find very very useful.
http://wiki.apache.org/nutch/FAQ
Thanks
Have you changed from the default MemStore gora storage to something else?
On Tuesday, June 25, 2013, Sznajder ForMailingList bs4mailingl...@gmail.com
wrote:
thanks Tejas
Yes, I cheecked the logs and no Error appears in them
I let the http.content.limit and parser.html.impl with their
Hi Jamshaid,
There is/are identical thread(s) currently open about this exact issue by
Tony.
Please chime in on them instead of opening brand new threads.
Thanks you
http://www.mail-archive.com/user%40nutch.apache.org/
On Tue, Jun 25, 2013 at 5:38 AM, Jamshaid Ashraf jamshaid...@gmail.comwrote:
Hi Tony,
On Tue, Jun 25, 2013 at 1:10 AM, Tony Mullins tonymullins...@gmail.comwrote:
So what should I do now to run my complete cycle of Nutch2.x jobs and
insert my docs to Solr ?
I'm not using HBase as backend however I know that as per the crawl script,
the updatedb doesn't use crawlId
org.apache.hadoop*hadoop*.mapreduce.Mapper this to import
org.apache.hadoop.mapreduce.Mapper.
Thanks Regards,
Jamshaid
On Tue, Jun 25, 2013 at 12:13 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Jamshaid,
Please see the Jira issue and patch for this.
https
the above field in Nutch's schema-solr4 will solve the
issue and it will out of box work with the pre-defined examples of Solr
4.3.0.
Tony.
On Mon, Jun 17, 2013 at 11:04 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Please note that we will be dropping the schema-solr4 in Nutch
Hi Jamshaid,
Please see the Jira issue and patch for this.
https://issues.apache.org/jira/browse/NUTCH-1571
I would like to commit a patch for 2.x regarding how we were writing bytes,
if you can test this patch then we can maybe add it as well and push 2.2.1
Thank you
Lewis
On Monday, June 24,
I may as well drop this one in here, I opened an issue a while back to
discuss inherent differences in ordering of filtering and normalization
between 1.x and 2.x codebases specifically within Generator* classes.
https://issues.apache.org/jira/browse/NUTCH-1373
I am not sure how/if this applies to
to fix few issues before...
[0] -1, nope, because... (and please explain why)
GREAT, I'll push the release artifacts, promote the Maven staging repos and
make thr announcements.
Thank you very much to everyone that VOTE'd.
Great work
LEwis
On Thu, Jun 20, 2013 at 2:48 PM, lewis john mcgibbney lewi
Hi,
law@CEE279Law3-Linux:~/Downloads/asf/2.x$ find . | xargs grep _csh_
./src/plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/.svn/text-base/OPICScoringFilter.java.svn-base:
private final static Utf8 CASH_KEY = new Utf8(_csh_);
, Apr 23, 2013 at 1:07 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
I agree.
I can sort this tomorrow.
@Kiran,
Are we still working to addition of documentation contributers via
contributers and admin group since the most recent lockdown?
Tejas should be added to both groups
#What_do_the_numbers_in_the_fetcher_log_indicate_.3F
On Sat, Jun 22, 2013 at 12:54 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Sounds great Tejas.
Wow this is a late shift.
If you can commit your fetcher diagnostics it would be great Tejas.
On Saturday, June 22, 2013, Tejas Patil tejas.patil...@gmail.com wrote:
What
Hi Imran,
HBase 0.90.x
thank you
Lewis
On Saturday, June 22, 2013, imran khan imrankhan.x...@gmail.com wrote:
Greetings,
I have seen many mails here about people having different issues with
different backends with Nutch 2.x
So which backend is most suited /stable with Nutch 2.x and also
result.
I really dont know what else I do here !!!
Could you please try any simple ParseFilter with latest Nutch2.x. ?
Thanks,
Tony
On Fri, Jun 21, 2013 at 12:36 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
And the rest of the webpage fields actually.
Are you
Hi Tony,
The second bullet point on the tutorial states that Gora works with 0.90.X
HBase branch (yes this is old)
It is known not to work with the 0.94.X branch.
Please try with the 90 branch.
Thanks
Lewis
On Fri, Jun 21, 2013 at 8:12 AM, Tony Mullins tonymullins...@gmail.comwrote:
Hi ,
that it has been
committed. Would be good to fix it if we can.
The code compiles and passes the test. +1 to release
Thanks
Julien
On 20 June 2013 22:48, lewis john mcgibbney lewi...@apache.org wrote:
Hi,
Please VOTE on the release of the Apache Nutch 1.7 artifacts.
As always
On Fri, Jun 21, 2013 at 11:40 AM, Tony Mullins tonymullins...@gmail.comwrote:
Thanks guys for your help support.
No hassle great to have you poking around and using the software. We know
there is work to be done. Thank you
I'll try it now with HBase 0.90.x.
Let us know how you get on.
and passes the test. +1 to release
Thanks
Julien
On 20 June 2013 22:48, lewis john mcgibbney lewi...@apache.org mailto:
lewi...@apache.org wrote:
Hi,
Please VOTE on the release of the Apache Nutch 1.7 artifacts.
As always, we solved a bunch of issues:
http://s.apache.org/1zE http
Hi,
Nearly all of this page is generated by JS right?
Right now my answer is no. We fetch then parse page source... which in this
case is mostly all JS. The magic happens in the browser.
...
Lewis
On Tue, Jun 18, 2013 at 10:59 PM, Deals Collect dealscoll...@gmail.comwrote:
Hi all,
Can Nutch
Forget this.
I am tripping and the low counters were directly in relation to NUTCH-1591
Sorry
Lewis
On Wed, Jun 19, 2013 at 5:04 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi,
We define the structure of ParseStatus [0] in our WebPage JSON schema [1].
All good so far.
What
Hi Joe,
In 1.x Markus and Julien IIRC committed a real nice patch a while back
which allows you to achieve what I think you are after.
Please look at this thread
http://www.mail-archive.com/user@nutch.apache.org/msg08738.html
You will find piles of stuff on the user archive about this kinda
Thanks Jason for posting this to user@ list.
For those using Cassandra (1.1.2) and gora-cassandra please patch your copy
of Nutch 2.x with Jason's patch.
It would be real great if we could get some feedback on this as I am of the
opinion that it certainly justifies a point oh release for 2.x.
Hi,
There is an open thread on the user list for this right now. Please look in
the recent archive.
I think it would be best to take this conv over there.
Lewis
On Thursday, June 20, 2013, Jamshaid Ashraf jamshaid...@gmail.com wrote:
Hi All,
I'm using Nutch 2.x/Cassandra and I have 3 urls in
Maybe an obvious question Tony, but have you tried stepping through this
and debugging your code?
There is another thread which appeared today, which basically is the same
problem as you have.
I am struggling to see how there are parsefilter plugin implementations
shipped with 2.x which do not
And the rest of the webpage fields actually.
Are you getting multiple values for each field or is it just for content?
On Thursday, June 20, 2013, Tony Mullins tonymullins...@gmail.com wrote:
Hi,
Did any one get chance to look at the pointed out issue ?
Just would like to know that is this a
Hi,
Please VOTE on the release of the Apache Nutch 1.7 artifacts.
As always, we solved a bunch of issues:
http://s.apache.org/1zE
SVN source tag:
http://svn.apache.org/repos/asf/nutch/tags/release-1.7/
Staging repo:
https://repository.apache.org/content/repositories/orgapachenutch-044/
Hi Tony,
You are using Cassandra backend right?
I think it's safe to say that there are lingering bugs in gora-cassandra.
I am getting some dodgy behaviour using Cassandra 1.1.2 during large crawls.
On Tue, Jun 18, 2013 at 12:40 AM, Tony Mullins tonymullins...@gmail.comwrote:
I have debuged
Mustafa,
Please read this thoroughly
http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer#Step_One:_Using_the_Mailing_Lists
Once we understand from a well detailed email, what the problem is, we are
more than willing to help.
Until then it is really difficult to help you out. Sorry.
Lewis
On
Hi,
We define the structure of ParseStatus [0] in our WebPage JSON schema [1].
All good so far.
What is not good (or not clear to me at least), is how we currently use
methods within this class to define Hadoop counters for the parsing tasks.
I parse large amounts of URLs, but the counters on one
Hi Tony,
On Tue, Jun 18, 2013 at 11:49 AM, Tony Mullins tonymullins...@gmail.comwrote:
...instead
of returning html of the current page it is returning me the url of all the
pages in seed.txt
I suspect that this should not be happening at all!
Could you please try entering 2 or more
This also happened during fetching stage as well.
JobTracker shows that Fetcher was successful, but that 98.30% of the Map
was complete.
This looks like user@nutch is the wrong list of this.
I will take it over to MR.
Lewis
On Tue, Jun 18, 2013 at 8:05 PM, Lewis John Mcgibbney
lewis.mcgibb
expects _version_ field and that was missing in my
schema. And the patch also doesn't include this field in
Schema-Solr4.xml.
Besides that I was also missing some .jars in my Solr class path.
Tony.
On Mon, Jun 17, 2013 at 12:14 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote
Please note that we will be dropping the schema-solr4 in Nutch and merging
the content into schema.xml. It is not good for us to maintain two schemas.
Thanks
Lewis
On Sunday, June 16, 2013, Lewis John Mcgibbney lewis.mcgibb...@gmail.com
wrote:
Hi Tony,
If you're able to patch this up and submit
Hi Tony,
Which gora backend are you on, including the version of the backend itself
please?
I use Gora 0.3 with gora-cassandra on some cron jobs and injected your URLs
into my db. All works fine.
I did notice that these pages have a hellish lots of content which is not
displayed on the page. Loads
, Jun 17, 2013 at 11:21 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Tony,
Which gora backend are you on, including the version of the backend
itself
please?
I use Gora 0.3 with gora-cassandra on some cron jobs and injected your
URLs
into my db. All works fine.
I
Please read the log message.
It needs to be a unique field.
Correct this in your schema and you should be good to go.
hth
On Mon, Jun 17, 2013 at 6:35 AM, kamal11 kkroyal@gmail.com wrote:
I am also facing the same error.
The nutch log says ERROR solr.SolrIndexer - java.io.IOException: Job
Yes Joe this is correct.
On Sun, Jun 16, 2013 at 12:03 PM, Joe Zhang smartag...@gmail.com wrote:
Thanks.
with regards to (2), is this score the boost we see in solr index?
On Sun, Jun 16, 2013 at 10:38 AM, Ahme Emre Aladağ
emre.ala...@agmlab.comwrote:
Note: I'm a newbie.
As far as
Hi,
On Thursday, June 13, 2013, RS tinyshr...@163.com wrote:
Thanks a lot
1.Is there a document discribe the column symbols (likecolumn=s:s )?
There are a lot symbols I can not understand.
Please check your gora-hbase-mapping.xml, this is where field names and
qualifiers are defined. It
I would not advise you t0o use the MemStore. So far the purpose of this is
not to store persistent data, it is mainly used for testing... as explained
in nutch-default.xml.
There are other alternatives which will be much more useful for your
deployment.
On Thu, Jun 13, 2013 at 5:36 AM, Peter
Hi James,
Can you please point us to the patch and we will try to get it in to the
codebase?
Thanks... and sorry if we didn't look into the patch yet.
Lewis
On Thu, Jun 13, 2013 at 4:29 AM, James Sullivan
james.brian.sulli...@gmail.com wrote:
Please ignore this E-mail. It only happens in the
Hi Tony,
Please see
https://issues.apache.org/jira/browse/NUTCH-1486
Thanks
Lewis
On Thu, Jun 13, 2013 at 8:04 AM, Tony Mullins tonymullins...@gmail.comwrote:
Hi,
Any one has a updated Nutch-Solr Schema for new Solr 4.3 ?
As the existing NutchconfSolr 4 schema doesn't work with Solr 4.3.
Hi,
It is (IMHO) kind of fruitless running the crawl class (which is deprecated
now and we highly suggest you use and amend the /src/bin/crawl script for
your usecase) within Eclipse. You will learn far more setting breakpoints
within individual classes and watching them execute on that basis. I
rebuild workspace. BTW: On which packages / classes do you see red
dots ?
On Sun, Jun 9, 2013 at 9:23 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Tony,
This source has literally just been released. The tutorial on the
Nutch
wiki has also just been updated
of Nutch? Which
config file should I touch?
On Sat, Jun 8, 2013 at 10:56 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Joe,
https://issues.apache.org/jira/browse/NUTCH-961
On Saturday, June 8, 2013, Joe Zhang smartag...@gmail.com wrote:
Can somebody please point me
Hi Joe,
I've not used this feature, it would be great if one of the others could
chime in here.
From what I can infer from the correspondence on the issue, and the
available patches, you should be applying the most recent one uploaded by
Markus [0] as your starting point. This is dated as
Hi Julien,
Dynamite.
I will release today.
Lewis
On Saturday, June 8, 2013, Julien Nioche lists.digitalpeb...@gmail.com
wrote:
Hi Lewis,
The md5, asc and sha are now correct. Thanks for fixing it.
Have a nice week end
Julien
On 7 June 2013 21:16, Lewis John Mcgibbney lewis.mcgibb
Good Afternoon Everyone,
The Apache Nutch PMC are extremely pleased to announce the immediate
release of Apache Nutch v2.2.
Apache Nutch is an open source web-search software project. Stemming
from Apache
Lucene http://lucene.apache.org/java/, it now builds on Apache
Hi Joe,
https://issues.apache.org/jira/browse/NUTCH-961
On Saturday, June 8, 2013, Joe Zhang smartag...@gmail.com wrote:
Can somebody please point me to some sample code?
Thanks much!
--
*Lewis*
)
Kiran Chitturi
Feng Lu
Julien Nioche
Sebastian Nagel
Lewis John McGibbney
[0] +/-0, fine, but consider to fix few issues before...
[0] -1, nope, because... (and please explain why)
This is an excellent VOTE'ing count so thank you to everyone that took the
time to review the release candidate
Additionally we've harmonized the behaviour of 2.x and 1.x so that the next
releases will be consistent. We should hopefully be releasing 2.2 very very
soon. You can look on the recent archive of this list to find the release
candidate artifacts for 2.2
Lewis
On Wednesday, June 5, 2013, feng lu
Yes I will defo try and fix this.
You could please log a Jira for it and we will get round to it when we can.
Thanks for reporting the issue here so far.
On Wed, Jun 5, 2013 at 11:04 AM, stone2dbone antoinette.d.stan...@gmail.com
wrote:
Lewis,
I've been told by someone who knows Java (I
Hi,
On Tue, Jun 4, 2013 at 11:40 AM, stone2dbone antoinette.d.stan...@gmail.com
wrote:
Okay, IndexFiltersChecker shows the value of my added field is
'[Ljava.lang.String;@15d1c817'. What might be causing this?
So you've not added any code only configuration right?
It is kinda difficult
Nutch jobs locally.
On Tue, Jun 4, 2013 at 3:25 AM, Yves S. Garret
yoursurrogate...@gmail.comwrote:
One more question, would it matter what version of Hadoop that I have?
On Thu, May 30, 2013 at 6:57 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
In all
Hi,
On Tue, Jun 4, 2013 at 7:42 PM, RS tinyshr...@163.com wrote:
InjectorJob: total number of urls injected after normalization and
filtering: 0
Nothing is injected here.
Please review you URL filters and try again.
Lewis
.
[local ]$ cat urls/seed.txt
http://nutch.apache.org/
Thanks
hechuan
At 2013-06-05 11:43:38,Lewis John Mcgibbney lewis.mcgibb...@gmail.com
wrote:
Hi,
On Tue, Jun 4, 2013 at 7:42 PM, RS tinyshr...@163.com wrote:
InjectorJob: total number of urls injected after normalization and
filtering
Hi,
On Tue, Jun 4, 2013 at 3:21 PM, stone2dbone
antoinette.d.stan...@gmail.comwrote:
I must still add a field to each document? Please clarify.
index-static should work as described out of the box.
Make sure that you have a comma-separated list of fields in the form
name:value within
Hi,
It is clear that for the configuration you are running NTLM is not
authenticating properly.
I would run the Http class with TRACE logging activated, this will show the
credentials you are after.
You should also note the documentation in nutch-default.xml which
explicitly states NOTE: For
Hi,
You can quickly check which fields will be added to your NutchDocument
before being passed to Solr using the IndexFiltersChecker tool. This tool
is available in both trunk and 2.x codebases and can be invoked from the
command line interface.
On Monday, June 3, 2013, stone2dbone
Good Friday Everyone,
Glad to get to a stage where we can VOTE on the release of the Apache Nutch
2.2 artifacts.
We solved a stack of issues:
http://s.apache.org/LPB
SVN source tag:
http://svn.apache.org/repos/asf/nutch/tags/release-2.2/
Staging repo:
Seems like a small cli syntax bug.
Please submit a patch and we can commit.
Thanks
Lewis
On Friday, May 31, 2013, Bai Shen baishen.li...@gmail.com wrote:
Two quick questions.
1. Why is the parameter -adddays and not -addDays?
2. Should it be changed to match the other parameters or is it
+1
On Fri, May 31, 2013 at 12:13 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
Please don't break existing scripts and support lower and uppercase.
Markus
-Original message-
From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
Sent: Fri 31-May-2013 19:11
To:
+1 dynamite Tejas
On Fri, May 31, 2013 at 2:57 PM, Tejas Patil tejas.patil...@gmail.comwrote:
Hi Kiran,
Happy to know :)
Have you faced any problems with it ? I am in middle of editing the wiki
page and your comments might help me do that.
Thanks,
Tejas
On Fri, May 31, 2013 at 2:55
Hi,
Just heads up for the event where you do need to add to the nested Metadata
structure within the WebPage.avsc, you can merely write your changes and
utilise the ant 'generate-gora-src' target from the build script. The
GoraCompiler will then compile everything in /src/gora to the path you
, similar issue:
http://bin.cakephp.org/view/180499048
I've left the defaults for config as they were, except this is in
gora.properties in apache nutch.
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
On Wed, May 29, 2013 at 7:40 PM, Lewis John Mcgibbney
around some jar files?
On Thu, May 30, 2013 at 6:35 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Make sure that everything is compiled and you are running from runtime or
with the Jar in hadoop
On Thu, May 30, 2013 at 3:00 PM, Yves S. Garret
yoursurrogate...@gmail.comwrote
Hi Chris,
Please check out NUTCH-1545
We'll hopefully be committing this today(ish) and it will hopefully be
included in the 2.2 RC which I am about to cut.
Your feedback would be great.
Thanks
On Wednesday, May 29, 2013, Christopher Gross cogr...@gmail.com wrote:
I did make some modifications
This is incompatible.
On Wed, May 29, 2013 at 1:59 PM, Yves S. Garret
yoursurrogate...@gmail.comwrote:
Hi all, I'm using HBase 0.94.7 and Nutch 2.1.
On Wed, May 29, 2013 at 4:55 PM, Adriana Farina
adriana.farin...@gmail.comwrote:
Hi Yves,
as Tejas said, your issue is almost
.X and Nutch 2.1 work?
On Wed, May 29, 2013 at 5:05 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
This is incompatible.
On Wed, May 29, 2013 at 1:59 PM, Yves S. Garret
yoursurrogate...@gmail.comwrote:
Hi all, I'm using HBase 0.94.7 and Nutch 2.1.
On Wed, May
This is most certainly better aimed at either Gora or HBase lists.
Obtaining better (and consistent) understanding and of course abstracting
users from such data structures is what we have been addressing in current
Gora development. (See GORA-174)
You will want to look specifically at some of the
OH, BTW I meant to refer you to the test in line 178 of [0]. testPutNested
hth
Lewis
On Wed, May 29, 2013 at 7:07 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
This is most certainly better aimed at either Gora or HBase lists.
Obtaining better (and consistent) understanding
Hi All,
A really nice aspect of the regex (urlfilter-automaton and urfilter-regex)
plugin implementation's in Nutch is that there is a small but very useful
RegexURLFilterBaseTest [0] which compares benchmarks for simple regex
parsing.
The results we get are as follows
urls automaton
, yadda, yadda.
On Thu, May 23, 2013 at 1:57 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi All,
A really nice aspect of the regex (urlfilter-automaton and
urfilter-regex)
plugin implementation's in Nutch is that there is a small but very useful
RegexURLFilterBaseTest [0
Hi Kirby,
On Thu, May 23, 2013 at 6:36 PM, Kirby Bohling kirby.bohl...@gmail.comwrote:
Not that I think you need them in particular, but it seems like Nutch could
be doing plenty of benchmarking, and micro benchmarking in particular.
I agree with this. It is not my goal to attack this head
Hi Martin,
I am struggling to understand how the DocumentFragment (populated either by
private methods parseTagSoup or parseNeko depending on your config in
nutch-site.xml) is null!
What you don't mention is some problem you are having?
I can't DEBUG the code tonight but I am interested to see
Hi Feng,
Where is the patch please?
Thank you very much
Lewis
On Wednesday, May 22, 2013, feng lu amuseme...@gmail.com wrote:
Hi Daniel
Now Nutch 2.x can not support solr authentication, I have already open an
issue and add a patch , you can patch this and try again.
Thanks
On Wed, May
Please search the mailing list for the HBase logging. There was a
conversation on this reasonably recently.
Please see my other response for the rest.
hth
Lewis
On Monday, May 20, 2013, Christopher Gross cogr...@gmail.com wrote:
Ok, so the crawlId isn't like the directories used in the 1.x
Hi Chris,
Please see the documentation I put up on the wiki for this phenomenon
http://wiki.apache.org/nutch/ErrorMessagesInNutch2#Nutch_logging_shows_Skipping_http:.2F.2FmyurlForParsing.com.3B_different_batch_id_.28null.29
Also, please search the mailing list for a recent discussion on the
Hi Chris,
On Mon, May 20, 2013 at 10:21 AM, Christopher Gross cogr...@gmail.comwrote:
Lewis --
Is the DEBUG something set in the conf/log4j.properties file? I have the
rootLogger set to INFO,DRFA and the threshold is ALL. Everything else is
INFO or WARN (no DEBUGs to be found.)
Well yes
Hi All,
I submitted a patch to upgrade the Nutch 2.x Branch codebase to the newly
released Gora 0.3.
The patch can be found here [0].
It would be excellent if folks could please test this patch and provide
feedback to the dev@ list.
The feedback will be very helpful in allowing us to progress
You need to follow the tutorial here
http://wiki.apache.org/nutch/RunNutchInEclipse
If after reading this thoroughly you have some problems please let us know
about them.
Thank you
Lewis
On Thu, May 16, 2013 at 12:07 PM, harsh yadav harsh.m...@gmail.com wrote:
2013-05-17 00:33:09,376 WARN
Hi Chris,
Thanks for getting on the list and discussing these aspects of development
:0)
From my perspective there are a number of observations
BRANCH 2.x
* NUTCH-1568 [0] is ripe for development. My sole justification for not
addressing this is that we wish to push Nutch 2.2 and it is safe to
Hi Shobha,
The is merely a class loading problem. You need to ensure that the class is
available on your classpath.
Although the problem you are having has nothing to do with this, my advice
is to not use Nutch 1.2.
Best
LEwis
On Wed, May 15, 2013 at 5:01 AM, Shobha shobhaendig...@gmail.com
solrindex -all it still indexes everything, not just the newly parsed
items.
On Wed, May 1, 2013 at 2:13 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
What version are you using?
If you can I would advise you to upgrade to 2.x HEAD.
On Wed, May 1, 2013 at 4:32 AM, Bai
there is a good change that the dump is in a foreign language, however this
depends on which language you consider as foreign and what language it
actually is.
AFAIK the encoding should be inferred directly from the page or document
markup, however failing this there is a default fall back of
Hi Kneerosh,
Golden rule of posting.
Which Nutch and which Solr versions are you using?
add index-more to your plugin configuration and it will get you two out of
three.
author... if the marup is there is it trivial.
Lewis
On Tuesday, May 7, 2013, kneerosh roshni_rajago...@yahoo.co.in wrote:
Hi,
Have you looked at the patch for NUTCH-1486?
this is not just schema changes.
The patch is for 2.x but the process of porting it to new pluggable
indexing architecture for trunk is trivial.
Can you do this?
Currently the patch uses concurrentupdatesolrserver where it should use
just solr
501 - 600 of 1408 matches
Mail list logo