Hi Madhvi,
After adding urlfilter-validator in nutch-site.xml, do you truncate your
webpage ?
Talat
09-11-2013 00:35 tarihinde, [email protected] yazdı:
Hi Talat,
I can re-create this exception. This exception starts happening as soon as
I index from outside Nutch. SolrDeleleteDuplicates works fine as long as
the whole solr index came from Nutch.
I haven't found out yet specifically which field might be causing it. But
looking at issue below, it might be because of the digest field not being
there.
https://issues.apache.org/jira/browse/NUTCH-1100
Can it be some other field?
Also, there is a patch for digest field. How should I apply it? Any help
will be great!
Madhvi
On 11/6/13 2:19 PM, "Talat UYARER" <[email protected]> wrote:
You wrote wrong. You should write like this
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-(regex|validator)|parse-(html|tika|metatags
|js
|swf)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(pass|
r
egex|basic)</value>
</property>
And you write in nutch-site.xml after than you should rebuild with ant
clean runtime
Talat
[email protected] şunu yazdı:
Hi Talat,
No, I am not using url filter-validator plugin. Here is my list of
plugins:
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|metatags|js|swf)|in
de
x-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(pass|regex|bas
ic
)</value>
</property>
Do I just need to change this to:
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse|validator-(html|tika|metatags|
js
|swf)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(pass
|r
egex|basic)</value>
</property>
Thank you so much,
Madhvi
On 11/6/13 1:08 PM, "Talat UYARER" <[email protected]> wrote:
Hi Madhvi,
Can you tell me what is your active plugins in your nutch-site.xml. I am
not sure but we have a issue simalar this. if your solr return null,
this
will because this issue. Please check your solr return data
You can look at https://issues.apache.org/jira/browse/NUTCH-1100
if yours is same, you should use urlfilter-validator plugin.
Urlfilter-validator has lots of benifit. i told in
http://mail-archives.apache.org/mod_mbox/nutch-user/201310.mbox/%3c5265B
C2
[email protected]%3e
Talat
[email protected] şunu yazdı:
I am going to start my own thread rather than being under javozzo's
thread :)!
Hi,
I am using Nutch 1.5.1 and Solr 3.6 and having problem with command
SolrDeleteDuplicates. Looking at Hadoop logs: I am getting error:
java.lang.NullPointerException
at org.apache.hadoop.io.Text.encode(Text.java:388)
at org.apache.hadoop.io.Text.set(Text.java:178)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.ne
xt
(S
olrDeleteDuplicates.java:270)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.ne
xt
(S
olrDeleteDuplicates.java:241)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask
.j
av
a:236)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:
21
6)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212
)
Also had another question about updating Nutch to 1.6 and 1.7. I had
tried
updating to newer version of Nutch but got exception during deleting
duplicates in SOLR. After lot of research online found that a field had
changed. A few said digest field and others said that url field is no
longer there. So here are my questions:
1: Is there a newer solr mapping file that needs to be used?
2: Can the SOLR index from 1.5.1 and index from newer version co-exist
or
we need to re-index from one version of Nutch?
I will really appreciate any help with this.
Thanks in advance,
Madhvi
Madhvi Arora
AutomationDirect
The #1 Best Mid-Sized Company to work for in
Atlanta<http://www.ajc.com/business/topworkplaces/automationdirect-com-
to
p-midsize-1421260.html> 2012