[jira] [Commented] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series

2013-03-06 Thread Roland (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13594500#comment-13594500
 ] 

Roland commented on NUTCH-1478:
---

+1
works fine for me.
Thank you kiran

 Parse-metatags and index-metadata plugin for Nutch 2.x series 
 --

 Key: NUTCH-1478
 URL: https://issues.apache.org/jira/browse/NUTCH-1478
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 2.1
Reporter: kiran
 Fix For: 2.2

 Attachments: metadata_parseChecker_sites.png, Nutch1478.patch, 
 Nutch1478.zip


 I have ported parse-metatags and index-metadata plugin to Nutch 2.x series.  
 This will take multiple values of same tag and index in Solr as i patched 
 before (https://issues.apache.org/jira/browse/NUTCH-1467).
 The usage is same as described here 
 (http://wiki.apache.org/nutch/IndexMetatags) but one change is that there is 
 no need to give 'metatag' keyword before metatag names. For example my 
 configuration looks like this 
 (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml)
  
 This is only the first version and does not include the junit test. I will 
 update the new version soon.
 This will parse the tags and index the tags in Solr. Make sure you create the 
 fields in 'index.parse.md' in nutch-site.xml in schema.xml in Solr.
 Please let me know if you have any suggestions
 This is supported by DLA (Digital Library and Archives) of Virginia Tech.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: [DISCUSS] Google Summer of Code

2013-03-06 Thread Lewis John Mcgibbney
Hi Kiran

On Tue, Mar 5, 2013 at 8:11 PM, dev-digest-h...@nutch.apache.org wrote:

 [DISCUSS] Google Summer of Code
 22402 by: Lewis John Mcgibbney
 22403 by: kiran chitturi


Please see

*http://s.apache.org/1sM

*Also, I would ask you to consider the following (BTW this is direct
feedback I got from Apache GSoC Admins)

1. What is the likelihood/danger of you being too busy in a new job (post
graduation) to do GSoC? You can think about this, but I suppose we can only
make a judgement call after having discussed it with you.
2. GSoC is designed as a full-time program, so even an additional
internship or a part-time job, let alone a full-time job are dangers to
successful participation and are generally discouraged by Apache admins.

I personally would like to get your opinions on the above before we
progress with this. I have confidence in your work and work ethic, but I
suppose it's just a case of determining whether you can fit this in around
your graduation life?

Thanks
Lewis


[jira] [Resolved] (NUTCH-842) AutoGenerate WebPage code

2013-03-06 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-842.


Resolution: Fixed

Committed @revision 1453593 in 2.x HEAD

 AutoGenerate WebPage code
 -

 Key: NUTCH-842
 URL: https://issues.apache.org/jira/browse/NUTCH-842
 Project: Nutch
  Issue Type: Improvement
Affects Versions: nutchgora
Reporter: Doğacan Güney
Assignee: Doğacan Güney
 Fix For: 2.2

 Attachments: NUTCH-842.patch, NUTCH-842-v2.patch


 This issue will track the addition of an ant task that will automatically 
 generate o.a.n.storage.WebPage (and ProtocolStatus and ParseStatus) from 
 src/gora/webpage.avsc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1540) Add Gora buffered read and write maximum limits to nutch-default.xml configuration.

2013-03-06 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-1540:
---

 Summary: Add Gora buffered read and write maximum limits to 
nutch-default.xml configuration.
 Key: NUTCH-1540
 URL: https://issues.apache.org/jira/browse/NUTCH-1540
 Project: Nutch
  Issue Type: Bug
  Components: storage
Affects Versions: 2.1
Reporter: Lewis John McGibbney
 Fix For: 2.2


I've been experimenting by using this via the command line for some time. It is 
starting to annoy me, so I wanted to make this more accessible to us all.
You can now easily set this in nutch-site.xml

Patch coming up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1540) Add Gora buffered read and write maximum limits to nutch-default.xml configuration.

2013-03-06 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1540:


Attachment: NUTCH-1540.patch

Patch for 2.x HEAD

 Add Gora buffered read and write maximum limits to nutch-default.xml 
 configuration.
 ---

 Key: NUTCH-1540
 URL: https://issues.apache.org/jira/browse/NUTCH-1540
 Project: Nutch
  Issue Type: Bug
  Components: storage
Affects Versions: 2.1
Reporter: Lewis John McGibbney
 Fix For: 2.2

 Attachments: NUTCH-1540.patch


 I've been experimenting by using this via the command line for some time. It 
 is starting to annoy me, so I wanted to make this more accessible to us all.
 You can now easily set this in nutch-site.xml
 Patch coming up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1540) Add Gora buffered read and write maximum limits to nutch-default.xml configuration.

2013-03-06 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1540.
-

Resolution: Fixed

Committed @revision 1453600 in 2.x HEAD

 Add Gora buffered read and write maximum limits to nutch-default.xml 
 configuration.
 ---

 Key: NUTCH-1540
 URL: https://issues.apache.org/jira/browse/NUTCH-1540
 Project: Nutch
  Issue Type: Bug
  Components: storage
Affects Versions: 2.1
Reporter: Lewis John McGibbney
 Fix For: 2.2

 Attachments: NUTCH-1540.patch


 I've been experimenting by using this via the command line for some time. It 
 is starting to annoy me, so I wanted to make this more accessible to us all.
 You can now easily set this in nutch-site.xml
 Patch coming up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1541) Indexer plugin to write CSV

2013-03-06 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1541:
--

 Summary: Indexer plugin to write CSV
 Key: NUTCH-1541
 URL: https://issues.apache.org/jira/browse/NUTCH-1541
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 1.7
Reporter: Sebastian Nagel
Priority: Minor


With the new pluggable indexer a simple plugin would be handy to write 
configurable fields into a CSV file - for further analysis or just for export.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1541) Indexer plugin to write CSV

2013-03-06 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1541:
---

Attachment: NUTCH-1541-v1.patch

First version.
NOTE: NUTCH-1047 is required, the targets for indexer-csv must be added 
manually to main build.xml

 Indexer plugin to write CSV
 ---

 Key: NUTCH-1541
 URL: https://issues.apache.org/jira/browse/NUTCH-1541
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 1.7
Reporter: Sebastian Nagel
Priority: Minor
 Attachments: NUTCH-1541-v1.patch


 With the new pluggable indexer a simple plugin would be handy to write 
 configurable fields into a CSV file - for further analysis or just for export.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-03-06 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595252#comment-13595252
 ] 

Sebastian Nagel commented on NUTCH-1047:


Hi Julien,

in overall, all looks good. A first version of the CSV indexer is ready 
(NUTCH-1541) and works well with the last v5 patch.

One point we should improve is the command-line help. I agree with Tejas that 
the help should list all required arguments. Of course, you are right the 
index/cleaning jobs are backend-neutral but then it would be preferable to 
have new commands index and indexclean. They are also required if other 
indexer back-ends are used. We can keep the solr* commands for legacy and 
because they are handy. A few additional lines to generate the prior help text 
are tolerable and could avoid unnecessary user requests on the mailing list.

The describe() method is a good idea. The new commands will then show 
sufficient help but IndexingJob/CleaningJob should also call describe() when 
help is shown!

Some trivialities to get the Java docs right:
* default.properties - need to add the new plugins.indexer group with 
indexer-solr as member
* build.xml - add group referring to plugins.indexer, add Java doc targets 
for indexer-solr


 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1541) Indexer plugin to write CSV

2013-03-06 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595263#comment-13595263
 ] 

Sebastian Nagel commented on NUTCH-1541:


Yes, the fields dumped are configurable. Of course, they must be available (ie, 
some indexing filter must add them before). Eg. this will dump the fields url 
and title in default CSV format (there will be a new output directory 
csvindexwriter):
{code}
 bin/nutch org.apache.nutch.indexer.IndexingJob -Dindexer.csv.fields=url,title \
   crawldb/ -linkdb linkdb/ -dir segments/
{code}
Don't forget to activate the plugin indexer-csv. To dump in tab-separated 
format:
{code}
 bin/nutch org.apache.nutch.indexer.IndexingJob \
   -Dindexer.csv.separator=$'\t' -Dindexer.csv.quotechar= 
-Dindexer.csv.recordsep=$'\n' \
   crawldb/ -linkdb linkdb/ -dir segments/
{code}
So the output is quite configurable.

 Indexer plugin to write CSV
 ---

 Key: NUTCH-1541
 URL: https://issues.apache.org/jira/browse/NUTCH-1541
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 1.7
Reporter: Sebastian Nagel
Priority: Minor
 Attachments: NUTCH-1541-v1.patch


 With the new pluggable indexer a simple plugin would be handy to write 
 configurable fields into a CSV file - for further analysis or just for export.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-842) AutoGenerate WebPage code

2013-03-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595264#comment-13595264
 ] 

Hudson commented on NUTCH-842:
--

Integrated in Nutch-nutchgora #519 (See 
[https://builds.apache.org/job/Nutch-nutchgora/519/])
NUTCH-842 AutoGenerate WebPage code (Revision 1453593)

 Result = SUCCESS
lewismc : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1453593
Files : 
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/build.xml


 AutoGenerate WebPage code
 -

 Key: NUTCH-842
 URL: https://issues.apache.org/jira/browse/NUTCH-842
 Project: Nutch
  Issue Type: Improvement
Affects Versions: nutchgora
Reporter: Doğacan Güney
Assignee: Doğacan Güney
 Fix For: 2.2

 Attachments: NUTCH-842.patch, NUTCH-842-v2.patch


 This issue will track the addition of an ant task that will automatically 
 generate o.a.n.storage.WebPage (and ProtocolStatus and ParseStatus) from 
 src/gora/webpage.avsc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1540) Add Gora buffered read and write maximum limits to nutch-default.xml configuration.

2013-03-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595265#comment-13595265
 ] 

Hudson commented on NUTCH-1540:
---

Integrated in Nutch-nutchgora #519 (See 
[https://builds.apache.org/job/Nutch-nutchgora/519/])
NUTCH-1540 Add Gora buffered read and write maximum limits to 
nutch-default.xml configuration. (Revision 1453600)

 Result = SUCCESS
lewismc : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1453600
Files : 
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/conf/nutch-default.xml


 Add Gora buffered read and write maximum limits to nutch-default.xml 
 configuration.
 ---

 Key: NUTCH-1540
 URL: https://issues.apache.org/jira/browse/NUTCH-1540
 Project: Nutch
  Issue Type: Bug
  Components: storage
Affects Versions: 2.1
Reporter: Lewis John McGibbney
 Fix For: 2.2

 Attachments: NUTCH-1540.patch


 I've been experimenting by using this via the command line for some time. It 
 is starting to annoy me, so I wanted to make this more accessible to us all.
 You can now easily set this in nutch-site.xml
 Patch coming up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1541) Indexer plugin to write CSV

2013-03-06 Thread kiran (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595284#comment-13595284
 ] 

kiran commented on NUTCH-1541:
--

Great! I will give it a try sometime soon this week. 

 Indexer plugin to write CSV
 ---

 Key: NUTCH-1541
 URL: https://issues.apache.org/jira/browse/NUTCH-1541
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 1.7
Reporter: Sebastian Nagel
Priority: Minor
 Attachments: NUTCH-1541-v1.patch


 With the new pluggable indexer a simple plugin would be handy to write 
 configurable fields into a CSV file - for further analysis or just for export.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-842) AutoGenerate WebPage code

2013-03-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595364#comment-13595364
 ] 

Hudson commented on NUTCH-842:
--

Integrated in Nutch-2.x-Windows #56 (See 
[https://builds.apache.org/job/Nutch-2.x-Windows/56/])
NUTCH-842 AutoGenerate WebPage code (Revision 1453593)

 Result = FAILURE
lewismc : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1453593
Files : 
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/build.xml


 AutoGenerate WebPage code
 -

 Key: NUTCH-842
 URL: https://issues.apache.org/jira/browse/NUTCH-842
 Project: Nutch
  Issue Type: Improvement
Affects Versions: nutchgora
Reporter: Doğacan Güney
Assignee: Doğacan Güney
 Fix For: 2.2

 Attachments: NUTCH-842.patch, NUTCH-842-v2.patch


 This issue will track the addition of an ant task that will automatically 
 generate o.a.n.storage.WebPage (and ProtocolStatus and ParseStatus) from 
 src/gora/webpage.avsc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1540) Add Gora buffered read and write maximum limits to nutch-default.xml configuration.

2013-03-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595365#comment-13595365
 ] 

Hudson commented on NUTCH-1540:
---

Integrated in Nutch-2.x-Windows #56 (See 
[https://builds.apache.org/job/Nutch-2.x-Windows/56/])
NUTCH-1540 Add Gora buffered read and write maximum limits to 
nutch-default.xml configuration. (Revision 1453600)

 Result = FAILURE
lewismc : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1453600
Files : 
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/conf/nutch-default.xml


 Add Gora buffered read and write maximum limits to nutch-default.xml 
 configuration.
 ---

 Key: NUTCH-1540
 URL: https://issues.apache.org/jira/browse/NUTCH-1540
 Project: Nutch
  Issue Type: Bug
  Components: storage
Affects Versions: 2.1
Reporter: Lewis John McGibbney
 Fix For: 2.2

 Attachments: NUTCH-1540.patch


 I've been experimenting by using this via the command line for some time. It 
 is starting to annoy me, so I wanted to make this more accessible to us all.
 You can now easily set this in nutch-site.xml
 Patch coming up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Build failed in Jenkins: Nutch-trunk #2143

2013-03-06 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-trunk/2143/

--
[...truncated 5503 lines...]

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlfilter-suffix
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/plugin/build-plugin.xml:117:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds

compile-test:
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/plugin/build-plugin.xml:180:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 1 source file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlfilter-suffix/test
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlfilter-suffix
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/hudson/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:file:/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running org.apache.nutch.urlfilter.suffix.TestSuffixURLFilter
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 4.411 sec
[junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 0.176 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-basic
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/plugin/build-plugin.xml:117:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds

compile:
 [echo] Compiling plugin: urlfilter-validator
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/plugin/build-plugin.xml:117:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds

compile-test:

compile-test:
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/plugin/build-plugin.xml:180:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/plugin/build-plugin.xml:180:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 1 source file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlfilter-validator/test
[javac] Compiling 1 source file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-basic/test
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlfilter-validator
[javac] 1 warning

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-basic
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/hudson/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:file:/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/hudson/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:file:/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running org.apache.nutch.urlfilter.validator.TestUrlValidator
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.021 sec
[junit] Running 

[jira] [Updated] (NUTCH-1542) adddays param for generator not present in 2.x

2013-03-06 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1542:
---

Summary: adddays param for generator not present in 2.x  (was: -adddays 
param for generator not present in 2.x)

 adddays param for generator not present in 2.x
 --

 Key: NUTCH-1542
 URL: https://issues.apache.org/jira/browse/NUTCH-1542
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 2.1
Reporter: Tejas Patil
Assignee: Tejas Patil
Priority: Minor
 Fix For: 2.2


 In 1.x, Generator had this param which could be used as a hack to crawl urls 
 which were due to fetch in future. In 2.x, this param is not present. Its not 
 clear why this was not ported from 1.x to 2.x. Unless it was left out for a 
 strong reason, we should have it in 2.x as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1542) -adddays param for generator not present in 2.x

2013-03-06 Thread Tejas Patil (JIRA)
Tejas Patil created NUTCH-1542:
--

 Summary: -adddays param for generator not present in 2.x
 Key: NUTCH-1542
 URL: https://issues.apache.org/jira/browse/NUTCH-1542
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 2.1
Reporter: Tejas Patil
Assignee: Tejas Patil
Priority: Minor
 Fix For: 2.2


In 1.x, Generator had this param which could be used as a hack to crawl urls 
which were due to fetch in future. In 2.x, this param is not present. Its not 
clear why this was not ported from 1.x to 2.x. Unless it was left out for a 
strong reason, we should have it in 2.x as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1542) adddays param for generator not present in 2.x

2013-03-06 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1542:
---

Attachment: NUTCH-1542.patch

Patch for changes in GeneratorJob and the crawl script.

 adddays param for generator not present in 2.x
 --

 Key: NUTCH-1542
 URL: https://issues.apache.org/jira/browse/NUTCH-1542
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 2.1
Reporter: Tejas Patil
Assignee: Tejas Patil
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1542.patch


 In 1.x, Generator had this param which could be used as a hack to crawl urls 
 which were due to fetch in future. In 2.x, this param is not present. Its not 
 clear why this was not ported from 1.x to 2.x. Unless it was left out for a 
 strong reason, we should have it in 2.x as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

2013-03-06 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-961:
---

Fix Version/s: 2.2

 Expose Tika's boilerpipe support
 

 Key: NUTCH-961
 URL: https://issues.apache.org/jira/browse/NUTCH-961
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.7, 2.2

 Attachments: BoilerpipeExtractorRepository.java, 
 NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser1.patch, 
 NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.4-dombuilder-1.patch, 
 NUTCH-961-1.5-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, 
 NUTCH-961v2.patch


 Tika 0.8 comes with the Boilerpipe content handler which can be used to 
 extract boilerplate content from HTML pages. We should see how we can expose 
 Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1393) Display consistent usage of GeneratorJob with 1.X

2013-03-06 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1393:
--

Attachment: NUTCH-1393.patch

add help information when no params input.

 Display consistent usage of GeneratorJob with 1.X
 -

 Key: NUTCH-1393
 URL: https://issues.apache.org/jira/browse/NUTCH-1393
 Project: Nutch
  Issue Type: Bug
  Components: administration gui, generator
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
 Fix For: 2.2

 Attachments: NUTCH-1393.patch


 If we pass the generate argument to the nutch script, the Generator 
 auto-spings into action and begins generating fetchlists. This should not be 
 the case, instead it should print traditional usage to stdout. An example is 
 below
 {code}
 lewis@lewis:~/ASF/nutchgora/runtime/local$ ./bin/nutch generate
 GeneratorJob: Selecting best-scoring urls due for fetch.
 GeneratorJob: starting
 GeneratorJob: filtering: true
 GeneratorJob: done
 GeneratorJob: generated batch id: 1339628223-1694200031
 {code}
 All I wanted to do was get the usage params printed to stdout but instead it 
 generated my batch willy nilly.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira