[jira] [Commented] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series
[ https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13594500#comment-13594500 ] Roland commented on NUTCH-1478: --- +1 works fine for me. Thank you kiran Parse-metatags and index-metadata plugin for Nutch 2.x series -- Key: NUTCH-1478 URL: https://issues.apache.org/jira/browse/NUTCH-1478 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 2.1 Reporter: kiran Fix For: 2.2 Attachments: metadata_parseChecker_sites.png, Nutch1478.patch, Nutch1478.zip I have ported parse-metatags and index-metadata plugin to Nutch 2.x series. This will take multiple values of same tag and index in Solr as i patched before (https://issues.apache.org/jira/browse/NUTCH-1467). The usage is same as described here (http://wiki.apache.org/nutch/IndexMetatags) but one change is that there is no need to give 'metatag' keyword before metatag names. For example my configuration looks like this (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml) This is only the first version and does not include the junit test. I will update the new version soon. This will parse the tags and index the tags in Solr. Make sure you create the fields in 'index.parse.md' in nutch-site.xml in schema.xml in Solr. Please let me know if you have any suggestions This is supported by DLA (Digital Library and Archives) of Virginia Tech. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [DISCUSS] Google Summer of Code
Hi Kiran On Tue, Mar 5, 2013 at 8:11 PM, dev-digest-h...@nutch.apache.org wrote: [DISCUSS] Google Summer of Code 22402 by: Lewis John Mcgibbney 22403 by: kiran chitturi Please see *http://s.apache.org/1sM *Also, I would ask you to consider the following (BTW this is direct feedback I got from Apache GSoC Admins) 1. What is the likelihood/danger of you being too busy in a new job (post graduation) to do GSoC? You can think about this, but I suppose we can only make a judgement call after having discussed it with you. 2. GSoC is designed as a full-time program, so even an additional internship or a part-time job, let alone a full-time job are dangers to successful participation and are generally discouraged by Apache admins. I personally would like to get your opinions on the above before we progress with this. I have confidence in your work and work ethic, but I suppose it's just a case of determining whether you can fit this in around your graduation life? Thanks Lewis
[jira] [Resolved] (NUTCH-842) AutoGenerate WebPage code
[ https://issues.apache.org/jira/browse/NUTCH-842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-842. Resolution: Fixed Committed @revision 1453593 in 2.x HEAD AutoGenerate WebPage code - Key: NUTCH-842 URL: https://issues.apache.org/jira/browse/NUTCH-842 Project: Nutch Issue Type: Improvement Affects Versions: nutchgora Reporter: Doğacan Güney Assignee: Doğacan Güney Fix For: 2.2 Attachments: NUTCH-842.patch, NUTCH-842-v2.patch This issue will track the addition of an ant task that will automatically generate o.a.n.storage.WebPage (and ProtocolStatus and ParseStatus) from src/gora/webpage.avsc. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1540) Add Gora buffered read and write maximum limits to nutch-default.xml configuration.
Lewis John McGibbney created NUTCH-1540: --- Summary: Add Gora buffered read and write maximum limits to nutch-default.xml configuration. Key: NUTCH-1540 URL: https://issues.apache.org/jira/browse/NUTCH-1540 Project: Nutch Issue Type: Bug Components: storage Affects Versions: 2.1 Reporter: Lewis John McGibbney Fix For: 2.2 I've been experimenting by using this via the command line for some time. It is starting to annoy me, so I wanted to make this more accessible to us all. You can now easily set this in nutch-site.xml Patch coming up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1540) Add Gora buffered read and write maximum limits to nutch-default.xml configuration.
[ https://issues.apache.org/jira/browse/NUTCH-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1540: Attachment: NUTCH-1540.patch Patch for 2.x HEAD Add Gora buffered read and write maximum limits to nutch-default.xml configuration. --- Key: NUTCH-1540 URL: https://issues.apache.org/jira/browse/NUTCH-1540 Project: Nutch Issue Type: Bug Components: storage Affects Versions: 2.1 Reporter: Lewis John McGibbney Fix For: 2.2 Attachments: NUTCH-1540.patch I've been experimenting by using this via the command line for some time. It is starting to annoy me, so I wanted to make this more accessible to us all. You can now easily set this in nutch-site.xml Patch coming up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1540) Add Gora buffered read and write maximum limits to nutch-default.xml configuration.
[ https://issues.apache.org/jira/browse/NUTCH-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1540. - Resolution: Fixed Committed @revision 1453600 in 2.x HEAD Add Gora buffered read and write maximum limits to nutch-default.xml configuration. --- Key: NUTCH-1540 URL: https://issues.apache.org/jira/browse/NUTCH-1540 Project: Nutch Issue Type: Bug Components: storage Affects Versions: 2.1 Reporter: Lewis John McGibbney Fix For: 2.2 Attachments: NUTCH-1540.patch I've been experimenting by using this via the command line for some time. It is starting to annoy me, so I wanted to make this more accessible to us all. You can now easily set this in nutch-site.xml Patch coming up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1541) Indexer plugin to write CSV
Sebastian Nagel created NUTCH-1541: -- Summary: Indexer plugin to write CSV Key: NUTCH-1541 URL: https://issues.apache.org/jira/browse/NUTCH-1541 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 1.7 Reporter: Sebastian Nagel Priority: Minor With the new pluggable indexer a simple plugin would be handy to write configurable fields into a CSV file - for further analysis or just for export. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1541) Indexer plugin to write CSV
[ https://issues.apache.org/jira/browse/NUTCH-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1541: --- Attachment: NUTCH-1541-v1.patch First version. NOTE: NUTCH-1047 is required, the targets for indexer-csv must be added manually to main build.xml Indexer plugin to write CSV --- Key: NUTCH-1541 URL: https://issues.apache.org/jira/browse/NUTCH-1541 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 1.7 Reporter: Sebastian Nagel Priority: Minor Attachments: NUTCH-1541-v1.patch With the new pluggable indexer a simple plugin would be handy to write configurable fields into a CSV file - for further analysis or just for export. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595252#comment-13595252 ] Sebastian Nagel commented on NUTCH-1047: Hi Julien, in overall, all looks good. A first version of the CSV indexer is ready (NUTCH-1541) and works well with the last v5 patch. One point we should improve is the command-line help. I agree with Tejas that the help should list all required arguments. Of course, you are right the index/cleaning jobs are backend-neutral but then it would be preferable to have new commands index and indexclean. They are also required if other indexer back-ends are used. We can keep the solr* commands for legacy and because they are handy. A few additional lines to generate the prior help text are tolerable and could avoid unnecessary user requests on the mailing list. The describe() method is a good idea. The new commands will then show sufficient help but IndexingJob/CleaningJob should also call describe() when help is shown! Some trivialities to get the Java docs right: * default.properties - need to add the new plugins.indexer group with indexer-solr as member * build.xml - add group referring to plugins.indexer, add Java doc targets for indexer-solr Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1541) Indexer plugin to write CSV
[ https://issues.apache.org/jira/browse/NUTCH-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595263#comment-13595263 ] Sebastian Nagel commented on NUTCH-1541: Yes, the fields dumped are configurable. Of course, they must be available (ie, some indexing filter must add them before). Eg. this will dump the fields url and title in default CSV format (there will be a new output directory csvindexwriter): {code} bin/nutch org.apache.nutch.indexer.IndexingJob -Dindexer.csv.fields=url,title \ crawldb/ -linkdb linkdb/ -dir segments/ {code} Don't forget to activate the plugin indexer-csv. To dump in tab-separated format: {code} bin/nutch org.apache.nutch.indexer.IndexingJob \ -Dindexer.csv.separator=$'\t' -Dindexer.csv.quotechar= -Dindexer.csv.recordsep=$'\n' \ crawldb/ -linkdb linkdb/ -dir segments/ {code} So the output is quite configurable. Indexer plugin to write CSV --- Key: NUTCH-1541 URL: https://issues.apache.org/jira/browse/NUTCH-1541 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 1.7 Reporter: Sebastian Nagel Priority: Minor Attachments: NUTCH-1541-v1.patch With the new pluggable indexer a simple plugin would be handy to write configurable fields into a CSV file - for further analysis or just for export. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-842) AutoGenerate WebPage code
[ https://issues.apache.org/jira/browse/NUTCH-842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595264#comment-13595264 ] Hudson commented on NUTCH-842: -- Integrated in Nutch-nutchgora #519 (See [https://builds.apache.org/job/Nutch-nutchgora/519/]) NUTCH-842 AutoGenerate WebPage code (Revision 1453593) Result = SUCCESS lewismc : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1453593 Files : * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/build.xml AutoGenerate WebPage code - Key: NUTCH-842 URL: https://issues.apache.org/jira/browse/NUTCH-842 Project: Nutch Issue Type: Improvement Affects Versions: nutchgora Reporter: Doğacan Güney Assignee: Doğacan Güney Fix For: 2.2 Attachments: NUTCH-842.patch, NUTCH-842-v2.patch This issue will track the addition of an ant task that will automatically generate o.a.n.storage.WebPage (and ProtocolStatus and ParseStatus) from src/gora/webpage.avsc. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1540) Add Gora buffered read and write maximum limits to nutch-default.xml configuration.
[ https://issues.apache.org/jira/browse/NUTCH-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595265#comment-13595265 ] Hudson commented on NUTCH-1540: --- Integrated in Nutch-nutchgora #519 (See [https://builds.apache.org/job/Nutch-nutchgora/519/]) NUTCH-1540 Add Gora buffered read and write maximum limits to nutch-default.xml configuration. (Revision 1453600) Result = SUCCESS lewismc : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1453600 Files : * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/conf/nutch-default.xml Add Gora buffered read and write maximum limits to nutch-default.xml configuration. --- Key: NUTCH-1540 URL: https://issues.apache.org/jira/browse/NUTCH-1540 Project: Nutch Issue Type: Bug Components: storage Affects Versions: 2.1 Reporter: Lewis John McGibbney Fix For: 2.2 Attachments: NUTCH-1540.patch I've been experimenting by using this via the command line for some time. It is starting to annoy me, so I wanted to make this more accessible to us all. You can now easily set this in nutch-site.xml Patch coming up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1541) Indexer plugin to write CSV
[ https://issues.apache.org/jira/browse/NUTCH-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595284#comment-13595284 ] kiran commented on NUTCH-1541: -- Great! I will give it a try sometime soon this week. Indexer plugin to write CSV --- Key: NUTCH-1541 URL: https://issues.apache.org/jira/browse/NUTCH-1541 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 1.7 Reporter: Sebastian Nagel Priority: Minor Attachments: NUTCH-1541-v1.patch With the new pluggable indexer a simple plugin would be handy to write configurable fields into a CSV file - for further analysis or just for export. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-842) AutoGenerate WebPage code
[ https://issues.apache.org/jira/browse/NUTCH-842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595364#comment-13595364 ] Hudson commented on NUTCH-842: -- Integrated in Nutch-2.x-Windows #56 (See [https://builds.apache.org/job/Nutch-2.x-Windows/56/]) NUTCH-842 AutoGenerate WebPage code (Revision 1453593) Result = FAILURE lewismc : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1453593 Files : * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/build.xml AutoGenerate WebPage code - Key: NUTCH-842 URL: https://issues.apache.org/jira/browse/NUTCH-842 Project: Nutch Issue Type: Improvement Affects Versions: nutchgora Reporter: Doğacan Güney Assignee: Doğacan Güney Fix For: 2.2 Attachments: NUTCH-842.patch, NUTCH-842-v2.patch This issue will track the addition of an ant task that will automatically generate o.a.n.storage.WebPage (and ProtocolStatus and ParseStatus) from src/gora/webpage.avsc. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1540) Add Gora buffered read and write maximum limits to nutch-default.xml configuration.
[ https://issues.apache.org/jira/browse/NUTCH-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595365#comment-13595365 ] Hudson commented on NUTCH-1540: --- Integrated in Nutch-2.x-Windows #56 (See [https://builds.apache.org/job/Nutch-2.x-Windows/56/]) NUTCH-1540 Add Gora buffered read and write maximum limits to nutch-default.xml configuration. (Revision 1453600) Result = FAILURE lewismc : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1453600 Files : * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/conf/nutch-default.xml Add Gora buffered read and write maximum limits to nutch-default.xml configuration. --- Key: NUTCH-1540 URL: https://issues.apache.org/jira/browse/NUTCH-1540 Project: Nutch Issue Type: Bug Components: storage Affects Versions: 2.1 Reporter: Lewis John McGibbney Fix For: 2.2 Attachments: NUTCH-1540.patch I've been experimenting by using this via the command line for some time. It is starting to annoy me, so I wanted to make this more accessible to us all. You can now easily set this in nutch-site.xml Patch coming up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Build failed in Jenkins: Nutch-trunk #2143
See https://builds.apache.org/job/Nutch-trunk/2143/ -- [...truncated 5503 lines...] deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlfilter-suffix [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/plugin/build-plugin.xml:117: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds compile-test: [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/plugin/build-plugin.xml:180: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 1 source file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlfilter-suffix/test [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] 1 warning jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: urlfilter-suffix [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/hudson/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:file:/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.nutch.urlfilter.suffix.TestSuffixURLFilter [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 4.411 sec [junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 0.176 sec init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-basic [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/plugin/build-plugin.xml:117: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds compile: [echo] Compiling plugin: urlfilter-validator [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/plugin/build-plugin.xml:117: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds compile-test: compile-test: [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/plugin/build-plugin.xml:180: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/plugin/build-plugin.xml:180: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 1 source file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlfilter-validator/test [javac] Compiling 1 source file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-basic/test [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] 1 warning jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: urlfilter-validator [javac] 1 warning jar: deps-test: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-basic [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/hudson/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:file:/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/home/hudson/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:file:/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.nutch.urlfilter.validator.TestUrlValidator [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.021 sec [junit] Running
[jira] [Updated] (NUTCH-1542) adddays param for generator not present in 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1542: --- Summary: adddays param for generator not present in 2.x (was: -adddays param for generator not present in 2.x) adddays param for generator not present in 2.x -- Key: NUTCH-1542 URL: https://issues.apache.org/jira/browse/NUTCH-1542 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 2.1 Reporter: Tejas Patil Assignee: Tejas Patil Priority: Minor Fix For: 2.2 In 1.x, Generator had this param which could be used as a hack to crawl urls which were due to fetch in future. In 2.x, this param is not present. Its not clear why this was not ported from 1.x to 2.x. Unless it was left out for a strong reason, we should have it in 2.x as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1542) -adddays param for generator not present in 2.x
Tejas Patil created NUTCH-1542: -- Summary: -adddays param for generator not present in 2.x Key: NUTCH-1542 URL: https://issues.apache.org/jira/browse/NUTCH-1542 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 2.1 Reporter: Tejas Patil Assignee: Tejas Patil Priority: Minor Fix For: 2.2 In 1.x, Generator had this param which could be used as a hack to crawl urls which were due to fetch in future. In 2.x, this param is not present. Its not clear why this was not ported from 1.x to 2.x. Unless it was left out for a strong reason, we should have it in 2.x as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1542) adddays param for generator not present in 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1542: --- Attachment: NUTCH-1542.patch Patch for changes in GeneratorJob and the crawl script. adddays param for generator not present in 2.x -- Key: NUTCH-1542 URL: https://issues.apache.org/jira/browse/NUTCH-1542 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 2.1 Reporter: Tejas Patil Assignee: Tejas Patil Priority: Minor Fix For: 2.2 Attachments: NUTCH-1542.patch In 1.x, Generator had this param which could be used as a hack to crawl urls which were due to fetch in future. In 2.x, this param is not present. Its not clear why this was not ported from 1.x to 2.x. Unless it was left out for a strong reason, we should have it in 2.x as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-961: --- Fix Version/s: 2.2 Expose Tika's boilerpipe support Key: NUTCH-961 URL: https://issues.apache.org/jira/browse/NUTCH-961 Project: Nutch Issue Type: New Feature Components: parser Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.7, 2.2 Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, NUTCH-961v2.patch Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1393) Display consistent usage of GeneratorJob with 1.X
[ https://issues.apache.org/jira/browse/NUTCH-1393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1393: -- Attachment: NUTCH-1393.patch add help information when no params input. Display consistent usage of GeneratorJob with 1.X - Key: NUTCH-1393 URL: https://issues.apache.org/jira/browse/NUTCH-1393 Project: Nutch Issue Type: Bug Components: administration gui, generator Affects Versions: nutchgora Reporter: Lewis John McGibbney Fix For: 2.2 Attachments: NUTCH-1393.patch If we pass the generate argument to the nutch script, the Generator auto-spings into action and begins generating fetchlists. This should not be the case, instead it should print traditional usage to stdout. An example is below {code} lewis@lewis:~/ASF/nutchgora/runtime/local$ ./bin/nutch generate GeneratorJob: Selecting best-scoring urls due for fetch. GeneratorJob: starting GeneratorJob: filtering: true GeneratorJob: done GeneratorJob: generated batch id: 1339628223-1694200031 {code} All I wanted to do was get the usage params printed to stdout but instead it generated my batch willy nilly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira