date:20130128

[jira] [Resolved] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2013-01-28 Thread Tejas Patil (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil resolved NUTCH-1284.


Resolution: Fixed

 Add site fetcher.max.crawl.delay as log output by default.
 --

 Key: NUTCH-1284
 URL: https://issues.apache.org/jira/browse/NUTCH-1284
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
Priority: Trivial
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1284-2.x.v1.patch, NUTCH-1284.patch, 
 NUTCH-1284-trunk.v1.patch


 Currently, when manually scanning our log output we cannot infer which pages 
 are governed by a crawl delay between successive fetch attempts of any given 
 page within the site. The value should be made available as something like:
 {code}
 2012-02-19 12:33:33,031 INFO  fetcher.Fetcher - fetching 
 http://nutch.apache.org/ (crawl.delay=XXXms)
 {code}
 This way we can easily and quickly determine whether the fetcher is having to 
 use this functionality or not. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2013-01-28 Thread Tejas Patil (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564107#comment-13564107
 ] 

Tejas Patil commented on NUTCH-1284:


Committed @revision 1439289 in trunk
Committed @revision 1439291 in 2.x

 Add site fetcher.max.crawl.delay as log output by default.
 --

 Key: NUTCH-1284
 URL: https://issues.apache.org/jira/browse/NUTCH-1284
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
Priority: Trivial
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1284-2.x.v1.patch, NUTCH-1284.patch, 
 NUTCH-1284-trunk.v1.patch


 Currently, when manually scanning our log output we cannot infer which pages 
 are governed by a crawl delay between successive fetch attempts of any given 
 page within the site. The value should be made available as something like:
 {code}
 2012-02-19 12:33:33,031 INFO  fetcher.Fetcher - fetching 
 http://nutch.apache.org/ (crawl.delay=XXXms)
 {code}
 This way we can easily and quickly determine whether the fetcher is having to 
 use this functionality or not. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1

2013-01-28 Thread Tejas Patil (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil resolved NUTCH-1042.


Resolution: Fixed

The fix for NUTCH-1284 takes care of this.

 Fetcher.max.crawl.delay property not taken into account correctly when set to 
 -1
 

 Key: NUTCH-1042
 URL: https://issues.apache.org/jira/browse/NUTCH-1042
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.3
Reporter: Nutch User - 1
Assignee: Lewis John McGibbney
 Fix For: 1.7, 2.2


 [Originally: 
 (http://lucene.472066.n3.nabble.com/A-possible-bug-or-misleading-documentation-td3162397.html).]
 From nutch-default.xml:
 
 property
  namefetcher.max.crawl.delay/name
  value30/value
  description
  If the Crawl-Delay in robots.txt is set to greater than this value (in
  seconds) then the fetcher will skip this page, generating an error report.
  If set to -1 the fetcher will never skip such pages and will wait the
  amount of time retrieved from robots.txt Crawl-Delay, however long that
  might be.
  /description
 /property
 
 Fetcher.java:
 (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup).
 The line 554 in Fetcher.java: this.maxCrawlDelay =
 conf.getInt(fetcher.max.crawl.delay, 30) * 1000; .
 The lines 615-616 in Fetcher.java:
 
 if (rules.getCrawlDelay()  0) {
   if (rules.getCrawlDelay()  maxCrawlDelay) {
 
 Now, the documentation states that, if fetcher.max.crawl.delay is set to
 -1, the crawler will always wait the amount of time the Crawl-Delay
 parameter specifies. However, as you can see, if it really is negative
 the condition on the line 616 is always true, which leads to skipping
 the page whose Crawl-Delay is set.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2013-01-28 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564110#comment-13564110
 ] 

Hudson commented on NUTCH-1284:
---

Integrated in Nutch-trunk-Windows #18 (See 
[https://builds.apache.org/job/Nutch-trunk-Windows/18/])
NUTCH-1284 Add site fetcher.max.crawl.delay as log output by default 
(Revision 1439289)

 Result = FAILURE
tejasp : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1439289
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java


 Add site fetcher.max.crawl.delay as log output by default.
 --

 Key: NUTCH-1284
 URL: https://issues.apache.org/jira/browse/NUTCH-1284
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
Priority: Trivial
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1284-2.x.v1.patch, NUTCH-1284.patch, 
 NUTCH-1284-trunk.v1.patch


 Currently, when manually scanning our log output we cannot infer which pages 
 are governed by a crawl delay between successive fetch attempts of any given 
 page within the site. The value should be made available as something like:
 {code}
 2012-02-19 12:33:33,031 INFO  fetcher.Fetcher - fetching 
 http://nutch.apache.org/ (crawl.delay=XXXms)
 {code}
 This way we can easily and quickly determine whether the fetcher is having to 
 use this functionality or not. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2013-01-28 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564111#comment-13564111
 ] 

Hudson commented on NUTCH-1284:
---

Integrated in Nutch-2.x-Windows #18 (See 
[https://builds.apache.org/job/Nutch-2.x-Windows/18/])
NUTCH-1284 Add site fetcher.max.crawl.delay as log output by default 
(Revision 1439291)

 Result = FAILURE
tejasp : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1439291
Files : 
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java


 Add site fetcher.max.crawl.delay as log output by default.
 --

 Key: NUTCH-1284
 URL: https://issues.apache.org/jira/browse/NUTCH-1284
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
Priority: Trivial
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1284-2.x.v1.patch, NUTCH-1284.patch, 
 NUTCH-1284-trunk.v1.patch


 Currently, when manually scanning our log output we cannot infer which pages 
 are governed by a crawl delay between successive fetch attempts of any given 
 page within the site. The value should be made available as something like:
 {code}
 2012-02-19 12:33:33,031 INFO  fetcher.Fetcher - fetching 
 http://nutch.apache.org/ (crawl.delay=XXXms)
 {code}
 This way we can easily and quickly determine whether the fetcher is having to 
 use this functionality or not. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2013-01-28 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564113#comment-13564113
 ] 

Hudson commented on NUTCH-1284:
---

Integrated in Nutch-nutchgora #478 (See 
[https://builds.apache.org/job/Nutch-nutchgora/478/])
NUTCH-1284 Add site fetcher.max.crawl.delay as log output by default 
(Revision 1439291)

 Result = FAILURE
tejasp : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1439291
Files : 
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java


 Add site fetcher.max.crawl.delay as log output by default.
 --

 Key: NUTCH-1284
 URL: https://issues.apache.org/jira/browse/NUTCH-1284
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
Priority: Trivial
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1284-2.x.v1.patch, NUTCH-1284.patch, 
 NUTCH-1284-trunk.v1.patch


 Currently, when manually scanning our log output we cannot infer which pages 
 are governed by a crawl delay between successive fetch attempts of any given 
 page within the site. The value should be made available as something like:
 {code}
 2012-02-19 12:33:33,031 INFO  fetcher.Fetcher - fetching 
 http://nutch.apache.org/ (crawl.delay=XXXms)
 {code}
 This way we can easily and quickly determine whether the fetcher is having to 
 use this functionality or not. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Build failed in Jenkins: Nutch-nutchgora #478

2013-01-28 Thread Apache Jenkins Server

See https://builds.apache.org/job/Nutch-nutchgora/478/changes

Changes:

[tejasp] NUTCH-1284 Add site fetcher.max.crawl.delay as log output by default

--
[...truncated 3497 lines...]

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: protocol-file
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/src/plugin/build-plugin.xml:117:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds

jar:

deps-test:

deploy:

copy-generated-lib:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: parse-js
[junit] Running org.apache.nutch.parse.js.TestJSParseFilter
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.407 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: index-anchor
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/src/plugin/build-plugin.xml:117:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds

compile-test:
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/src/plugin/build-plugin.xml:180:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 1 source file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/index-anchor/test
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: index-anchor
[junit] Running org.apache.nutch.indexer.anchor.TestAnchorIndexingFilter
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.733 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: index-basic
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/src/plugin/build-plugin.xml:117:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds

compile-test:
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/src/plugin/build-plugin.xml:180:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 1 source file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/index-basic/test
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: index-basic
[junit] Running org.apache.nutch.indexer.basic.TestBasicIndexingFilter
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.989 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: index-more
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/src/plugin/build-plugin.xml:117:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds

compile-test:
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/src/plugin/build-plugin.xml:180:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 1 source file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/index-more/test
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: index-more
[junit] Running org.apache.nutch.indexer.more.TestMoreIndexingFilter
[junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.307 sec

init:

init-plugin:
 [echo] Copying language profiles
 [echo] Copying test files

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file =

[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2013-01-28 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564114#comment-13564114
 ] 

Hudson commented on NUTCH-1284:
---

Integrated in Nutch-trunk #2103 (See 
[https://builds.apache.org/job/Nutch-trunk/2103/])
NUTCH-1284 Add site fetcher.max.crawl.delay as log output by default 
(Revision 1439289)

 Result = SUCCESS
tejasp : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1439289
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java


 Add site fetcher.max.crawl.delay as log output by default.
 --

 Key: NUTCH-1284
 URL: https://issues.apache.org/jira/browse/NUTCH-1284
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
Priority: Trivial
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1284-2.x.v1.patch, NUTCH-1284.patch, 
 NUTCH-1284-trunk.v1.patch


 Currently, when manually scanning our log output we cannot infer which pages 
 are governed by a crawl delay between successive fetch attempts of any given 
 page within the site. The value should be made available as something like:
 {code}
 2012-02-19 12:33:33,031 INFO  fetcher.Fetcher - fetching 
 http://nutch.apache.org/ (crawl.delay=XXXms)
 {code}
 This way we can easily and quickly determine whether the fetcher is having to 
 use this functionality or not. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (NUTCH-1465) Support sitemaps in Nutch

2013-01-28 Thread Tejas Patil (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil reassigned NUTCH-1465:
--

Assignee: Tejas Patil

 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.7

 Attachments: NUTCH-1465-trunk.v1.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Julien Nioche (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564173#comment-13564173
]

Julien Nioche commented on NUTCH-1047:
--

@tejasp can reproduce the issue and am looking into it, thanks. Somehow the
configuration does not get passed on properly when using the crawl command.
Thanks.

Lufeng
{quote}
But i don't know why not add an option to set IndexerUrl such as bin/nutch
solrindex -indexurl http://localhost:8983/solr/.
{quote}

whether it is passed as a parameter or via configuration should not make much
of a difference. Your suggestion also assumes that the indexing backend can be
reached via a single URL which is not necessarily the case as it could not need
a URL at all or at the opposite need multiple URLs. Better to leave that logic
in the configuration and assume that the backends will find whatever they need
there.

{quote}
the corrent command to invoke the IndexingJob command is bin/nutch solrindex
http://localhost:8983/solr/ crawldb/ segments/20130121115214/ -filter.
{quote}

as explained above we want to keep compatibility with the existing sorlindex
command and not change its syntax. Underneath it uses the new code based on
plugins but sets the value of the solr config. There is no shortcut for the
generic indexing job command in the nutch script yet but we could add one. For
now it has to be called in full e.g. bin/nutch
org.apache.nutch.indexer.IndexingJob ... which will make sense when we have
other indexing backends and not just SOLR.

Think about 'nutch solrindex' as a shortcut for the generic command.

Pluggable indexing backends
---

Key: NUTCH-1047
URL: https://issues.apache.org/jira/browse/NUTCH-1047
Project: Nutch
Issue Type: New Feature
Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
Labels: indexing
Fix For: 1.7

Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch,
NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch

One possible feature would be to add a new endpoint for indexing-backends and
make the indexing plugable. at the moment we are hardwired to SOLR - which is
OK - but as other resources like ElasticSearch are becoming more popular it
would be better to handle this as plugins. Not sure about the name of the
endpoint though : we already have indexing-plugins (which are about
generating fields sent to the backends) and moreover the backends are not
necessarily for indexing / searching but could be just an external storage
e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this
could be pertaining to the storage in GORA. 'indexing-backend' is the best
name that came to my mind so far - please suggest better ones.
We should come up with generic map/reduce jobs for indexing, deduplicating
and cleaning and maybe add a Nutch extension point there so we can easily
hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Tejas Patil (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564187#comment-13564187
]

Tejas Patil commented on NUTCH-1047:

Hi Julien,
After reply from @lufeng, I was able to perform indexing with the crawl
command. Here is a summary of things I have observed:
||solr.server.url in nutch-site.xml||-D in crawl command||Works ?||
|no|no|RuntimeException: Missing SOLR URL|
|no|yes|yes|
|yes|no|yes|
|yes|yes|yes|

Note that I had to pass -solr and solr url everytime. Else it didnt invoke
indexing.

Pluggable indexing backends
---

Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch,
NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564196#comment-13564196
 ] 

Julien Nioche commented on NUTCH-1047:
--

Hi Tejas

It will work everytime you set it in nutch-site.xml. As for setting it with -D 
in the crawl command - you definitely should not have to do that and this is 
where the bug is. The problem is that for some reason we value we take from the 
crawl command is correctly set in the configuration object however the later is 
reloaded or overridden during the call to JobClient.runJob(job) (IndexingJob 
line 120).

BTW the crawl command is deprecated and should be removed at some point as we 
have the crawl script. Could you try using the SOLRIndex command as well as the 
crawl script while I try and solve the problem with the crawl command?

Thanks

Julien



 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Tejas Patil (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564252#comment-13564252
]

Tejas Patil commented on NUTCH-1047:

Hi Julien, The solrindex commmand and crawl script are work fine after setting
solr.server.url in nutch-site.xml. I did not use -D option during these
runs.

Pluggable indexing backends
---

Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch,
NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564263#comment-13564263
 ] 

Julien Nioche commented on NUTCH-1047:
--

Tejas

The crawl script and the solr index should work without setting 
solr.server.url in nutch-site.xml or using -D as this is handled for you in 
the nutch script. Can you please test without specifying solr.server.url in 
nutch-site.xml?

Thanks

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-01-28 Thread Sebastian Nagel (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564274#comment-13564274
]

Sebastian Nagel commented on NUTCH-1465:

Hi Tejas,
thanks and a few comments on the patch:

“??for a given host, sitemaps are processed just once??” But they are not
cached over cycles because the cache is bound to the protocol object. Is this
correct? So a sitemap is fetched and processed every cycle for every host? If
yes and sitemaps are large (they can!) this would cause a lot of extra traffic.

Shouldn't sitemap URLs handled the same way as any other URL: add them to
CrawlDb, fetch and parse once, add found links to CrawlDb, cf. [Ken's post at
CC|https://groups.google.com/forum/?fromgroups#!topic/crawler-commons/DrAX4Th1A4I].
There are some complications:
- due to their size, sitemaps may require larger values regarding size and time
limits
- sitemaps may require more frequent re-fetching (eg. by
MimeAdaptiveFetchSchedule)
- the current Outlink class cannot hold extra information contained in sitemaps
(lastmod, changefreq, etc.)

There is another way which we use it for several customers: A SitemapInjector
fetches the sitemaps, extracts URLs and injects them with all extra
information. It's a simple use case for a customized site-search: there is a
sitemap and it shall be used as seed list or even exclusive list of documents
to be crawled. Is there any interest in this solution? It's not a general
solution and not adaptable to a large web crawl.

Support sitemaps in Nutch
-

Key: NUTCH-1465
URL: https://issues.apache.org/jira/browse/NUTCH-1465
Project: Nutch
Issue Type: New Feature
Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
Fix For: 1.7

Attachments: NUTCH-1465-trunk.v1.patch

I recently came across this rather stagnant codebase[0] which is ASL v2.0
licensed and appears to have been used successfully to parse sitemaps as per
the discussion here[1].
[0] http://sourceforge.net/projects/sitemap-parser/
[1]
http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-01-28 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564768#comment-13564768
 ] 

Sebastian Nagel commented on NUTCH-1465:




Yes, SitemapInjector is a map-reduce job. The scenario for its use is the 
following:
- a small set of sites to be crawled (eg, to feed a site-search index)
- you can think of sitemaps as remote seed lists. Because many content 
management systems can generate sitemaps it is convenient for the site owners 
to publish seeds. The URLs contained in the sitemap can be also the complete 
and exclusive set of URLs to be crawled (you can use the plugin scoring-depth 
to limit the crawl to seed URLs).
- because you can trust in the sitemap's content
-* checks for cross submissions are not necessary
-* extra information (lastmod, changefreq, priority) can be used
That's we use sitemaps: remote seed lists, maintained by customers, quite 
convenient if you run a crawler as a service.

For large web crawls there is also another aspect: detection of sitemaps which 
is bound to processing of robots.txt. Processing of sitemaps can (and should?) 
be done the usual Nutch way:
- detection is done in the protocol plugin (see Tejas' patch)
- record in CrawlDb: done by Fetcher (cross submission information can be added)
- fetch (if not yet done), parse (a plugin parse-sitemap based on 
crawler-commons?) and extract outlinks: sitemaps may require special treatment 
here because they can be large in size and usually contain many outlinks. Also 
the Outlink class needs to be extended to deal with the extra info relevant for 
scheduling
To use an extra tool (as the SitemapInjector) for processing the sitemaps has 
the disadvantage that we first must get all sitemap URLs out of the CrawlDb. On 
the contrary, special treatment can easily be realized in a separate map-reduce 
job.

Comments?!

Thanks, Tejas: the feature is moving forward thanks to your initiative!

 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.7

 Attachments: NUTCH-1465-trunk.v1.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564827#comment-13564827
 ] 

Sebastian Nagel commented on NUTCH-1047:


As some test for the interface started to implement a CSV-indexer - useful for 
exporting crawled data or for quick analysis. First working version (draft, 
still a lot to do) within 100+ lines of code: +1 for the interface / extension 
point.

Some concerns about the usability of IndexingJob as a daily tool:
- it's not really transparent which indexer is run (solr, elastic, etc.): you 
have to look into the property plugin-includes
- options must be passed to indexer plugins as properties: complicated, no help 
to get a list of available properties



 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-01-28 Thread Markus Jelsma (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564836#comment-13564836
]

Markus Jelsma commented on NUTCH-1465:
--

Thanks all for your interesting comments.

It's a complicated issue. One one hand host data should be stored in NUTCH-1325
but that would require additional logic and sending each segment output to the
hostdb, in case there's a sitemap crawled. On the other hand it's ideal to
store host data. It's also easy to use in jobs such as the indexer and
generator.

I don't yet favour a specific approach but storing sitemap data in a hostdb may
be something to think about.

Cheers

Support sitemaps in Nutch
-

Key: NUTCH-1465
URL: https://issues.apache.org/jira/browse/NUTCH-1465
Project: Nutch
Issue Type: New Feature
Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
Fix For: 1.7

Attachments: NUTCH-1465-trunk.v1.patch

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-01-28 Thread Tejas Patil (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564883#comment-13564883
 ] 

Tejas Patil commented on NUTCH-1465:


Hi Sebastian,

So we are looking at 2 things here:
- a standalone utility for injecting sitemaps to crawldb: 
-# User starts off with urls to sitemap pages
-# SitemapInjector fetches these seeds, parses it (with a parse plugin based on 
CC)
-# SitemapInjector updates the crawldb with the sitemap entries.

- handling of sitemap within the nutch cycle: fetch, parse and update phases
-# Robots parsing will populate a table of host: _list of links to sitemap 
pages_
-# These will be added to the fetcher queue and will be fetched
-# A parser plugin based on CC will parse the sitemap page
-# Outlink class needs to be extended to store the meta obtained from sitemap
-# Write this into the segment
-# Update phase needs to update the crawl frequency of already existing urls in 
crawldb based on what we got from the sitemap. Else just add new entires to the 
crawldb.

I am not clear about the extending outlink thing. The normal outlink extraction 
need not be done as CC will already do that for us. Sitemap parser plugin must 
do this and create objects of our specialized sitemap link. While writing, 
where is CrawlDatum generated from the outlink ?

The mime type that we get is text/xml which can also mean a normal xml file. 
How will nutch identify if its a sitemap page and invoke the correct parser 
plugin ? (I know that this magic is done by feed parser but not sure which part 
of code is doing that. Just point me to that code).


 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.7

 Attachments: NUTCH-1465-trunk.v1.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-945) Indexing to multiple SOLR Servers

2013-01-28 Thread Alexander Kingson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564953#comment-13564953
 ] 

Alexander Kingson commented on NUTCH-945:
-

 I see that the issue is unresolved.Is this patch working?

 Indexing to multiple SOLR Servers
 -

 Key: NUTCH-945
 URL: https://issues.apache.org/jira/browse/NUTCH-945
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.2
Reporter: Charan Malemarpuram
 Fix For: 2.2

 Attachments: MurmurHashPartitioner.java, 
 NonPartitioningPartitioner.java, patch-NUTCH-945.txt


 It would be nice to have a default Indexer in Nutch, which can submit docs to 
 multiple SOLR Servers.
  Partitioning is always the question, when writing to multiple SOLR Servers.
  Default partitioning can be a simple hashcode based distribution with 
  addition hooks to customization.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Jenkins build is back to normal : Nutch-nutchgora #479

2013-01-28 Thread Apache Jenkins Server

See https://builds.apache.org/job/Nutch-nutchgora/479/

[jira] [Resolved] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

[jira] [Resolved] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1

[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

Build failed in Jenkins: Nutch-nutchgora #478

[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

[jira] [Assigned] (NUTCH-1465) Support sitemaps in Nutch

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

[jira] [Commented] (NUTCH-945) Indexing to multiple SOLR Servers

Jenkins build is back to normal : Nutch-nutchgora #479

21 matches

Site Navigation

Mail list logo

Footer information