[jira] [Commented] (NUTCH-2531) Unclear steps in Nutch2 Tutorial

2019-11-13 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973428#comment-16973428
 ] 

Sebastian Nagel commented on NUTCH-2531:


Hi [~balaShashanka], there are no plans as there are indeed zero committers 
working actively on 2.x right now. Of course, there is a small chance that new 
(or old) contributors start working again on the 2.x branch. But the entire 
question is better discussion on the Nutch mailing list, not here (it's a bug 
tracker). Thanks!

> Unclear steps in Nutch2 Tutorial
> 
>
> Key: NUTCH-2531
> URL: https://issues.apache.org/jira/browse/NUTCH-2531
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Krzysztof Madejski
>Priority: Minor
> Fix For: 2.5
>
>
> I was trying to install Nutch based on this tutorial 
> [https://wiki.apache.org/nutch/Nutch2Tutorial:]
>  
> Issues I've found:
> In Obtaining Software and Configuration:
>  # _"Specify the [...] along with all of the other Configuration options 
> suggested within the [Nutch 1.x 
> tutorial|http://wiki.apache.org/nutch/NutchTutorial]."_
>   It would be better to copy necessary configuration. I don't have idea which 
> settings exactly should be copied.
> 2. _"In addition add the missing hbase-common-0.98.8-hadoop2.jar transitive 
> dependency, this is a bug in gora-hbase 0.6.1 as described 
> [here|https://github.com/apache/gora/pull/21]. This bug is removed in current 
> Gora development."_
>   __  What does this step require from me? Should I add something to the 
> dependencies? In which file? This point is written in an informative manner. 
> Should be either deleted or clear instruction should be given.
> 3. _"*N.B.* It's probably worth checking and setting all your usual 
> configuration settings within $NUTCH_HOME/conf/nutch-site.xml etc. before 
> progressing."_
>    I'ts my first install. There is no such thing as "usual configuration"..
> In "Invoke Nutch":
>  # "nutch readdb" doesn't return anything meaningful apart from Usage. 
> ./nutch readdb
> Usage: WebTableReader (-stats | -url [url] | -dump  [-regex regex]) 
>  [-crawlId ] [-content] [-headers] [-links] [-text]
>  -crawlId  - the id to prefix the schemas to operate on, 
>  (default: storage.crawl.id)
>  -stats [-sort] - print overall statistics to System.out
>  [-sort] - list status sorted by host
>  -url  - print information on  to System.out
>  -dump  [-regex regex] - dump the webtable to a text file in 
>  
>  -content - dump also raw content
>  -headers - dump protocol headers
>  -links - dump links
>  -text - dump extracted text
>  [-regex] - filter on the URL of the webtable entry



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2531) Unclear steps in Nutch2 Tutorial

2019-11-12 Thread Shashanka Balakuntala Srinivasa (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972981#comment-16972981
 ] 

Shashanka Balakuntala Srinivasa commented on NUTCH-2531:


Hi [~snagel], is there a future plan of maintaining the 2.x branch. I've read 
the PMC announcement on stopping further development because there are less 
committers. But is there any plan to continue development, as 2.x provide a lot 
of flexibility in storage of crawldata?

 

> Unclear steps in Nutch2 Tutorial
> 
>
> Key: NUTCH-2531
> URL: https://issues.apache.org/jira/browse/NUTCH-2531
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Krzysztof Madejski
>Priority: Minor
> Fix For: 2.5
>
>
> I was trying to install Nutch based on this tutorial 
> [https://wiki.apache.org/nutch/Nutch2Tutorial:]
>  
> Issues I've found:
> In Obtaining Software and Configuration:
>  # _"Specify the [...] along with all of the other Configuration options 
> suggested within the [Nutch 1.x 
> tutorial|http://wiki.apache.org/nutch/NutchTutorial]."_
>   It would be better to copy necessary configuration. I don't have idea which 
> settings exactly should be copied.
> 2. _"In addition add the missing hbase-common-0.98.8-hadoop2.jar transitive 
> dependency, this is a bug in gora-hbase 0.6.1 as described 
> [here|https://github.com/apache/gora/pull/21]. This bug is removed in current 
> Gora development."_
>   __  What does this step require from me? Should I add something to the 
> dependencies? In which file? This point is written in an informative manner. 
> Should be either deleted or clear instruction should be given.
> 3. _"*N.B.* It's probably worth checking and setting all your usual 
> configuration settings within $NUTCH_HOME/conf/nutch-site.xml etc. before 
> progressing."_
>    I'ts my first install. There is no such thing as "usual configuration"..
> In "Invoke Nutch":
>  # "nutch readdb" doesn't return anything meaningful apart from Usage. 
> ./nutch readdb
> Usage: WebTableReader (-stats | -url [url] | -dump  [-regex regex]) 
>  [-crawlId ] [-content] [-headers] [-links] [-text]
>  -crawlId  - the id to prefix the schemas to operate on, 
>  (default: storage.crawl.id)
>  -stats [-sort] - print overall statistics to System.out
>  [-sort] - list status sorted by host
>  -url  - print information on  to System.out
>  -dump  [-regex regex] - dump the webtable to a text file in 
>  
>  -content - dump also raw content
>  -headers - dump protocol headers
>  -links - dump links
>  -text - dump extracted text
>  [-regex] - filter on the URL of the webtable entry



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2531) Unclear steps in Nutch2 Tutorial

2018-03-12 Thread Krzysztof Madejski (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395516#comment-16395516
 ] 

Krzysztof Madejski commented on NUTCH-2531:
---

And for the end:

1. What is the process that is scraping the websites? Does it run while "nutch 
inject" is running?
2. Where the web interface can be accessed? It is mentioned in the link leading 
to non-existing page "You may want to check out the documentation for the 
[Nutch Web Application|https://wiki.apache.org/nutch/TODO] ;

An overview of architecture and components would help a lot!

> Unclear steps in Nutch2 Tutorial
> 
>
> Key: NUTCH-2531
> URL: https://issues.apache.org/jira/browse/NUTCH-2531
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Krzysztof Madejski
>Priority: Minor
>
> I was trying to install Nutch based on this tutorial 
> [https://wiki.apache.org/nutch/Nutch2Tutorial:]
>  
> Issues I've found:
> In Obtaining Software and Configuration:
>  # _"Specify the [...] along with all of the other Configuration options 
> suggested within the [Nutch 1.x 
> tutorial|http://wiki.apache.org/nutch/NutchTutorial]."_
>   It would be better to copy necessary configuration. I don't have idea which 
> settings exactly should be copied.
> 2. _"In addition add the missing hbase-common-0.98.8-hadoop2.jar transitive 
> dependency, this is a bug in gora-hbase 0.6.1 as described 
> [here|https://github.com/apache/gora/pull/21]. This bug is removed in current 
> Gora development."_
>   __  What does this step require from me? Should I add something to the 
> dependencies? In which file? This point is written in an informative manner. 
> Should be either deleted or clear instruction should be given.
> 3. _"*N.B.* It's probably worth checking and setting all your usual 
> configuration settings within $NUTCH_HOME/conf/nutch-site.xml etc. before 
> progressing."_
>    I'ts my first install. There is no such thing as "usual configuration"..
> In "Invoke Nutch":
>  # "nutch readdb" doesn't return anything meaningful apart from Usage. 
> ./nutch readdb
> Usage: WebTableReader (-stats | -url [url] | -dump  [-regex regex]) 
>  [-crawlId ] [-content] [-headers] [-links] [-text]
>  -crawlId  - the id to prefix the schemas to operate on, 
>  (default: storage.crawl.id)
>  -stats [-sort] - print overall statistics to System.out
>  [-sort] - list status sorted by host
>  -url  - print information on  to System.out
>  -dump  [-regex regex] - dump the webtable to a text file in 
>  
>  -content - dump also raw content
>  -headers - dump protocol headers
>  -links - dump links
>  -text - dump extracted text
>  [-regex] - filter on the URL of the webtable entry



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)