RE: Need Tutorial on Nutch

2018-03-06 Thread Markus Jelsma
Hi, Yes you are going to need code, and a lot more than just that, probably including dropping the 'every two hour' requirement. For your case you need either site-specific price extraction, which is easy but a lot of work for 500+ sites. Or you need a more complicated generic algorithm,

Re: Need Tutorial on Nutch

2018-03-06 Thread Eric Valencia
Yash, well, I want to monitor the price for every item in the top 500 retail websites every two hours, 24/7/365. Java is needed? On Tue, Mar 6, 2018 at 12:15 PM, Yash Thenuan Thenuan < rit2014...@iiita.ac.in> wrote: > If you want simple crawlung then Not at all. > But having experience with

Re: Need Tutorial on Nutch

2018-03-06 Thread Yash Thenuan Thenuan
If you want simple crawlung then Not at all. But having experience with java will help you to fulfil your personal requirements. On 7 Mar 2018 01:42, "Eric Valencia" wrote: > Does this require knowing Java proficiently? > > On Tue, Mar 6, 2018 at 10:51 AM Semyon

Re: Need Tutorial on Nutch

2018-03-06 Thread Eric Valencia
Does this require knowing Java proficiently? On Tue, Mar 6, 2018 at 10:51 AM Semyon Semyonov wrote: > Here is an unpleasant truth - there is no up to date tutorial for Nutch. > To make it even more interesting, sometimes the tutorial can contradict > real behavior of

RE: Regarding Internal Links

2018-03-06 Thread Yossi Tamari
Regarding the configuration parameter, your Parse Filter should expose a setConf method that receives a conf parameter. Keep that as a member variable and pass it where necessary. Regarding parsestatus, contentmeta and parsemeta, you're going to have to look at them yourself (probably in a

Re: Need Tutorial on Nutch

2018-03-06 Thread Semyon Semyonov
Here is an unpleasant truth - there is no up to date tutorial for Nutch. To make it even more interesting, sometimes the tutorial can contradict real behavior of Nutch, because of lately introduced features/bugs. If you find such cases, please try to fix and contribute to the project. Welcome

Re: Need Tutorial on Nutch

2018-03-06 Thread Yash Thenuan Thenuan
Start with nutch 1.x if you are getting some trouble. Its easier to configure and by following nutch 1.x tutorial you will be able to crawl your first website easily. On 7 Mar 2018 00:13, "Eric Valencia" wrote: > Thank you kindly Yash. Yes, I did try some of the

Re: Need Tutorial on Nutch

2018-03-06 Thread Eric Valencia
Thank you kindly Yash. Yes, I did try some of the tutorials actually but they seem to be missing the complete amount of steps required to successfully scrape in nutch. On Tue, Mar 6, 2018 at 10:37 AM Yash Thenuan Thenuan wrote: > I would suggest to start with the

Re: Need Tutorial on Nutch

2018-03-06 Thread Yash Thenuan Thenuan
I would suggest to start with the documentation on nutch's website. You can get a Idea about how to start crawling and all. Apart from that there are no proper tutorials as such. Just start crawling if you got stuck somewhere try to find something related to that on Google and nutch mailing list

Need Tutorial on Nutch

2018-03-06 Thread Eric Valencia
I'm a beginner in Nutch and need the best tutorials to get started. Can you guys let me know how you would advise yourselves if starting today (like me)? Eric

RE: Regarding Internal Links

2018-03-06 Thread Yossi Tamari
You should go over each segment, and for each one produce a ParseText and a ParseData. This is basically what the HTML Parser does for the whole document, which is why I suggested you should dive into its code. A ParseText is basically just a String containing the actual content of the segment

RE: Regarding Internal Links

2018-03-06 Thread Yash Thenuan Thenuan
> I am able to get the content corresponding to each Internal link by > writing a parse filter plugin. Now I am not getting how to proceed > further. How can I parse them as separate document and what should > my ParseResult filter return??

Re: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-03-06 Thread Sebastian Nagel
Hi Semyon, > We apply logical AND here, which is not really reasonable here. By now, there was only a single exemption filter, it made no difference. But yes, sounds plausible to change this to an OR resp. return true as soon one of the filters accepts/exempts the URL. Please open a issue to

Re: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-03-06 Thread Semyon Semyonov
I have proposed a solution for this problem https://issues.apache.org/jira/browse/NUTCH-2522. The other question is how voting mechanism of UrlExemptionFilters should work. UrlExemptionFilters.java : lines 60-65 //An URL is exempted when all the filters accept it to pass through for