that could be it indeed I googled it for you, first hit searching for "nutch crawl query pages"
http://stackoverflow.com/questions/7045716/nutch-1-2-why-wont-nutch-crawl-url-with-query-strings On Thu, May 24, 2012 at 1:52 PM, Tolga <[email protected]> wrote: > I might have figured out why. Our website has a lot of query strings in > addresses. One example is http://www.sabanciuniv.edu/** > eng/?genel_bilgi/yonetim/**yonetim_kapak/yonetim_kapak.**html<http://www.sabanciuniv.edu/eng/?genel_bilgi/yonetim/yonetim_kapak/yonetim_kapak.html>. > Could this be why? If that's the case, how do I crawl it? > > Regards, > > > On 5/24/12 11:28 AM, Piet van Remortel wrote: > >> I googled for you: >> >> "Typically one starts testing one’s configuration by crawling at shallow >> depths, sharply limiting the number of pages fetched at each level >> (-topN), >> and watching the output to check that desired pages are fetched and >> undesirable pages are not. Once one is confident of the configuration, >> then >> an appropriate depth for a full crawl is around 10. The number of pages >> per >> level (-topN) for a full crawl can be from tens of thousands to millions, >> depending on your resources." >> >> Also, as the nutch documentation shows, the topN parameter is optional. >> >> Can I respectfully suggest that you go through the basic information that >> is available online to get familiar with Nutch. Copying the online >> information into this mailing list is not helping anybody. >> >> >> On Thu, May 24, 2012 at 10:19 AM, Tolga<[email protected]> wrote: >> >> >>> On 5/24/12 11:00 AM, Piet van Remortel wrote: >>> >>> On Thu, May 24, 2012 at 9:35 AM, Tolga<[email protected]> wrote: >>>> >>>> - I don't fully understand the use of topN parameter. Should I increase >>>> >>>>> it? >>>>> >>>>> yes >>>>> >>>> What would a sensible topN value be a for a large university website? >>> >>> >>>> - You mean parse-pdf thing? I've got that in my nutch-default.xml. >>>> >>>>> good, should work then >>>>> >>>> >>>> - I looked for the link, it was there. Besides, that was for another >>>> >>>>> website I was experimenting on. >>>>> - How do I check segments? >>>>> >>>>> e.g. with segmentreader, a hadoop access command built in nutch >>>>> >>>> >>>> - I didn't check filenames, but I've tried searching for a word in that >>>> >>>>> PDF file. >>>>> >>>>> then the reason could also be indexing >>>>> >>>> >>>> - I've got more than 50gb free. >>>> >>>>> - I'm not sure about webserver kicking me off, I'll have the check that >>>>> with the sysadmin. >>>>> >>>>> should be visible as something like timeouts or a similar message in >>>>> the >>>>> >>>> hadoop logs >>>> >>>> >>>> Regards, >>>> >>>>> >>>>> On 5/24/12 10:25 AM, Piet van Remortel wrote: >>>>> >>>>> - your topN parameter limited the crawl : see the info at >>>>> >>>>>> http://wiki.apache.org/nutch/******NutchTutorial<http://wiki.apache.org/nutch/****NutchTutorial> >>>>>> <http://wiki.**apache.org/nutch/****NutchTutorial<http://wiki.apache.org/nutch/**NutchTutorial> >>>>>> > >>>>>> <http://wiki.**apache.org/**nutch/NutchTutorial<http://apache.org/nutch/NutchTutorial> >>>>>> <http://**wiki.apache.org/nutch/**NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial> >>>>>> > >>>>>> **> >>>>>> >>>>>> >>>>>> >>>>>> or : >>>>>> >>>>>> - file filters >>>>>> - there is no link to the files (as you suggested yourself already) >>>>>> - did you check the correct/all segments ? >>>>>> - did you check the fully correct filenames ? wildcards don't work on >>>>>> all >>>>>> segmentreader approaches >>>>>> - size limits of the crawler (see previous discussion) >>>>>> - did you check file presence in the segment, or parse result ? i.e. >>>>>> parsing could have failed (cfr the previous discussion of the last few >>>>>> days) >>>>>> - your disk got full and crawling stopped >>>>>> - the webserver(s) kicked you off >>>>>> - your hadoop logs have overrun the local disk on which the crawler >>>>>> was >>>>>> running (i.e. disk full) >>>>>> >>>>>> Piet >>>>>> >>>>>> >>>>>> On Thu, May 24, 2012 at 9:17 AM, Tolga<[email protected]> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> I am crawling a large website, which is our university's. From the >>>>>>> logs >>>>>>> and some grep'ing, I see that some pdf files were not crawled. Why >>>>>>> could >>>>>>> this happen? I'm crawling with -depth 100 -topN 5. >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> >>>>>>> >>>>>>>

