And, sorry for the spamming, another question: As far as I understood, another option, even more better, could be the following:
in the prefix-urlfilter: http://www.xyz.com/book/ And in the regex-urlfilter.txt +. but it doesn't work...it still crawls everything, also URLs (belonging to the same xyz domain) that don't have that prefix. Best, Andrea On Sat, Apr 16, 2016 at 7:58 AM, Andrea Gazzarini <[email protected]> wrote: > Hi Furkan, > I'm not able to have it working. Maybe I misunderstood your email. > > Simplifying, let's assume my website has the following structure > > http://www.xyz.com/book/1 > > that contains a link towards > > http://www.xyz.com/book/2 > http://www.xyz.com/book/3 > > The /2 and /3 also contain some outlink, so the site map is the following > > > - http://www.xyz.com/book/1 > - http://www.xyz.com/book/2 > - http://www.xyz.com/book/5 > - http://www.xyz.com/book/6 > - http://www.xyz.com/book/3 > - http://www.xyz.com/book/7 > - http://www.xyz.com/book/8 > > I put > > http://www.xyz.com/book/1 > > in the seed file and the following line in the regex-urlfilter.txt (the > only uncommented line) > > +^http://www.xyz.com/book/([0-9]*\.) > > Running > > bin/crawl -i -D solr.server.url=http://localhost:8983/solr/woozlee > urls/few/captain-gazza.txt TestCrawl x > > > Injector: Total number of urls rejected by filters: 0 > > Injector: Total number of urls after normalization: 1 > > ... > > Indexing 1 documents > > Indexer: number of documents indexed, deleted, or skipped: > > Indexer: 1 indexed (add/update) > > Indexer: finished at 2016-04-16 07:54:35, elapsed: 00:00:04 > > Cleaning up index if possible > > /home/solr/apache-nutch-1.11/bin/nutch clean -Dsolr.server.url= > http://localhost:8983/solr/woozlee TestCrawl/crawldb > > Sat Apr 16 07:54:39 CEST 2016 : Iteration 2 of 5 > > Generating a new segment > > /home/solr/apache-nutch-1.11/bin/nutch generate -D mapreduce.job.reduces=2 > -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false > -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true > TestCrawl/crawldb TestCrawl/segments -topN 50000 -numFetchers 1 -noFilter > > Generator: starting at 2016-04-16 07:54:40 > > Generator: Selecting best-scoring urls due for fetch. > > Generator: filtering: false > > Generator: normalizing: true > > Generator: topN: 50000 > > Generator: 0 records selected for fetching, exiting ... > > Generate returned 1 (no new segments created) > > Escaping loop: no more URLs to fetch now > > > Whatever is *x*, the cycle completes shortly and indexes only the URL in > the seed list (i.e. I have one record indexed in Solr). > > > Again, many thanks for your help > > > Best, > > Andrea > > On Fri, Apr 15, 2016 at 9:23 PM, Andrea Gazzarini <[email protected]> > wrote: > >> Hi Furkan, >> many thanks, I'm going to try and I'll let you know. >> >> For the first question, I'm not sure about the overall size but we're >> talking about 2milions (growing) pages; in general nothing that can be >> easily handled with a from-scratch and custom solution. >> >> I was wondering if, from a functional perspective, Nutch is a good fit >> for "automatizing" the periodic indexing (in Solr, this is my ultimate >> goal) of that website. If that works the same mechanism will be used for >> other websites as well. >> >> Best, >> Andrea >> On 15 Apr 2016 18:16, "Furkan KAMACI" <[email protected]> wrote: >> >>> Hi Andrea, >>> >>> Regex URL Filter works like that: >>> >>> This accepts anything else: >>> >>> *+.* >>> >>> Let's assume that you want to crawl Nutch's website. If you wished to >>> limit >>> the crawl to the nutch.apache.org domain, than definition should be >>> that: >>> >>> * +^http://([a-z0-9]*\.)*nutch.apache.org/ <http://nutch.apache.org/>* >>> >>> So, if your more like this section has that pattern: >>> >>> *http://www.xyz.com/book/{book_id} <http://www.xyz.com/book/{book_id}>* >>> >>> Than your definition should be that: >>> >>> *+^http://www.xyz.com/book/([0-9]*\.)* >>> <http://www.xyz.com/book/([0-9]*\.)*>* >>> >>> For your first question, you should tell us what is the approximate size >>> of >>> the data you will crawl, etc. and do you have any other needs? >>> >>> Kind Regards, >>> Furkan KAMACI >>> >>> >>> On Fri, Apr 15, 2016 at 4:17 PM, Andrea Gazzarini <[email protected]> >>> wrote: >>> >>> > Hi guys, >>> > just playing as a Nutch newbie in a simple (at least I think) use case: >>> > >>> > I have a website (e.g. http://www.xyz.com) that allows searching for >>> > books. Here, as any straight search website I have two kind of pages: >>> > >>> > * a page that shows search results (depending on the user entered >>> > search terms) >>> > * a details page about a given book. Each details page is a permalink >>> > which follows a given naming convention (e.g. >>> > http://www.xyz.com/book/{book id}) >>> > >>> > The details page has something like a "more like this" section that >>> > contains permalinks to other (similar) books. >>> > Now, my requirement is to index in Solr *all* details page of such >>> website. >>> > >>> > If Nutch is a suitable tool for doing that (and this is actually the >>> first >>> > question), could you please give me some hint about how to configure >>> it? >>> > >>> > Specifically, I tried put a seed file with just one entry >>> > >>> > http://www.xyx.com/book/1 >>> > >>> > and then I configured my regex-urlfilter.txt >>> > >>> > +^http://www.xyx.com/book >>> > >>> > But it indexes only the /1 page. I imagined that the "more like this" >>> > section of the /1 page would act as a set of outlinks for getting >>> further >>> > details pages (where in turns there are further MLT sections, and so >>> on) >>> > >>> > Best, >>> > Andrea >>> > >>> > >>> >> >

