RE: Preserve HTML that is being crawled from Nutch?

2013-11-14 Thread Markus Jelsma
By default title is indexed in the title field and using the headings plugin the h1 and h2 etc are indexed as h1..h2 as well, optionally as multi valued. Also by default is that all text is indexed into a content field, including title and headings. You can try the NUTCH-961 issue for actual

Get original URL from crawldb in case of redirect

2013-11-14 Thread Amit Sela
Hi all, I'm readin the crawldb as CrawledPage and I see the fetched URL, content etc. In case of a redirection (I allow 10 redirections in nutch-site.xml) the fetched URL is not the original URL the Fetcher turned to, and I would like to get that as well. Does nutch store it somewhere, I'm

All in one Crawl class

2013-11-14 Thread Julien Nioche
See https://issues.apache.org/jira/browse/NUTCH-1621 It has now been removed from both trunk and 2.x. I will update the Wiki pages accordingly over the next couple of days to reflect this change. As of the next releases of Nutch the crawl script will have to be used instead. It works just as

Performing Web Scraping within the content of fetched html pages

2013-11-14 Thread Alex McLintock
Hi Folks, I'm reasonably familiar with older versions of Nutch - but have been out of the loop for a bit. I've done some googling, and reading docs, and have not really understood everything yet. Would someone please summarise the state of play if I want to do web scraping with Nutch - eg to

RE: Performing Web Scraping within the content of fetched html pages

2013-11-14 Thread Markus Jelsma
Hi Alex, -Original message- From:Alex McLintock a...@owal.co.uk Sent: Thursday 14th November 2013 14:34 To: user@nutch.apache.org Subject: Performing Web Scraping within the content of fetched html pages Hi Folks, I'm reasonably familiar with older versions of Nutch - but have

Unable to inject seeds with

2013-11-14 Thread Jon Uhal
First, here is my environment: Hadoop 1.2.1 Accumulo 1.4.4 Zookeeper 3.4.5 Gora 0.3 Solr 4.5.1 I have been trying to get a 4 node Hadoop cluster to start a distributed crawl but have been running into some issue with just injecting the seeds. I have successfully been able to get the

Re: Unable to inject seeds with

2013-11-14 Thread Lewis John Mcgibbney
Hi Jon, On Thu, Nov 14, 2013 at 4:15 PM, user-digest-h...@nutch.apache.org wrote: Unable to inject seeds with 29017 by: Jon Uhal First, here is my environment: Hadoop 1.2.1 Accumulo 1.4.4 Zookeeper 3.4.5 Gora 0.3 Solr 4.5.1 All software revisions look fine so good start :)

Re: Nutch cluster

2013-11-14 Thread A Laxmi
Hi Julien- From the link you provided ( http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial) for Nutch 1.x - how and where is the crawled data stored? Thanks! On Wed, Nov 13, 2013 at 4:58 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Just to add to what Markus said : see

Re: hBase + Nutch - timeout or session expiration while injecting

2013-11-14 Thread A Laxmi
Honza - You got to dig into those weird messages to understand the issue :) This might be the one that could be causing the issue - *2013-09-21 17:40:04,138 INFO org.apache.zookeeper.server.* *NIOServerCnxn: Established session 0x141412cd9bc0006 with negotiated timeout 4 for client

Re: hBase + Nutch - timeout or session expiration while injecting

2013-11-14 Thread glumet
Hello, I resolved it (as I can remeber) by killing all processes and then rolling-restart.sh. -- View this message in context: http://lucene.472066.n3.nabble.com/hBase-Nutch-timeout-or-session-expiration-while-injecting-tp4091375p4100977.html Sent from the Nutch - User mailing list archive at

Re: Unable to inject seeds with

2013-11-14 Thread Jon Uhal
So I broke down and tried to use the 1.7 release to do the inject step just to see if it would work. It did. So there is something that I either broke with my 2.2.1 setup or I'm doing something wrong with my 2.2.1 configs. Going to try to re-build 2.2.1 from source and try again. On Thu, Nov 14,

Re: Preserve HTML that is being crawled from Nutch?

2013-11-14 Thread Reyes, Mark
RE: https://issues.apache.org/jira/browse/NUTCH-961 Are there usage instructions on how to do this? The JIRA ticket shows several attachments. Is there a specific attachment to download? Please keep in mind that I am running my Solr 4.5 instance and Nutch 1.7 crawl ‘almost’ as described from

RE: Preserve HTML that is being crawled from Nutch?

2013-11-14 Thread Markus Jelsma
You need my latest patch: 17/Jun/13 16:34. This is for trunk (1.8) but also works on 1.7 and 1.6. Set the following options in your nutch-site: tika.use_boilerpipe=true tika.boilerpipe.extractor=ArticleExtractor or CanolaExtractor ArticleExtractor works best for, well, article style pages. The

Re: Unable to inject seeds with

2013-11-14 Thread Jon Uhal
So I think it has to do with Accumulo somehow. I reverted the conf/gora.properties setting for mock from false to: gora.datastore.accumulo.mock=true and re-building and re-running the runtime deploy job completed successfully. Trying to see if I can track down the issue. On Thu, Nov 14, 2013

Re: Unable to inject seeds with

2013-11-14 Thread Jon Uhal
Is there a chance that since I am using ZooKeeper 3.4.5 and Nutch 2.2.1 builds with ZooKeeper 3.3.1, there is a version issue? On Thu, Nov 14, 2013 at 3:44 PM, Jon Uhal jonu...@gmail.com wrote: So I think it has to do with Accumulo somehow. I reverted the conf/gora.properties setting for mock

RE: Nutch 1.7 and Solr 4.4.0 Integrate

2013-11-14 Thread Luis Armando Roca Fumero
Hello friends: I'm crawling with nutch, and I don't to craw images at all, and I don't to craw urls with ? or strange characters . When I looking for *.gif. This is a fragment of my solr's search responselst name=responseHeaderint name=status0/intint name=QTime73/intlst name=paramsstr

Re: Preserve HTML that is being crawled from Nutch?

2013-11-14 Thread Reyes, Mark
How could I download the latest patch? Ive enabled nutch-site.xml with, property nametika.use_boilerpipe/name valuetrue/value /property property nametika.boilerpipe.extractor/name valueArticleExtractor/value /property On

Re: Unable to inject seeds with

2013-11-14 Thread Lewis John Mcgibbney
Hi Jon, Glad to hear that your making some more progress! On Thu, Nov 14, 2013 at 8:45 PM, user-digest-h...@nutch.apache.org wrote: So I think it has to do with Accumulo somehow. I reverted the conf/gora.properties setting for mock from false to: gora.datastore.accumulo.mock=true and

RE: Preserve HTML that is being crawled from Nutch?

2013-11-14 Thread Markus Jelsma
That is the latest patch i referred to. Download it and get yourself a copy of 1.7 sources or do a svn export of trunk. Have the patch in the root folder of the sources and patch with patch -p0 file.patch Build with $ ant and you got yourself a extracting Nutch in runtime/local/ For more info