No. of documents decreasing in 2nd fetch | Nutch 2.3.1 + hadoop 2.7.1 + mongodb

2017-05-16 Thread shubham.gupta
Hey I have a batch of 5000 seed URLs. I am trying to crawl these URLs by utilizing the apache job created after the command "ant clean runtime" is executed. In the first 2 cycles of nutch workflow i.e. inject->generate->fetch->parse->updatedb, it is working fine. Also, it is able to fetch

Unable to parse a huge list of seed URLs | Nutch 2.3.1 + MongoDB + Hadoop 2.7.1

2017-04-11 Thread shubham.gupta
Hey I have around 5000 URLs in my seed Url list. If I inject the whole list, then it fails to fetch all documents and parse. The depth is set to 1. But when the list is divided into a batch of 1000 URLs then it is able to fetch and parse all documents successfully. In the former case 5141

Re: All nutch jobs Failing | Nutch 2.3.1 + MongoDB

2017-03-15 Thread shubham.gupta
, shubham.gupta wrote: Hey While I am running the whole process flow of Nutch i.e. Inject,Generate,Fetch,Parse,Update. The following errors are being logged: *Generator Job* java.lang.Exception: java.lang.ClassCastException: org.bson.types.ObjectId cannot be cast to java.lang.String

All nutch jobs Failing | Nutch 2.3.1 + MongoDB

2017-03-07 Thread shubham.gupta
Hey While I am running the whole process flow of Nutch i.e. Inject,Generate,Fetch,Parse,Update. The following errors are being logged: *Generator Job* java.lang.Exception: java.lang.ClassCastException: org.bson.types.ObjectId cannot be cast to java.lang.String at

Inserting Nutch(2.3.1) data crawled into Accumulo1.7.1 with Gora 0.7.1

2017-02-16 Thread shubham.gupta
Hey I have to add the Gora Backend with Accumulo in Nutch. Currently in the ivy/ivy.xml file, the gora version used is 0.6.1 which uses Accumulo 1.5.1. We are porting accumulo to 1.7.1 version hence have to build gora with updated accumulo code from source and plug that version back to

All the jobs failing while running it in hadoop(local) | Nutch 2.3.1+Hadoop 2.7.1+MongoDb

2017-01-17 Thread shubham.gupta
Hey I am running the jobs by recreating the local environment on Hadoop. While running the jobs the following errors are coming: DEFINE GeneratorJob: java.lang.RuntimeException: job failed: name=[rss_new]generate: 1484716707-832889027, jobid=job_local1136172300_0001 at

Re: Changing date format while page is parsed

2017-01-16 Thread shubham.gupta
Check for the field 24. It is the dt_stamp field. Also,why has the time been hardcoded to 1000L and 2000L? Thanks and Regards, Shubham Gupta On Saturday 14 January 2017 12:06 PM, vickyk wrote: shubham.gupta wrote Hey, When a webpage is parsed it stores the date in the dt_stamp in Long

Insert custom field in the webpage table | Nutch 2.3.1 + MongoDb

2017-01-16 Thread shubham.gupta
Hey, I am trying to insert a custom field when the parsing step is executed. That is, when the webpage table is formed which is given as argument, I am trying to insert in the same table. But as checked the field is not being inserted. I have created a custom plugin to do so. It is being

Re: Changing date format while page is parsed

2017-01-13 Thread shubham.gupta
org.apache.commons.lang.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss'Z'").format(date); Where do I have to make these changes? Thanks and Regards, Shubham Gupta On Friday 13 January 2017 09:18 AM, shubham.gupta wrote: Also, there is a problem that all documents have a dt_

Re: Changing date format while page is parsed

2017-01-12 Thread shubham.gupta
can store it as Long and convert to ISO Date whenever you want. You can follow that: org.apache.commons.lang.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss'Z'").format(date); Kind Regards, Furkan KAMACI On Thu, Jan 12, 2017 at 1:50 PM, shubham.gupta <shubham.gu...@orkash.com&

Changing date format while page is parsed

2017-01-12 Thread shubham.gupta
Hey, When a webpage is parsed it stores the date in the dt_stamp in Long format. Whereas, I want to store it in ISODate Format. I have tried to change the type of java.lang.Long to java .util.Long and have changed while setting dtStamp and getting dtStamp. But after i build the project,

Nutch 2.3.1 + Hadoop 2.7.1 |How to set priority on custom HtmlParseFilter Plugins

2016-12-16 Thread shubham.gupta
Hi We are using two plugin extension i.e the language-identifier and our custom plugin. We want to first call the language identifier plugin and then our custom plugin such that the language identified is added along with the content parsed. I am not be able to identify any configuration

Re: Very less documents fetched

2016-12-14 Thread shubham.gupta
Also, the fetch command does noth support the "-all" argument? Thanks and Regards, Shubham Gupta On Wednesday 14 December 2016 06:08 PM, shubham.gupta wrote: Hey, I am running Nutch 2.3.1 on Hadoop 2.7.1. After running 1 whole process, humongous documents are created with a stat

Very less documents fetched

2016-12-14 Thread shubham.gupta
Hey, I am running Nutch 2.3.1 on Hadoop 2.7.1. After running 1 whole process, humongous documents are created with a status 1. When the fetch job is run with the argument "-all", very few documents with status 1 are fetched, whereas when the batchId of status 1 documents is specified, the

Re: Injector and Generator Job Failing

2016-10-14 Thread shubham.gupta
Due to huge amount of hadoop logging I had only allowed the logging of ERROR messages and above of both hadoop and nutch. Also, enabled the periodic deletion of logs as lot of disk was being utilized. So, I am kind of in the dark here. Thanks and Regards, Shubham Gupta On Friday 14 October

Injector and Generator Job Failing

2016-10-14 Thread shubham.gupta
Hey Whenever i run the nutch application, only the injector and generate job fails. The path of the plugin folders in conf/nutch-site.xml is correct. The following error occurs: INFO mapreduce.Job: Job job_1476273924585_1272 failed with state FAILED due to: Task failed

Re: 90% of URL rejected by filtering (Nutch 2.3.1)

2016-10-05 Thread shubham.gupta
you can comment out this line -^.{513,}$ and check. Regards, Sachin Shaju sachi...@mstack.com +919539887554 On Wed, Oct 5, 2016 at 11:41 AM, shubham.gupta <shubham.gu...@orkash.com> wrote: my current regex-urlfilter properties are as follows: # skip file: ftp: and mailto: urls #-^(fi

Re: 90% of URL rejected by filtering (Nutch 2.3.1)

2016-10-05 Thread shubham.gupta
my current regex-urlfilter properties are as follows: # skip file: ftp: and mailto: urls #-^(file|ftp|mailto): # skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin #-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|

Re: 90% of URL rejected by filtering (Nutch 2.3.1)

2016-10-04 Thread shubham.gupta
The problem is not yet solved. Thanks and Regards Shubham Gupta On Monday 03 October 2016 11:12 AM, shubham.gupta wrote: After doing this 3 less URLs have been rejected. Thanks and Regards, Shubham Gupta On Monday 03 October 2016 10:28 AM, Sachin Shaju wrote: You may check by commenting all

Re: 90% of URL rejected by filtering (Nutch 2.3.1)

2016-10-02 Thread shubham.gupta
...@mstack.com On Mon, Oct 3, 2016 at 10:05 AM, shubham.gupta <shubham.gu...@orkash.com> wrote: Hey When the inject job is run 90% of my seedurls get rejected. Therefore, very few urls get crawled and does not give proper outputs. my regex-urlfilter properties are as follows: # skip fil

90% of URL rejected by filtering (Nutch 2.3.1)

2016-10-02 Thread shubham.gupta
Hey When the inject job is run 90% of my seedurls get rejected. Therefore, very few urls get crawled and does not give proper outputs. my regex-urlfilter properties are as follows: # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse #

UpdateDb job fails everytime

2016-09-15 Thread shubham.gupta
Hey, Whenever the update job is executed the following errors occur: INFO mapreduce.Job: Task Id : attempt_1473832356852_0104_m_00_2, Status : FAILED Error: java.net.MalformedURLException: no protocol:

Re: Application creating huge amount of logs : Nutch 2.3.1 + Hadoop 2.7.1

2016-09-11 Thread shubham.gupta
The nutch job has started making these huge amount of logs after the fetcher.parse property was set to TRUE. So, is there any relation with that? Shubham On Thursday 08 September 2016 10:28 AM, shubham.gupta wrote: What changes can be made in Nutch log4j.properties to reduce the size of Nutch

Application failing due to physical container storage overflow (Nutch 2.3.1 + Hadoop 2.7.1 + Yarn)

2016-09-08 Thread shubham.gupta
Hey The running nutch job fails due to the following error: Container [pid=6179,containerID=container_1473334555047_0003_01_15] is running beyond physical memory limits. Current usage: 4.1 GB of 4 GB physical memory used; 8.4 GB of 8.4 GB virtual memory used. Killing container. The

Re: Application creating huge amount of logs : Nutch 2.3.1 + Hadoop 2.7.1

2016-09-07 Thread shubham.gupta
What changes can be made in Nutch log4j.properties to reduce the size of Nutch logging size. Shubham On Wednesday 07 September 2016 04:04 AM, Markus Jelsma wrote: I've seen Hadoop not honouring some logs settings before. Are you really sure these are org.apache.nutch.* logs? If so, and as

Re: Application creating huge amount of logs : Nutch 2.3.1 + Hadoop 2.7.1

2016-09-05 Thread shubham.gupta
Hey, I have changed the user.log_retain size to 10 MB still it is creating a huge size of logs. This leads to the failure of datanode and the job fails. And, if the logs are deleted periodically then the fetch phase takes a lot of time and it is uncertain that whether it will complete or

Re: Application creating huge amount of logs : Nutch 2.3.1 + Hadoop 2.7.1

2016-08-24 Thread shubham.gupta
Hey Logs are created when spills of map job are created during the FETCH job and are stored in /home/hadoop/nodelogs/usercache/root/appcache. The total size of logs sums up to over 13GB which occupies a lot of disk space of the datanode and I have to delete those logs for smooth functioning

Application creating huge amount of logs : Nutch 2.3.1 + Hadoop 2.7.1

2016-08-23 Thread shubham.gupta
Hey I have integrated Nutch 2.3.1 with Hadoop 2.7.1, and the fetcher.parse property is set TRUE and the database used is MongoDB. While the map job of nutch runs, it creates a huge size of nodelogs over 13GB in size. And the cause of such huge amount of files in unknown. Any suggestion would

Re: Nutch is taking very long time to complete crawl job :Nutch 2.3.1 + hadoop 2.7.1 + Yarn

2016-08-01 Thread shubham.gupta
Hey Markus, What I am trying to do is perform RSS crawling using Nutch, therefore I require that the whole process should be completed within 1 hour. According to your suggestion, I set fetcher.parse = true which reduced the time of fetching process to 44 minutes and fetched 9195 pages. But,

Nutch is taking very long time to complete crawl job :Nutch 2.3.1 + hadoop 2.7.1 +yarn

2016-07-28 Thread shubham.gupta
Hi I am trying to use Nutch 2.3.1 with 3 datanode(4GB RAM each) hadoop 2.7.1 cluster . The seed list provided consists of around 5000 Urls . I am using 60 threads and 5 numTasks for crawling these urls at distance of 1, but, it is taking 1 day to complete the crawl job (Inject : 1 minute 35

Nutch is taking very long time to complete crawl job :Nutch 2.3.1 + hadoop 2.7.1 +yarn

2016-07-28 Thread shubham.gupta
Hi I am trying to use Nutch 2.3.1 with 3 datanode(4GB RAM each) hadoop 2.7.1 cluster . The seed list provided consists of around 5000 Urls . I am using 60 threads and 5 numTasks for crawling these urls at distance of 1, but, it is taking 1 day to complete the crawl job (Inject : 1 minute 35