Hey
I have a batch of 5000 seed URLs. I am trying to crawl these URLs by
utilizing the apache job created after the command "ant clean runtime"
is executed.
In the first 2 cycles of nutch workflow i.e.
inject->generate->fetch->parse->updatedb, it is working fine. Also, it
is able to fetch
Hey
I have around 5000 URLs in my seed Url list. If I inject the whole list,
then it fails to fetch all documents and parse. The depth is set to 1.
But when the list is divided into a batch of 1000 URLs then it is able
to fetch and parse all documents successfully.
In the former case 5141
, shubham.gupta wrote:
Hey
While I am running the whole process flow of Nutch i.e.
Inject,Generate,Fetch,Parse,Update.
The following errors are being logged:
*Generator Job*
java.lang.Exception: java.lang.ClassCastException:
org.bson.types.ObjectId cannot be cast to java.lang.String
Hey
While I am running the whole process flow of Nutch i.e.
Inject,Generate,Fetch,Parse,Update.
The following errors are being logged:
*Generator Job*
java.lang.Exception: java.lang.ClassCastException:
org.bson.types.ObjectId cannot be cast to java.lang.String
at
Hey
I have to add the Gora Backend with Accumulo in Nutch. Currently in the
ivy/ivy.xml file, the gora version used is 0.6.1 which uses Accumulo
1.5.1. We are porting accumulo to 1.7.1 version hence have to build gora
with updated accumulo code from source and plug that version back to
Hey
I am running the jobs by recreating the local environment on Hadoop.
While running the jobs the following errors are coming:
DEFINE
GeneratorJob: java.lang.RuntimeException: job failed:
name=[rss_new]generate: 1484716707-832889027, jobid=job_local1136172300_0001
at
Check for the field 24. It is the dt_stamp field. Also,why has the time
been hardcoded to 1000L and 2000L?
Thanks and Regards,
Shubham Gupta
On Saturday 14 January 2017 12:06 PM, vickyk wrote:
shubham.gupta wrote
Hey,
When a webpage is parsed it stores the date in the dt_stamp in Long
Hey,
I am trying to insert a custom field when the parsing step is executed.
That is, when the webpage table is formed which is given as argument, I
am trying to insert in the same table. But as checked the field is not
being inserted. I have created a custom plugin to do so. It is being
org.apache.commons.lang.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss'Z'").format(date);
Where do I have to make these changes?
Thanks and Regards,
Shubham Gupta
On Friday 13 January 2017 09:18 AM, shubham.gupta wrote:
Also, there is a problem that all documents have a dt_
can store it as
Long and convert to ISO Date whenever you want. You can follow that:
org.apache.commons.lang.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss'Z'").format(date);
Kind Regards,
Furkan KAMACI
On Thu, Jan 12, 2017 at 1:50 PM, shubham.gupta <shubham.gu...@orkash.com&
Hey,
When a webpage is parsed it stores the date in the dt_stamp in Long
format. Whereas, I want to store it in ISODate Format. I have tried to
change the type of java.lang.Long to java .util.Long and have changed
while setting dtStamp and getting dtStamp.
But after i build the project,
Hi
We are using two plugin extension i.e the language-identifier and our
custom plugin. We want to first call the language identifier plugin and
then our custom plugin such that the language identified is added along
with the content parsed.
I am not be able to identify any configuration
Also, the fetch command does noth support the "-all" argument?
Thanks and Regards,
Shubham Gupta
On Wednesday 14 December 2016 06:08 PM, shubham.gupta wrote:
Hey,
I am running Nutch 2.3.1 on Hadoop 2.7.1. After running 1 whole
process, humongous documents are created with a stat
Hey,
I am running Nutch 2.3.1 on Hadoop 2.7.1. After running 1 whole process,
humongous documents are created with a status 1. When the fetch job is
run with the argument "-all", very few documents with status 1 are
fetched, whereas when the batchId of status 1 documents is specified,
the
Due to huge amount of hadoop logging I had only allowed the logging of
ERROR messages and above of both hadoop and nutch. Also, enabled the
periodic deletion of logs as lot of disk was being utilized. So, I am
kind of in the dark here.
Thanks and Regards,
Shubham Gupta
On Friday 14 October
Hey
Whenever i run the nutch application, only the injector and generate job
fails.
The path of the plugin folders in conf/nutch-site.xml is correct.
The following error occurs:
INFO mapreduce.Job: Job job_1476273924585_1272 failed with state FAILED
due to: Task failed
you can comment out this line -^.{513,}$ and check.
Regards,
Sachin Shaju
sachi...@mstack.com
+919539887554
On Wed, Oct 5, 2016 at 11:41 AM, shubham.gupta <shubham.gu...@orkash.com>
wrote:
my current regex-urlfilter properties are as follows:
# skip file: ftp: and mailto: urls
#-^(fi
my current regex-urlfilter properties are as follows:
# skip file: ftp: and mailto: urls
#-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
#-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|
The problem is not yet solved.
Thanks and Regards
Shubham Gupta
On Monday 03 October 2016 11:12 AM, shubham.gupta wrote:
After doing this 3 less URLs have been rejected.
Thanks and Regards,
Shubham Gupta
On Monday 03 October 2016 10:28 AM, Sachin Shaju wrote:
You may check by commenting all
...@mstack.com
On Mon, Oct 3, 2016 at 10:05 AM, shubham.gupta <shubham.gu...@orkash.com>
wrote:
Hey
When the inject job is run 90% of my seedurls get rejected. Therefore,
very few urls get crawled and does not give proper outputs.
my regex-urlfilter properties are as follows:
# skip fil
Hey
When the inject job is run 90% of my seedurls get rejected. Therefore,
very few urls get crawled and does not give proper outputs.
my regex-urlfilter properties are as follows:
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
#
Hey,
Whenever the update job is executed the following errors occur:
INFO mapreduce.Job: Task Id : attempt_1473832356852_0104_m_00_2,
Status : FAILED
Error: java.net.MalformedURLException: no protocol:
The nutch job has started making these huge amount of logs after the
fetcher.parse property was set to TRUE. So, is there any relation with that?
Shubham
On Thursday 08 September 2016 10:28 AM, shubham.gupta wrote:
What changes can be made in Nutch log4j.properties to reduce the size
of Nutch
Hey
The running nutch job fails due to the following error:
Container [pid=6179,containerID=container_1473334555047_0003_01_15]
is running beyond physical memory limits. Current usage: 4.1 GB of 4 GB
physical memory used; 8.4 GB of 8.4 GB virtual memory used. Killing
container.
The
What changes can be made in Nutch log4j.properties to reduce the size of
Nutch logging size.
Shubham
On Wednesday 07 September 2016 04:04 AM, Markus Jelsma wrote:
I've seen Hadoop not honouring some logs settings before. Are you really sure
these are org.apache.nutch.* logs? If so, and as
Hey,
I have changed the user.log_retain size to 10 MB still it is creating a
huge size of logs. This leads to the failure of datanode and the job
fails. And, if the logs are deleted periodically then the fetch phase
takes a lot of time and it is uncertain that whether it will complete or
Hey
Logs are created when spills of map job are created during the FETCH job
and are stored in /home/hadoop/nodelogs/usercache/root/appcache. The
total size of logs sums up to over 13GB which occupies a lot of disk
space of the datanode and I have to delete those logs for smooth
functioning
Hey
I have integrated Nutch 2.3.1 with Hadoop 2.7.1, and the fetcher.parse
property is set TRUE and the database used is MongoDB. While the map job
of nutch runs, it creates a huge size of nodelogs over 13GB in size. And
the cause of such huge amount of files in unknown. Any suggestion would
Hey Markus,
What I am trying to do is perform RSS crawling using Nutch, therefore I
require that the whole process should be completed within 1 hour.
According to your suggestion, I set fetcher.parse = true which reduced
the time of fetching process to 44 minutes and fetched 9195 pages. But,
Hi
I am trying to use Nutch 2.3.1 with 3 datanode(4GB RAM each) hadoop
2.7.1 cluster . The seed list provided consists of around 5000 Urls . I
am using 60 threads and 5 numTasks for crawling these urls at distance
of 1, but, it is taking 1 day to complete the crawl job (Inject : 1
minute 35
Hi
I am trying to use Nutch 2.3.1 with 3 datanode(4GB RAM each) hadoop
2.7.1 cluster . The seed list provided consists of around 5000 Urls . I
am using 60 threads and 5 numTasks for crawling these urls at distance
of 1, but, it is taking 1 day to complete the crawl job (Inject : 1
minute 35
31 matches
Mail list logo