Re: nutch crawling

2012-03-01 Thread Elisabeth Adler

Hi,
A similar question has been posted yesterday (Query in nutch) - as 
Lewis suggested, NUTCH-585 [1] might be what you need.


Best,
Elisabeth

 [1] https://issues.apache.org/jira/browse/NUTCH-585

On 29.02.2012 12:15, sanjay87 wrote:

Hi Techies,

I am having some queries related to Nutch- the web crawler. I am actually
done with Crawling the website and indexing the same in SOLR, but the
problem here is – the Nutch crawler crawls at a domain level i.e. the menu
items , anchor text and everything which is actually not needed.

I only need to crawl the legitimate content present in the site.

I tried to crawl the  localhost:8080/solr/admin page and the response is not
legitimate.

The content field is having all the data which is actually not needed.

We have tried a lot of options and still we are unable to find a solution,
please provide your valuable inputs.

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/nutch-crawling-tp3786913p3786913.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Distributed Indexing on MapReduce

2012-03-01 Thread Frank Scholten
Hi all,

I am looking into reusing some existing code for distributed indexing
to test a Mahout tool I am working on
https://issues.apache.org/jira/browse/MAHOUT-944

What I want is to index the Apache Public Mail Archives dataset (200G)
via MapReduce on Hadoop.

I have been going through the Nutch and contrib/index code and from my
understanding I have to:

* Create an InputFormat / RecordReader / InputSplit class for
splitting the e-mails across mappers
* Create a Mapper which emits the e-mails as key value pairs
* Create a Reducer which indexes the e-mails on the local filesystem
(or straight to HDFS?)
* Copy these indexes from local filesystem to HDFS. In the same Reducer?

I am unsure about the final steps. How to get to the end result, a
bunch of index shards on HDFS. It seems
that each Reducer needs to be aware of a directory they eventually
write to on HDFS. I don't see how to get each reducer to copy its
shard to HDFS

How do I set this up?

Cheers,

Frank


Re: http.redirect.max

2012-03-01 Thread alxsss

 Hello,

I tried 1, 2, -1 for the config http.redirect.max, but nutch still postpones 
redirected urls to later depths.
What is the correct config  setting to have nutch crawl redirected urls 
immediately. I need it because I have restriction on depth be at most 2.

Thanks.
Alex.

 

 

-Original Message-
From: xuyuanme xuyua...@gmail.com
To: user user@nutch.apache.org
Sent: Fri, Feb 24, 2012 1:31 am
Subject: Re: http.redirect.max


The config file is used for some proof of concept testing so the content
might be confusing, please ignore some incorrect part.

Yes from my end I can see the crawl for website http://www.scotland.gov.uk
is redirected as expected.

However the website I tried to crawl is a bit more tricky.

Here's what I want to do:

1. Set
http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_BrowseDrugInitial=B
as the seed page

2. And try to crawl one of the link
(http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.OverviewDrugName=BACIGUENT)
as a test

If you click the link, you'll find the website use redirect and cookie to
control page navigation. So I used protocol-httpclient plugin instead of
protocol-http to handle the cookie.

However, the redirect does not happen as expected. The only way I can fetch
second link is to manually change response = getResponse(u, datum,
*false*) call to response = getResponse(u, datum, *true*) in
org.apache.nutch.protocol.http.api.HttpBase.java file and recompile the
lib-http plugin.

So my issue is related to this specific site
http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_BrowseDrugInitial=B


lewis john mcgibbney wrote
 
 I've checked working with redirects and everything seems to work fine for
 me.
 
 The site I checked on
 
 http://www.scotland.gov.uk
 
 temp redirect to
 
 http://home.scotland.gov.uk/home
 
 Nutch gets this fine when I do some tweaking with nutch-site.xml
 
 redirects property -1 (just to demonstrate, I would usually not set it so)
 
 Lewis
 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3772115.html
Sent from the Nutch - User mailing list archive at Nabble.com.

 


Re: Featured link support in Nutch

2012-03-01 Thread Markus Jelsma

Hi

What is a featured link? Maybe Solr's elevator component is what your 
are looking for?


cheers

On Thu, 1 Mar 2012 11:59:00 -0800, Stany Fargose 
stannyfarg...@gmail.com wrote:

Hi All,

I am working on replacing our current site search with Nutch-Solr. I 
am
very new to this technologies but I like what it's offering. I got 
the

basic setup working.

I was wondering how would we implement 'Featured link' using 
Nutch-Solr. I

would like to hear your thoughts.

Thanks in advance.

-Stan


--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350


Re: multiple small crawlers on single machine conflict at /tmp/hadoop-username/mapred

2012-03-01 Thread Markus Jelsma

you can either:

1. run on hadoop
2. not run multiple concurrent jobs on a local machine
3. set a hadoop.tmp.dir per job
4. merge all crawls to a single crawl

On Thu, 1 Mar 2012 16:26:00 -0500, Jeremy Villalobos 
jeremyvillalo...@gmail.com wrote:

Hello:

I am running multiple small crawls on one machine.  I notice that 
they are

conflicting because they all access

/tmp/hadoop-username/mapred

How do I change the location of this folder ?

Do I have use hadoop to run multiple crawlers each specific to a site 
?


thanks

Jeremy


--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350


Re: multiple small crawlers on single machine conflict at /tmp/hadoop-username/mapred

2012-03-01 Thread Jeremy Villalobos
That is what I was looking for, thank you.

this property was added to:
$NUCHT_DIR/runtime/local/conf/nutch-site.xml

Jeremy

On Thu, Mar 1, 2012 at 7:01 PM, Markus Jelsma markus.jel...@openindex.iowrote:

 you can either:

 1. run on hadoop
 2. not run multiple concurrent jobs on a local machine
 3. set a hadoop.tmp.dir per job
 4. merge all crawls to a single crawl


 On Thu, 1 Mar 2012 16:26:00 -0500, Jeremy Villalobos 
 jeremyvillalo...@gmail.com wrote:

 Hello:

 I am running multiple small crawls on one machine.  I notice that they are
 conflicting because they all access

 /tmp/hadoop-username/mapred

 How do I change the location of this folder ?

 Do I have use hadoop to run multiple crawlers each specific to a site ?

 thanks

 Jeremy


 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/**markus17http://www.linkedin.com/in/markus17
 050-8536600 / 06-50258350



Only fetching initial seedlist

2012-03-01 Thread James Ford
Hello,

I am having a problem getting nutch to crawl and fetch the initial seedlist
only. It seems like nutch tend to skip some urls? Or it does not parse some
of them?

For example with the following seedlist:

http://www.domain.com/?_PageId=492AreaId=441
http://www.domain.com/?_PageId=631AreaId=11
http://www.domain.com/?_PageId=490AreaId=19

Nutch would not fetch and parse all the urls? I am not that interested in
the outlinks, my general purpose is to crawl, fetch and parse the seedlist
ONLY.

I am using the crawl command with a depth of 1 and infinite topN. I have
also tried injecting manually.

Thanks,
James Ford

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Only-fetching-initial-seedlist-tp3791957p3791957.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: multiple small crawlers on single machine conflict at /tmp/hadoop-username/mapred

2012-03-01 Thread remi tassing
How did you define that property so it's different so each job?

Remi

On Friday, March 2, 2012, Jeremy Villalobos jeremyvillalo...@gmail.com
wrote:
 That is what I was looking for, thank you.

 this property was added to:
 $NUCHT_DIR/runtime/local/conf/nutch-site.xml

 Jeremy

 On Thu, Mar 1, 2012 at 7:01 PM, Markus Jelsma markus.jel...@openindex.io
wrote:

 you can either:

 1. run on hadoop
 2. not run multiple concurrent jobs on a local machine
 3. set a hadoop.tmp.dir per job
 4. merge all crawls to a single crawl


 On Thu, 1 Mar 2012 16:26:00 -0500, Jeremy Villalobos 
 jeremyvillalo...@gmail.com wrote:

 Hello:

 I am running multiple small crawls on one machine.  I notice that they
are
 conflicting because they all access

 /tmp/hadoop-username/mapred

 How do I change the location of this folder ?

 Do I have use hadoop to run multiple crawlers each specific to a site ?

 thanks

 Jeremy


 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/**markus17http://www.linkedin.com/in/markus17

 050-8536600 / 06-50258350




Re: Only fetching initial seedlist

2012-03-01 Thread remi tassing
This question comes a lot, try searching the mailinglist archive

On Friday, March 2, 2012, James Ford simon.fo...@gmail.com wrote:
 Hello,

 I am having a problem getting nutch to crawl and fetch the initial
seedlist
 only. It seems like nutch tend to skip some urls? Or it does not parse
some
 of them?

 For example with the following seedlist:

 http://www.domain.com/?_PageId=492AreaId=441
 http://www.domain.com/?_PageId=631AreaId=11
 http://www.domain.com/?_PageId=490AreaId=19

 Nutch would not fetch and parse all the urls? I am not that interested in
 the outlinks, my general purpose is to crawl, fetch and parse the seedlist
 ONLY.

 I am using the crawl command with a depth of 1 and infinite topN. I have
 also tried injecting manually.

 Thanks,
 James Ford

 --
 View this message in context:
http://lucene.472066.n3.nabble.com/Only-fetching-initial-seedlist-tp3791957p3791957.html
 Sent from the Nutch - User mailing list archive at Nabble.com.



different fetch interval for each depth urls

2012-03-01 Thread alxsss
Hello,

I need to have different fetch intervals for initial seed urls and  urls 
extracted from them at depth 1. How this can be achieved. I tried -adddays 
option in generate command but it seems it cannot be used to solve this issue. 

Thanks in advance.
Alex.


Re: multiple small crawlers on single machine conflict at /tmp/hadoop-username/mapred

2012-03-01 Thread Jeremy Villalobos
It is a small number of crawlers, so I copied a runtime for each.
 therefore different configuration files.

Jeremy

On Thu, Mar 1, 2012 at 10:57 PM, remi tassing tassingr...@gmail.com wrote:

 How did you define that property so it's different so each job?

 Remi

 On Friday, March 2, 2012, Jeremy Villalobos jeremyvillalo...@gmail.com
 wrote:
  That is what I was looking for, thank you.
 
  this property was added to:
  $NUCHT_DIR/runtime/local/conf/nutch-site.xml
 
  Jeremy
 
  On Thu, Mar 1, 2012 at 7:01 PM, Markus Jelsma 
 markus.jel...@openindex.io
 wrote:
 
  you can either:
 
  1. run on hadoop
  2. not run multiple concurrent jobs on a local machine
  3. set a hadoop.tmp.dir per job
  4. merge all crawls to a single crawl
 
 
  On Thu, 1 Mar 2012 16:26:00 -0500, Jeremy Villalobos 
  jeremyvillalo...@gmail.com wrote:
 
  Hello:
 
  I am running multiple small crawls on one machine.  I notice that they
 are
  conflicting because they all access
 
  /tmp/hadoop-username/mapred
 
  How do I change the location of this folder ?
 
  Do I have use hadoop to run multiple crawlers each specific to a site ?
 
  thanks
 
  Jeremy
 
 
  --
  Markus Jelsma - CTO - Openindex
  http://www.linkedin.com/in/**markus17
 http://www.linkedin.com/in/markus17
 
  050-8536600 / 06-50258350
 
 



Re: multiple small crawlers on single machine conflict at /tmp/hadoop-username/mapred

2012-03-01 Thread Markus Jelsma
You can also pass it to most jobs with $ nutch job 
-Dhadoop.tmp.dir=bla args. This can be even automatic with some shell 
scripting.



On Fri, 2 Mar 2012 00:49:36 -0500, Jeremy Villalobos 
jeremyvillalo...@gmail.com wrote:

It is a small number of crawlers, so I copied a runtime for each.
 therefore different configuration files.

Jeremy

On Thu, Mar 1, 2012 at 10:57 PM, remi tassing  wrote:
 How did you define that property so it's different so each job?

 Remi

 On Friday, March 2, 2012, Jeremy Villalobos
 wrote:


That is what I was looking for, thank you.

 
  this property was added to:
  $NUCHT_DIR/runtime/local/conf/nutch-site.xml
 
  Jeremy
 
  On Thu, Mar 1, 2012 at 7:01 PM, Markus Jelsma wrote:
 
  you can either:
 
  1. run on hadoop
  2. not run multiple concurrent jobs on a local machine
  3. set a hadoop.tmp.dir per job
  4. merge all crawls to a single crawl
 
 
  On Thu, 1 Mar 2012 16:26:00 -0500, Jeremy Villalobos 
  jeremyvillalo...@gmail.com [4] wrote:
 
  Hello:
 
  I am running multiple small crawls on one machine.  I notice
that they
 are
  conflicting because they all access
 
  /tmp/hadoop-username/mapred
 
  How do I change the location of this folder ?
 
  Do I have use hadoop to run multiple crawlers each specific to a
site ?
 
  thanks
 
  Jeremy
 
 
  --
  Markus Jelsma - CTO - Openindex
  http://www.linkedin.com/in/**markus17 [5]
  050-8536600 / 06-50258350
 
 



Links:
--
[1] mailto:tassingr...@gmail.com
[2] mailto:jeremyvillalo...@gmail.com
[3] mailto:markus.jel...@openindex.io
[4] mailto:jeremyvillalo...@gmail.com
[5] http://www.linkedin.com/in/**markus17
[6] http://www.linkedin.com/in/markus17


--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350


Re: Only fetching initial seedlist

2012-03-01 Thread Markus Jelsma

indeed. check urlfilters and plugins.

On Fri, 2 Mar 2012 05:59:20 +0200, remi tassing tassingr...@gmail.com 
wrote:

This question comes a lot, try searching the mailinglist archive

On Friday, March 2, 2012, James Ford simon.fo...@gmail.com wrote:

Hello,

I am having a problem getting nutch to crawl and fetch the initial

seedlist
only. It seems like nutch tend to skip some urls? Or it does not 
parse

some

of them?

For example with the following seedlist:

http://www.domain.com/?_PageId=492AreaId=441
http://www.domain.com/?_PageId=631AreaId=11
http://www.domain.com/?_PageId=490AreaId=19

Nutch would not fetch and parse all the urls? I am not that 
interested in
the outlinks, my general purpose is to crawl, fetch and parse the 
seedlist

ONLY.

I am using the crawl command with a depth of 1 and infinite topN. I 
have

also tried injecting manually.

Thanks,
James Ford

--
View this message in context:


http://lucene.472066.n3.nabble.com/Only-fetching-initial-seedlist-tp3791957p3791957.html

Sent from the Nutch - User mailing list archive at Nabble.com.



--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350


Re: different fetch interval for each depth urls

2012-03-01 Thread Markus Jelsma
Well, you could set a new default fetch interval in your configuration 
after the first crawl cycle but the depth information is lost if you 
continue crawling so there is no real solution.


What problem are you trying to solve anyway?

On Fri, 2 Mar 2012 00:19:34 -0500 (EST), alx...@aim.com wrote:

Hello,

I need to have different fetch intervals for initial seed urls and
urls extracted from them at depth 1. How this can be achieved. I 
tried

-adddays option in generate command but it seems it cannot be used to
solve this issue.

Thanks in advance.
Alex.


--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350


Re: Featured link support in Nutch

2012-03-01 Thread Stany Fargose
That's what I was looking for. Thanks Markus!

I also have another question (did lot of search on this already). We want
to get results by using 'starts with' or prefix query.
e.g. Return all results where url starts with http://auto.yahoo.com

Thanks again!


On Thu, Mar 1, 2012 at 3:59 PM, Markus Jelsma markus.jel...@openindex.iowrote:

 Hi

 What is a featured link? Maybe Solr's elevator component is what your are
 looking for?

 cheers


 On Thu, 1 Mar 2012 11:59:00 -0800, Stany Fargose stannyfarg...@gmail.com
 wrote:

 Hi All,

 I am working on replacing our current site search with Nutch-Solr. I am
 very new to this technologies but I like what it's offering. I got the
 basic setup working.

 I was wondering how would we implement 'Featured link' using Nutch-Solr. I
 would like to hear your thoughts.

 Thanks in advance.

 -Stan


 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/**markus17http://www.linkedin.com/in/markus17
 050-8536600 / 06-50258350



Re: Featured link support in Nutch

2012-03-01 Thread Markus Jelsma
wildcard query or edge ngram filter. terms component can also do this 
or even facet.prefix!


On Thu, 1 Mar 2012 23:15:23 -0800, Stany Fargose 
stannyfarg...@gmail.com wrote:

That's what I was looking for. Thanks Markus!

I also have another question (did lot of search on this already). We
want to get results by using 'starts with' or prefix query.
e.g. Return all results where url starts with http://auto.yahoo.com
[1]

Thanks again!

On Thu, Mar 1, 2012 at 3:59 PM, Markus Jelsma  wrote:
 Hi

 What is a featured link? Maybe Solr's elevator component is what
your are looking for?

 cheers

 On Thu, 1 Mar 2012 11:59:00 -0800, Stany Fargose  wrote:
  Hi All,

 I am working on replacing our current site search with Nutch-Solr. I
am
 very new to this technologies but I like what it's offering. I got
the
 basic setup working.

 I was wondering how would we implement 'Featured link' using
Nutch-Solr. I
 would like to hear your thoughts.

 Thanks in advance.

 -Stan

 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17 [4]
 050-8536600 / 06-50258350



Links:
--
[1] http://auto.yahoo.com
[2] mailto:markus.jel...@openindex.io
[3] mailto:stannyfarg...@gmail.com
[4] http://www.linkedin.com/in/markus17


--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350