Hi,
We are crawling a site using nutch 1.6 and indexing into solr.
However, we need to rewrite the urls that are indexed in the following way
For instance, nutch crawls a page http://www.example.com/article=xxx but
when moving data to the index we would like to use the url
If you are using Solr versions in the 4.x series, then you could update the
fields [1] once the data is indexed. This is not doing the nutch way but
this is something that came in min and can work right away.
[1] http://wiki.apache.org/solr/UpdateJSON#Atomic_Updates
On Mon, Apr 22, 2013 at 9:56
I noticed, when Im crawling a website using Nutch, and indexing it in Solr-
when I search for words in the content of the page- i wasn't getting results
though I get the title and text from the header.
I then took the html source and validated it at http://xmlgrid.net/ and
found that its not well
Hi,
Check the following properties in your nutch-site.xml if you have
overridden it. Else you may have to as that determines the amount of
content that is downloaded and parsed. Anything beyond what is mentioned in
these parameters will be truncated by nutch:
property
logging explicitly states that no solrUrl is set.
On Sunday, April 21, 2013, kiran chitturi chitturikira...@gmail.com wrote:
Hi Mick,
Since this is an error with Indexing, Can you check the logs from Solr
side
?
On Sun, Apr 21, 2013 at 4:15 AM, micklai lailixi...@gmail.com wrote:
HI
Hi,
The 1.x indexer takes a -normalize parameter and there you can rewrite your
URL's. Judging from your patterns the RegexURLNormalizer should be sufficient.
Make sure you use the config file containing that pattern only when indexing,
otherwise they'll end up in the CrawlDB and segments. Use
URLNormalizers can have a scope, see
http://nutch.apache.org/apidocs-1.6/org/apache/nutch/net/URLNormalizers.html#SCOPE_INDEXER.
Should help to normalise only at indexing time
On 22 April 2013 16:56, Markus Jelsma markus.jel...@openindex.io wrote:
Hi,
The 1.x indexer takes a -normalize
Hello guys:
I am trying to run nutch over Hadoop. Everything was ok. I modified files
by the tutorial that I have already read and in the moment of make a first
crawl, I get the following:
[root@Hadoop nutch]# hadoop jar build/apache-nutch-2.1.jar
org.apache.nutch.crawl.Crawl /home/nutch/data
run your job jar from within the runtime/deploy directory.
On Monday, April 22, 2013, Maximiliano Marin conta...@maximilianomarin.com
wrote:
Hello guys:
I am trying to run nutch over Hadoop. Everything was ok. I modified files
by the tutorial that I have already read and in the moment of make
I'm crawling a local server. I have Nutch 2 working on a local machine
with the default 1G heap size. I got several OOM errors, but the fetch
eventually finishes.
In order to get rid of the OOM errors, I moved everything to a machine with
more memory and increased the heap size to 8G. However,
Hi,
more information would be useful:
- exact Nutch version (2.?)
- how Nutch is called (eg, via bin/crawl)
- details of the configuration, esp.
-depth
-topN
http.content.limit
fetcher.parse
- storage back-end
In general, something is wrong. Maybe, some oversized documents
are crawled.
Nutch 2.1
bin/nutch fetch -all
No depth
No topN. I'm only pulling around 600 documents at the current round.
http.content.limit is -1
fetcher.parse is the default
HBase
It's not the documents AFAIK. I'm crawling the same server and it works on
my local machine, but not on the server with more
Okay, I changed my server Nutch back to 1G of heap. Now it fails with a
RuntimeException. Prior to that, I get the following exception.
org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed
3 actions: servers with issues: localhost:44776
On Mon, Apr 22, 2013 at 2:58 PM,
In runtime/deploy, I only found:
- apache-nutch-2.1.job
- bin/nutch
I tried to execute the job file as a jar file using hadoop -jar
apache-nutch-2.1.job and I had the same result.
thank you.
Atte,
Maximiliano Marin Bustos
MCTS: Windows Server 2008 R2, Virtualization
MCTS: SQL Server 2008,
make sure that the hadoop/bin directory is on your path and then go to
the runtime/deploy and say something like this:
bin/nutch crawl ...
that being said don't use the nutch crawl command. as much as the
alternative ( the crawl script in the deploy/bin folder ) is primitive
it is still
could someone please tell me one more time, in this line:
0/20 spinwaiting/active, 53852 pages, 7612 errors, 4.1 12 pages/s, 2632
7346 kb/s, 989 URLs in 5 queues reduce
what are the two numbers before pages/s and two numbers before kb/s?
thanks,
Can you look into the job archive and see what is wrong?
Maybe you need to rebuild the job archive
ant job from ${NUTCH.HOME}
On Mon, Apr 22, 2013 at 1:09 PM, Maximiliano Marin
conta...@maximilianomarin.com wrote:
In runtime/deploy, I only found:
- apache-nutch-2.1.job
- bin/nutch
I tried
Fetcher threads try to get a fetch item (url) from a queue of all the fetch
items (this queue is actually a queue of queues. For details see [0]). If a
thread doesnt get a fetch-item, it spinwaits for 500ms before polling the
queue again.
The '*spinWaiting*' count tells us how many threads are in
hi Tejas,
this is a real excellent reply and very useful.
it would be really great if we could somehow have this kind of low level
information readily available on the Nutch wiki.
On Monday, April 22, 2013, Tejas Patil tejas.patil...@gmail.com wrote:
Fetcher threads try to get a fetch item (url)
Hi Lewis,
Thanks !!
I have huge respect for those who engineered the Fetcher class (esp. of
1.x) as its simply *awesome* and complex piece of code.
I can polish my post more so that it comes to the wiki quality. I don't
have access to wiki. Can you provide me the same ?
Thanks,
Tejas
On Mon,
I agree.
I can sort this tomorrow.
@Kiran,
Are we still working to addition of documentation contributors via wiki uid
entry to Contributer and Admin group?
Tejas can and should be added to both Groups
thanks
lewis
On Monday, April 22, 2013, Tejas Patil tejas.patil...@gmail.com wrote:
Hi Lewis,
21 matches
Mail list logo