rewriting urls that are index

2013-04-22 Thread Niels Boldt
Hi, We are crawling a site using nutch 1.6 and indexing into solr. However, we need to rewrite the urls that are indexed in the following way For instance, nutch crawls a page http://www.example.com/article=xxx but when moving data to the index we would like to use the url

Re: rewriting urls that are index

2013-04-22 Thread kiran chitturi
If you are using Solr versions in the 4.x series, then you could update the fields [1] once the data is indexed. This is not doing the nutch way but this is something that came in min and can work right away. [1] http://wiki.apache.org/solr/UpdateJSON#Atomic_Updates On Mon, Apr 22, 2013 at 9:56

Nutch- not getting all content of page

2013-04-22 Thread kneerosh
I noticed, when Im crawling a website using Nutch, and indexing it in Solr- when I search for words in the content of the page- i wasn't getting results though I get the title and text from the header. I then took the html source and validated it at http://xmlgrid.net/ and found that its not well

Re: Nutch- not getting all content of page

2013-04-22 Thread chethan
Hi, Check the following properties in your nutch-site.xml if you have overridden it. Else you may have to as that determines the amount of content that is downloaded and parsed. Anything beyond what is mentioned in these parameters will be truncated by nutch: property

Re: [Exception in thread main java.io.IOException: Job failed!]

2013-04-22 Thread Lewis John Mcgibbney
logging explicitly states that no solrUrl is set. On Sunday, April 21, 2013, kiran chitturi chitturikira...@gmail.com wrote: Hi Mick, Since this is an error with Indexing, Can you check the logs from Solr side ? On Sun, Apr 21, 2013 at 4:15 AM, micklai lailixi...@gmail.com wrote: HI

RE: rewriting urls that are index

2013-04-22 Thread Markus Jelsma
Hi, The 1.x indexer takes a -normalize parameter and there you can rewrite your URL's. Judging from your patterns the RegexURLNormalizer should be sufficient. Make sure you use the config file containing that pattern only when indexing, otherwise they'll end up in the CrawlDB and segments. Use

Re: rewriting urls that are index

2013-04-22 Thread Julien Nioche
URLNormalizers can have a scope, see http://nutch.apache.org/apidocs-1.6/org/apache/nutch/net/URLNormalizers.html#SCOPE_INDEXER. Should help to normalise only at indexing time On 22 April 2013 16:56, Markus Jelsma markus.jel...@openindex.io wrote: Hi, The 1.x indexer takes a -normalize

Crawling and Hadoop problem

2013-04-22 Thread Maximiliano Marin
Hello guys: I am trying to run nutch over Hadoop. Everything was ok. I modified files by the tutorial that I have already read and in the moment of make a first crawl, I get the following: [root@Hadoop nutch]# hadoop jar build/apache-nutch-2.1.jar org.apache.nutch.crawl.Crawl /home/nutch/data

Re: Crawling and Hadoop problem

2013-04-22 Thread Lewis John Mcgibbney
run your job jar from within the runtime/deploy directory. On Monday, April 22, 2013, Maximiliano Marin conta...@maximilianomarin.com wrote: Hello guys: I am trying to run nutch over Hadoop. Everything was ok. I modified files by the tutorial that I have already read and in the moment of make

Nutch 2 hanging after aborting hung threads

2013-04-22 Thread Bai Shen
I'm crawling a local server. I have Nutch 2 working on a local machine with the default 1G heap size. I got several OOM errors, but the fetch eventually finishes. In order to get rid of the OOM errors, I moved everything to a machine with more memory and increased the heap size to 8G. However,

Re: Nutch 2 hanging after aborting hung threads

2013-04-22 Thread Sebastian Nagel
Hi, more information would be useful: - exact Nutch version (2.?) - how Nutch is called (eg, via bin/crawl) - details of the configuration, esp. -depth -topN http.content.limit fetcher.parse - storage back-end In general, something is wrong. Maybe, some oversized documents are crawled.

Re: Nutch 2 hanging after aborting hung threads

2013-04-22 Thread Bai Shen
Nutch 2.1 bin/nutch fetch -all No depth No topN. I'm only pulling around 600 documents at the current round. http.content.limit is -1 fetcher.parse is the default HBase It's not the documents AFAIK. I'm crawling the same server and it works on my local machine, but not on the server with more

Re: Nutch 2 hanging after aborting hung threads

2013-04-22 Thread Bai Shen
Okay, I changed my server Nutch back to 1G of heap. Now it fails with a RuntimeException. Prior to that, I get the following exception. org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 3 actions: servers with issues: localhost:44776 On Mon, Apr 22, 2013 at 2:58 PM,

Re: Crawling and Hadoop problem

2013-04-22 Thread Maximiliano Marin
In runtime/deploy, I only found: - apache-nutch-2.1.job - bin/nutch I tried to execute the job file as a jar file using hadoop -jar apache-nutch-2.1.job and I had the same result. thank you. Atte, Maximiliano Marin Bustos MCTS: Windows Server 2008 R2, Virtualization MCTS: SQL Server 2008,

Re: Crawling and Hadoop problem

2013-04-22 Thread kaveh minooie
make sure that the hadoop/bin directory is on your path and then go to the runtime/deploy and say something like this: bin/nutch crawl ... that being said don't use the nutch crawl command. as much as the alternative ( the crawl script in the deploy/bin folder ) is primitive it is still

need legends for fetch reduce jobtracker ouput

2013-04-22 Thread kaveh minooie
could someone please tell me one more time, in this line: 0/20 spinwaiting/active, 53852 pages, 7612 errors, 4.1 12 pages/s, 2632 7346 kb/s, 989 URLs in 5 queues reduce what are the two numbers before pages/s and two numbers before kb/s? thanks,

Re: Crawling and Hadoop problem

2013-04-22 Thread Lewis John Mcgibbney
Can you look into the job archive and see what is wrong? Maybe you need to rebuild the job archive ant job from ${NUTCH.HOME} On Mon, Apr 22, 2013 at 1:09 PM, Maximiliano Marin conta...@maximilianomarin.com wrote: In runtime/deploy, I only found: - apache-nutch-2.1.job - bin/nutch I tried

Re: need legends for fetch reduce jobtracker ouput

2013-04-22 Thread Tejas Patil
Fetcher threads try to get a fetch item (url) from a queue of all the fetch items (this queue is actually a queue of queues. For details see [0]). If a thread doesnt get a fetch-item, it spinwaits for 500ms before polling the queue again. The '*spinWaiting*' count tells us how many threads are in

Re: need legends for fetch reduce jobtracker ouput

2013-04-22 Thread Lewis John Mcgibbney
hi Tejas, this is a real excellent reply and very useful. it would be really great if we could somehow have this kind of low level information readily available on the Nutch wiki. On Monday, April 22, 2013, Tejas Patil tejas.patil...@gmail.com wrote: Fetcher threads try to get a fetch item (url)

Re: need legends for fetch reduce jobtracker ouput

2013-04-22 Thread Tejas Patil
Hi Lewis, Thanks !! I have huge respect for those who engineered the Fetcher class (esp. of 1.x) as its simply *awesome* and complex piece of code. I can polish my post more so that it comes to the wiki quality. I don't have access to wiki. Can you provide me the same ? Thanks, Tejas On Mon,

Re: need legends for fetch reduce jobtracker ouput

2013-04-22 Thread Lewis John Mcgibbney
I agree. I can sort this tomorrow. @Kiran, Are we still working to addition of documentation contributors via wiki uid entry to Contributer and Admin group? Tejas can and should be added to both Groups thanks lewis On Monday, April 22, 2013, Tejas Patil tejas.patil...@gmail.com wrote: Hi Lewis,