Re: Bad Request in nutch when i use parsechecker?

2012-02-02 Thread mina
[via Lucene] ml-node+s472066n3707012...@n3.nabble.com wrote: Nutch cannot do this right now. However, there's a patch that does the encoding. https://issues.apache.org/jira/browse/NUTCH-1098 On Wednesday 01 February 2012 16:26:06 mina wrote: how i can force nutch to encoding this url? i

how can i use patch-with-utf8-encoding.diff in https://issues.apache.org/jira/browse/NUTCH-1098?

2012-02-02 Thread mina
i want to use https://issues.apache.org/jira/browse/NUTCH-1098 in my nutch to encoding urls, but i don't know what should i do? how i can use patch-with-utf8-encoding.diff in my nutch? it has .diff format. -- View this message in context:

Re: why nutch dosen't crawl Arabic sites well?

2012-02-01 Thread mina
don't send every message twice or more. On Tuesday 31 January 2012 10:51:06 mina wrote: i can crawl an arabic site like: http://www.sahafa.com/ but i can't crawl another site like:http://www.aljazeera.net/Portal/ help me please. -- View this message in context: http://lucene.472066.n3

Re: Bad Request in nutch when i use parsechecker?

2012-02-01 Thread mina
%D8%B1-%D9%85%D9%86%D8%A7%D8%B7%D9%82-%D9%85%D8%AD%D8%B1%D9%88%D9%85/%D8%B3%D9%8A%D8%A7%D8%B3%D9%8A/ encoding, encoding, encoding On Wednesday 01 February 2012 14:14:55 mina wrote: hi, i use this command: bin/nutch parsechecker -dumpText http://www.irna.ir/News/30786427/سوء-استفاده-از-نام

Re: why nutch dosen't crawl all links

2012-02-01 Thread mina
hi, i use this command: bin/nutch parsechecker -dumpText http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آوری-رای-در-مناطق-محروم/سياسي/ and see log: fetching: http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آوری-رای-در-مناطق-محروم/سياسي/

Re: why nutch dosen't crawl all links

2012-01-31 Thread mina
? Remi On Tuesday, January 31, 2012, mina tahereganji...@gmail.com wrote: i crawl a site with nutch 1.4, i understand that nutch dosen't crawl all links in this site. i have no filter and no limit rule to crawling. for example nutch never crawl this link: http://www.irna.ir/News/30786427

Re: why nutch dosen't crawl all links

2012-01-31 Thread mina
i can crawl an arabic site like: http://www.sahafa.com/ but i can't crawl another site like:http://www.aljazeera.net/portal/ help me please. -- View this message in context: http://lucene.472066.n3.nabble.com/why-nutch-dosen-t-crawl-all-links-tp3702031p3702593.html Sent from the Nutch - User

why nutch dosen't crawl Arabic sites well?

2012-01-31 Thread mina
i can crawl an arabic site like: http://www.sahafa.com/ but i can't crawl another site like:http://www.aljazeera.net/Portal/ help me please. -- View this message in context: http://lucene.472066.n3.nabble.com/why-nutch-dosen-t-crawl-Arabic-sites-well-tp3702769p3702769.html Sent from the Nutch

Re: error in crawl all link in no English language sites

2012-01-31 Thread mina
outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property Julien On 31 January 2012 02:56, mina tahereganji...@gmail.com wrote: i crawl a site with nutch 1.4. but nutch dosen't crawl all links in this site. the language of this site

Re: why nutch dosen't crawl all links

2012-01-31 Thread mina
test with parsechecker and indexchecker tools. On Tuesday 31 January 2012 09:29:39 mina wrote: i can crawl an arabic site like: http://www.sahafa.com/ but i can't crawl another site like:http://www.aljazeera.net/portal/ help me please. -- View this message in context: http://lucene

error in crawl all link in no English language sites

2012-01-30 Thread mina
i crawl a site with nutch 1.4. but nutch dosen't crawl all links in this site. the language of this site is not English. for example nutch dosen't crawl this link: http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آوری-رای-در-مناطق-محروم/سياسي/ what can i solve this

use stop words in schema in nutch

2012-01-08 Thread mina
hi markus. i have a problem in nutch. i want use stopwords in nutch, when i crawl sites and use solr to index them, any word in stopwords.txt can is searched. help me. -- View this message in context: http://lucene.472066.n3.nabble.com/use-stop-words-in-schema-in-nutch-tp3641820p3641820.html

how can crawl .js files with nutch?

2012-01-08 Thread mina
i want crawl .js files beacuse in .js files i add some links to a sites. how i can config nutch to ceawl .js files? i use nutch 1.4 -- View this message in context: http://lucene.472066.n3.nabble.com/how-can-crawl-js-files-with-nutch-tp3642613p3642613.html Sent from the Nutch - User mailing

how can parse .js files in nutch?

2012-01-08 Thread mina
i use nutch 1.4 and i want to pares .js files beacuse some links add in sites with .js files. help me. how i can config nutch? -- View this message in context: http://lucene.472066.n3.nabble.com/how-can-parse-js-files-in-nutch-tp3642673p3642673.html Sent from the Nutch - User mailing list

Re: fill up /tmp when crawl with nutc1.3

2012-01-02 Thread mina
of space. On Sunday 01 January 2012 10:40:49 mina wrote: hi, i setup nutch 1.3 without hadoop, when i crawl 4 sites with depth 10 and topN 1 my /tmp is filled up to 100% and my crawling is failed, how can i tell nutch not do that and use another directory? or how can i empty my /tmp? help

fill up /tmp when crawl with nutc1.3

2012-01-01 Thread mina
hi, i setup nutch 1.3 without hadoop, when i crawl 4 sites with depth 10 and topN 1 my /tmp is filled up to 100% and my crawling is failed, how can i tell nutch not do that and use another directory? or how can i empty my /tmp? help me. -- View this message in context:

Re: error in solrindex command in nutch 1.4

2011-12-27 Thread mina
i can solve this problem. i read nutch doc for solrindex in: http://wiki.apache.org/nutch/bin/nutch%20solrindex this isn't correct.correct command for solrindex is: sh nutch solr index crawldb -linkdb linkdb segments/*. thanks for your answer Markus. On Mon, Dec 26, 2011 at 7:23 AM, Markus

error in solrindex command in nutch 1.4

2011-12-26 Thread mina
i crawl my sites with nutch 1.4 and when i want to index sites with solr3.3 i get some errors. i use this command: sh nutch solrindex crawldb linkdb segments/* my errors: Input path does not exist: file:/linkdb/crawl_parse Input path does not exist: file:/linkdb/parse_data Input path does not

error in topN

2011-12-20 Thread mina
hi, i crawl one site that it has 100 link in depth 1, and 100 links in depth 2, but nutch only crawl 23 links from depth 1 and 30 from depth 2. how can i force nutch to crawl all links in depth 1 and 2. i use nutch 1.3 topN=1 depth =2 and in my nutch-site.xml: property

Malformed URL: '', skipping (java.net.MalformedURLException

2011-12-15 Thread mina
i crawl sites with nutch 1.3. i see this exception in my log when nutch crawl my sites: Malformed URL: '', skipping (java.net.MalformedURLException: no protocol: at java.net.URL.init(URL.java:567) at java.net.URL.init(URL.java:464) at java.net.URL.init(URL.java:413)

error java.net.SocketTimeoutException: Read timed out in crawl with nutch?

2011-12-05 Thread mina
hi, i crawl 4 sites with topN=100 and depth=3 with nutch1.3. i have java.net.SocketTimeoutException: Read timed out error in crawl log. what property i should set in nutch-site.xml? -- View this message in context:

error java.net.SocketException: Connection reset in crawl with nutch

2011-12-05 Thread mina
hi, i crawl 4 sites with: topN=100 depth=3 http.max.delays=1000 http.timeout=8 nutch1.3 i have java.net.SocketException: Connection reset error in crawl log. help me. -- View this message in context:

Re: how give several sites to nutch to crawl?

2011-12-04 Thread mina
i add this property in nutch-site.xml but my problem isn't resolved, how property i should use? help me. its important for me. -- View this message in context: http://lucene.472066.n3.nabble.com/how-give-several-sites-to-nutch-to-crawl-tp3556697p3559106.html Sent from the Nutch - User mailing

Re: how give several sites to nutch to crawl?

2011-12-03 Thread mina
thanks for your answer. i use this script to crawl my sites: $NUTCH_HOME/bin/nutch inject $NUTCH_HOME/bin/crawl1/crawldb $NUTCH_HOME/bin/seedUrls for((i=0; i $depth; i++)) do echo --- Beginning crawl at depth `expr $i + 1` of $depth --- $NUTCH_HOME/bin/nutch generate

delete url from crawldb in nutch 1.3?

2011-11-14 Thread mina
i crawl sites with nutch 1.3, now i want delete a url from crawldb, how can i do this? how i can see urls in crawldb? -- View this message in context: http://lucene.472066.n3.nabble.com/delete-url-from-crawldb-in-nutch-1-3-tp3506106p3506106.html Sent from the Nutch - User mailing list archive at

Re: crawl sites in nutch 1.3?

2011-11-11 Thread mina
thanks for your answer, i think topN caused this problem, beacuse when nutch fetch a url , it will fetch any links that exist in page.the maximum links that will fetch from a page is equals to topN. i think if nutch fetch urls equals topN it will not fetch another url from sites.txt. please give

recrawl sites with a scheduled crawling

2011-11-02 Thread mina
hi, i want to re_crawl my sites every hour. i write a script for this. i edit some properties in nutch-site.xml. but my re_crawler fetches urls only for 3 times an after that it stop fetching. it's mean that my nutch don't update after 3 hours. this is my changes in nutch-site.xml: property

recrawl sites in nutch 1.3

2011-10-24 Thread mina
hi all. i have a script that re_crawl a site but this re_crawler fetch URL only for 3 times and don't get updates of this, i want re_crawler fetch an crawl this site every day. what property i should set in nutch- site.xml? help me. -- View this message in context:

how can i crawl pdfs?

2011-09-24 Thread mina
hi all, when i crawl pdfs ,nutch fetch any link in pdfs , how can i omit this? thanks a lot. -- View this message in context: http://lucene.472066.n3.nabble.com/how-can-i-crawl-pdfs-tp3364549p3364549.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: how do recrawl sites and filesystems?

2011-09-24 Thread mina
thanks for your answer. how i can use Jira? i don't know it? please help me. -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-tp3364532p3364600.html To

Re: how do recrawl sites and filesystems?

2011-09-24 Thread mina
On Sat, Sep 24, 2011 at 9:23 AM, tahere ganjiyar tahereganji...@gmail.comwrote: thanks for your answer. how i can use Jira? i don't know it? please help me. -- If you reply to this email, your message will be added to the discussion below:

Re: how do recrawl sites and filesystems?

2011-09-24 Thread mina
how i should use this? On Sat, Sep 24, 2011 at 9:46 AM, Markus Jelsma-2 [via Lucene] ml-node+s472066n3364686...@n3.nabble.com wrote: No need to send multiple messages. Here's Nutch' Jira issue tracker: https://issues.apache.org/jira/browse/NUTCH thanks for your answer. how i can use