from:"mina"

Re: Bad Request in nutch when i use parsechecker?

2012-02-02 Thread mina

[via Lucene] ml-node+s472066n3707012...@n3.nabble.com wrote: Nutch cannot do this right now. However, there's a patch that does the encoding. https://issues.apache.org/jira/browse/NUTCH-1098 On Wednesday 01 February 2012 16:26:06 mina wrote: how i can force nutch to encoding this url? i

how can i use patch-with-utf8-encoding.diff in https://issues.apache.org/jira/browse/NUTCH-1098?

2012-02-02 Thread mina

i want to use https://issues.apache.org/jira/browse/NUTCH-1098 in my nutch to encoding urls, but i don't know what should i do? how i can use patch-with-utf8-encoding.diff in my nutch? it has .diff format. -- View this message in context:

Re: why nutch dosen't crawl Arabic sites well?

2012-02-01 Thread mina

don't send every message twice or more. On Tuesday 31 January 2012 10:51:06 mina wrote: i can crawl an arabic site like: http://www.sahafa.com/ but i can't crawl another site like:http://www.aljazeera.net/Portal/ help me please. -- View this message in context: http://lucene.472066.n3

Re: Bad Request in nutch when i use parsechecker?

2012-02-01 Thread mina

%D8%B1-%D9%85%D9%86%D8%A7%D8%B7%D9%82-%D9%85%D8%AD%D8%B1%D9%88%D9%85/%D8%B3%D9%8A%D8%A7%D8%B3%D9%8A/ encoding, encoding, encoding On Wednesday 01 February 2012 14:14:55 mina wrote: hi, i use this command: bin/nutch parsechecker -dumpText http://www.irna.ir/News/30786427/سوء-استفاده-از-نام

Re: why nutch dosen't crawl all links

2012-02-01 Thread mina

hi, i use this command: bin/nutch parsechecker -dumpText http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آوری-رای-در-مناطق-محروم/سياسي/ and see log: fetching: http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آوری-رای-در-مناطق-محروم/سياسي/

Re: why nutch dosen't crawl all links

2012-01-31 Thread mina

? Remi On Tuesday, January 31, 2012, mina tahereganji...@gmail.com wrote: i crawl a site with nutch 1.4, i understand that nutch dosen't crawl all links in this site. i have no filter and no limit rule to crawling. for example nutch never crawl this link: http://www.irna.ir/News/30786427

Re: why nutch dosen't crawl all links

2012-01-31 Thread mina

i can crawl an arabic site like: http://www.sahafa.com/ but i can't crawl another site like:http://www.aljazeera.net/portal/ help me please. -- View this message in context: http://lucene.472066.n3.nabble.com/why-nutch-dosen-t-crawl-all-links-tp3702031p3702593.html Sent from the Nutch - User

why nutch dosen't crawl Arabic sites well?

2012-01-31 Thread mina

i can crawl an arabic site like: http://www.sahafa.com/ but i can't crawl another site like:http://www.aljazeera.net/Portal/ help me please. -- View this message in context: http://lucene.472066.n3.nabble.com/why-nutch-dosen-t-crawl-Arabic-sites-well-tp3702769p3702769.html Sent from the Nutch

Re: error in crawl all link in no English language sites

2012-01-31 Thread mina

outlinks will be processed for a page; otherwise, all outlinks will be processed. /description /property Julien On 31 January 2012 02:56, mina tahereganji...@gmail.com wrote: i crawl a site with nutch 1.4. but nutch dosen't crawl all links in this site. the language of this site

Re: why nutch dosen't crawl all links

2012-01-31 Thread mina

test with parsechecker and indexchecker tools. On Tuesday 31 January 2012 09:29:39 mina wrote: i can crawl an arabic site like: http://www.sahafa.com/ but i can't crawl another site like:http://www.aljazeera.net/portal/ help me please. -- View this message in context: http://lucene

error in crawl all link in no English language sites

2012-01-30 Thread mina

i crawl a site with nutch 1.4. but nutch dosen't crawl all links in this site. the language of this site is not English. for example nutch dosen't crawl this link: http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آوری-رای-در-مناطق-محروم/سياسي/ what can i solve this

use stop words in schema in nutch

2012-01-08 Thread mina

hi markus. i have a problem in nutch. i want use stopwords in nutch, when i crawl sites and use solr to index them, any word in stopwords.txt can is searched. help me. -- View this message in context: http://lucene.472066.n3.nabble.com/use-stop-words-in-schema-in-nutch-tp3641820p3641820.html

how can crawl .js files with nutch?

2012-01-08 Thread mina

i want crawl .js files beacuse in .js files i add some links to a sites. how i can config nutch to ceawl .js files? i use nutch 1.4 -- View this message in context: http://lucene.472066.n3.nabble.com/how-can-crawl-js-files-with-nutch-tp3642613p3642613.html Sent from the Nutch - User mailing

how can parse .js files in nutch?

2012-01-08 Thread mina

i use nutch 1.4 and i want to pares .js files beacuse some links add in sites with .js files. help me. how i can config nutch? -- View this message in context: http://lucene.472066.n3.nabble.com/how-can-parse-js-files-in-nutch-tp3642673p3642673.html Sent from the Nutch - User mailing list

Re: fill up /tmp when crawl with nutc1.3

2012-01-02 Thread mina

of space. On Sunday 01 January 2012 10:40:49 mina wrote: hi, i setup nutch 1.3 without hadoop, when i crawl 4 sites with depth 10 and topN 1 my /tmp is filled up to 100% and my crawling is failed, how can i tell nutch not do that and use another directory? or how can i empty my /tmp? help

fill up /tmp when crawl with nutc1.3

2012-01-01 Thread mina

hi, i setup nutch 1.3 without hadoop, when i crawl 4 sites with depth 10 and topN 1 my /tmp is filled up to 100% and my crawling is failed, how can i tell nutch not do that and use another directory? or how can i empty my /tmp? help me. -- View this message in context:

Re: error in solrindex command in nutch 1.4

2011-12-27 Thread mina

i can solve this problem. i read nutch doc for solrindex in: http://wiki.apache.org/nutch/bin/nutch%20solrindex this isn't correct.correct command for solrindex is: sh nutch solr index crawldb -linkdb linkdb segments/*. thanks for your answer Markus. On Mon, Dec 26, 2011 at 7:23 AM, Markus

error in solrindex command in nutch 1.4

2011-12-26 Thread mina

i crawl my sites with nutch 1.4 and when i want to index sites with solr3.3 i get some errors. i use this command: sh nutch solrindex crawldb linkdb segments/* my errors: Input path does not exist: file:/linkdb/crawl_parse Input path does not exist: file:/linkdb/parse_data Input path does not

error in topN

2011-12-20 Thread mina

hi, i crawl one site that it has 100 link in depth 1, and 100 links in depth 2, but nutch only crawl 23 links from depth 1 and 30 from depth 2. how can i force nutch to crawl all links in depth 1 and 2. i use nutch 1.3 topN=1 depth =2 and in my nutch-site.xml: property

Malformed URL: '', skipping (java.net.MalformedURLException

2011-12-15 Thread mina

i crawl sites with nutch 1.3. i see this exception in my log when nutch crawl my sites: Malformed URL: '', skipping (java.net.MalformedURLException: no protocol: at java.net.URL.init(URL.java:567) at java.net.URL.init(URL.java:464) at java.net.URL.init(URL.java:413)

error java.net.SocketTimeoutException: Read timed out in crawl with nutch?

2011-12-05 Thread mina

hi, i crawl 4 sites with topN=100 and depth=3 with nutch1.3. i have java.net.SocketTimeoutException: Read timed out error in crawl log. what property i should set in nutch-site.xml? -- View this message in context:

error java.net.SocketException: Connection reset in crawl with nutch

2011-12-05 Thread mina

hi, i crawl 4 sites with: topN=100 depth=3 http.max.delays=1000 http.timeout=8 nutch1.3 i have java.net.SocketException: Connection reset error in crawl log. help me. -- View this message in context:

Re: how give several sites to nutch to crawl?

2011-12-04 Thread mina

i add this property in nutch-site.xml but my problem isn't resolved, how property i should use? help me. its important for me. -- View this message in context: http://lucene.472066.n3.nabble.com/how-give-several-sites-to-nutch-to-crawl-tp3556697p3559106.html Sent from the Nutch - User mailing

Re: how give several sites to nutch to crawl?

2011-12-03 Thread mina

thanks for your answer. i use this script to crawl my sites: $NUTCH_HOME/bin/nutch inject $NUTCH_HOME/bin/crawl1/crawldb $NUTCH_HOME/bin/seedUrls for((i=0; i $depth; i++)) do echo --- Beginning crawl at depth `expr $i + 1` of $depth --- $NUTCH_HOME/bin/nutch generate

delete url from crawldb in nutch 1.3?

2011-11-14 Thread mina

i crawl sites with nutch 1.3, now i want delete a url from crawldb, how can i do this? how i can see urls in crawldb? -- View this message in context: http://lucene.472066.n3.nabble.com/delete-url-from-crawldb-in-nutch-1-3-tp3506106p3506106.html Sent from the Nutch - User mailing list archive at

Re: crawl sites in nutch 1.3?

2011-11-11 Thread mina

thanks for your answer, i think topN caused this problem, beacuse when nutch fetch a url , it will fetch any links that exist in page.the maximum links that will fetch from a page is equals to topN. i think if nutch fetch urls equals topN it will not fetch another url from sites.txt. please give

recrawl sites with a scheduled crawling

2011-11-02 Thread mina

hi, i want to re_crawl my sites every hour. i write a script for this. i edit some properties in nutch-site.xml. but my re_crawler fetches urls only for 3 times an after that it stop fetching. it's mean that my nutch don't update after 3 hours. this is my changes in nutch-site.xml: property

recrawl sites in nutch 1.3

2011-10-24 Thread mina

hi all. i have a script that re_crawl a site but this re_crawler fetch URL only for 3 times and don't get updates of this, i want re_crawler fetch an crawl this site every day. what property i should set in nutch- site.xml? help me. -- View this message in context:

how can i crawl pdfs?

2011-09-24 Thread mina

hi all, when i crawl pdfs ,nutch fetch any link in pdfs , how can i omit this? thanks a lot. -- View this message in context: http://lucene.472066.n3.nabble.com/how-can-i-crawl-pdfs-tp3364549p3364549.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: how do recrawl sites and filesystems?

2011-09-24 Thread mina

thanks for your answer. how i can use Jira? i don't know it? please help me. -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-tp3364532p3364600.html To

Re: how do recrawl sites and filesystems?

2011-09-24 Thread mina

On Sat, Sep 24, 2011 at 9:23 AM, tahere ganjiyar tahereganji...@gmail.comwrote: thanks for your answer. how i can use Jira? i don't know it? please help me. -- If you reply to this email, your message will be added to the discussion below:

Re: how do recrawl sites and filesystems?

2011-09-24 Thread mina

how i should use this? On Sat, Sep 24, 2011 at 9:46 AM, Markus Jelsma-2 [via Lucene] ml-node+s472066n3364686...@n3.nabble.com wrote: No need to send multiple messages. Here's Nutch' Jira issue tracker: https://issues.apache.org/jira/browse/NUTCH thanks for your answer. how i can use

Re: Bad Request in nutch when i use parsechecker?

how can i use patch-with-utf8-encoding.diff in https://issues.apache.org/jira/browse/NUTCH-1098?

Re: why nutch dosen't crawl Arabic sites well?

Re: Bad Request in nutch when i use parsechecker?

Re: why nutch dosen't crawl all links

Re: why nutch dosen't crawl all links

Re: why nutch dosen't crawl all links

why nutch dosen't crawl Arabic sites well?

Re: error in crawl all link in no English language sites

Re: why nutch dosen't crawl all links

error in crawl all link in no English language sites

use stop words in schema in nutch

how can crawl .js files with nutch?

how can parse .js files in nutch?

Re: fill up /tmp when crawl with nutc1.3

fill up /tmp when crawl with nutc1.3

Re: error in solrindex command in nutch 1.4

error in solrindex command in nutch 1.4

error in topN

Malformed URL: '', skipping (java.net.MalformedURLException

error java.net.SocketTimeoutException: Read timed out in crawl with nutch?

error java.net.SocketException: Connection reset in crawl with nutch

Re: how give several sites to nutch to crawl?

Re: how give several sites to nutch to crawl?

delete url from crawldb in nutch 1.3?

Re: crawl sites in nutch 1.3?

recrawl sites with a scheduled crawling

recrawl sites in nutch 1.3

how can i crawl pdfs?

Re: how do recrawl sites and filesystems?

Re: how do recrawl sites and filesystems?

Re: how do recrawl sites and filesystems?

32 matches

Site Navigation

Mail list logo

Footer information