[via Lucene]
ml-node+s472066n3707012...@n3.nabble.com wrote:
Nutch cannot do this right now. However, there's a patch that does the
encoding.
https://issues.apache.org/jira/browse/NUTCH-1098
On Wednesday 01 February 2012 16:26:06 mina wrote:
how i can force nutch to encoding this url? i
i want to use https://issues.apache.org/jira/browse/NUTCH-1098
in my nutch to encoding urls, but i don't know what should i do?
how i can use patch-with-utf8-encoding.diff in my nutch? it has .diff
format.
--
View this message in context:
don't send every message twice or more.
On Tuesday 31 January 2012 10:51:06 mina wrote:
i can crawl an arabic site like: http://www.sahafa.com/
but i can't crawl another site like:http://www.aljazeera.net/Portal/
help me please.
--
View this message in context:
http://lucene.472066.n3
%D8%B1-%D9%85%D9%86%D8%A7%D8%B7%D9%82-%D9%85%D8%AD%D8%B1%D9%88%D9%85/%D8%B3%D9%8A%D8%A7%D8%B3%D9%8A/
encoding, encoding, encoding
On Wednesday 01 February 2012 14:14:55 mina wrote:
hi, i use this command:
bin/nutch parsechecker -dumpText
http://www.irna.ir/News/30786427/سوء-استفاده-از-نام
hi, i use this command:
bin/nutch parsechecker -dumpText
http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آوری-رای-در-مناطق-محروم/سياسي/
and see log:
fetching:
http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آوری-رای-در-مناطق-محروم/سياسي/
?
Remi
On Tuesday, January 31, 2012, mina tahereganji...@gmail.com wrote:
i crawl a site with nutch 1.4, i understand that nutch dosen't crawl all
links in this site. i have no filter and no limit rule to crawling. for
example nutch never crawl this link:
http://www.irna.ir/News/30786427
i can crawl an arabic site like: http://www.sahafa.com/
but i can't crawl another site like:http://www.aljazeera.net/portal/
help me please.
--
View this message in context:
http://lucene.472066.n3.nabble.com/why-nutch-dosen-t-crawl-all-links-tp3702031p3702593.html
Sent from the Nutch - User
i can crawl an arabic site like: http://www.sahafa.com/
but i can't crawl another site like:http://www.aljazeera.net/Portal/
help me please.
--
View this message in context:
http://lucene.472066.n3.nabble.com/why-nutch-dosen-t-crawl-Arabic-sites-well-tp3702769p3702769.html
Sent from the Nutch
outlinks
will be processed for a page; otherwise, all outlinks will be processed.
/description
/property
Julien
On 31 January 2012 02:56, mina tahereganji...@gmail.com wrote:
i crawl a site with nutch 1.4. but nutch dosen't crawl all links in this
site. the language of this site
test with
parsechecker and indexchecker tools.
On Tuesday 31 January 2012 09:29:39 mina wrote:
i can crawl an arabic site like: http://www.sahafa.com/
but i can't crawl another site like:http://www.aljazeera.net/portal/
help me please.
--
View this message in context:
http://lucene
i crawl a site with nutch 1.4. but nutch dosen't crawl all links in this
site. the language of this site is not English. for example nutch dosen't
crawl this link:
http://www.irna.ir/News/30786427/سوء-استفاده-از-نام-كمیته-امداد-برای-جمع-آوری-رای-در-مناطق-محروم/سياسي/
what can i solve this
hi markus. i have a problem in nutch. i want use stopwords in nutch,
when i crawl sites and use solr to index them, any word in
stopwords.txt can is searched. help me.
--
View this message in context:
http://lucene.472066.n3.nabble.com/use-stop-words-in-schema-in-nutch-tp3641820p3641820.html
i want crawl .js files beacuse in .js files i add some links to a sites. how
i can config nutch to ceawl .js files?
i use nutch 1.4
--
View this message in context:
http://lucene.472066.n3.nabble.com/how-can-crawl-js-files-with-nutch-tp3642613p3642613.html
Sent from the Nutch - User mailing
i use nutch 1.4 and i want to pares .js files beacuse some links add in sites
with .js files. help me. how i can config nutch?
--
View this message in context:
http://lucene.472066.n3.nabble.com/how-can-parse-js-files-in-nutch-tp3642673p3642673.html
Sent from the Nutch - User mailing list
of space.
On Sunday 01 January 2012 10:40:49 mina wrote:
hi, i setup nutch 1.3 without hadoop, when i crawl 4 sites with depth 10
and topN 1 my /tmp is filled up to 100% and my crawling is failed, how
can i tell nutch not do that and use another directory? or how can i empty
my /tmp? help
hi, i setup nutch 1.3 without hadoop, when i crawl 4 sites with depth 10 and
topN 1 my /tmp is filled up to 100% and my crawling is failed, how can i
tell nutch not do that and use another directory? or how can i empty my
/tmp? help me.
--
View this message in context:
i can solve this problem. i read nutch doc for solrindex in:
http://wiki.apache.org/nutch/bin/nutch%20solrindex
this isn't correct.correct command for solrindex is:
sh nutch solr index crawldb -linkdb linkdb segments/*.
thanks for your answer Markus.
On Mon, Dec 26, 2011 at 7:23 AM, Markus
i crawl my sites with nutch 1.4 and when i want to index sites with solr3.3 i
get some errors. i use this command:
sh nutch solrindex crawldb linkdb segments/*
my errors:
Input path does not exist: file:/linkdb/crawl_parse
Input path does not exist: file:/linkdb/parse_data
Input path does not
hi, i crawl one site that it has 100 link in depth 1, and 100 links in depth
2, but nutch only crawl 23 links from depth 1 and 30 from depth 2. how can i
force nutch to crawl all links in depth 1 and 2. i use nutch 1.3
topN=1
depth =2
and in my nutch-site.xml:
property
i crawl sites with nutch 1.3. i see this exception in my log when nutch crawl
my sites:
Malformed URL: '', skipping (java.net.MalformedURLException: no
protocol:
at java.net.URL.init(URL.java:567)
at java.net.URL.init(URL.java:464)
at java.net.URL.init(URL.java:413)
hi, i crawl 4 sites with topN=100 and depth=3 with nutch1.3. i have
java.net.SocketTimeoutException: Read timed out error in crawl log. what
property i should set in nutch-site.xml?
--
View this message in context:
hi, i crawl 4 sites with:
topN=100
depth=3
http.max.delays=1000
http.timeout=8
nutch1.3
i have java.net.SocketException: Connection reset error in crawl log. help
me.
--
View this message in context:
i add this property in nutch-site.xml but my problem isn't resolved, how
property i should use? help me. its important for me.
--
View this message in context:
http://lucene.472066.n3.nabble.com/how-give-several-sites-to-nutch-to-crawl-tp3556697p3559106.html
Sent from the Nutch - User mailing
thanks for your answer. i use this script to crawl my sites:
$NUTCH_HOME/bin/nutch inject $NUTCH_HOME/bin/crawl1/crawldb
$NUTCH_HOME/bin/seedUrls
for((i=0; i $depth; i++))
do
echo --- Beginning crawl at depth `expr $i + 1` of $depth ---
$NUTCH_HOME/bin/nutch generate
i crawl sites with nutch 1.3, now i want delete a url from crawldb, how can i
do this? how i can see urls in crawldb?
--
View this message in context:
http://lucene.472066.n3.nabble.com/delete-url-from-crawldb-in-nutch-1-3-tp3506106p3506106.html
Sent from the Nutch - User mailing list archive at
thanks for your answer, i think topN caused this problem, beacuse when nutch
fetch a url , it will fetch any links that exist in page.the maximum links
that will fetch from a page is equals to topN. i think if nutch fetch urls
equals topN it will not fetch another url from sites.txt. please give
hi, i want to re_crawl my sites every hour. i write a script for this. i edit
some properties in nutch-site.xml. but my re_crawler fetches urls only for 3
times an after that it stop fetching. it's mean that my nutch don't update
after 3 hours. this is my changes in nutch-site.xml:
property
hi all. i have a script that re_crawl a site but this re_crawler fetch URL
only for 3 times and don't get updates of this, i want re_crawler fetch an
crawl this site every day. what property i should set in nutch- site.xml?
help me.
--
View this message in context:
hi all,
when i crawl pdfs ,nutch fetch any link in pdfs ,
how can i omit this?
thanks a lot.
--
View this message in context:
http://lucene.472066.n3.nabble.com/how-can-i-crawl-pdfs-tp3364549p3364549.html
Sent from the Nutch - User mailing list archive at Nabble.com.
thanks for your answer.
how i can use Jira?
i don't know it?
please help me.
--
If you reply to this email, your message will be added to the discussion
below:
http://lucene.472066.n3.nabble.com/how-do-recrawl-sites-and-filesystems-tp3364532p3364600.html
To
On Sat, Sep 24, 2011 at 9:23 AM, tahere ganjiyar
tahereganji...@gmail.comwrote:
thanks for your answer.
how i can use Jira?
i don't know it?
please help me.
--
If you reply to this email, your message will be added to the discussion
below:
how i should use this?
On Sat, Sep 24, 2011 at 9:46 AM, Markus Jelsma-2 [via Lucene]
ml-node+s472066n3364686...@n3.nabble.com wrote:
No need to send multiple messages. Here's Nutch' Jira issue tracker:
https://issues.apache.org/jira/browse/NUTCH
thanks for your answer.
how i can use
32 matches
Mail list logo