Re: Nutch crawling other countries domain despite db.ignore.external.links

2016-05-31 Thread Sebastian Nagel
Hi Jean, db.ignore.external.links=true should work. Which version of Nutch are you using? How is the property set? Does your seed list only contain URLs from mysite.com, and none from mysite.es? Regards, Sebastian On 05/25/2016 11:44 AM, Jean Vence wrote: > I am trying to crawl a single site

Nutch crawling other countries domain despite db.ignore.external.links

2016-05-25 Thread Jean Vence
I am trying to crawl a single site and have used db.ignore.external.links=true flag. But it seems to fail because it will crawl sites with a different country extension so for example: if the seed is mysite.com, it will crawl mysite.com, mysite.es & mysite.it - I dont want to use a regex to

Re: Can Nutch crawling shortened url?

2015-06-13 Thread Lewis John Mcgibbney
Hi Ankit, On Mon, Jun 8, 2015 at 2:13 AM, user-digest-h...@nutch.apache.org wrote: I tried it with 1.10, but the shortened urls still dont get followed through. Have you tried changing logging level to TRACE within conf/log4j.properties? This may provide more detail for you. I think

Re: [MASSMAIL]Can Nutch crawling shortened url?

2015-06-04 Thread Ankit Goel
- Original Message - From: Ankit Goel ankitgoel2...@gmail.com To: user@nutch.apache.org Sent: Tuesday, June 2, 2015 9:59:40 PM Subject: [MASSMAIL]Can Nutch crawling shortened url? Hi, I was playing around with nutch 1.9 when I came across some twitter t.co links. When I ran it through

Can Nutch crawling shortened url?

2015-06-02 Thread Ankit Goel
Hi, I was playing around with nutch 1.9 when I came across some twitter t.co links. When I ran it through parsechecker, I got failed fetch protocol status : moved(12). I have set my http.redirect.max count to 5 (experimented with 10) which works for other links, but didnt seem to redirect me. I

Re: [MASSMAIL]Can Nutch crawling shortened url?

2015-06-02 Thread Jorge Luis Betancourt González
URLs, but it should work as a normal redirect. Regards, [1] https://issues.apache.org/jira/browse/NUTCH-1939 - Original Message - From: Ankit Goel ankitgoel2...@gmail.com To: user@nutch.apache.org Sent: Tuesday, June 2, 2015 9:59:40 PM Subject: [MASSMAIL]Can Nutch crawling shortened url

Nutch crawling advice

2015-01-02 Thread Tigran Tsaturyan
Hello, I am a new user of Nutch and though looked through several manuals on the web, I still have questions. Hope you will be able to give answers or point me to some manual. My questions: . I intend to use Nutch to crawl several particular sites and (as I know data structure inside

nutch crawling issues

2013-07-10 Thread devang pandey
I have a website eg . www.example.com. Now when I am crawling this using nutch 1.4 problem is that of duplicated crawling . There are a number of pages like www.example.com/s38r84rejkfndn/xyz.aspx . Now this number s38r84rejkfndn keeps on changing every time you visit this page and hence crawler

Re: nutch crawling issues

2013-07-10 Thread devang pandey
-Original message- From:devang pandey devangpande...@gmail.com Sent: Wednesday 10th July 2013 10:29 To: user@nutch.apache.org Subject: nutch crawling issues I have a website eg . www.example.com. Now when I am crawling this using nutch 1.4 problem is that of duplicated crawling

RE: nutch crawling issues

2013-07-10 Thread Markus Jelsma
Hi - conf/regex-url-filter.txt and make sure the urlfilter-regex is enabled in your nutch-site plugin.includes config. -Original message- From:devang pandey devangpande...@gmail.com Sent: Wednesday 10th July 2013 11:51 To: user@nutch.apache.org Subject: Re: nutch crawling issues

Re: nutch crawling issues

2013-07-10 Thread devang pandey
in your nutch-site plugin.includes config. -Original message- From:devang pandey devangpande...@gmail.com Sent: Wednesday 10th July 2013 11:51 To: user@nutch.apache.org Subject: Re: nutch crawling issues hello markus I have one confusion should i implement changes in crawl

nutch crawling same page manytimes

2013-07-09 Thread devang pandey
I am crawling a website with nutch 1.2 . Problem that I am facing is :that website generates different urls for same page every time you open it .Becausee of this issue nutch is crawling same page again and again . Please suggest me how to resolve this issue.

Continue Nutch Crawling After Exception

2013-03-05 Thread raviksingh
checked org.apache.nutch.net.URLFilter. I was unable to make it work. Please ask any details if required. -- View this message in context: http://lucene.472066.n3.nabble.com/Continue-Nutch-Crawling-After-Exception-tp4044888.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Continue Nutch Crawling After Exception

2013-03-05 Thread Lewis John Mcgibbney
Hi, On Tue, Mar 5, 2013 at 7:22 AM, raviksingh ravisingh.air...@gmail.comwrote: I am new to Nutch.I have already configured Nutch with MYSQL. I have few questions : I would like to star by saying that this is not a great idea. If you read this list you will see why. 1.Currently I am

Re: Why is my Nutch-crawling so slow?

2013-02-19 Thread Tejas Patil
in context: http://lucene.472066.n3.nabble.com/Why-is-my-Nutch-crawling-so-slow-tp4037964p4041290.html Sent from the Nutch - User mailing list archive at Nabble.com.

Why is my Nutch-crawling so slow?

2013-02-02 Thread imehesz
://([a-z0-9]*\.)*MYDOMAIN.COM/ Is there a way I can speed this up? thanks, --i -- View this message in context: http://lucene.472066.n3.nabble.com/Why-is-my-Nutch-crawling-so-slow-tp4037964.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Why is my Nutch-crawling so slow?

2013-02-02 Thread Tejas Patil
this message in context: http://lucene.472066.n3.nabble.com/Why-is-my-Nutch-crawling-so-slow-tp4037964.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch crawling file system SOLVED

2012-09-11 Thread Lewis John Mcgibbney
. Getting the index with pdf's file name but not the content in those -- View this message in context: http://lucene.472066.n3.nabble.com/Re-nutch-crawling-file-system-SOLVED-tp3815336p4006754.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Lewis

Re: nutch crawling file system SOLVED

2012-09-11 Thread dpverma
. is that the reason no text is getting extracted from the pdf? If rebuilding nutch is crucial step...can you pls guide me as to how to do it. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Re-nutch-crawling-file-system-SOLVED-tp3815336p4007024.html Sent from the Nutch - User

Re: nutch crawling file system SOLVED

2012-09-10 Thread dpverma
Can you pls let me know how you solved your problem? I am also getting the same error which you had. Getting the index with pdf's file name but not the content in those -- View this message in context: http://lucene.472066.n3.nabble.com/Re-nutch-crawling-file-system-SOLVED-tp3815336p4006754

Re: Nutch Crawling for Videos

2012-08-22 Thread Lewis John Mcgibbney
Hi Robert, There is a parse-swf plugin for Nutch which uses the JavaSWF library [0] to parse such files (of what version I am not currently aware) and I can confirm that it does work e.g. when used from command line I can obtain parse data from within a local swf file. I am not sure if this

Nutch Crawling for Videos

2012-08-20 Thread Robert Irribarren
I am wondering how I would make nutch crawl for videos say, youtube. I also want this to be a seperate section in solr. So when I search I can filter the results into videos and just websites.

Re: nutch crawling file system SOLVED

2012-03-17 Thread alessio crisantemi
this is the return after crawling with nutch and indexing on solr: doc float name=boost0.298293/float - str name=content Index of C:\Documents and Settings\Alessio\Documenti Index of C:\Documents and Settings\Alessio\Documenti ../ - - - 003_C_001_Alessio_2004_08_13.dvf Tue, 17 Aug 2004 20:09:52

Re: nutch crawling file system SOLVED

2012-03-17 Thread Lewis John Mcgibbney
Hi Alessio, On Sat, Mar 17, 2012 at 5:31 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: suggestions? For what?

Re: nutch crawling file system SOLVED

2012-03-17 Thread alessio crisantemi
I would that the result of my search be the text of my pdf file and not the list of documents into the directory and the path address.. Il giorno 17 marzo 2012 21:11, Lewis John Mcgibbney lewis.mcgibb...@gmail.com ha scritto: Hi Alessio, On Sat, Mar 17, 2012 at 5:31 PM, alessio crisantemi

Re: nutch crawling file system SOLVED

2012-03-12 Thread alessio crisantemi
I add the path of my directory on regex-urlfilter but nutch crawl also other directories... And more: I follow your suggestions and I indexing again my root, But I have still a index with the name of my pdf's files and not the content of those. I don't comprend.. alessio Il giorno 12 marzo 2012

Re: nutch crawling file system SOLVED

2012-03-11 Thread Lewis John Mcgibbney
Hi Alessio, If you check out our official tutorial you will see no mention of crawl-urlfilter, this was deprecated after Nutch 1.2 IIRC. I can only suggest that any other tutorial you are using is in need of an update. http://wiki.apache.org/nutch/NutchTutorial On Sat, Mar 10, 2012 at 4:42 PM,

Re: nutch crawling file system SOLVED

2012-03-11 Thread Lewis John Mcgibbney
Please see below On Sun, Mar 11, 2012 at 5:10 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: [1]http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F I've now updated this link, thanks for pointing this out. And Now, I have another problem: I crawled my

Re: nutch crawling file system SOLVED

2012-03-11 Thread remi tassing
You're probably looking for the Highlighting future http://wiki.apache.org/solr/HighlightingParameters Remi On Sun, Mar 11, 2012 at 6:10 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: Thank you Lewis for your explanation: I supposed this fact and I post on mailing list my

Re: nutch crawling file system SOLVED

2012-03-11 Thread alessio crisantemi
thank you Remi for your preciuos help. I try again and I write you the results. But I have another little question: how can I do for limit the crawling only to my selected root? Because all time, Nutch crawl also the parent directories. I read that The code that is responsable for this is in

Re: nutch crawling file system SOLVED

2012-03-11 Thread remi tassing
Using crawl-ulrfilter (or regex-urlfilter depending on which one you're using), you should be able to solve this. Unless you're not clear on what folders to exclude...? On Sunday, March 11, 2012, alessio crisantemi alessio.crisant...@gmail.com wrote: thank you Remi for your preciuos help. I try

Re: nutch crawling file system SOLVED

2012-03-10 Thread alessio crisantemi
I'm partially solved. following the tutorial, I configured my nutch for crawl a local file system, thank you. But I have a duobt: why all tutorials and guide about nutch speak about crawl-urlfilter.txt' file, but the default config or Nutch don't have this file? But If I insert the code that the

Re: nutch crawling

2012-03-01 Thread Elisabeth Adler
this message in context: http://lucene.472066.n3.nabble.com/nutch-crawling-tp3786913p3786913.html Sent from the Nutch - User mailing list archive at Nabble.com.

nutch crawling

2012-02-29 Thread sanjay87
a solution, please provide your valuable inputs. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/nutch-crawling-tp3786913p3786913.html Sent from the Nutch - User mailing list archive at Nabble.com.

Specialized Nutch Crawling

2012-01-04 Thread niviksha
infrastructure to do so? Any gotchas? Thanks! Venkat -- View this message in context: http://lucene.472066.n3.nabble.com/Specialized-Nutch-Crawling-tp3633342p3633342.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Specialized Nutch Crawling

2012-01-04 Thread Gora Mohanty
On Thu, Jan 5, 2012 at 4:42 AM, niviksha nivik...@gmail.com wrote: Hi all, this is my first post. I've used lucene extensively in the past, but am just getting my feet wet with Nutch. The problem I have is to use Nutch to crawl relational (sql) databases. Is this possible via the current plug

Re: nutch crawling arabic pdf site

2011-02-20 Thread hala
thaanx a lot for your help you have a wide experience but the problem is still exist i don't know what can i do -- View this message in context: http://lucene.472066.n3.nabble.com/nutch-crawling-arabic-pdf-site-tp2485360p2538118.html Sent from the Nutch - User mailing list archive

Re: nutch crawling arabic pdf site

2011-02-20 Thread Markus Jelsma
The problem isn't fixed in the 0.9 relase of Tika so you're still stuck here and there is no other parse-pdf plugin which you can use. There is, however, the parse-ext plugin [1] which you perhaps could use to execute pdf2text and return the parsed content. I haven't used this plugin and i

Re: nutch crawling arabic pdf site

2011-02-16 Thread hala
words in the pdf ,nutch don't return them to me. please give me any help -- View this message in context: http://lucene.472066.n3.nabble.com/nutch-crawling-arabic-pdf-site-tp2485360p2507554.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch crawling arabic pdf site

2011-02-16 Thread Markus Jelsma
Hi, 1. is the PDF actually fetched, parsed and indexed? Doesn't your regex- urlfilter skip PDF? 2. Is the PDF too large, is it being truncated by Nutch? 3. Does Tika actually parse the PDF as you expect? There may be issues at separate locations. You can use the parser checker to confirm Tika's

Re: nutch crawling arabic pdf site

2011-02-16 Thread Markus Jelsma
I dug a bit deeper and i now believe you're the victim of TIKA-469 https://issues.apache.org/jira/browse/TIKA-469 On Sunday 13 February 2011 13:47:11 hala wrote: when i crawl a site with pdf link contain arabic words it dont return me the arabic word in pdf when i search with nutch on it

Re: nutch crawling arabic pdf site

2011-02-14 Thread Markus Jelsma
Hi, What configuration are you using? Did you actually succeed a complete crawl (generate, fetch, update, index) cycle? Are you using Nutch' internal search or are you using Solr as search backend? Can your servlet container handle non latin input for GET requests? Using Nutch and Solr i can

nutch crawling arabic pdf site

2011-02-13 Thread hala
when i crawl a site with pdf link contain arabic words it dont return me the arabic word in pdf when i search with nutch on it what can i do please help me -- View this message in context: http://lucene.472066.n3.nabble.com/nutch-crawling-arabic-pdf-site-tp2485360p2485360.html Sent from

RE: How to speed up nutch crawling!

2011-02-02 Thread McGibbney, Lewis John
@nutch.apache.org Subject: How to speed up nutch crawling! Hi list, I am Arjun. I am trying to develop an application in which I'll give a constrained set of urls to the urls file in Nutch. I am able to crawl these urls and get the contents of them by reading the data from the segments. I have crawled

Re: How to speed up nutch crawling!

2011-02-02 Thread Adam Estrada
Kumar Reddy [mailto:charjunkumar.re...@iiitb.net] Sent: 02 February 2011 07:52 To: user@nutch.apache.org Subject: How to speed up nutch crawling! Hi list, I am Arjun. I am trying to develop an application in which I'll give a constrained set of urls to the urls file in Nutch. I am able

How to speed up nutch crawling!

2011-02-01 Thread Arjun Kumar Reddy
Hi list, I am Arjun. I am trying to develop an application in which I'll give a constrained set of urls to the urls file in Nutch. I am able to crawl these urls and get the contents of them by reading the data from the segments. I have crawled by giving the depth 1 as I am no way concerned

Re: nutch crawling with java (not shellscript)

2010-11-24 Thread Daniel Martin
/org/apache/nutch/fetcher/TestFetcher.java?spec=svn77r=77 http://code.google.com/p/daicaheb/source/browse/trunk/nutch-trunk-bb/src/test/org/apache/nutch/fetcher/TestFetcher.java?spec=svn77r=77 -- View this message in context: http://lucene.472066.n3.nabble.com/nutch-crawling-with-java

Re: nutch crawling with java (not shellscript)

2010-11-23 Thread Daniel Martin
Hiya Matthias, did you find a tutorial about running the nutch crawler via java? I am very interested because I am working in the university and nobody in the department knows about that Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/nutch-crawling-with-java

Re: nutch crawling with java (not shellscript)

2010-11-23 Thread Eddie Drapkin
On 11/23/2010 12:55 PM, Daniel Martin wrote: Hiya Matthias, did you find a tutorial about running the nutch crawler via java? I am very interested because I am working in the university and nobody in the department knows about that Thanks Why not just read the shell script and decipher what

Re: nutch crawling with java (not shellscript)

2010-11-23 Thread Scott Gonyea
in the university and nobody in the department knows about that Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/nutch-crawling-with-java-not-shellscript-tp617212p1955611.html Sent from the Nutch - User mailing list archive at Nabble.com.