Re: Ignore navigation during index

2015-03-26 Thread remi tassing
You will probably need to customize the parse-html plugin for your purpose On Mar 26, 2015 4:20 PM, Richardson, Jacquelyn F. fluke...@ornl.gov wrote: Hi, Is there a way to tell nutch to ignore the navigation or footer parts of an html page during the crawl process? Specifically I do not want

Re: Scheduling multiple possibly parallel nutch crawls based on different configurations?

2015-03-15 Thread remi tassing
I have a similar need with an additional requirement whereby the crawlDB should be merged at the end. The best solution I could think of,so far, is having independent instances of nutch. Remi On Mar 14, 2015 9:08 PM, steve labar steve.labarbera@gmail.com wrote: Hi, I have a use case where

Re: How to verify URLFilterChecker

2015-02-09 Thread remi tassing
Search this mailing list archI've for 'URLFilterChecker documentation', you'll find the following: From: Markus Jelsma markus.jel...@openindex.io Date: Dec 9, 2011 2:02 PM Subject: Re: URLFilterChecker documentation To: remi tassing tassingr...@gmail.com Cc: That's not stdin is it? echo http

Re: HttpPostAuthentication

2014-12-16 Thread remi tassing
I have been doing a lot of POST authentication while crawling corporate stuff. Since POST methods may vary drastically between sites (e.g. typical JIRA to POST+JS redirection, NTLMv2...) it's hard not to extend the crawler with some additional Java. So what I've ended up doing is to build a

Re: Unable to crawl a URL unless session cookies are set

2014-12-04 Thread remi tassing
do you force the crawler to crawl the same URL? If I were to check for certain cookie values, and they match, I would like to be able to crawl the same URL again. Kartik -Original Message- From: remi tassing [mailto:tassingr...@gmail.com] Sent: Tuesday, December 02, 2014 5:24 PM

Re: Unable to crawl a URL unless session cookies are set

2014-12-02 Thread remi tassing
Hi Kartik, I had a similar enquiry a long time ago and from what I remember, Nutch will save the new URL and crawl it in the future...which is not the needed behavior here. To solve this problem, I've customized my protocol-httpclient (HttpResponse class) to just open the 2nd URL right after the

Re: When to delete the segments?

2014-11-02 Thread remi tassing
The next fetching time is computed after updatedb is isssued with that segment So as long as you don't need the parsed data anymore then you can delete the segment (e.g. after indexing through Solr...). On Mon, Nov 3, 2014 at 8:41 AM, Meraj A. Khan mera...@gmail.com wrote: Hi All, I am

Re: When to delete the segments?

2014-11-02 Thread remi tassing
for in my script ? On Sun, Nov 2, 2014 at 7:58 PM, remi tassing tassingr...@gmail.com wrote: The next fetching time is computed after updatedb is isssued with that segment So as long as you don't need the parsed data anymore then you can delete the segment (e.g. after indexing through Solr

Re: Ignoring parts of a URL like certain query parameters

2014-11-01 Thread remi tassing
Hi John, Have a look at some regex tutorials. What you are asking for is absolutely doable. E.g.: regex pattern^(http://www.test.com?.*)query2=.* http://www.test.com?.*query2=.*;(.*)/pattern substitution$1$2/substitution /regex Plz double check if the ampersand should be escaped or not. I'm

Re: Nutch returns empty result set for some websites

2014-07-18 Thread remi tassing
Can you check the log file for more info? default location: $NUTCH_HOME/logs/hadoop.log Ref: http://www.opensourceconnections.com/blog/2014/05/24/crawling-with-nutch/ On Fri, Jul 18, 2014 at 8:52 PM, Ankur Dulwani dulwani_anku...@yahoo.co.in wrote: Hi, I am using Nutch to crawl data from

Re: Nutch use a Browser or phantomjs as fetcher

2014-06-21 Thread remi tassing
/Fetcher.java as hook, if it contains html and head in the first 500 characters. Regards, Patrick HTH Julien On 7 June 2014 11:35, remi tassing tassingr...@gmail.com wrote: I'm currently looking at those separately but an integrated option would be more efficient

Re: Nutch use a Browser or phantomjs as fetcher

2014-06-07 Thread remi tassing
I'm currently looking at those separately but an integrated option would be more efficient. Looking forward for any experience sharing On Sat, Jun 7, 2014 at 6:25 PM, Patrick Kirsch pkir...@zscho.de wrote: Hey list, I'm sure this issue was asked several times, but a quick look in the nutch

Re: Nutch 1.7 - deleting segments

2014-05-03 Thread remi tassing
you are correct On Fri, May 2, 2014 at 7:46 PM, chethan chethan.p...@gmail.com wrote: Hi, I have a Nutch crawl with 4 segments which are fully indexed using the bin/nutch solrindexcommand. Now I'm all out of storage on the box, so can I delete the 4 segments and retain only the crawldb

Re: Nutch 1.8 Solrindexer failing

2014-05-03 Thread remi tassing
Could you provide the complete stack trace? Probably add more debug info in. This could be due to some disk size issue... On Sat, May 3, 2014 at 8:51 PM, BlackIce blackice...@gmail.com wrote: HI, playing around with Nutch 1.8 in localmode on Solr 4.7.. When indexing larger crawls 10k and up

Re: Nutch 2.2.1: PDF issue

2014-04-13 Thread remi tassing
Hi Laxmi, Could you provide some examples? On Mon, Apr 14, 2014 at 2:31 AM, A Laxmi a.lakshmi...@gmail.com wrote: Hi Sebastian, Yes, you are right, there is *no *title defined in the PDF's info container and that is when Nutch is returning empty titles where as Google somehow returns the

Re: Nutch 2.2.1: PDF issue

2014-04-13 Thread remi tassing
the title - https://www.google.com/#q=http:%2F%2Fwww.srs.fs.usda.gov%2Fecon%2Fdata%2Fforestincentives%2Fgreene-etal-sofew2006proc.pdf Thanks.. On Sun, Apr 13, 2014 at 8:08 PM, remi tassing tassingr...@gmail.com wrote: Hi Laxmi, Could you provide some examples

Re: One site only index.

2014-04-02 Thread remi tassing
Hi Shane, You could use the same scripts as before but just modify the regex-urlfilter.txt to restrict the crawling scope. BR, Remi On Thu, Apr 3, 2014 at 10:52 AM, Shane Wood sh...@cbm8bit.com wrote: I have indexed several site successfully. Now i wish too index a new site and not update

Re: Crawling an authenticated site

2014-03-22 Thread remi tassing
Hi, If it's a form-based authentication where you need to send Http POST requests, then I would suggest you modify HttpResponse.java for the purpose Remi On Sat, Mar 22, 2014 at 2:31 AM, John Lafitte jlafi...@brandextract.comwrote: I haven't done it myself but it's documented here:

Re: Unable to crawl and index pdf metadata into Solr from Nutch

2014-03-21 Thread remi tassing
Hi, modify the default value of http.content.limit and/or ftp.content.limit value accordingly. This problem has nothing to do with the format but the content size Remi On Fri, Mar 21, 2014 at 4:52 PM, reddibabu reddybabu...@gmail.com wrote: Hi, I am using Nutch 1.7 and Solr 4.5 I can

Re: Java Heap Space error

2014-03-21 Thread remi tassing
Hi, JAVA_HEAP_MAX value can be modified in the bin/nutch script Remi On Thu, Mar 20, 2014 at 11:11 PM, Vangelis karv karvouni...@hotmail.comwrote: I managed to crawl again but I have something else now: https://www.dropbox.com/s/853xf1evi8sb51v/error . Also, I found this : 2014-03-20

Re: Escaping URL during redirection

2012-09-10 Thread remi tassing
Sorry, I think it works. I was trying 'parsechecker' and it doesn't apply 'regexnormalizer' rules by default. So, case solved, thanks a lot! On Sunday, September 9, 2012, Sebastian Nagel wrote: Redirects are filtered and normalized. It works for 1.4/1.5 and should for trunk. One subtlety:

Re: Problem with corrupted index Input path does not exist:

2012-09-08 Thread remi tassing
deleting that specific segment directory [0] should fix the problem but it depends on what you're attempting to do. Remi [0]: /home/user/Apache Nutch/crawl/segments/**20120908095131/ On Saturday, September 8, 2012, Alaak wrote: Hi, I needed to abort a crawl this morning and it seems my

Escaping URL during redirection

2012-09-08 Thread remi tassing
Hi guys, I'm not quite sure how to make Nutch follow the normalizer regular expressions during redirection. I see some URLs are not properly escaped. Any help? Remi

Re: Crawl HTTPS websites/Enable Plugin

2012-07-23 Thread remi tassing
.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Remi Tassing

Re: How does nutch reflect with HTTP status not 200?

2012-07-22 Thread remi tassing
Hi, just in case there was no reply yet. Nutch does have some handling depending on the HTTP response code (e.g. 302 redirection ...). For more detail, check the source code HttpBase.java. Remi Nutch supports redirection On Tue, Jul 17, 2012 at 11:21 AM, IT_ailen

Re: Compilation of core classes

2012-06-30 Thread remi tassing
for the late response BTW! Remi On Sun, Jun 10, 2012 at 10:42 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Remi 'ant compile-core' is what you're after Julien On 10 June 2012 10:35, remi tassing tassingr...@gmail.com wrote: Hello guys, this is probably a basic Java/Ant

Re: Near Duplicate Detection in nutch /Solr

2012-06-23 Thread remi tassing
I'm very interested in this topic as well. Plz let the community know if/when you get smth cool implemented =) On Saturday, June 23, 2012, parnab kumar wrote: Hi, I have crawled and indexed around 2.5 million web pages . However , almost 30 % of the pages are near duplicates . Is there any

Re: Getting seed url

2012-06-11 Thread remi tassing
Segments have a field called 'outlinks', could this help? On Tuesday, June 12, 2012, Sebastian Nagel wrote: Hi Sandeep, tracking the seed(s) for a document could be done by a scoring filter. The seed URL must be passed: 0 into CrawlDatum's meta by injectedScore() (alternatively, use

Re: URL filtering and normalization

2012-06-11 Thread remi tassing
bad URLs are already and still in. You'll need to update your db with the 'updatedb' command On Monday, June 11, 2012, Bai Shen wrote: However, I'm still seeing youtube urls in the fetch logs. I'm using the -noFilter and -noNorm options with generate. I'm also not using the

Re: disable filtering and normalization in the crawl-tool

2012-06-11 Thread remi tassing
Certainty, but you might need them to avoid crawling unnecessary pages On Monday, June 11, 2012, Matthias Paul wrote: Hi, wouldn't it be better performance-wise to disable filtering and normalization in the crawl-tool in the generate, update and invert link steps? Filtering and

Re: using less resources

2012-05-23 Thread remi tassing
I was wondering how do you know if the page was changed without actually fetching it On Wednesday, May 23, 2012, wrote: Hello, As far as I understood nutch recrawls urls when their fetch time has past current time regardless if those urls were modified or not. Is there any initiative on

Re: Crawl sites with hashtags in url

2012-05-01 Thread remi tassing
Hi Roberto, If you're having an invalid URI error, then this might probably help you: http://lucene.472066.n3.nabble.com/Invalid-uri-td3742047.html Remi On Tue, May 1, 2012 at 7:25 PM, Roberto Gardenier r.garden...@simgroep.nlwrote: Hello, Im currently trying to crawl a site which uses

Re: solution for scanned pdf parsing

2012-04-24 Thread remi tassing
It could also be due to the filesize //Remi On Tuesday, April 24, 2012, nutchsolruser nutchsolru...@gmail.com wrote: I have some pdf files , data present in pdf is scanned articles and some unicode text. I am using tika as pdf parser. but parser fails for pdf's with images in it. is it

Re: Good workflow for a regular re-indexing job

2012-04-23 Thread remi tassing
Have you read this? http://wiki.apache.org/nutch/NutchTutorial/ You can put all commands in a shell script Remi On Monday, April 23, 2012, Ian Piper wrote: Hi all, I have set up a process for crawling a client's website using nutch and then creating a Solr index. I have run into a workflow

Re: exclude some urls from crawling

2012-04-13 Thread remi tassing
To exclude index.php and index.html just use: -index\.html -index\.php You can do the same for video and live-score. To ultimately make sure if a URL is blocked or not, try: echo URL | bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined Remi On Tuesday, April 10, 2012, alessio

Re: How to handle failures in nutch?

2012-04-10 Thread remi tassing
I don't think so! freegen will generate a new segment and you don't need to merge it with the others. Then you can (fetch and) parse the content from that new segment. Finally you just need to update your crawldb (with updatedb) Remi On Tue, Apr 10, 2012 at 6:01 PM, nutch.bu...@gmail.com

Re: Returning web page abstract with Solr

2012-04-04 Thread remi tassing
Are you looking for result highlighting? http://wiki.apache.org/solr/HighlightingParameters Remi On Wed, Apr 4, 2012 at 3:30 PM, smooth almonds sir.ramsel.ja...@gmail.comwrote: I've crawled flickr.com with Nutch successfully and am trying to return a highlighted abstract using Solr as the

Normalizer error: IndexOutOfBoundsException: No group 1

2012-04-02 Thread remi tassing
Hi all, I just found a weird error and it looks like a JDK bug but I'm not sure. Whenever replacing a URL-A, that contains a number, with a URL-B, then I get an error: IndexOutOfBoundsException: No group 1 In my regex-normalize.xml, I have: regex patternhttp://google1.com/.+/pattern

Re: crawling a website

2012-04-02 Thread remi tassing
It depends on the structure of your site and you can modify regex-urlfilter.txt to reach your goal. From the examples you gave, you can do this: *- ^http://ww.mywebsite.com/[^/]*$* it will exclude http://ww.mywebsite.com/alpha, http://ww.mywebsite.com/beta , http://ww.mywebsite.com/gamma *-

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2012-04-02 Thread remi tassing
It could be a million reasons: seed, filter, authentication...maybe the pages are already crawled... is there any clue in the log? Remi On Mon, Apr 2, 2012 at 5:37 PM, jepse j...@jepse.net wrote: Hey, i have the same problem. No urls to fetch.. For couple urls. Have no clou how to fix

Re: Normalizer error: IndexOutOfBoundsException: No group 1

2012-04-02 Thread remi tassing
. Sebastian On 04/02/2012 09:40 AM, remi tassing wrote: Hi all, I just found a weird error and it looks like a JDK bug but I'm not sure. Whenever replacing a URL-A, that contains a number, with a URL-B, then I get an error: IndexOutOfBoundsException: No group 1 In my regex-normalize.xml

Re: Re-indexing temporarily unavailable page

2012-03-28 Thread remi tassing
nice! On Wed, Mar 28, 2012 at 10:52 PM, dspathis dspat...@gmail.com wrote: I forgot to mention I'm using Nutch 1.4. For those interested, I solved my issue by modifying the protocol-http plugin, specifically the HttpResponse class. In the HttpResponse contstructor, I changed if

Re: divide fetch process ?

2012-03-27 Thread remi tassing
I think that is exactly what HADOOP does! Start here: http://wiki.apache.org/nutch/NutchHadoopTutorial On Tue, Mar 27, 2012 at 6:19 AM, pepe3059 pepe3...@gmail.com wrote: Hello, i have some questions, sorry if i'm so noob Is there a way to divide fetch process between two or more computers

Re: db_unfetched large number, but crawling not fetching any longer

2012-03-27 Thread remi tassing
I'm not sure to totally understand what you meant. 1. In case you know exactly how the relative urls are translated into, you can use urlnormalizefilter to change them in what would make more 'sense'. 2. The 2nd option, if you don't want those relative links to be included, you can use the

Re: [ANNOUNCEMENT] Lewis John Mc Gibbney is a Nutch committer and PMC member

2012-03-27 Thread remi tassing
Try this: http://wiki.apache.org/solr/FAQ#My_search_returns_too_many_.2BAC8_too_little_.2BAC8_unexpected_results.2C_how_to_debug.3F Solr also has a debug mode where you can see result's score etc... On Mon, Mar 26, 2012 at 12:54 PM, Hangthunder jiajin@gmail.com wrote: Hi, Lewis, I got a

Re: Different number of parsed pages for crawls with same settings

2012-03-27 Thread remi tassing
This happened to me before for a very specific reason and I'm not sure if it's the same for you. Some of the websites I was trying to access were temporarily down. I would suggest you check the difference between the logs Remi On Tue, Mar 27, 2012 at 4:28 PM, Elisabeth Adler

Re: Out-of-the-box Nutch indexing url source to Solr

2012-03-25 Thread remi tassing
Hey, Try the command bin/nutch readseg -dump[1][2]. It reads a segment (or multiple segments) and output their content including outlinks, html content, parsed content... I hope it helps! Remi [1]: http://www.marco.bianchi.name/myPortal/using-the-binnutch-readseg-command.aspx [2]:

Re: nutch crawling file system SOLVED

2012-03-11 Thread remi tassing
You're probably looking for the Highlighting future http://wiki.apache.org/solr/HighlightingParameters Remi On Sun, Mar 11, 2012 at 6:10 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: Thank you Lewis for your explanation: I supposed this fact and I post on mailing list my

Re: nutch crawling file system SOLVED

2012-03-11 Thread remi tassing
Using crawl-ulrfilter (or regex-urlfilter depending on which one you're using), you should be able to solve this. Unless you're not clear on what folders to exclude...? On Sunday, March 11, 2012, alessio crisantemi alessio.crisant...@gmail.com wrote: thank you Remi for your preciuos help. I try

Re: Crawling with Certs

2012-03-07 Thread remi tassing
in the AuthenticationSchemes (http://wiki.apache.org/nutch/HttpAuthenticationSchemes) that is not shown on the page? If you have a specific page that could help please send that. -- Chris On Wed, Mar 7, 2012 at 3:40 PM, remi tassing tassingr...@gmail.com wrote: Try googling for Nutch+httpclient Remi

Re: nutch craling file system

2012-03-04 Thread remi tassing
Plz try GOOGLing that first! If you don't find anything then try these: [1]http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F [2]http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch [3]

Re: java.net.UnknownHostException during fetching

2012-03-04 Thread remi tassing
I had that same error for dead URLs or those that needed proxies to get access to Remi On Sun, Mar 4, 2012 at 1:19 PM, hadi md.anb...@gmail.com wrote: I have one link with many external link inside it,when the fetching process start many external link failed with:

Re: multiple small crawlers on single machine conflict at /tmp/hadoop-username/mapred

2012-03-01 Thread remi tassing
How did you define that property so it's different so each job? Remi On Friday, March 2, 2012, Jeremy Villalobos jeremyvillalo...@gmail.com wrote: That is what I was looking for, thank you. this property was added to: $NUCHT_DIR/runtime/local/conf/nutch-site.xml Jeremy On Thu, Mar 1,

Re: Only fetching initial seedlist

2012-03-01 Thread remi tassing
This question comes a lot, try searching the mailinglist archive On Friday, March 2, 2012, James Ford simon.fo...@gmail.com wrote: Hello, I am having a problem getting nutch to crawl and fetch the initial seedlist only. It seems like nutch tend to skip some urls? Or it does not parse some of

Re: IOExeption when crawling with nutch in Fetching process

2012-02-29 Thread remi tassing
Another possibility might be the tmp memory[1]: The answer we find addressed the situation is that you're most likely out of disk space in /tmp. Consider using another location, or possibly another partition for hadoop.tmp.dir (which can be set in nutch-site.xml) with plenty of room for large

Re: How to crowl AJAX populated pages

2012-02-28 Thread remi tassing
Same question here... I have similar issues where (redirection)links are given through JavaScript I hope I haven't hijacked your post as I see these issues very similar Remi On Tue, Feb 28, 2012 at 10:56 AM, Grijesh pintu.grij...@gmail.com wrote: I need to Crawl pages which were loaded using

Re: crawldb modifications

2012-02-28 Thread remi tassing
I think he ment to remove some specific URLs not everything On Tue, Feb 28, 2012 at 1:51 PM, Markus Jelsma markus.jel...@openindex.iowrote: I may be missing something but rm -r crawl/crawldb works fine here. On Tuesday 28 February 2012 07:03:39 remi tassing wrote: What do in this case

Re: too few db_fetched

2012-02-28 Thread remi tassing
Hi Jose, We have this question very often and the short answer, with regard to 'stats' printout, is that everything is probably fine. For a more complete answer plz search in the mailing-list or Google. BTW, how did you change the heap size? I get some IOException when the TopN is 'too' high

Re: crawldb modifications

2012-02-27 Thread remi tassing
What do in this case is to erase the db, use the.command mergesegs with -filter option and then updatedb. I would.love to know if there is a simpler way Remi On Monday, February 27, 2012, Charles Thomas ctho...@wisc.edu wrote: Is there a way to clear out the various databases that Nutch uses

Re: Exception in thread main java.io.IOException: Job failed!

2012-02-23 Thread remi tassing
- LinkDb: adding segment: file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** local/crawl/segments /20120222154459 On 22/02/2012 16:36, remi tassing wrote: Hey Daniel, You can find more output log in logs/Hadoop files Remi On Wednesday, February 22, 2012, Daniel Bourrion

Re: Optimising the speed of Nutch.

2012-02-22 Thread remi tassing
Try decreasing the number of fetcher threads instead... On Wed, Feb 22, 2012 at 2:33 PM, Bharat Goyal bharat.go...@shiksha.comwrote: Went through the checklist and made some changes as in increased the no of fetcher threads from default 10 to 30, but I still see nutch eating up all the

Using jcifs for NTLM in HttpClient

2012-02-22 Thread remi tassing
Hey guys, I've been trying to figure out how to incorporate jcifs [1] into Nutch but I just need a hint here. I downloaded the jcifs class and updated the CLASSPATH. I was planning to modify http.java but so many things look different: In [1], there are several import org.apache.http.*

Re: http.redirect.max

2012-02-22 Thread remi tassing
Would you give Nucth-1.4 a try? Maybe this bug is already solved? Remi On Thursday, February 23, 2012, xuyuanme xuyua...@gmail.com wrote: Thanks for the information. But I found the wiki page http://wiki.apache.org/nutch/RedirectHandling http://wiki.apache.org/nutch/RedirectHandling still

Re: IOExeption when crawling with nutch in Fetching process

2012-02-19 Thread remi tassing
Hey Hadi, I had this error message several times, for different reasons but never because of disk space. I would suggest you run smaller crawls just to narrow down the issue. Start with Top 1, then 10, ... Remi On Sunday, February 19, 2012, Lewis John Mcgibbney lewis.mcgibb...@gmail.com

Re: ParseSegment taking a long time to finish

2012-02-19 Thread remi tassing
Hi, Could you also try the parsechecker tool on that last url? It's possible.that the file has a.problem or simply a bug. Remi On Sunday, February 19, 2012, Magnús Skúlason magg...@gmail.com wrote: Hi, According to my logs a really long time +2 hours elapses between parsing the last page in

Re: Some PDF contains is not readable when crawling with nutch

2012-02-19 Thread remi tassing
Hi, Could you post the PDF link? Remi On Saturday, February 18, 2012, hadi md.anb...@gmail.com wrote: I have problem with some pdf,when i crawl them with nutch, some contains is not readble, i do not know this problem is about their font or something else. how can i solve this problem? --

Re: fetch Aborting with 50 hung threads.

2012-02-18 Thread remi tassing
I had a similar issue before with Nutch-1.2 and 10 hung threads. It happened when I changed the code for HttpResponse.java. I tried reconnecting/authenticating after having an http 500 error code. After removing those specific codes everything well back to normal. It's probably not the same

Re: Failure authenticating with NTLM

2012-02-18 Thread remi tassing
Hi Gouri, Did you see any HTTP error code in the stdout? I'm not sure if this will work but you can try this: http://hc.apache.org/httpcomponents-client-ga/ntlm.html Remi On Fri, Feb 17, 2012 at 1:27 PM, Gouri Deshpande gouri.sam...@gmail.comwrote: Hi, I am getting the error: Failure

URLNormalizer not working properly

2012-02-18 Thread remi tassing
Hi, I'm witnessing a weird problem. I configured regex-normalize.xml to escape whitespaces, curly braces...and it works while checking with URLNormalizerChecker: *echo URL non escaped | bin/nutch org.apache.nutch.net.URLNormalizerChecker* *output: escaped URL* But when I run crawl with Nutch, I

Re: URLNormalizer not working properly

2012-02-18 Thread remi tassing
I had 18000 db_fetched, now only 54. Pretty dangerous command :-( On Saturday, February 18, 2012, Markus Jelsma markus.jel...@openindex.io wrote: Did you update the entire crawldb with that normalizer? Hi, I'm witnessing a weird problem. I configured regex-normalize.xml to escape

Re: URLNormalizer not working properly

2012-02-18 Thread remi tassing
Ok, it makes sense, thanks Markus! Remi On Saturday, February 18, 2012, Markus Jelsma markus.jel...@openindex.io wrote: That works just fine! I wonder why crawldb has to be updated first. All these URLs are in segments and similarly the regex-urlfilter works immediately without the need of

Re: Failed fetching

2012-02-15 Thread remi tassing
I just used protocol-http and it works! It's probably a configuration issue. You can download a clean version and start afresh Remi On Wed, Feb 15, 2012 at 3:46 AM, tiagorcs dasilva-ti...@mitsue.co.jpwrote: So do you suggest me to download Nutch from a different source? Maybe to reconfigure

tstamp vs. lastModified ...

2012-02-15 Thread remi tassing
Hello all, What does tstamp represent? I can we shown in Solr results after indexing. I'm interested in showing the last modified meta-data in Solr results but I'm not sure if Nutch does retrieve this value. Thanks in advance for the help! Remi

Re: how are CSV/TXT files handled

2012-02-15 Thread remi tassing
=application/pdf creator=PScript5.dll Version 5.2 On Wed, Feb 8, 2012 at 2:04 PM, remi tassing tassingr...@gmail.com wrote: $ bin/nutch parsechecker http://avis.free.fr/livret_278_recettes.pdf fetching: http://avis.free.fr/livret_278_recettes.pdf Can't fetch URL successfully lewismc@lewismc-HP

Re: tstamp vs. lastModified ...

2012-02-15 Thread remi tassing
. On Wed, Feb 15, 2012 at 1:26 PM, remi tassing tassingr...@gmail.com wrote: Hello all, What does tstamp represent? I can we shown in Solr results after indexing. I'm interested in showing the last modified meta-data in Solr results but I'm not sure if Nutch does retrieve this value

Re: tstamp vs. lastModified ...

2012-02-15 Thread remi tassing
, February 15, 2012, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Remi, On Wed, Feb 15, 2012 at 1:51 PM, remi tassing tassingr...@gmail.com wrote: Thanks for the clarification! nb For tstamp, I can actually see it in Solr results (even thought the format is weird) what

Re: tstamp vs. lastModified ...

2012-02-15 Thread remi tassing
, remi tassing tassingr...@gmail.com wrote: tstamp shows a string of digits like 20020123123212 This is OK. -mm-dd-hh-mm-ssZ It is however hellishly old ! Never heard of the plugin index-more and it's poorly documented. Well it's been included in 1.2 onwards so I'm very surprised

Re: tstamp vs. lastModified ...

2012-02-15 Thread remi tassing
it to type=date it should take it (and you can do Solr's date arithmetic on it. On Feb 15, 2012, at 11:01 AM, remi tassing wrote: Awesome! Pushing this to Solr gives me an error (solrindex): SEVERE: java.lang.NumberFormatException: For input string: 2012-02-08T14:40:09.416Z

Re: tstamp vs. lastModified ...

2012-02-15 Thread remi tassing
Z). From the error message it appears that perhaps the field into which this field is going in is set as long or int. If you set it to type=date it should take it (and you can do Solr's date arithmetic on it. On Feb 15, 2012, at 11:01 AM, remi tassing wrote: Awesome

Re: Build a pipeline using nutch

2012-02-15 Thread remi tassing
Hi, Just a related question: Does.it make a big difference to fetch and parse directly than fetch all first, then parse. I was.under the impression that they yield.to the same end result Remi On Wednesday, February 15, 2012, Markus Jelsma mar...@apache.org wrote: my questions/doubts are

Re: Failed fetching

2012-02-14 Thread remi tassing
I'm slowly from migrating from Nutch-1.2 to 1.4 and it works with cygwin. I use protocol-httpclient but could try protocol-http if you want Remi On Friday, February 10, 2012, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: In all honesty this is strange. We can assure you that 1.4 DOES

Re: Understanding NutchConfigration properly

2012-02-12 Thread remi tassing
if they are really useless why keep them? Remi On Sunday, February 12, 2012, Julien Nioche lists.digitalpeb...@gmail.com wrote: i meant bothering to remove these files not open a jira Julien On Sunday, 12 February 2012, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: I'm in an

Re: how are CSV/TXT files handled

2012-02-08 Thread remi tassing
7, 2012 at 11:17 AM, Markus Jelsma mar...@apache.org wrote: Upgrade to 1.4. With the nutch parsechecker command I get the following error message: Error: Could not find or load main class parsechecker, this doesn't sound good! On Tue, Feb 7, 2012 at 9:58 AM, remi tassing tassingr

Re: how are CSV/TXT files handled

2012-02-07 Thread remi tassing
With the nutch parsechecker command I get the following error message: Error: Could not find or load main class parsechecker, this doesn't sound good! On Tue, Feb 7, 2012 at 9:58 AM, remi tassing tassingr...@gmail.com wrote: The point that made me start thinking is because I got this error

how are CSV/TXT files handled

2012-02-07 Thread remi tassing
Hey guys, I checked the mailing-list archive but couldn't get an answer on this. I think CSV and TXT don't need any kind of parsing, but how.are handled by default? Remi

how are CSV/TXT files handled

2012-02-06 Thread remi tassing
Hey guys, I checked the mailing-list archive but couldn't get an answer on this. I think CSV and TXT don't need any kind of parsing, but how.are handled by default? Remi

Re: how are CSV/TXT files handled

2012-02-06 Thread remi tassing
|tika)|index-(basic|anchor)|q... Remi On Tue, Feb 7, 2012 at 9:16 AM, remi tassing tassingr...@gmail.com wrote: Hey guys, I checked the mailing-list archive but couldn't get an answer on this. I think CSV and TXT don't need any kind of parsing, but how.are handled by default? Remi

Re: why nutch dosen't crawl Arabic sites well?

2012-02-01 Thread remi tassing
Try the following command. It'll export all the urls that were crawled. [1] http://wiki.apache.org/nutch/bin/nutch_readdb Remi On Wednesday, February 1, 2012, mina tahereganji...@gmail.com wrote: i have no error in my log, has nutch an error for crawl Arabic sites? help me. On 1/31/12, remi

Re: invalid uri with three dots

2012-02-01 Thread remi tassing
Problem solved! I replaced all whitespaces with %20 in the url before getting the content in httpreaponse.java(Httpclient plugin). Dirty solution? Yes, but it works for me now. Remi On Thursday, January 26, 2012, remi tassing tassingr...@gmail.com wrote: Hey guys, any ideas on how

From Nutch 1.2 to 1.4

2012-01-31 Thread remi tassing
Hi, So I've finally decided to move to Nutch-1.4, it seems a lot faster. The issue I had with executing versions greater than 1.2 on cygwin is solved by the tip from Luis, thanks! Now I have a couple of questions: 1. Are the segments backward compatible? I tried updatedb but I get

Aborting with 10 hung threads -ver.2

2012-01-31 Thread remi tassing
Hi, I'm using Nutch-1.2 and having Aborting with 10 hung threads for some sites. I checked this thread http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15889.html and the JIRA issue https://issues.apache.org/jira/browse/NUTCH-719 In Fetcher.java, I did the following change: -

all possible fields in Nutch Schema.xml

2012-01-31 Thread remi tassing
Hi, From the schema.xml shipped with.Nutch, some.fields like content, url...are already defined. I was wondering if there was an exhaustive list of possible fileds we could include. Are those from this site all there is?

is it necessary to merge DBs before solrindex?

2012-01-31 Thread remi tassing
Hi, The solrindex command requires crawldb and linkdb as parameters. Now, I would like to know if for newly generated segments it's necessary to merge the corresponding crawldb and linkdb before invoking solrindex? Merging is kinda time consuming... Remi

Re: undo db_gone

2012-01-29 Thread remi tassing
I'm using Solr-3.4. I honestly didn't get that message Mark Remi On Sunday, January 29, 2012, Markus Jelsma markus.jel...@openindex.io wrote: In trunk you can use generate.restrict.status to generate records for that status. Hi, I understand when a url is classified as db_gone, Nutch

undo db_gone

2012-01-28 Thread remi tassing
Hi, I understand when a url is classified as db_gone, Nutch won't bother fetch it again. I have many urls in this situation that I would like to recrawl. Any idea how to fix it? Remi

Re: invalid uri with three dots

2012-01-26 Thread remi tassing
a comment - 30/Jun/09 14:46 Properly escape non-URI characters. HttpClient is not a browser and thus does not, can not and will never try to fix invalid input. On Wed, Jan 18, 2012 at 4:51 PM, remi tassing tassingr...@gmail.com wrote: I posted a question on this JIRA: https://issues.apache.org/jira

Re: Getting html pages through a Nutch crawl (for a dataset)

2012-01-23 Thread remi tassing
Samarawickrama smsa...@googlemail.com wrote: Hi, I tried the readdb comamnd, but I can't get the html pages with it. Thanks, Sameendra On Mon, Jan 23, 2012 at 12:14 PM, remi tassing tassingr...@gmail.com wrote: Hi Sameendra, read this page: http://wiki.apache.org/nutch/bin

Re: Getting html pages through a Nutch crawl (for a dataset)

2012-01-23 Thread remi tassing
, 2012 at 8:02 PM, remi tassing tassingr...@gmail.com wrote: Hi, in your output directory, you should see two files: 1..part-0.crc 2. part-0 Open the second one with a text editor and you should be able to see the crawled urls. Perharps if there is no html in there, you probably didn't

Re: Dump unfetched ,fetched,gone, URLS

2012-01-23 Thread remi tassing
This command dumps the fetched and unfetched but not gone urls: http://wiki.apache.org/nutch/bin/nutch_readseg Remi On Monday, January 23, 2012, Nutch Begineeer sachinyadav0...@gmail.com wrote: What is command to get list of all unfetched , gone, fetched urls. I am only able to get their count

Re: concurrent Nutch instances in parallel

2012-01-22 Thread remi tassing
Thanks Markus! I'll merge segments for now and try Hadoop when it gets more serious Remi On Sunday, January 22, 2012, Markus Jelsma markus.jel...@openindex.io wrote: It should work just fine but you should use Hadoop. Segment merging is quite expensive! Hi, Is it safe to run concurrent

  1   2   >