Re: Extracting documents from nutch segments

2012-01-23 Thread Julien Nioche
Have a look at Behemoth [https://github.com/jnioche/behemoth]. It can take Nutch segments as input, process docs with UIMA over Hadoop and generate vectors for Mahout We know Mahout. But I think we still need the content of each document. We would like to annotate the documents retrieved by

Re: java.net.MalformedURLException creating new Content in unit test

2012-01-23 Thread José Ignacio Ortiz de Galisteo
Hi all. Well, I have just found was the problem was, in case somebody has the same problem: In our case the configuration mime.types.file was tika.mimetype.xml (this is the default type). To solve it just include the tika-mimetype.xml file in the classpath of the project. Instead of this we create

Re: Support for x-robots-tag

2012-01-23 Thread Markus Jelsma
There is currently no built-in support for the x-robots-tag header. On Sunday 22 January 2012 01:01:26 Michael Lissner wrote: Hi, I'm doing some research on what technologies various crawlers support for crawl exclusion. Without installing and figuring out Nutch, I can't figure out if it

Re: Extracting documents from nutch segments

2012-01-23 Thread Adriana Farina
Thank you! I'll try out the solutions you all seggested. Thanks a lot to all of you! You're great!:) 2012/1/23 Julien Nioche lists.digitalpeb...@gmail.com Have a look at Behemoth [https://github.com/jnioche/behemoth]. It can take Nutch segments as input, process docs with UIMA over Hadoop

Re: Getting html pages through a Nutch crawl (for a dataset)

2012-01-23 Thread Sameendra Samarawickrama
Hi, I tried the readdb comamnd, but I can't get the html pages with it. Thanks, Sameendra On Mon, Jan 23, 2012 at 12:14 PM, remi tassing tassingr...@gmail.comwrote: Hi Sameendra, read this page: http://wiki.apache.org/nutch/bin/nutch_readdb For instance the following command, will read

Re: Getting html pages through a Nutch crawl (for a dataset)

2012-01-23 Thread remi tassing
Hi, in your output directory, you should see two files: 1..part-0.crc 2. part-0 Open the second one with a text editor and you should be able to see the crawled urls. Perharps if there is no html in there, you probably didn't crawl any. Remi On Mon, Jan 23, 2012 at 4:08 PM, Sameendra

Re: Getting html pages through a Nutch crawl (for a dataset)

2012-01-23 Thread Sameendra Samarawickrama
yes it has a dump file which contains 'CrawlDatums'. And I found some html content in it but to get html pages out of it I think you will have to further process it right? How about my crawl contains several thousand web pages, will that file contain the contents of all the pages? Is this the way

Following .axd urls

2012-01-23 Thread Ian Piper
Hi all, I'd appreciate some guidance... can't seem to find much useful stuff on the web on this. I have set up a Nutch and Solr service that is crawling a client's site. They have a lot of pages that are accessed with urls like this:

Re: Following .axd urls

2012-01-23 Thread Lewis John Mcgibbney
Hi Ian, What fetching depth are you using? Lewis On Mon, Jan 23, 2012 at 7:46 AM, Ian Piper ianpi...@tellura.co.uk wrote: Hi all, I'd appreciate some guidance... can't seem to find much useful stuff on the web on this. I have set up a Nutch and Solr service that is crawling a client's

Re: Following .axd urls

2012-01-23 Thread Julien Nioche
Hi Ian The problem I'm finding is that the crawler is not apparently visiting or indexing the content of these urls. The document at the far end of the link has this url http://[domain]/medialibrary.axd?id=414405745 is actually a pdf. I am using the tika plugin which I thought would allow

Re: Following .axd urls

2012-01-23 Thread Julien Nioche
having said that if the URL filters are correct, the next step is to check that the parser actually returns the outlink. Google for ParserChecker and try it on the URL containing the link On 23 January 2012 16:04, Julien Nioche lists.digitalpeb...@gmail.comwrote: Hi Ian The problem I'm

Re: Getting html pages through a Nutch crawl (for a dataset)

2012-01-23 Thread remi tassing
If you need the urls, then yes, you just need to further process that file. If you need the content of those htlm files, then I'm not.sure how to.do.that On Monday, January 23, 2012, Sameendra Samarawickrama smsa...@googlemail.com wrote: yes it has a dump file which contains 'CrawlDatums'. And

Re: Getting html pages through a Nutch crawl (for a dataset)

2012-01-23 Thread Markus Jelsma
It is in the big dump file output by the readseg command. I need the content. :( On Mon, Jan 23, 2012 at 9:47 PM, remi tassing tassingr...@gmail.com wrote: If you need the urls, then yes, you just need to further process that file. If you need the content of those htlm files, then

Re: Dump unfetched ,fetched,gone, URLS

2012-01-23 Thread remi tassing
This command dumps the fetched and unfetched but not gone urls: http://wiki.apache.org/nutch/bin/nutch_readseg Remi On Monday, January 23, 2012, Nutch Begineeer sachinyadav0...@gmail.com wrote: What is command to get list of all unfetched , gone, fetched urls. I am only able to get their count

Re: Dump unfetched ,fetched,gone, URLS

2012-01-23 Thread Markus Jelsma
That is the SegmentReader tool. You an use the crawldbscanner tool in Nutch 1.4 to get a dump of crawldb records by status. In Nutch trunk you can use the readdb tool as well to get a dump of records by status or regex pattern and write as CSV which is easier to use than the output of