Handling URLs with non-UTF8 characters

2011-09-15 Thread Thomas B
I've run into a small issue with my deployment of Nutch. Some of the sites I crawl use characters such as æøå in their URLs, and these never seem to parse properly. Is there any way to get around this? I tried adding the UTF-values (as '\u00e5' and so on) in regex-normalize.xml, but I suppose they

Integrating Nutch-1.3 SVN version into another project.

2011-09-15 Thread Luis Cappa Banda
Hello. I've downloaded Nutch-1.3 version via Subversion and modified some classes a little. My intention is to integrate with Maven the new artifacts created from the new hacked Nutch version and integrate them with another Maven project which has a dependency to the hacked version mentioned.

Integrating Nutch-1.3 SVN version into another project.

2011-09-15 Thread Luis Cappa Banda
Hello. I've downloaded Nutch-1.3 version via Subversion and modified some classes a little. My intention is to integrate with Maven the new artifacts created from the new hacked Nutch version and integrate them with another Maven project which has a dependency to the hacked version mentioned.

Re: Integrating Nutch-1.3 SVN version into another project.

2011-09-15 Thread lewis john mcgibbney
Hi Luis, There appears to be two things which stick out for me here. I'm not sure if you know what Solr is not shipped with Nutch therefore to index you would also need to include some kind of Solr image in your parent project. I mention this because although the log at 2011-09-15 16:57:09

Re: Not able to index url which is giving http 302

2011-09-15 Thread lewis john mcgibbney
This looks pretty tricky. I am not experienced with using http-client in general and we could do with getting a wiki page established to comment on the re-direct policies and scenarios as there is quite a bit of confusion within the community as to what some 'states' actually mean and how to

Re: Integrating Nutch-1.3 SVN version into another project.

2011-09-15 Thread Julien Nioche
[... ] We do not push any Nutch related stuff to the Sonatype Nexus Maven Repository so you can't therefore pull it and depend upon in an any way. We do and it is synced with Central see : - http://wiki.apache.org/nutch/NutchMavenSupport -

Re: Integrating Nutch-1.3 SVN version into another project.

2011-09-15 Thread lewis john mcgibbney
Excellent :0) Sorry Luis for my limited knowledge on the Maven side of things. On Thu, Sep 15, 2011 at 5:15 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: [... ] We do not push any Nutch related stuff to the Sonatype Nexus Maven Repository so you can't therefore pull it and

not crawling protected pdf

2011-09-15 Thread Marlen
Hi everybody I've a nutch 1.2, while the indexing procces it fail indexing protected pdf dont know why, I look up on the internet and nothing found... I checked the jai_codec.jar and jai_core.jar and its ok, even I updated. Could any one tellme what is going on or where to look for it..

Re: Integrating Nutch-1.3 SVN version into another project.

2011-09-15 Thread Luis Cappa Banda
Hello Lewis, Julien. Solr isn´t the problem. I know that Nutch don´t includes it, so I´ve added it as a Maven dependency aswell. Otherwise, I´ve got a Java project that compiles and deploys a Solr server without problems. The thing is that the parent project (which includes my version of Nutch,

Crawling and redirects to the same URL

2011-09-15 Thread Elisabeth Adler
Hi, I am having issues crawling an intranet site with an (imho) odd redirect mechanism. One part of the intranet website requires authentication which Nutch can bypass sending a special http.agent.name. This works fine. The issue I am facing is that the server sends a redirect (302) after

Machine readable vs. human readable URLs.

2011-09-15 Thread Chip Calhoun
Hi everyone, We'd like to use Nutch and Solr to replace an existing Verity search that's become a bit long in the tooth. In our Verity search, we have a hack which allows each document to have a machine-readable URL which is indexed (generally an xml document), and a human-readable URL which

Crawling search result pages

2011-09-15 Thread Arcadius Ahouansou
Hello. I am new to Nutch. I need to use Nutch to index data into Solr. Lets say I need to crawl some newspaper search pages and index any article regarding the word java. I understand that I would need to point Nutch to the search result page. 1) What I need from nutch is not to crawl/index

Re: Crawling search result pages

2011-09-15 Thread Markus Jelsma
Hi, If you want to index every page on example.org with the term 'java' you need to crawl every page on example.org. Nutch will follow links and add the links to its CrawlDB so you can crawl newly discovered pages in the next crawl cycle (read the wiki tutorial on what a cycle is). If you

Nutch 1.3 Solrindex Failed on JPG (non multiValued field title)

2011-09-15 Thread Michael.Sulistijo
Hi I'm fairly new to use Nutch, I have segments of 2,000 URLs, and I'm trying to index them to solr, but throws out error message like this: and I checked the logs/hadoop.log and get something like: So I thought that http://forum.kompas.com/avatars/bangthoyib.gif?dateline=1261039714 is the

Nutch 1.3 Solrindex Failed on JPG (non multiValued field title)

2011-09-15 Thread Michael.Sulistijo
Hi I'm fairly new to use Nutch, I have segments of 2,000 URLs, and I'm trying to index them to solr, but throws out error message like this: and I checked the logs/hadoop.log and get something like: So I thought that http://forum.kompas.com/avatars/bangthoyib.gif?dateline=1261039714 is the