Hi All,
I am very excited now. :) Thanks a lot to everyone for inviting me.
I'm a software engineer and crawler team leader of my company in
Istanbul. I have been using Apache Nutch 2.X for 10 months.
My company works on Search Technologies. We have a huge Hadoop cluster
for crawling and
Congrats Talat! Well deserved!
Renato M.
2014-04-02 8:56 GMT+02:00 Talat Uyarer ta...@uyarer.com:
Hi All,
I am very excited now. :) Thanks a lot to everyone for inviting me.
I'm a software engineer and crawler team leader of my company in
Istanbul. I have been using Apache Nutch 2.X for
Congratulationsmate! With the contributions of this great team, I can
see the brillant future of Nutch 2.x from now :)
Alparslan
On 02-04-2014 09:57, Renato Marroquín Mogrovejo wrote:
Congrats Talat! Well deserved!
Renato M.
2014-04-02 8:56 GMT+02:00 Talat Uyarer ta...@uyarer.com:
Hi
Congratulations Talat and welcome on board!
Julien
On 2 April 2014 07:56, Talat Uyarer ta...@uyarer.com wrote:
Hi All,
I am very excited now. :) Thanks a lot to everyone for inviting me.
I'm a software engineer and crawler team leader of my company in
Istanbul. I have been using Apache
Hi all,
I have followed all steps to set-up Nutch (2.2.1) with HBase (0.90.4)
and Solr (4.7.1) as described in the book Web Crawling and Data Mining
with Apache Nutch, however, I am getting the following error:
InjectorJob: org.apache.gora.util.GoraException:
java.lang.RuntimeException:
I have indexed several site successfully.
Now i wish too index a new site and not update any other sites already
indexed.
I use Nutch 2.21 MYSQL 5.3 and Solr 4.7.0 how would you recommend i go
about indexing a new site only
if someone can give examples of command lines that would be
Hi Shane,
You could use the same scripts as before but just modify the
regex-urlfilter.txt to restrict the crawling scope.
BR, Remi
On Thu, Apr 3, 2014 at 10:52 AM, Shane Wood sh...@cbm8bit.com wrote:
I have indexed several site successfully.
Now i wish too index a new site and not update
Can you choose a custom regex-urlfilter.txt too save editing it each
time you wish too index a different site ?.
I am surprised you can't enter a url when generating a fetch list. ie
/bin/nutch generate --only someurl.com --job 192833-292837
The you fetch job 192833-292837 parse job
Hi All,
I am using Apache Nutch 1.7. I can able to crawl and index all most all
sites except wiki pages.
While trying to crawl wiki pages it is saying that fetch of
http://wiki.ibm.com/ failed with: java.net.UnknownHostException:
wiki.ibm.com.
Is it require any additional configuration for
Hi reddibabu,
java.net.UnknownHostException is dns resolve problem. When i try to enter
website it didnt open.
Nutch has not any specific confugration for wiki.
Talat
3 Nis 2014 07:54 tarihinde reddibabu reddybabu...@gmail.com yazdı:
Hi All,
I am using Apache Nutch 1.7. I can able to crawl
reddibabu,
I cannot resolve wiki.ibm.com so I'm guessing nutch can't either. Is that
an internal dns record?
On Wed, Apr 2, 2014 at 11:54 PM, reddibabu reddybabu...@gmail.com wrote:
Hi All,
I am using Apache Nutch 1.7. I can able to crawl and index all most all
sites except wiki pages.
If you use local dns server for resolving, you should write nameservers in
resolv.conf which nutch working servers.
You should be sure nutch's server can resolve this site. If you use console
you ycan use lynx for checking
Talat
3 Nis 2014 08:02 tarihinde John Lafitte jlafi...@brandextract.com
12 matches
Mail list logo