Re: Injecting urls and define Inlink

2010-01-20 Thread MyD
Because I like to associate two separate running crawls. The only way is to associate it through the URL. Is there any way like CrawlDBReader but instead of reading I would like to write into it. A example would be great. Thanks in advance. Cheers On Jan 20, 2010, at 8:36 AM, MilleBii

Nofollow links on nutch

2010-01-20 Thread axi
I want to obtain the inlinks of certain urls, and identify those inlinks as nofollowed or followed. Now, with the actual nutch linkdb is this possible? ¿How can I do that, using the linkdb access class in nutch and accesing that data? Thanks in advance, -- View this message in context:

Alt text of images as anchor text

2010-01-20 Thread axi
after several test, I have noticed that nutch ignores alt text of images inside a href= tags. So, this feature isn't implemented yet right? thanks in advance, -- View this message in context: http://old.nabble.com/Alt-text-of-images-as-anchor-text-tp27244358p27244358.html Sent from the Nutch

Re: Tried to run Crawl with depth of only 2 and getting IOException

2010-01-20 Thread Nutch Newbie
On Wed, Jan 20, 2010 at 7:10 PM, kraman kirthi.ra...@gmail.com wrote: kirth...@cerebrum [~/www/nutch]# ./bin/nutch crawl url -dir tinycrawl -depth 2 crawl started in: tinycrawl rootUrlDir = url threads = 10 depth = 2 Injector: starting Injector: crawlDb: tinycrawl/crawldb Injector:

Re: Injecting urls and define Inlink

2010-01-20 Thread Nutch Newbie
On Wed, Jan 20, 2010 at 8:04 AM, MyD myd.ro...@googlemail.com wrote: Because I like to associate two separate running crawls. The only way is to associate it through the URL. Is there any way like CrawlDBReader but instead of reading I would like to write into it. A example would be great.

Re: Alt text of images as anchor text

2010-01-20 Thread Nutch Newbie
On Wed, Jan 20, 2010 at 4:16 PM, axi axi...@gmail.com wrote: after several test, I have noticed that nutch ignores alt text of images inside a href= tags. So, this feature isn't implemented yet right? what exactly you want nutch should do to the alt text index it? tokenize it? make this field

Re: Alt text of images as anchor text

2010-01-20 Thread axi
If you put image as link, is commonly known that alt text of that image is equivalent to the anchor text of text link. Now if you put an image with alt text inside a link, anchor text for that link is empty and no image alt text is counted. Nutch Newbie wrote: On Wed, Jan 20, 2010 at 4:16

Re: Alt text of images as anchor text

2010-01-20 Thread Nutch Newbie
On Wed, Jan 20, 2010 at 8:11 PM, axi axi...@gmail.com wrote: If you put image as link, is commonly known that alt text of that image is equivalent to the anchor text of text link. Now if you put an image with alt text inside a link, anchor text for that link is empty and no image alt text is

Re: Alt text of images as anchor text

2010-01-20 Thread axi
I'll try that, but the real anchor text is in On Wed, Jan 20, 2010 at 8:11 PM, axi axi...@gmail.com wrote: If you put image as link, is commonly known that alt text of that image is equivalent to the anchor text of text link. Now if you put an image with alt text inside a link, anchor

Re: [jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-01-20 Thread MilleBii
I'd like to use Julien's approach because I found the scoring filter complex to understand. My use case is the following : 1. during scoring after parsing, I want to tag interesting pages for me, say meta=HIT 2. in the next step (to be created) I would like to prune the segment of NON-HIT content

[jira] Created: (NUTCH-780) Nutch crawler did not read configuration files

2010-01-20 Thread Vu Hoang (JIRA)
Nutch crawler did not read configuration files -- Key: NUTCH-780 URL: https://issues.apache.org/jira/browse/NUTCH-780 Project: Nutch Issue Type: Bug Components: ndfs Affects

[jira] Updated: (NUTCH-780) Nutch crawler did not read configuration files

2010-01-20 Thread Vu Hoang (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vu Hoang updated NUTCH-780: --- Description: Nutch searcher can read properties at the constructor ...

Re: [jira] Commented: (NUTCH-650) Hbase Integration

2010-01-20 Thread xiao yang
Update: hadoop and hbase jar version is not right. After updating jars in 'lib/' directory and rebuild, now it's throwing: org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: Column family mtdt: does not exist in

[jira] Commented: (NUTCH-650) Hbase Integration

2010-01-20 Thread Xiao Yang (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803186#action_12803186 ] Xiao Yang commented on NUTCH-650: - Exception:

[jira] Commented: (NUTCH-780) Nutch crawler did not read configuration files

2010-01-20 Thread Vu Hoang (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803193#action_12803193 ] Vu Hoang commented on NUTCH-780: add lines below into class org.apache.nutch.crawl.Crawl

[jira] Commented: (NUTCH-780) Nutch crawler did not read configuration files

2010-01-20 Thread Vu Hoang (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803194#action_12803194 ] Vu Hoang commented on NUTCH-780: add lines below into class org.apache.nutch.crawl.Crawl

[jira] Updated: (NUTCH-780) Nutch crawler did not read configuration files

2010-01-20 Thread Vu Hoang (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vu Hoang updated NUTCH-780: --- Comment: was deleted (was: add lines below into class org.apache.nutch.crawl.Crawl

[jira] Issue Comment Edited: (NUTCH-780) Nutch crawler did not read configuration files

2010-01-20 Thread Vu Hoang (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803193#action_12803193 ] Vu Hoang edited comment on NUTCH-780 at 1/21/10 6:32 AM: - add lines