Because I like to associate two separate running crawls. The only way
is to associate it through the URL.
Is there any way like CrawlDBReader but instead of reading I would
like to write into it. A example would be great. Thanks in advance.
Cheers
On Jan 20, 2010, at 8:36 AM, MilleBii
I want to obtain the inlinks of certain urls, and identify those inlinks as
nofollowed or followed.
Now, with the actual nutch linkdb is this possible?
¿How can I do that, using the linkdb access class in nutch and accesing that
data?
Thanks in advance,
--
View this message in context:
after several test, I have noticed that nutch ignores alt text of images
inside a href= tags.
So, this feature isn't implemented yet right?
thanks in advance,
--
View this message in context:
http://old.nabble.com/Alt-text-of-images-as-anchor-text-tp27244358p27244358.html
Sent from the Nutch
On Wed, Jan 20, 2010 at 7:10 PM, kraman kirthi.ra...@gmail.com wrote:
kirth...@cerebrum [~/www/nutch]# ./bin/nutch crawl url -dir tinycrawl -depth
2
crawl started in: tinycrawl
rootUrlDir = url
threads = 10
depth = 2
Injector: starting
Injector: crawlDb: tinycrawl/crawldb
Injector:
On Wed, Jan 20, 2010 at 8:04 AM, MyD myd.ro...@googlemail.com wrote:
Because I like to associate two separate running crawls. The only way is to
associate it through the URL.
Is there any way like CrawlDBReader but instead of reading I would like to
write into it. A example would be great.
On Wed, Jan 20, 2010 at 4:16 PM, axi axi...@gmail.com wrote:
after several test, I have noticed that nutch ignores alt text of images
inside a href= tags.
So, this feature isn't implemented yet right?
what exactly you want nutch should do to the alt text index it?
tokenize it? make this field
If you put image as link, is commonly known that alt text of that image is
equivalent to the anchor text of text link. Now if you put an image with alt
text inside a link, anchor text for that link is empty and no image alt text
is counted.
Nutch Newbie wrote:
On Wed, Jan 20, 2010 at 4:16
On Wed, Jan 20, 2010 at 8:11 PM, axi axi...@gmail.com wrote:
If you put image as link, is commonly known that alt text of that image is
equivalent to the anchor text of text link. Now if you put an image with alt
text inside a link, anchor text for that link is empty and no image alt text
is
I'll try that,
but the real anchor text is in
On Wed, Jan 20, 2010 at 8:11 PM, axi axi...@gmail.com wrote:
If you put image as link, is commonly known that alt text of that image is
equivalent to the anchor text of text link. Now if you put an image with
alt
text inside a link, anchor
I'd like to use Julien's approach because I found the scoring filter complex
to understand.
My use case is the following :
1. during scoring after parsing, I want to tag interesting pages for me, say
meta=HIT
2. in the next step (to be created) I would like to prune the segment of
NON-HIT content
Nutch crawler did not read configuration files
--
Key: NUTCH-780
URL: https://issues.apache.org/jira/browse/NUTCH-780
Project: Nutch
Issue Type: Bug
Components: ndfs
Affects
[
https://issues.apache.org/jira/browse/NUTCH-780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vu Hoang updated NUTCH-780:
---
Description:
Nutch searcher can read properties at the constructor ...
Update: hadoop and hbase jar version is not right. After updating jars in
'lib/' directory and rebuild, now it's throwing:
org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException:
org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: Column
family mtdt: does not exist in
[
https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803186#action_12803186
]
Xiao Yang commented on NUTCH-650:
-
Exception:
[
https://issues.apache.org/jira/browse/NUTCH-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803193#action_12803193
]
Vu Hoang commented on NUTCH-780:
add lines below into class org.apache.nutch.crawl.Crawl
[
https://issues.apache.org/jira/browse/NUTCH-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803194#action_12803194
]
Vu Hoang commented on NUTCH-780:
add lines below into class org.apache.nutch.crawl.Crawl
[
https://issues.apache.org/jira/browse/NUTCH-780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vu Hoang updated NUTCH-780:
---
Comment: was deleted
(was: add lines below into class org.apache.nutch.crawl.Crawl
[
https://issues.apache.org/jira/browse/NUTCH-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803193#action_12803193
]
Vu Hoang edited comment on NUTCH-780 at 1/21/10 6:32 AM:
-
add lines
18 matches
Mail list logo