date:20100120


I want to obtain the inlinks of certain urls, and identify those inlinks as
nofollowed or followed.
Now, with the actual nutch linkdb is this possible?
¿How can I do that, using the linkdb access class in nutch and accesing that
data?

Thanks in advance,
-- 
View this message in context: 
http://old.nabble.com/Nofollow-links-on-nutch-tp27243419p27243419.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Alt text of images as anchor text


after several test, I have noticed that nutch ignores alt text of images
inside a href= tags. 
So, this feature isn't implemented yet right?


thanks in advance,
-- 
View this message in context: 
http://old.nabble.com/Alt-text-of-images-as-anchor-text-tp27244358p27244358.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: Tried to run Crawl with depth of only 2 and getting IOException

On Wed, Jan 20, 2010 at 7:10 PM, kraman kirthi.ra...@gmail.com wrote:

 kirth...@cerebrum [~/www/nutch]# ./bin/nutch crawl url -dir tinycrawl -depth
 2
 crawl started in: tinycrawl
 rootUrlDir = url
 threads = 10
 depth = 2
 Injector: starting
 Injector: crawlDb: tinycrawl/crawldb
 Injector: urlDir: url
 Injector: Converting injected urls to crawl db entries.
 Injector: Merging injected urls into crawl db.
 Injector: done
 Generator: Selecting best-scoring urls due for fetch.
 Generator: starting
 Generator: segment: tinycrawl/segments/20100120130316
 Generator: filtering: false
 Generator: topN: 2147483647
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: Partitioning selected urls by host, for politeness.
 Generator: done.
 Fetcher: starting
 Fetcher: segment: tinycrawl/segments/20100120130316
 Fetcher: threads: 10
 fetching http://www.mywebsite.us/
 fetch of http://www.mywebsite.us/ failed with: java.lang.RuntimeException:
 Agent name not configured!

You need to fix nutch config file as per README.




 Fetcher: done
 CrawlDb update: starting
 CrawlDb update: db: tinycrawl/crawldb
 CrawlDb update: segments: [tinycrawl/segments/20100120130316]
 CrawlDb update: additions allowed: true
 CrawlDb update: URL normalizing: true
 CrawlDb update: URL filtering: true
 CrawlDb update: Merging segment data into db.
 CrawlDb update: done
 Generator: Selecting best-scoring urls due for fetch.
 Generator: starting
 Generator: segment: tinycrawl/segments/20100120130323
 Generator: filtering: false
 Generator: topN: 2147483647
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: Partitioning selected urls by host, for politeness.
 Generator: done.
 Fetcher: starting
 Fetcher: segment: tinycrawl/segments/20100120130323
 Fetcher: threads: 10
 fetching http://www.mywebsite.us/
 fetch of http://www.mywebsite.us/ failed with: java.lang.RuntimeException:
 Agent name not configured!
 Fetcher: done
 CrawlDb update: starting
 CrawlDb update: db: tinycrawl/crawldb
 CrawlDb update: segments: [tinycrawl/segments/20100120130323]
 CrawlDb update: additions allowed: true
 CrawlDb update: URL normalizing: true
 CrawlDb update: URL filtering: true
 CrawlDb update: Merging segment data into db.
 CrawlDb update: done
 LinkDb: starting
 LinkDb: linkdb: tinycrawl/linkdb
 LinkDb: URL normalize: true
 LinkDb: URL filter: true
 LinkDb: adding segment: tinycrawl/segments/20100120130323
 LinkDb: adding segment: tinycrawl/segments/20100120130316
 LinkDb: done
 Indexer: starting
 Indexer: linkdb: tinycrawl/linkdb
 Indexer: adding segment: tinycrawl/segments/20100120130323
 Indexer: adding segment: tinycrawl/segments/20100120130316
 Optimizing index.
 Indexer: done
 Dedup: starting
 Dedup: adding indexes in: tinycrawl/indexes
 Exception in thread main java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at
 org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

 LogFile gives
 java.lang.ArrayIndexOutOfBoundsException: -1
        at
 org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
        at
 org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
        at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
        at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
 --
 View this message in context: 
 http://old.nabble.com/Tried-to-run-Crawl-with-depth-of-only-2-and-getting-IOException-tp27246959p27246959.html
 Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: Injecting urls and define Inlink

On Wed, Jan 20, 2010 at 8:04 AM, MyD myd.ro...@googlemail.com wrote:
 Because I like to associate two separate running crawls. The only way is to
 associate it through the URL.
 Is there any way like CrawlDBReader but instead of reading I would like to
 write into it. A example would be great. Thanks in advance.

You need to write nutch plugin thread below talks about conceptually
similar to your issue

http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg08124.html


 Cheers

 On Jan 20, 2010, at 8:36 AM, MilleBii mille...@gmail.com wrote:

 Why don't you injetct www.inlink.com instead ?

 2010/1/20, MyD myd.ro...@googlemail.com:

 Dear Nutch developers:

 Is there any way to inject urls and define an inlink for them?

 E.g. I inject the url www.example.com and the inlink should be
 www.inlink.com

 Thanks in advance for your help.

 Cheers,
 Markus



 --
 -MilleBii-

Re: Alt text of images as anchor text

On Wed, Jan 20, 2010 at 4:16 PM, axi axi...@gmail.com wrote:

 after several test, I have noticed that nutch ignores alt text of images
 inside a href= tags.
 So, this feature isn't implemented yet right?

what exactly you want nutch should do to the alt text index it?
tokenize it? make this field available as query i.e. img_alt:my alt
tags or?




 thanks in advance,
 --
 View this message in context: 
 http://old.nabble.com/Alt-text-of-images-as-anchor-text-tp27244358p27244358.html
 Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: Alt text of images as anchor text


If you put image as link, is commonly known that alt text of that image is
equivalent to the anchor text of text link. Now if you put an image with alt
text inside a link, anchor text for that link is empty and no image alt text
is counted.


Nutch Newbie wrote:
 
 On Wed, Jan 20, 2010 at 4:16 PM, axi axi...@gmail.com wrote:

 after several test, I have noticed that nutch ignores alt text of images
 inside   tags.
  So, this feature isn't implemented yet right?
 
 what exactly you want nutch should do to the alt text index it?
 tokenize it? make this field available as query i.e. img_alt:my alt
 tags or?
 
 


 thanks in advance,
 --
 View this message in context:
 http://old.nabble.com/Alt-text-of-images-as-anchor-text-tp27244358p27244358.html
 Sent from the Nutch - Dev mailing list archive at Nabble.com.


 
 

-- 
View this message in context: 
http://old.nabble.com/Alt-text-of-images-as-anchor-text-tp27244358p27247820.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: Alt text of images as anchor text

On Wed, Jan 20, 2010 at 8:11 PM, axi axi...@gmail.com wrote:

 If you put image as link, is commonly known that alt text of that image is
 equivalent to the anchor text of text link. Now if you put an image with alt
 text inside a link, anchor text for that link is empty and no image alt text
 is counted.

are you crawling for images? or

http://svn.apache.org/repos/asf/lucene/nutch/trunk/conf/crawl-urlfilter.txt.template

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$


 Nutch Newbie wrote:

 On Wed, Jan 20, 2010 at 4:16 PM, axi axi...@gmail.com wrote:

 after several test, I have noticed that nutch ignores alt text of images
 inside   tags.
  So, this feature isn't implemented yet right?

 what exactly you want nutch should do to the alt text index it?
 tokenize it? make this field available as query i.e. img_alt:my alt
 tags or?




 thanks in advance,
 --
 View this message in context:
 http://old.nabble.com/Alt-text-of-images-as-anchor-text-tp27244358p27244358.html
 Sent from the Nutch - Dev mailing list archive at Nabble.com.





 --
 View this message in context: 
 http://old.nabble.com/Alt-text-of-images-as-anchor-text-tp27244358p27247820.html
 Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: Alt text of images as anchor text


I'll try that, 
but the real anchor text is in  
On Wed, Jan 20, 2010 at 8:11 PM, axi axi...@gmail.com wrote:

 If you put image as link, is commonly known that alt text of that image is
 equivalent to the anchor text of text link. Now if you put an image with
 alt
 text inside a link, anchor text for that link is empty and no image alt
 text
 is counted.

are you crawling for images? or

http://svn.apache.org/repos/asf/lucene/nutch/trunk/conf/crawl-urlfilter.txt.template

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$


 Nutch Newbie wrote:

 On Wed, Jan 20, 2010 at 4:16 PM, axi axi...@gmail.com wrote:

 after several test, I have noticed that nutch ignores alt text of images
 inside   tags.
  So, this feature isn't implemented yet right?

 what exactly you want nutch should do to the alt text index it?
 tokenize it? make this field available as query i.e. img_alt:my alt
 tags or?




 thanks in advance,
 --
 View this message in context:
 http://old.nabble.com/Alt-text-of-images-as-anchor-text-tp27244358p27244358.html
 Sent from the Nutch - Dev mailing list archive at Nabble.com.





 --
 View this message in context:
 http://old.nabble.com/Alt-text-of-images-as-anchor-text-tp27244358p27247820.html
 Sent from the Nutch - Dev mailing list archive at Nabble.com.





-- 
View this message in context: 
http://old.nabble.com/Alt-text-of-images-as-anchor-text-tp27244358p27249488.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: [jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-01-20 Thread MilleBii

I'd like to use Julien's approach because I found the scoring filter complex
to understand.

My use case is the following :
1. during scoring after parsing, I want to tag interesting pages for me, say
meta=HIT
2. in the next step (to be created) I would like to prune the segment of
NON-HIT content in order to optimize segment space (I use nutch caching), I
typically need to ditch 90% of segment data.

Also considering to
4. focus recrawls on HIT pages and their outlinks

Today I don't know really if how one can retrieve these meta data, I have
manage to avoid storing text content for NON-HIT but it is a dirty trick.

2010/1/19 Andrzej Bialecki (JIRA) j...@apache.org

[
https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802175#action_12802175]

Andrzej Bialecki commented on NUTCH-779:
-

Personally I would use ScoringFilters because I'm familiar with the API,
but the approach that you propose is certainly more user friendly especially
for novice users.

Mechanism for passing metadata from parse to crawldb

Key: NUTCH-779
URL: https://issues.apache.org/jira/browse/NUTCH-779
Project: Nutch
Issue Type: New Feature
Reporter: Julien Nioche
Attachments: NUTCH-779

The patch attached allows to pass parse metadata to the corresponding
entry of the crawldb.
Comments are welcome

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

--
-MilleBii-

[jira] Created: (NUTCH-780) Nutch crawler did not read configuration files

Nutch crawler did not read configuration files
--

 Key: NUTCH-780
 URL: https://issues.apache.org/jira/browse/NUTCH-780
 Project: Nutch
  Issue Type: Bug
  Components: ndfs
Affects Versions: 1.0.0
Reporter: Vu Hoang


Nutch searcher can read properties at the constructor ...
{code:java|title=NutchSearcher.java|borderStyle=solid}
NutchBean bean = new NutchBean(getFilesystem().getConf(), fs);
... // put search engine code here
{code}

... but Nutch crawler is not, it only reads data from arguments.
{code:java|title=NutchCrawler.java|borderStyle=solid}
StringBuilder builder = new StringBuilder();
builder.append(domainlist + SPACE);
builder.append(ARGUMENT_CRAWL_DIR);
builder.append(domainlist + SUBFIX_CRAWLED + SPACE);
builder.append(ARGUMENT_CRAWL_THREADS);
builder.append(threads + SPACE);
builder.append(ARGUMENT_CRAWL_DEPTH);
builder.append(depth + SPACE);
builder.append(ARGUMENT_CRAWL_TOPN);
builder.append(topN + SPACE);
Crawl.main(builder.toString().split(SPACE));
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-780) Nutch crawler did not read configuration files


 [ 
https://issues.apache.org/jira/browse/NUTCH-780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vu Hoang updated NUTCH-780:
---

Description: 
Nutch searcher can read properties at the constructor ...

{code:java|title=NutchSearcher.java|borderStyle=solid}
NutchBean bean = new NutchBean(getFilesystem().getConf(), fs);
... // put search engine code here
{code}

... but Nutch crawler is not, it only reads data from arguments.
{code:java|title=NutchCrawler.java|borderStyle=solid}
StringBuilder builder = new StringBuilder();
builder.append(domainlist + SPACE);
builder.append(ARGUMENT_CRAWL_DIR);
builder.append(domainlist + SUBFIX_CRAWLED + SPACE);
builder.append(ARGUMENT_CRAWL_THREADS);
builder.append(threads + SPACE);
builder.append(ARGUMENT_CRAWL_DEPTH);
builder.append(depth + SPACE);
builder.append(ARGUMENT_CRAWL_TOPN);
builder.append(topN + SPACE);
Crawl.main(builder.toString().split(SPACE));
{code}

  was:
Nutch searcher can read properties at the constructor ...
{code:java|title=NutchSearcher.java|borderStyle=solid}
NutchBean bean = new NutchBean(getFilesystem().getConf(), fs);
... // put search engine code here
{code}

... but Nutch crawler is not, it only reads data from arguments.
{code:java|title=NutchCrawler.java|borderStyle=solid}
StringBuilder builder = new StringBuilder();
builder.append(domainlist + SPACE);
builder.append(ARGUMENT_CRAWL_DIR);
builder.append(domainlist + SUBFIX_CRAWLED + SPACE);
builder.append(ARGUMENT_CRAWL_THREADS);
builder.append(threads + SPACE);
builder.append(ARGUMENT_CRAWL_DEPTH);
builder.append(depth + SPACE);
builder.append(ARGUMENT_CRAWL_TOPN);
builder.append(topN + SPACE);
Crawl.main(builder.toString().split(SPACE));
{code}


 Nutch crawler did not read configuration files
 --

 Key: NUTCH-780
 URL: https://issues.apache.org/jira/browse/NUTCH-780
 Project: Nutch
  Issue Type: Bug
  Components: ndfs
Affects Versions: 1.0.0
Reporter: Vu Hoang

 Nutch searcher can read properties at the constructor ...
 {code:java|title=NutchSearcher.java|borderStyle=solid}
 NutchBean bean = new NutchBean(getFilesystem().getConf(), fs);
 ... // put search engine code here
 {code}
 ... but Nutch crawler is not, it only reads data from arguments.
 {code:java|title=NutchCrawler.java|borderStyle=solid}
 StringBuilder builder = new StringBuilder();
 builder.append(domainlist + SPACE);
 builder.append(ARGUMENT_CRAWL_DIR);
 builder.append(domainlist + SUBFIX_CRAWLED + SPACE);
 builder.append(ARGUMENT_CRAWL_THREADS);
 builder.append(threads + SPACE);
 builder.append(ARGUMENT_CRAWL_DEPTH);
 builder.append(depth + SPACE);
 builder.append(ARGUMENT_CRAWL_TOPN);
 builder.append(topN + SPACE);
 Crawl.main(builder.toString().split(SPACE));
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (NUTCH-650) Hbase Integration

2010-01-20 Thread xiao yang

Update: hadoop and hbase jar version is not right. After updating jars in
'lib/' directory and rebuild, now it's throwing:

org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException:
org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: Column
family mtdt: does not exist in region crawl,,1264048608430 in table {NAME =
'crawl', FAMILIES = [{NAME = 'bas', COMPRESSION = 'NONE', VERSIONS =
'3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false',
BLOCKCACHE = 'true'}, {NAME = 'cnt', COMPRESSION = 'NONE', VERSIONS =
'3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false',
BLOCKCACHE = 'true'}, {NAME = 'cnttyp', COMPRESSION = 'NONE', VERSIONS =
'3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false',
BLOCKCACHE = 'true'}, {NAME = 'fchi', COMPRESSION = 'NONE', VERSIONS =
'3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false',
BLOCKCACHE = 'true'}, {NAME = 'fcht', COMPRESSION = 'NONE', VERSIONS =
'3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false',
BLOCKCACHE = 'true'}, {NAME = 'hdrs', COMPRESSION = 'NONE', VERSIONS =
'3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false',
BLOCKCACHE = 'true'}, {NAME = 'ilnk', COMPRESSION = 'NONE', VERSIONS =
'3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false',
BLOCKCACHE = 'true'}, {NAME = 'modt', COMPRESSION = 'NONE', VERSIONS =
'3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false',
BLOCKCACHE = 'true'}, {NAME = 'mtdt', COMPRESSION = 'NONE', VERSIONS =
'3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false',
BLOCKCACHE = 'true'}, {NAME = 'olnk', COMPRESSION = 'NONE', VERSIONS =
'3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false',
BLOCKCACHE = 'true'}, {NAME = 'prsstt', COMPRESSION = 'NONE', VERSIONS =
'3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false',
BLOCKCACHE = 'true'}, {NAME = 'prtstt', COMPRESSION = 'NONE', VERSIONS =
'3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false',
BLOCKCACHE = 'true'}, {NAME = 'prvfch', COMPRESSION = 'NONE', VERSIONS =
'3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false',
BLOCKCACHE = 'true'}, {NAME = 'prvsig', COMPRESSION = 'NONE', VERSIONS =
'3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false',
BLOCKCACHE = 'true'}, {NAME = 'repr', COMPRESSION = 'NONE', VERSIONS =
'3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false',
BLOCKCACHE = 'true'}, {NAME = 'rtrs', COMPRESSION = 'NONE', VERSIONS =
'3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false',
BLOCKCACHE = 'true'}, {NAME = 'scr', COMPRESSION = 'NONE', VERSIONS =
'3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false',
BLOCKCACHE = 'true'}, {NAME = 'sig', COMPRESSION = 'NONE', VERSIONS =
'3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false',
BLOCKCACHE = 'true'}, {NAME = 'stt', COMPRESSION = 'NONE', VERSIONS =
'3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false',
BLOCKCACHE = 'true'}, {NAME = 'ttl', COMPRESSION = 'NONE', VERSIONS =
'3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false',
BLOCKCACHE = 'true'}, {NAME = 'txt', COMPRESSION = 'NONE', VERSIONS =
'3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false',
BLOCKCACHE = 'true'}]}
at
org.apache.hadoop.hbase.regionserver.HRegion.checkFamily(HRegion.java:2381)
at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1241)
at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1208)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1834)
at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:995)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers$2.doCall(HConnectionManager.java:1193)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.process(HConnectionManager.java:1115)
at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1201)
at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:605)
at

[jira] Commented: (NUTCH-650) Hbase Integration

2010-01-20 Thread Xiao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803186#action_12803186
 ] 

Xiao Yang commented on NUTCH-650:
-

Exception:

org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: 
org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: Column family 
mtdt: does not exist in region crawl,,1264048608430 in table {NAME = 'crawl', 
FAMILIES = [{NAME = 'bas', COMPRESSION = 'NONE', VERSIONS = '3', TTL = 
'2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 
'true'}, {NAME = 'cnt', COMPRESSION = 'NONE', VERSIONS = '3', TTL = 
'2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 
'true'}, {NAME = 'cnttyp', COMPRESSION = 'NONE', VERSIONS = '3', TTL = 
'2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 
'true'}, {NAME = 'fchi', COMPRESSION = 'NONE', VERSIONS = '3', TTL = 
'2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 
'true'}, {NAME = 'fcht', COMPRESSION = 'NONE', VERSIONS = '3', TTL = 
'2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 
'true'}, {NAME = 'hdrs', COMPRESSION = 'NONE', VERSIONS = '3', TTL = 
'2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 
'true'}, {NAME = 'ilnk', COMPRESSION = 'NONE', VERSIONS = '3', TTL = 
'2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 
'true'}, {NAME = 'modt', COMPRESSION = 'NONE', VERSIONS = '3', TTL = 
'2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 
'true'}, {NAME = 'mtdt', COMPRESSION = 'NONE', VERSIONS = '3', TTL = 
'2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 
'true'}, {NAME = 'olnk', COMPRESSION = 'NONE', VERSIONS = '3', TTL = 
'2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 
'true'}, {NAME = 'prsstt', COMPRESSION = 'NONE', VERSIONS = '3', TTL = 
'2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 
'true'}, {NAME = 'prtstt', COMPRESSION = 'NONE', VERSIONS = '3', TTL = 
'2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 
'true'}, {NAME = 'prvfch', COMPRESSION = 'NONE', VERSIONS = '3', TTL = 
'2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 
'true'}, {NAME = 'prvsig', COMPRESSION = 'NONE', VERSIONS = '3', TTL = 
'2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 
'true'}, {NAME = 'repr', COMPRESSION = 'NONE', VERSIONS = '3', TTL = 
'2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 
'true'}, {NAME = 'rtrs', COMPRESSION = 'NONE', VERSIONS = '3', TTL = 
'2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 
'true'}, {NAME = 'scr', COMPRESSION = 'NONE', VERSIONS = '3', TTL = 
'2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 
'true'}, {NAME = 'sig', COMPRESSION = 'NONE', VERSIONS = '3', TTL = 
'2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 
'true'}, {NAME = 'stt', COMPRESSION = 'NONE', VERSIONS = '3', TTL = 
'2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 
'true'}, {NAME = 'ttl', COMPRESSION = 'NONE', VERSIONS = '3', TTL = 
'2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 
'true'}, {NAME = 'txt', COMPRESSION = 'NONE', VERSIONS = '3', TTL = 
'2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 
'true'}]}
at 
org.apache.hadoop.hbase.regionserver.HRegion.checkFamily(HRegion.java:2381)
at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1241)
at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1208)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1834)
at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)

at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at 
org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
at 
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:995)
at 
org.apache.hadoop.hbase.client.HConnectionManager$TableServers$2.doCall(HConnectionManager.java:1193)
at 
org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.process(HConnectionManager.java:1115)
at

[jira] Commented: (NUTCH-780) Nutch crawler did not read configuration files


[ 
https://issues.apache.org/jira/browse/NUTCH-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803193#action_12803193
 ] 

Vu Hoang commented on NUTCH-780:


add lines below into class org.apache.nutch.crawl.Crawl
{code:java|title=org/apache/nutch/crawl/Crawl.java|borderStyle=solid}
public static Configuration nutchConfig = null;
public static void setNutchConfig(Configuration config) { nutchConfig = config; 
}
{code}

and re-configure nutch configuration inside of method main as below
{code:java|title=org/apache/nutch/crawl/Crawl.java|borderStyle=solid}
Configuration conf = null;
if (nutchConfig != null) conf = nutchConfig;
else conf = NutchConfiguration.createCrawlConfiguration();
{code}

 Nutch crawler did not read configuration files
 --

 Key: NUTCH-780
 URL: https://issues.apache.org/jira/browse/NUTCH-780
 Project: Nutch
  Issue Type: Bug
  Components: ndfs
Affects Versions: 1.0.0
Reporter: Vu Hoang

 Nutch searcher can read properties at the constructor ...
 {code:java|title=NutchSearcher.java|borderStyle=solid}
 NutchBean bean = new NutchBean(getFilesystem().getConf(), fs);
 ... // put search engine code here
 {code}
 ... but Nutch crawler is not, it only reads data from arguments.
 {code:java|title=NutchCrawler.java|borderStyle=solid}
 StringBuilder builder = new StringBuilder();
 builder.append(domainlist + SPACE);
 builder.append(ARGUMENT_CRAWL_DIR);
 builder.append(domainlist + SUBFIX_CRAWLED + SPACE);
 builder.append(ARGUMENT_CRAWL_THREADS);
 builder.append(threads + SPACE);
 builder.append(ARGUMENT_CRAWL_DEPTH);
 builder.append(depth + SPACE);
 builder.append(ARGUMENT_CRAWL_TOPN);
 builder.append(topN + SPACE);
 Crawl.main(builder.toString().split(SPACE));
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-780) Nutch crawler did not read configuration files


[ 
https://issues.apache.org/jira/browse/NUTCH-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803194#action_12803194
 ] 

Vu Hoang commented on NUTCH-780:


add lines below into class org.apache.nutch.crawl.Crawl
{code:java|title=org/apache/nutch/crawl/Crawl.java|borderStyle=solid}
public static Configuration nutchConfig = null;
public static void setNutchConfig(Configuration config) { nutchConfig = config; 
}
{code}

and re-configure nutch configuration inside of method main as below
{code:java|title=org/apache/nutch/crawl/Crawl.java|borderStyle=solid}
Configuration conf = null;
if (nutchConfig != null) conf = nutchConfig;
else conf = NutchConfiguration.createCrawlConfiguration();
{code}

 Nutch crawler did not read configuration files
 --

 Key: NUTCH-780
 URL: https://issues.apache.org/jira/browse/NUTCH-780
 Project: Nutch
  Issue Type: Bug
  Components: ndfs
Affects Versions: 1.0.0
Reporter: Vu Hoang

 Nutch searcher can read properties at the constructor ...
 {code:java|title=NutchSearcher.java|borderStyle=solid}
 NutchBean bean = new NutchBean(getFilesystem().getConf(), fs);
 ... // put search engine code here
 {code}
 ... but Nutch crawler is not, it only reads data from arguments.
 {code:java|title=NutchCrawler.java|borderStyle=solid}
 StringBuilder builder = new StringBuilder();
 builder.append(domainlist + SPACE);
 builder.append(ARGUMENT_CRAWL_DIR);
 builder.append(domainlist + SUBFIX_CRAWLED + SPACE);
 builder.append(ARGUMENT_CRAWL_THREADS);
 builder.append(threads + SPACE);
 builder.append(ARGUMENT_CRAWL_DEPTH);
 builder.append(depth + SPACE);
 builder.append(ARGUMENT_CRAWL_TOPN);
 builder.append(topN + SPACE);
 Crawl.main(builder.toString().split(SPACE));
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-780) Nutch crawler did not read configuration files


 [ 
https://issues.apache.org/jira/browse/NUTCH-780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vu Hoang updated NUTCH-780:
---

Comment: was deleted

(was: add lines below into class org.apache.nutch.crawl.Crawl
{code:java|title=org/apache/nutch/crawl/Crawl.java|borderStyle=solid}
public static Configuration nutchConfig = null;
public static void setNutchConfig(Configuration config) { nutchConfig = config; 
}
{code}

and re-configure nutch configuration inside of method main as below
{code:java|title=org/apache/nutch/crawl/Crawl.java|borderStyle=solid}
Configuration conf = null;
if (nutchConfig != null) conf = nutchConfig;
else conf = NutchConfiguration.createCrawlConfiguration();
{code})

 Nutch crawler did not read configuration files
 --

 Key: NUTCH-780
 URL: https://issues.apache.org/jira/browse/NUTCH-780
 Project: Nutch
  Issue Type: Bug
  Components: ndfs
Affects Versions: 1.0.0
Reporter: Vu Hoang

 Nutch searcher can read properties at the constructor ...
 {code:java|title=NutchSearcher.java|borderStyle=solid}
 NutchBean bean = new NutchBean(getFilesystem().getConf(), fs);
 ... // put search engine code here
 {code}
 ... but Nutch crawler is not, it only reads data from arguments.
 {code:java|title=NutchCrawler.java|borderStyle=solid}
 StringBuilder builder = new StringBuilder();
 builder.append(domainlist + SPACE);
 builder.append(ARGUMENT_CRAWL_DIR);
 builder.append(domainlist + SUBFIX_CRAWLED + SPACE);
 builder.append(ARGUMENT_CRAWL_THREADS);
 builder.append(threads + SPACE);
 builder.append(ARGUMENT_CRAWL_DEPTH);
 builder.append(depth + SPACE);
 builder.append(ARGUMENT_CRAWL_TOPN);
 builder.append(topN + SPACE);
 Crawl.main(builder.toString().split(SPACE));
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (NUTCH-780) Nutch crawler did not read configuration files