Re: Injecting urls and define Inlink
Because I like to associate two separate running crawls. The only way is to associate it through the URL. Is there any way like CrawlDBReader but instead of reading I would like to write into it. A example would be great. Thanks in advance. Cheers On Jan 20, 2010, at 8:36 AM, MilleBii mille...@gmail.com wrote: Why don't you injetct www.inlink.com instead ? 2010/1/20, MyD myd.ro...@googlemail.com: Dear Nutch developers: Is there any way to inject urls and define an inlink for them? E.g. I inject the url www.example.com and the inlink should be www.inlink.com Thanks in advance for your help. Cheers, Markus -- -MilleBii-
Nofollow links on nutch
I want to obtain the inlinks of certain urls, and identify those inlinks as nofollowed or followed. Now, with the actual nutch linkdb is this possible? ¿How can I do that, using the linkdb access class in nutch and accesing that data? Thanks in advance, -- View this message in context: http://old.nabble.com/Nofollow-links-on-nutch-tp27243419p27243419.html Sent from the Nutch - Dev mailing list archive at Nabble.com.
Alt text of images as anchor text
after several test, I have noticed that nutch ignores alt text of images inside a href= tags. So, this feature isn't implemented yet right? thanks in advance, -- View this message in context: http://old.nabble.com/Alt-text-of-images-as-anchor-text-tp27244358p27244358.html Sent from the Nutch - Dev mailing list archive at Nabble.com.
Re: Tried to run Crawl with depth of only 2 and getting IOException
On Wed, Jan 20, 2010 at 7:10 PM, kraman kirthi.ra...@gmail.com wrote: kirth...@cerebrum [~/www/nutch]# ./bin/nutch crawl url -dir tinycrawl -depth 2 crawl started in: tinycrawl rootUrlDir = url threads = 10 depth = 2 Injector: starting Injector: crawlDb: tinycrawl/crawldb Injector: urlDir: url Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: tinycrawl/segments/20100120130316 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: tinycrawl/segments/20100120130316 Fetcher: threads: 10 fetching http://www.mywebsite.us/ fetch of http://www.mywebsite.us/ failed with: java.lang.RuntimeException: Agent name not configured! You need to fix nutch config file as per README. Fetcher: done CrawlDb update: starting CrawlDb update: db: tinycrawl/crawldb CrawlDb update: segments: [tinycrawl/segments/20100120130316] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: tinycrawl/segments/20100120130323 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: tinycrawl/segments/20100120130323 Fetcher: threads: 10 fetching http://www.mywebsite.us/ fetch of http://www.mywebsite.us/ failed with: java.lang.RuntimeException: Agent name not configured! Fetcher: done CrawlDb update: starting CrawlDb update: db: tinycrawl/crawldb CrawlDb update: segments: [tinycrawl/segments/20100120130323] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done LinkDb: starting LinkDb: linkdb: tinycrawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: tinycrawl/segments/20100120130323 LinkDb: adding segment: tinycrawl/segments/20100120130316 LinkDb: done Indexer: starting Indexer: linkdb: tinycrawl/linkdb Indexer: adding segment: tinycrawl/segments/20100120130323 Indexer: adding segment: tinycrawl/segments/20100120130316 Optimizing index. Indexer: done Dedup: starting Dedup: adding indexes in: tinycrawl/indexes Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439) at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) LogFile gives java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113) at org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176) at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126) -- View this message in context: http://old.nabble.com/Tried-to-run-Crawl-with-depth-of-only-2-and-getting-IOException-tp27246959p27246959.html Sent from the Nutch - Dev mailing list archive at Nabble.com.
Re: Injecting urls and define Inlink
On Wed, Jan 20, 2010 at 8:04 AM, MyD myd.ro...@googlemail.com wrote: Because I like to associate two separate running crawls. The only way is to associate it through the URL. Is there any way like CrawlDBReader but instead of reading I would like to write into it. A example would be great. Thanks in advance. You need to write nutch plugin thread below talks about conceptually similar to your issue http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg08124.html Cheers On Jan 20, 2010, at 8:36 AM, MilleBii mille...@gmail.com wrote: Why don't you injetct www.inlink.com instead ? 2010/1/20, MyD myd.ro...@googlemail.com: Dear Nutch developers: Is there any way to inject urls and define an inlink for them? E.g. I inject the url www.example.com and the inlink should be www.inlink.com Thanks in advance for your help. Cheers, Markus -- -MilleBii-
Re: Alt text of images as anchor text
On Wed, Jan 20, 2010 at 4:16 PM, axi axi...@gmail.com wrote: after several test, I have noticed that nutch ignores alt text of images inside a href= tags. So, this feature isn't implemented yet right? what exactly you want nutch should do to the alt text index it? tokenize it? make this field available as query i.e. img_alt:my alt tags or? thanks in advance, -- View this message in context: http://old.nabble.com/Alt-text-of-images-as-anchor-text-tp27244358p27244358.html Sent from the Nutch - Dev mailing list archive at Nabble.com.
Re: Alt text of images as anchor text
If you put image as link, is commonly known that alt text of that image is equivalent to the anchor text of text link. Now if you put an image with alt text inside a link, anchor text for that link is empty and no image alt text is counted. Nutch Newbie wrote: On Wed, Jan 20, 2010 at 4:16 PM, axi axi...@gmail.com wrote: after several test, I have noticed that nutch ignores alt text of images inside tags. So, this feature isn't implemented yet right? what exactly you want nutch should do to the alt text index it? tokenize it? make this field available as query i.e. img_alt:my alt tags or? thanks in advance, -- View this message in context: http://old.nabble.com/Alt-text-of-images-as-anchor-text-tp27244358p27244358.html Sent from the Nutch - Dev mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/Alt-text-of-images-as-anchor-text-tp27244358p27247820.html Sent from the Nutch - Dev mailing list archive at Nabble.com.
Re: Alt text of images as anchor text
On Wed, Jan 20, 2010 at 8:11 PM, axi axi...@gmail.com wrote: If you put image as link, is commonly known that alt text of that image is equivalent to the anchor text of text link. Now if you put an image with alt text inside a link, anchor text for that link is empty and no image alt text is counted. are you crawling for images? or http://svn.apache.org/repos/asf/lucene/nutch/trunk/conf/crawl-urlfilter.txt.template # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ Nutch Newbie wrote: On Wed, Jan 20, 2010 at 4:16 PM, axi axi...@gmail.com wrote: after several test, I have noticed that nutch ignores alt text of images inside tags. So, this feature isn't implemented yet right? what exactly you want nutch should do to the alt text index it? tokenize it? make this field available as query i.e. img_alt:my alt tags or? thanks in advance, -- View this message in context: http://old.nabble.com/Alt-text-of-images-as-anchor-text-tp27244358p27244358.html Sent from the Nutch - Dev mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/Alt-text-of-images-as-anchor-text-tp27244358p27247820.html Sent from the Nutch - Dev mailing list archive at Nabble.com.
Re: Alt text of images as anchor text
I'll try that, but the real anchor text is in On Wed, Jan 20, 2010 at 8:11 PM, axi axi...@gmail.com wrote: If you put image as link, is commonly known that alt text of that image is equivalent to the anchor text of text link. Now if you put an image with alt text inside a link, anchor text for that link is empty and no image alt text is counted. are you crawling for images? or http://svn.apache.org/repos/asf/lucene/nutch/trunk/conf/crawl-urlfilter.txt.template # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ Nutch Newbie wrote: On Wed, Jan 20, 2010 at 4:16 PM, axi axi...@gmail.com wrote: after several test, I have noticed that nutch ignores alt text of images inside tags. So, this feature isn't implemented yet right? what exactly you want nutch should do to the alt text index it? tokenize it? make this field available as query i.e. img_alt:my alt tags or? thanks in advance, -- View this message in context: http://old.nabble.com/Alt-text-of-images-as-anchor-text-tp27244358p27244358.html Sent from the Nutch - Dev mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/Alt-text-of-images-as-anchor-text-tp27244358p27247820.html Sent from the Nutch - Dev mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/Alt-text-of-images-as-anchor-text-tp27244358p27249488.html Sent from the Nutch - Dev mailing list archive at Nabble.com.
Re: [jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb
I'd like to use Julien's approach because I found the scoring filter complex to understand. My use case is the following : 1. during scoring after parsing, I want to tag interesting pages for me, say meta=HIT 2. in the next step (to be created) I would like to prune the segment of NON-HIT content in order to optimize segment space (I use nutch caching), I typically need to ditch 90% of segment data. Also considering to 4. focus recrawls on HIT pages and their outlinks Today I don't know really if how one can retrieve these meta data, I have manage to avoid storing text content for NON-HIT but it is a dirty trick. 2010/1/19 Andrzej Bialecki (JIRA) j...@apache.org [ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802175#action_12802175] Andrzej Bialecki commented on NUTCH-779: - Personally I would use ScoringFilters because I'm familiar with the API, but the approach that you propose is certainly more user friendly especially for novice users. Mechanism for passing metadata from parse to crawldb Key: NUTCH-779 URL: https://issues.apache.org/jira/browse/NUTCH-779 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Attachments: NUTCH-779 The patch attached allows to pass parse metadata to the corresponding entry of the crawldb. Comments are welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -- -MilleBii-
[jira] Created: (NUTCH-780) Nutch crawler did not read configuration files
Nutch crawler did not read configuration files -- Key: NUTCH-780 URL: https://issues.apache.org/jira/browse/NUTCH-780 Project: Nutch Issue Type: Bug Components: ndfs Affects Versions: 1.0.0 Reporter: Vu Hoang Nutch searcher can read properties at the constructor ... {code:java|title=NutchSearcher.java|borderStyle=solid} NutchBean bean = new NutchBean(getFilesystem().getConf(), fs); ... // put search engine code here {code} ... but Nutch crawler is not, it only reads data from arguments. {code:java|title=NutchCrawler.java|borderStyle=solid} StringBuilder builder = new StringBuilder(); builder.append(domainlist + SPACE); builder.append(ARGUMENT_CRAWL_DIR); builder.append(domainlist + SUBFIX_CRAWLED + SPACE); builder.append(ARGUMENT_CRAWL_THREADS); builder.append(threads + SPACE); builder.append(ARGUMENT_CRAWL_DEPTH); builder.append(depth + SPACE); builder.append(ARGUMENT_CRAWL_TOPN); builder.append(topN + SPACE); Crawl.main(builder.toString().split(SPACE)); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-780) Nutch crawler did not read configuration files
[ https://issues.apache.org/jira/browse/NUTCH-780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vu Hoang updated NUTCH-780: --- Description: Nutch searcher can read properties at the constructor ... {code:java|title=NutchSearcher.java|borderStyle=solid} NutchBean bean = new NutchBean(getFilesystem().getConf(), fs); ... // put search engine code here {code} ... but Nutch crawler is not, it only reads data from arguments. {code:java|title=NutchCrawler.java|borderStyle=solid} StringBuilder builder = new StringBuilder(); builder.append(domainlist + SPACE); builder.append(ARGUMENT_CRAWL_DIR); builder.append(domainlist + SUBFIX_CRAWLED + SPACE); builder.append(ARGUMENT_CRAWL_THREADS); builder.append(threads + SPACE); builder.append(ARGUMENT_CRAWL_DEPTH); builder.append(depth + SPACE); builder.append(ARGUMENT_CRAWL_TOPN); builder.append(topN + SPACE); Crawl.main(builder.toString().split(SPACE)); {code} was: Nutch searcher can read properties at the constructor ... {code:java|title=NutchSearcher.java|borderStyle=solid} NutchBean bean = new NutchBean(getFilesystem().getConf(), fs); ... // put search engine code here {code} ... but Nutch crawler is not, it only reads data from arguments. {code:java|title=NutchCrawler.java|borderStyle=solid} StringBuilder builder = new StringBuilder(); builder.append(domainlist + SPACE); builder.append(ARGUMENT_CRAWL_DIR); builder.append(domainlist + SUBFIX_CRAWLED + SPACE); builder.append(ARGUMENT_CRAWL_THREADS); builder.append(threads + SPACE); builder.append(ARGUMENT_CRAWL_DEPTH); builder.append(depth + SPACE); builder.append(ARGUMENT_CRAWL_TOPN); builder.append(topN + SPACE); Crawl.main(builder.toString().split(SPACE)); {code} Nutch crawler did not read configuration files -- Key: NUTCH-780 URL: https://issues.apache.org/jira/browse/NUTCH-780 Project: Nutch Issue Type: Bug Components: ndfs Affects Versions: 1.0.0 Reporter: Vu Hoang Nutch searcher can read properties at the constructor ... {code:java|title=NutchSearcher.java|borderStyle=solid} NutchBean bean = new NutchBean(getFilesystem().getConf(), fs); ... // put search engine code here {code} ... but Nutch crawler is not, it only reads data from arguments. {code:java|title=NutchCrawler.java|borderStyle=solid} StringBuilder builder = new StringBuilder(); builder.append(domainlist + SPACE); builder.append(ARGUMENT_CRAWL_DIR); builder.append(domainlist + SUBFIX_CRAWLED + SPACE); builder.append(ARGUMENT_CRAWL_THREADS); builder.append(threads + SPACE); builder.append(ARGUMENT_CRAWL_DEPTH); builder.append(depth + SPACE); builder.append(ARGUMENT_CRAWL_TOPN); builder.append(topN + SPACE); Crawl.main(builder.toString().split(SPACE)); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (NUTCH-650) Hbase Integration
Update: hadoop and hbase jar version is not right. After updating jars in 'lib/' directory and rebuild, now it's throwing: org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: Column family mtdt: does not exist in region crawl,,1264048608430 in table {NAME = 'crawl', FAMILIES = [{NAME = 'bas', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'cnt', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'cnttyp', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'fchi', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'fcht', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'hdrs', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'ilnk', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'modt', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'mtdt', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'olnk', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'prsstt', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'prtstt', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'prvfch', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'prvsig', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'repr', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'rtrs', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'scr', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'sig', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'stt', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'ttl', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'txt', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}]} at org.apache.hadoop.hbase.regionserver.HRegion.checkFamily(HRegion.java:2381) at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1241) at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1208) at org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1834) at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:995) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers$2.doCall(HConnectionManager.java:1193) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.process(HConnectionManager.java:1115) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1201) at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:605) at
[jira] Commented: (NUTCH-650) Hbase Integration
[ https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803186#action_12803186 ] Xiao Yang commented on NUTCH-650: - Exception: org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: Column family mtdt: does not exist in region crawl,,1264048608430 in table {NAME = 'crawl', FAMILIES = [{NAME = 'bas', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'cnt', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'cnttyp', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'fchi', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'fcht', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'hdrs', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'ilnk', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'modt', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'mtdt', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'olnk', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'prsstt', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'prtstt', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'prvfch', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'prvsig', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'repr', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'rtrs', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'scr', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'sig', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'stt', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'ttl', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}, {NAME = 'txt', COMPRESSION = 'NONE', VERSIONS = '3', TTL = '2147483647', BLOCKSIZE = '65536', IN_MEMORY = 'false', BLOCKCACHE = 'true'}]} at org.apache.hadoop.hbase.regionserver.HRegion.checkFamily(HRegion.java:2381) at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1241) at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1208) at org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1834) at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:995) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers$2.doCall(HConnectionManager.java:1193) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.process(HConnectionManager.java:1115) at
[jira] Commented: (NUTCH-780) Nutch crawler did not read configuration files
[ https://issues.apache.org/jira/browse/NUTCH-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803193#action_12803193 ] Vu Hoang commented on NUTCH-780: add lines below into class org.apache.nutch.crawl.Crawl {code:java|title=org/apache/nutch/crawl/Crawl.java|borderStyle=solid} public static Configuration nutchConfig = null; public static void setNutchConfig(Configuration config) { nutchConfig = config; } {code} and re-configure nutch configuration inside of method main as below {code:java|title=org/apache/nutch/crawl/Crawl.java|borderStyle=solid} Configuration conf = null; if (nutchConfig != null) conf = nutchConfig; else conf = NutchConfiguration.createCrawlConfiguration(); {code} Nutch crawler did not read configuration files -- Key: NUTCH-780 URL: https://issues.apache.org/jira/browse/NUTCH-780 Project: Nutch Issue Type: Bug Components: ndfs Affects Versions: 1.0.0 Reporter: Vu Hoang Nutch searcher can read properties at the constructor ... {code:java|title=NutchSearcher.java|borderStyle=solid} NutchBean bean = new NutchBean(getFilesystem().getConf(), fs); ... // put search engine code here {code} ... but Nutch crawler is not, it only reads data from arguments. {code:java|title=NutchCrawler.java|borderStyle=solid} StringBuilder builder = new StringBuilder(); builder.append(domainlist + SPACE); builder.append(ARGUMENT_CRAWL_DIR); builder.append(domainlist + SUBFIX_CRAWLED + SPACE); builder.append(ARGUMENT_CRAWL_THREADS); builder.append(threads + SPACE); builder.append(ARGUMENT_CRAWL_DEPTH); builder.append(depth + SPACE); builder.append(ARGUMENT_CRAWL_TOPN); builder.append(topN + SPACE); Crawl.main(builder.toString().split(SPACE)); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-780) Nutch crawler did not read configuration files
[ https://issues.apache.org/jira/browse/NUTCH-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803194#action_12803194 ] Vu Hoang commented on NUTCH-780: add lines below into class org.apache.nutch.crawl.Crawl {code:java|title=org/apache/nutch/crawl/Crawl.java|borderStyle=solid} public static Configuration nutchConfig = null; public static void setNutchConfig(Configuration config) { nutchConfig = config; } {code} and re-configure nutch configuration inside of method main as below {code:java|title=org/apache/nutch/crawl/Crawl.java|borderStyle=solid} Configuration conf = null; if (nutchConfig != null) conf = nutchConfig; else conf = NutchConfiguration.createCrawlConfiguration(); {code} Nutch crawler did not read configuration files -- Key: NUTCH-780 URL: https://issues.apache.org/jira/browse/NUTCH-780 Project: Nutch Issue Type: Bug Components: ndfs Affects Versions: 1.0.0 Reporter: Vu Hoang Nutch searcher can read properties at the constructor ... {code:java|title=NutchSearcher.java|borderStyle=solid} NutchBean bean = new NutchBean(getFilesystem().getConf(), fs); ... // put search engine code here {code} ... but Nutch crawler is not, it only reads data from arguments. {code:java|title=NutchCrawler.java|borderStyle=solid} StringBuilder builder = new StringBuilder(); builder.append(domainlist + SPACE); builder.append(ARGUMENT_CRAWL_DIR); builder.append(domainlist + SUBFIX_CRAWLED + SPACE); builder.append(ARGUMENT_CRAWL_THREADS); builder.append(threads + SPACE); builder.append(ARGUMENT_CRAWL_DEPTH); builder.append(depth + SPACE); builder.append(ARGUMENT_CRAWL_TOPN); builder.append(topN + SPACE); Crawl.main(builder.toString().split(SPACE)); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-780) Nutch crawler did not read configuration files
[ https://issues.apache.org/jira/browse/NUTCH-780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vu Hoang updated NUTCH-780: --- Comment: was deleted (was: add lines below into class org.apache.nutch.crawl.Crawl {code:java|title=org/apache/nutch/crawl/Crawl.java|borderStyle=solid} public static Configuration nutchConfig = null; public static void setNutchConfig(Configuration config) { nutchConfig = config; } {code} and re-configure nutch configuration inside of method main as below {code:java|title=org/apache/nutch/crawl/Crawl.java|borderStyle=solid} Configuration conf = null; if (nutchConfig != null) conf = nutchConfig; else conf = NutchConfiguration.createCrawlConfiguration(); {code}) Nutch crawler did not read configuration files -- Key: NUTCH-780 URL: https://issues.apache.org/jira/browse/NUTCH-780 Project: Nutch Issue Type: Bug Components: ndfs Affects Versions: 1.0.0 Reporter: Vu Hoang Nutch searcher can read properties at the constructor ... {code:java|title=NutchSearcher.java|borderStyle=solid} NutchBean bean = new NutchBean(getFilesystem().getConf(), fs); ... // put search engine code here {code} ... but Nutch crawler is not, it only reads data from arguments. {code:java|title=NutchCrawler.java|borderStyle=solid} StringBuilder builder = new StringBuilder(); builder.append(domainlist + SPACE); builder.append(ARGUMENT_CRAWL_DIR); builder.append(domainlist + SUBFIX_CRAWLED + SPACE); builder.append(ARGUMENT_CRAWL_THREADS); builder.append(threads + SPACE); builder.append(ARGUMENT_CRAWL_DEPTH); builder.append(depth + SPACE); builder.append(ARGUMENT_CRAWL_TOPN); builder.append(topN + SPACE); Crawl.main(builder.toString().split(SPACE)); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (NUTCH-780) Nutch crawler did not read configuration files
[ https://issues.apache.org/jira/browse/NUTCH-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803193#action_12803193 ] Vu Hoang edited comment on NUTCH-780 at 1/21/10 6:32 AM: - add lines below into class org.apache.nutch.crawl.Crawl {code:java|title=org/apache/nutch/crawl/Crawl.java|borderStyle=solid} public static Configuration nutchConfig = null; public static void setNutchConfig(Configuration config) { nutchConfig = config; } {code} and re-configure nutch configuration inside of method main as below {code:java|title=org/apache/nutch/crawl/Crawl.java|borderStyle=solid} Configuration conf = null; if (nutchConfig != null) conf = nutchConfig; else conf = NutchConfiguration.createCrawlConfiguration(); {code} I recommend that solution :) was (Author: vushogerts): add lines below into class org.apache.nutch.crawl.Crawl {code:java|title=org/apache/nutch/crawl/Crawl.java|borderStyle=solid} public static Configuration nutchConfig = null; public static void setNutchConfig(Configuration config) { nutchConfig = config; } {code} and re-configure nutch configuration inside of method main as below {code:java|title=org/apache/nutch/crawl/Crawl.java|borderStyle=solid} Configuration conf = null; if (nutchConfig != null) conf = nutchConfig; else conf = NutchConfiguration.createCrawlConfiguration(); {code} Nutch crawler did not read configuration files -- Key: NUTCH-780 URL: https://issues.apache.org/jira/browse/NUTCH-780 Project: Nutch Issue Type: Bug Components: ndfs Affects Versions: 1.0.0 Reporter: Vu Hoang Nutch searcher can read properties at the constructor ... {code:java|title=NutchSearcher.java|borderStyle=solid} NutchBean bean = new NutchBean(getFilesystem().getConf(), fs); ... // put search engine code here {code} ... but Nutch crawler is not, it only reads data from arguments. {code:java|title=NutchCrawler.java|borderStyle=solid} StringBuilder builder = new StringBuilder(); builder.append(domainlist + SPACE); builder.append(ARGUMENT_CRAWL_DIR); builder.append(domainlist + SUBFIX_CRAWLED + SPACE); builder.append(ARGUMENT_CRAWL_THREADS); builder.append(threads + SPACE); builder.append(ARGUMENT_CRAWL_DEPTH); builder.append(depth + SPACE); builder.append(ARGUMENT_CRAWL_TOPN); builder.append(topN + SPACE); Crawl.main(builder.toString().split(SPACE)); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.