Hello Guys, I have got the following scenario
urls.txt /http://localhost214:8080/ /vickey@vickey:~/development/crawler/apache-nutch-2.3.1/runtime/local/bin$ ./nutch inject seedlocal/ -crawlId 1 InjectorJob: starting at 2017-01-03 22:37:01 InjectorJob: Injecting urlDir: seedlocal InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class. InjectorJob: total number of urls rejected by filters: 0 *InjectorJob: total number of urls injected after normalization and filtering: 2* Injector: finished at 2017-01-03 22:37:06, elapsed: 00:00:04 / /hbase(main):021:0> scan "1_webpage" ROW COLUMN+CELL localhost213:http:8080 column=f:fi, timestamp=1483463225679, value=\x00'\x8D\x00 localhost213:http:8080 column=f:ts, timestamp=1483463225679, value=\x00\x00\x01YeL`\x15 localhost213:http:8080 column=mk:_injmrk_, timestamp=1483463225679, value=y localhost213:http:8080 column=mk:dist, timestamp=1483463225679, value=0 localhost213:http:8080 column=mtdt:_csh_, timestamp=1483463225679, value=?\x80\x00\x00 localhost213:http:8080 column=s:s, timestamp=1483463225679, value=?\x80\x00\x00 localhost214:http:8080 column=f:fi, timestamp=1483463225827, value=\x00'\x8D\x00 localhost214:http:8080 column=f:ts, timestamp=1483463225827, value=\x00\x00\x01YeL`\x15 localhost214:http:8080 column=mk:_injmrk_, timestamp=1483463225827, value=y localhost214:http:8080 column=mk:dist, timestamp=1483463225827, value=0 localhost214:http:8080 column=mtdt:_csh_, timestamp=1483463225827, value=?\x80\x00\x00 localhost214:http:8080 column=s:s, timestamp=1483463225827, value=?\x80\x00\x00 2 row(s) in 0.0360 seconds / I deleted the 1_webpage /hbase(main):022:0> disable "1_webpage" 0 row(s) in 1.6340 seconds/ /hbase(main):023:0> drop "1_webpage" 0 row(s) in 0.2340 seconds/ /hbase(main):024:0> scan "1_webpage" ROW COLUMN+CELL *ERROR: Unknown table 1_webpage!* / Next I injected the same seed url again /vickey@vickey:~/development/crawler/apache-nutch-2.3.1/runtime/local/bin$ ./nutch inject seedlocal/ -crawlId 1 InjectorJob: starting at 2017-01-03 22:47:19 InjectorJob: Injecting urlDir: seedlocal InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class. InjectorJob: total number of urls rejected by filters: 0 InjectorJob: total number of urls injected after normalization and filtering: 2 Injector: finished at 2017-01-03 22:47:24, elapsed: 00:00:04/ / hbase(main):025:0> scan "1_webpage" ROW COLUMN+CELL localhost213:http:8080 column=f:fi, timestamp=1483463843514, value=\x00'\x8D\x00 localhost213:http:8080 column=f:ts, timestamp=1483463843514, value=\x00\x00\x01YeU\xCDv localhost213:http:8080 column=mk:_injmrk_, timestamp=1483463843514, value=y localhost213:http:8080 column=mk:dist, timestamp=1483463843514, value=0 localhost213:http:8080 column=mtdt:_csh_, timestamp=1483463843514, value=?\x80\x00\x00 localhost213:http:8080 column=s:s, timestamp=1483463843514, value=?\x80\x00\x00 localhost214:http:8080 column=f:fi, timestamp=1483463843666, value=\x00'\x8D\x00 localhost214:http:8080 column=f:ts, timestamp=1483463843666, value=\x00\x00\x01YeU\xCDv localhost214:http:8080 column=mk:_injmrk_, timestamp=1483463843666, value=y localhost214:http:8080 column=mk:dist, timestamp=1483463843666, value=0 localhost214:http:8080 column=mtdt:_csh_, timestamp=1483463843666, value=?\x80\x00\x00 localhost214:http:8080 column=s:s, timestamp=1483463843666, value=?\x80\x00\x00 2 row(s) in 0.0460 seconds/ Shouldn't deleting the 1_webpage table from the HBase not clear all the entries. Please note that the seed url entry is *http://localhost214:8080* I have been expecting its entry in the 1_webpage, but it is showing the other. Why do I see the *http://localhost213:8080* entry? I guess it is coming from the file system. I would like to know here before I go and start digging the code. Thanks, Vicky -- View this message in context: http://lucene.472066.n3.nabble.com/Seed-URL-ingestor-behavior-tp4312095.html Sent from the Nutch - User mailing list archive at Nabble.com.

