Re: error crawling

Lewis John Mcgibbney Mon, 20 May 2013 09:34:38 -0700

Please search the mailing list for the HBase logging. There was a
conversation on this reasonably recently.


Please see my other response for the rest.
hth
Lewis

On Monday, May 20, 2013, Christopher Gross <[email protected]> wrote:
> Ok, so the crawlId isn't like the directories used in the 1.x versions of
> nutch.
>
> Well, changing that line makes that part work.  I still get the "Skipping
> <url>; different batch id (null)" error.
>
> I'm not sure if this line from the hadoop.log file relates:
> INFO  store.HBaseStore - Keyclass and nameclass match but mismatching
table
> names  mappingfile schema is 'webpage' vs actual schema 'crawl_webpage' ,
> assuming they are the same.
>
> Any ideas for that one?
>
> -- Chris
>
>
> On Fri, May 17, 2013 at 4:32 PM, Tejas Patil <[email protected]
>wrote:
>
>> The exception speaks about the problem:
>>
>> java.lang.RuntimeException: java.lang.IllegalArgumentException: Illegal
>> first
>> character <46> at 0.
>> User-space table names can only start with 'word characters': i.e.
>> [a-zA-Z_0-9]: ./crawl/_webpage
>>
>> The crawlId passed must follow the regex [a-zA-Z_0-9]. The one you passed
>> has dot and slash.
>> $ ./bin/nutch inject urls/ -crawlId ./crawl/
>>
>> Try this:
>> $ ./bin/nutch inject urls/ -crawlId crawl
>>
>>
>>
>> On Fri, May 17, 2013 at 12:47 PM, <[email protected]> wrote:
>>
>> > What if you do bin/nutch inject urls/ ?
>> >
>> >
>> >
>> >
>> >
>> >
>> > -----Original Message-----
>> > From: Christopher Gross <[email protected]>
>> > To: user <[email protected]>
>> > Sent: Fri, May 17, 2013 11:26 am
>> > Subject: error crawling
>> >
>> >
>> > I'm having trouble getting my nutch working.  I had it on another
server
>> > and it was working fine.  I migrated it to a new server, and I've been
>> > getting nothing but problems.  My old script wasn't working right
>> (getting
>> > a lot of "skipping" on the parser saying that the crawl id was null [a
>> > separate point of frustration]), so now I'm trying the 'newer' crawl
>> > script.  This one is worse, since I can't even get the inject to work.
>> >
>> > urls contains a "seed.txt" file that worked previously and contains a
>> bunch
>> > of urls.  crawl is empty.
>> >
>> > from my $NUTCH_HOME directory:
>> >
>> > $ ./bin/nutch inject urls/ -crawlId ./crawl/
>> > InjectorJob: starting
>> > InjectorJob: urlDir: urls
>> > InjectorJob: org.apache.gora.util.GoraException:
>> > java.lang.RuntimeException: java.lang.IllegalArgumentException: Illegal
>> > first character <46> at 0. User-space table names can only start with
>> 'word
>> > characters': i.e. [a-zA-Z_0-9]: ./crawl/_webpage
>> >         at
>> >
>> >
>>
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
>> >         at
>> >
>> >
>>
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
>> >         at
>> >
>>
org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:75)
>> >         at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:214)
>> >         at
>> org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:228)
>> >         at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:248)
>> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> >         at
org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:258)
>> > Caused by: java.lang.RuntimeException:
>> java.lang.IllegalArgumentException:
>> > Illegal first character <46> at 0. User-space table names can only
start
>> > with 'word characters': i.e. [a-zA-Z_0-9]: ./crawl/_webpage
>> >         at
>> > org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:125)
>> >         at
>> >
>> >
>>
org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
>> >         at
>> >
>> >
>>
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
>> >         ... 7 more
>> > Caused by: java.lang.IllegalArgumentException: Illegal first character
>> <46>
>> > at 0. User-space table names can only start with 'word characters':
i.e.
>> > [a-zA-Z_0-9]: ./crawl/_webpage
>> >         at
>> >
>> >
>> org.apache.hadoop.hbase.HTableDescriptor.

-- 
*Lewis*

Re: error crawling

Reply via email to