Still have problem when I try to update the index using the custom analyzer.
If I comment out the
truncate => 1
rerun I got the following errror.
'body' assigned conflicting FieldType
LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
at /home/hwu/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/Lucy.pm
line 118.
Lucy::Index::Indexer::new('Lucy::Index::Indexer', 'index',
'/home/hwu/data/lucy/mitbbs.index', 'schema',
'Lucy::Plan::Schema=SCALAR(0x211c758)', 'create', 1) called at
mitbbs_index.pl line 26
*** Error in `perl': corrupted double-linked list: 0x00000000021113a0 ***
If I switch the analyzer to Lucy::Analysis::StandardTokenize. works fine.
a new seg_2 is created.
my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
my $raw_type = Lucy::Plan::FullTextType->new(
analyzer => $tokenizer,
);
So I guess I must miss something in the custom Chinese Analyzer.
------------------my script--------------------
#!/usr/local/bin/perl
#TODO: update doc, instead create everytime
use DBI;
use File::Spec::Functions qw( catfile );
use Lucy::Plan::Schema;
use Lucy::Plan::FullTextType;
use Lucy::Index::Indexer;
use ChineseAnalyzer;
my $path_to_index = '/home/hwu/data/lucy/mitbbs.index';
# Create Schema.
my $schema = Lucy::Plan::Schema->new;
my $chinese = ChineseAnalyzer->new();
my $raw_type = Lucy::Plan::FullTextType->new(
analyzer => $chinese,
);
$schema->spec_field( name => 'body', type => $raw_type);
# Create an Indexer object.
my $indexer = Lucy::Index::Indexer->new(
index => $path_to_index,
schema => $schema,
create => 1,
truncate => 1,
);
my $driver = "SQLite";
my $database = "/home/hwu/data/mitbbs.db";
my $dsn = "DBI:$driver:dbname=$database";
my $dbh = DBI->connect($dsn,{ RaiseError => 1 }) or die $DBI::errstr;
my $stmt = qq(SELECT id, text from post where id >= 100 and id < 200;);
#my $stmt = qq(SELECT id, text from post where id < 100;);
my $sth = $dbh->prepare( $stmt );
my $rv = $sth->execute() or die $DBI::errstr;
while(my @row = $sth->fetchrow_array()) {
print "id = ". $row[0] . "\n";
print $row[1];
my $doc = { body => $row[1] };
$indexer->add_doc($doc);
}
$indexer->commit;
print "Finished.\n";
On Sat, Feb 18, 2017 at 6:46 AM, Nick Wellnhofer <[email protected]>
wrote:
> On 18/02/2017 07:22, Hao Wu wrote:
>
>> Thanks. Get it work.
>>
>
> Lucy's StandardTokenizer breaks up the text at the word boundaries defined
> in Unicode Standard Annex #29. Then we treat every Alphabetic character
> that doesn't have a Word_Break property as a single term. These are
> characters that match \p{Ideographic}, \p{Script: Hiragana}, or
> \p{Line_Break: Complex_Context}. This should work for Chinese but as Peter
> mentioned, we don't support n-grams.
>
> If you're using QueryParser, you're likely to run into problems, though.
> QueryParser will turn a sequence of Chinese characters into a PhraseQuery
> which is obviously wrong. A quick hack is to insert a space after every
> Chinese character before passing a query string to QueryParser:
>
> $query_string =~ s/\p{Ideographic}/$& /g;
>
> Nick
>
>