Re: [lucy-user] Chinese support?

2017-02-20 Thread Hao Wu
Hi Peter,

Thanks for spending time in the script.

I clean it up a bit, so there is no dependency now.

https://gist.github.com/swuecho/1b960ae17a1f47466be006fd14e3b7ff

still do not work.




On Mon, Feb 20, 2017 at 9:03 PM, Peter Karman  wrote:

> Hao Wu wrote on 2/20/17 10:18 PM:
>
>> Hi Peter,
>>
>> Thanks for reply.
>>
>> That could be a problem. But probably not in my case.
>>
>> I removed the old index.
>>
>> run the program with 'ChineseAnalyzer' and truncate => 0  twice. the
>> second
>> time, will give me the error.
>>
>> 'body' assigned conflicting FieldType
>> LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
>> at /home/hwu/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/Lucy.pm
>> line 118.
>> Lucy::Index::Indexer::new('Lucy::Index::Indexer', 'index',
>> '/home/hwu/data/lucy/mitbbs.index', 'schema',
>> 'Lucy::Plan::Schema=SCALAR(0x1c56798)', 'create', 1) called at
>> mitbbs_index.pl
>>  line 26
>>
>> run the program with 'ChineseAnalyzer' and truncate => 0  twice, no
>> error. but I
>> want to update the index.
>>
>> run the program with 'StandardTokenizer', with  truncate 0 or 1, both
>> work fine.
>>
>> So, this make me think I must miss something in the 'ChineseAnalyzer' I
>> have.
>>
>>
>
> This is not your default, I don't think. This seems like a bug.
>
> Here's a smaller gist demonstrating the problem:
>
> https://gist.github.com/karpet/d8fe12085246b8419f9e4ab44930c1cc
>
> With the 2 files in the gist, I get this result:
>
> [karpet@pekmac:~/tmp/chinese-analyzer]$ perl indexer.pl test-index
> Building prefix dict from the default dictionary ...
> Loading model from cache /var/folders/r3/yk7hmbb9125fns
> df9bqs6lrmgp/T/jieba.cache
> Loading model cost 0.553 seconds.
> Prefix dict has been built succesfully.
> Finished.
>
> [karpet@pekmac:~/tmp/chinese-analyzer]$ perl indexer.pl test-index
> 'body' assigned conflicting FieldType
> LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
> at /usr/local/perl/5.24.0/lib/site_perl/5.24.0/darwin-2level/Lucy.pm
> line 118.
> Lucy::Index::Indexer::new("Lucy::Index::Indexer", "index",
> "test-index", "schema", Lucy::Plan::Schema=SCALAR(0x7f9b0b004a18),
> "create", 1) called at indexer.pl line 23
> Segmentation fault: 11
>
>
>
> I would expect the code to work as you wrote it, so maybe someone else can
> spot what's going wrong.
>
> Here's what the schema_1.json file looks like after the initial index
> creation:
>
> {
>   "_class": "Lucy::Plan::Schema",
>   "analyzers": [
> null,
> {
>   "_class": "ChineseAnalyzer"
> }
>   ],
>   "fields": {
> "body": {
>   "analyzer": "1",
>   "type": "fulltext"
>
> }
>   }
> }
>
>
> --
> Peter Karman  .  https://peknet.com/  .  https://keybase.io/peterkarman
>


Re: [lucy-user] Chinese support?

2017-02-20 Thread Hao Wu
Still have problem when I try to update the index using the custom analyzer.

If I comment out the
   truncate => 1

rerun I got the following errror.


'body' assigned conflicting FieldType
LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
at /home/hwu/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/Lucy.pm
line 118.
Lucy::Index::Indexer::new('Lucy::Index::Indexer', 'index',
'/home/hwu/data/lucy/mitbbs.index', 'schema',
'Lucy::Plan::Schema=SCALAR(0x211c758)', 'create', 1) called at
mitbbs_index.pl line 26
*** Error in `perl': corrupted double-linked list: 0x021113a0 ***

If I switch the analyzer to  Lucy::Analysis::StandardTokenize.  works fine.
a new seg_2 is created.

my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
my $raw_type = Lucy::Plan::FullTextType->new(
analyzer => $tokenizer,
);

So I guess I must miss something in the custom Chinese Analyzer.



--my script

#!/usr/local/bin/perl
#TODO: update doc, instead create everytime
use DBI;
use File::Spec::Functions qw( catfile );

use Lucy::Plan::Schema;
use Lucy::Plan::FullTextType;
use Lucy::Index::Indexer;

use ChineseAnalyzer;

my $path_to_index = '/home/hwu/data/lucy/mitbbs.index';

# Create Schema.
my $schema = Lucy::Plan::Schema->new;

my $chinese = ChineseAnalyzer->new();

my $raw_type = Lucy::Plan::FullTextType->new(
analyzer => $chinese,
);

$schema->spec_field( name => 'body',  type => $raw_type);

# Create an Indexer object.
my $indexer = Lucy::Index::Indexer->new(
index=> $path_to_index,
schema   => $schema,
create   => 1,
truncate => 1,
);

my $driver   = "SQLite";
my $database = "/home/hwu/data/mitbbs.db";
my $dsn = "DBI:$driver:dbname=$database";
my $dbh = DBI->connect($dsn,{ RaiseError => 1 })  or die $DBI::errstr;


my $stmt = qq(SELECT id, text from post where id >= 100 and id < 200;);
#my $stmt = qq(SELECT id, text from post where id < 100;);
my $sth = $dbh->prepare( $stmt );
my $rv = $sth->execute() or die $DBI::errstr;

while(my @row = $sth->fetchrow_array()) {
  print "id = ". $row[0] . "\n";
  print $row[1];
  my $doc = { body => $row[1] };
  $indexer->add_doc($doc);
}

$indexer->commit;

print "Finished.\n";

On Sat, Feb 18, 2017 at 6:46 AM, Nick Wellnhofer 
wrote:

> On 18/02/2017 07:22, Hao Wu wrote:
>
>> Thanks. Get it work.
>>
>
> Lucy's StandardTokenizer breaks up the text at the word boundaries defined
> in Unicode Standard Annex #29. Then we treat every Alphabetic character
> that doesn't have a Word_Break property as a single term. These are
> characters that match \p{Ideographic}, \p{Script: Hiragana}, or
> \p{Line_Break: Complex_Context}. This should work for Chinese but as Peter
> mentioned, we don't support n-grams.
>
> If you're using QueryParser, you're likely to run into problems, though.
> QueryParser will turn a sequence of Chinese characters into a PhraseQuery
> which is obviously wrong. A quick hack is to insert a space after every
> Chinese character before passing a query string to QueryParser:
>
> $query_string =~ s/\p{Ideographic}/$& /g;
>
> Nick
>
>


[lucy-user] Chinese support?

2017-02-17 Thread Hao Wu
Hi all,

I use the StandardTokenizer. search by English word work, but in
Chinese give me strange results.

my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
my $raw_type = Lucy::Plan::FullTextType->new(
analyzer => $tokenizer,
);

also, I was going to use the EasyAnalyzer (
https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Analysis/EasyAnalyzer.pod
)
, but chinese in not supported.

What is the simple way to use lucy with chinese doc? Thanks.

Best,

Hao Wu