Looking at the StandardAnalyzer/Tokenizer (which use JFlex internally), it
appears that the grammar used by the parser doesn't consider "?" and "!" as
punctuation!

Grrrrrrr.


-----Original Message-----
From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org] On
Behalf Of Rick Bullotta
Sent: Wednesday, September 15, 2010 2:44 PM
To: 'Neo4j user discussions'
Subject: Re: [Neo4j] Bug: LuceneFullTextQueryIndex service ignoring last
word/term

Well, I have it implemented, and it is cleaning up the content, but the
standard Lucene analyzer still isn't working correctly.  Random words are
completely ignored with no special markup in the content, sometimes words
are combined, punctuation is never removed, etc..  Something is really
wrong, IMO.  Does anyone know of a way to dump out what the Lucene tokenizer
is generating in terms of splitting the text into tokens/words?  

-----Original Message-----
From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org] On
Behalf Of Rick Bullotta
Sent: Wednesday, September 15, 2010 2:23 PM
To: 'Neo4j user discussions'
Subject: Re: [Neo4j] Bug: LuceneFullTextQueryIndex service ignoring last
word/term

Actually, I ended up coming with a workaround that involved using
HTMLStripReader/HTMLStripCharFilter for "pre-parsing" the text before
passing it into the neo .index(node,key,value) method.  Works great, though
there's a little extra string allocation involved.  It won't be invoked
often, so it isn't a big concern.


-----Original Message-----
From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org] On
Behalf Of Toby Matejovsky
Sent: Wednesday, September 15, 2010 12:57 PM
To: Neo4j user discussions
Subject: Re: [Neo4j] Bug: LuceneFullTextQueryIndex service ignoring last
word/term

This is probably what you just found, but for others:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCh
arFilterFactory

--
Toby Matejovsky


On Wed, Sep 15, 2010 at 12:49 PM,
<rick.bullo...@burningskysoftware.com>wrote:

>   Removing HTML markup is not a trivial task, but luckily, the Apache
>   Solr team has already created additional analyzers for Lucene that do
>   what I need (the analysis package in solr has a lot of really good
>   stuff in it);
>
>
>
>   I will still need some help from the Neo team to understand how use a
>   specific analyzer instead of the default one...
>
>
>
>   Thanks,
>
>
>
>   Rick
>
>
>
>   -------- Original Message --------
>   Subject: Re: [Neo4j] Bug: LuceneFullTextQueryIndex service ignoring
>   last word/term
>    From: Morten Barklund <[1]mor...@barklund.dk>
>   Date: Wed, September 15, 2010 12:29 pm
>   To: Neo4j user discussions <[2]u...@lists.neo4j.org>
>    Hi
>   I might be overly simplistic here, but why not lowercase the text,
>   remove
>   html markup, then remove all non-word-or-space-characters, store this
>   as the
>   stripped version of the text on the node (for de-indexing) and index
>   this?
>   /Barklund
>   On Wed, Sep 15, 2010 at 18:07,
>    <[3]rick.bullo...@burningskysoftware.com> wrote:
>   > Actually, it seems like a deeper bug/design flaw in Lucene's
>   > analyzer/tokenizer. The actual text is HTML text, with <p> and </p>
>   > wrappers. Lucene somewhat randomly seems to treat the last two words
>   > as a single token, and in other cases ignore it altogether. The dot
>   > character screws it up even more, because even if it tokenizes with
>   the
>   > dot character, you can't query with it (or at least nothing gets
>   > returned).
>   >
>   >
>   >
>   > Hmmm. I really don't want to have to write a tokenizer/analyzer if I
>   > can avoid it. Seems like a LOT of work.
>   >
>   >
>   >
>   > Do you have any example code of a custom tokenizer/analyzer we could
>   > start from?
>   >
>   >
>   >
>   > Thanks,
>   >
>   >
>   >
>   > Rick
>   >
>   > -------- Original Message --------
>   > Subject: Re: [Neo4j] Bug: LuceneFullTextQueryIndex service ignoring
>   > last word/term
>    > From: Mattias Persson <[1][4]matt...@neotechnology.com>
>    > Date: Wed, September 15, 2010 11:47 am
>    > To: Neo4j user discussions <[2][5]u...@lists.neo4j.org>
>    > Couldn't it be that sentences ends with a dot... so "Cheese is good."
>   > will
>   > index the words: ["Cheese", "is", "good."] ? Observe the last word
>   > isn't
>   > "good", it's "good." with a dot. I know that has messed up some
>   > searches for
>   > me at least. You could perhaps override the implementation and
>   > instantiate
>   > an Analyzer/Tokenizer which gets rid of such punctuation characters?
>    > 2010/9/15 <[3][6]rick.bullo...@burningskysoftware.com>
>    > > Using neo4j-index-1.1 and lucene-core-2.9.2, by the way.
>   > >
>   > >
>   > >
>   > >
>   > >
>   > > -------- Original Message --------
>   > > Subject: Re: [Neo4j] Bug: LuceneFullTextQueryIndex service ignoring
>   > > last word/term
>    > > From: Mattias Persson <[1][4][7]matt...@neotechnology.com>
>   > > Date: Wed, September 15, 2010 10:37 am
>    > > To: Neo4j user discussions <[2][5][8]u...@lists.neo4j.org>
>   > > That sounds weird. Look at
>   > > TestLuceneFulltextIndexService#testSimpleFulltext
>   > > method, it queries for the last word and it seems to work.
>   > > Could you provide more info on this?
>    > > 2010/9/15 <[3][6][9]rick.bullo...@burningskysoftware.com>
>    > > > I've noticed that when indexing full text, the last term/word is
>   > > always
>   > > > ignored. This is a major issue, but I'm not sure if it is in the
>   > > index
>   > > > utils or in Lucene itself.
>   > > >
>   > > >
>   > > >
>   > > > Any thoughts?
>   > > >
>   > > >
>   > > >
>   > > > Thanks,
>   > > >
>   > > >
>   > > >
>   > > > Rick
>   > > > _______________________________________________
>   > > > Neo4j mailing list
>    > > > [4][7][10]u...@lists.neo4j.org
>   > > > [5][8][11]https://lists.neo4j.org/mailman/listinfo/user
>   > > >
>   > > --
>   > > Mattias Persson, [[6][9][12]matt...@neotechnology.com]
>   > > Hacker, Neo Technology
>   > > [7][10][13]www.neotechnology.com
>   > > _______________________________________________
>   > > Neo4j mailing list
>   > > [8][11][14]u...@lists.neo4j.org
>   > > [9][12][15]https://lists.neo4j.org/mailman/listinfo/user
>   > >
>   > > References
>   > >
>   > > 1. [13][16]mailto:matt...@neotechnology.com
>   > > 2. [14][17]mailto:user@lists.neo4j.org
>   > > 3. [15][18]mailto:rick.bullo...@burningskysoftware.com
>   > > 4. [16][19]mailto:User@lists.neo4j.org
>   > > 5. [17][20]https://lists.neo4j.org/mailman/listinfo/user
>   > > 6. [18][21]mailto:matt...@neotechnology.com
>   > > 7. [19][22]http://www.neotechnology.com/
>   > > 8. [20][23]mailto:User@lists.neo4j.org
>   > > 9. [21][24]https://lists.neo4j.org/mailman/listinfo/user
>   > > _______________________________________________
>   > > Neo4j mailing list
>   > > [22][25]u...@lists.neo4j.org
>   > > [23][26]https://lists.neo4j.org/mailman/listinfo/user
>   > >
>   > --
>   > Mattias Persson, [[24][27]matt...@neotechnology.com]
>   > Hacker, Neo Technology
>   > [25][28]www.neotechnology.com
>   > _______________________________________________
>   > Neo4j mailing list
>   > [26][29]u...@lists.neo4j.org
>   > [27][30]https://lists.neo4j.org/mailman/listinfo/user
>   >
>   > References
>   >
>   > 1. [31]mailto:matt...@neotechnology.com
>   > 2. [32]mailto:user@lists.neo4j.org
>   > 3. [33]mailto:rick.bullo...@burningskysoftware.com
>   > 4. [34]mailto:matt...@neotechnology.com
>   > 5. [35]mailto:user@lists.neo4j.org
>   > 6. [36]mailto:rick.bullo...@burningskysoftware.com
>   > 7. [37]mailto:User@lists.neo4j.org
>   > 8. [38]https://lists.neo4j.org/mailman/listinfo/user
>   > 9. [39]mailto:matt...@neotechnology.com
>   > 10. [40]http://www.neotechnology.com/
>   > 11. [41]mailto:User@lists.neo4j.org
>   > 12. [42]https://lists.neo4j.org/mailman/listinfo/user
>   > 13. [43]mailto:matt...@neotechnology.com
>   > 14. [44]mailto:user@lists.neo4j.org
>   > 15. [45]mailto:rick.bullo...@burningskysoftware.com
>   > 16. [46]mailto:User@lists.neo4j.org
>   > 17. [47]https://lists.neo4j.org/mailman/listinfo/user
>   > 18. [48]mailto:matt...@neotechnology.com
>   > 19. [49]http://www.neotechnology.com/
>   > 20. [50]mailto:User@lists.neo4j.org
>   > 21. [51]https://lists.neo4j.org/mailman/listinfo/user
>   > 22. [52]mailto:User@lists.neo4j.org
>   > 23. [53]https://lists.neo4j.org/mailman/listinfo/user
>   > 24. [54]mailto:matt...@neotechnology.com
>   > 25. [55]http://www.neotechnology.com/
>   > 26. [56]mailto:User@lists.neo4j.org
>   > 27. [57]https://lists.neo4j.org/mailman/listinfo/user
>   > _______________________________________________
>   > Neo4j mailing list
>   > [58]u...@lists.neo4j.org
>   > [59]https://lists.neo4j.org/mailman/listinfo/user
>    >
>   --
>   Morten Barklund
>   _______________________________________________
>   Neo4j mailing list
>    [60]u...@lists.neo4j.org
>   [61]https://lists.neo4j.org/mailman/listinfo/user
>
> References
>
>   1. mailto:mor...@barklund.dk
>    2. mailto:user@lists.neo4j.org
>   3. mailto:rick.bullo...@burningskysoftware.com
>   4. mailto:matt...@neotechnology.com
>   5. mailto:user@lists.neo4j.org
>   6. mailto:rick.bullo...@burningskysoftware.com
>    7. mailto:matt...@neotechnology.com
>   8. mailto:user@lists.neo4j.org
>   9. mailto:rick.bullo...@burningskysoftware.com
>  10. mailto:User@lists.neo4j.org
>  11. https://lists.neo4j.org/mailman/listinfo/user
>  12. mailto:matt...@neotechnology.com
>  13. http://www.neotechnology.com/
>  14. mailto:User@lists.neo4j.org
>  15. https://lists.neo4j.org/mailman/listinfo/user
>  16. mailto:matt...@neotechnology.com
>  17. mailto:user@lists.neo4j.org
>  18. mailto:rick.bullo...@burningskysoftware.com
>  19. mailto:User@lists.neo4j.org
>  20. https://lists.neo4j.org/mailman/listinfo/user
>  21. mailto:matt...@neotechnology.com
>  22. http://www.neotechnology.com/
>  23. mailto:User@lists.neo4j.org
>  24. https://lists.neo4j.org/mailman/listinfo/user
>  25. mailto:User@lists.neo4j.org
>  26. https://lists.neo4j.org/mailman/listinfo/user
>  27. mailto:matt...@neotechnology.com
>  28. http://www.neotechnology.com/
>  29. mailto:User@lists.neo4j.org
>  30. https://lists.neo4j.org/mailman/listinfo/user
>  31. mailto:matt...@neotechnology.com
>  32. mailto:user@lists.neo4j.org
>  33. mailto:rick.bullo...@burningskysoftware.com
>  34. mailto:matt...@neotechnology.com
>  35. mailto:user@lists.neo4j.org
>  36. mailto:rick.bullo...@burningskysoftware.com
>  37. mailto:User@lists.neo4j.org
>  38. https://lists.neo4j.org/mailman/listinfo/user
>  39. mailto:matt...@neotechnology.com
>  40. http://www.neotechnology.com/
>  41. mailto:User@lists.neo4j.org
>  42. https://lists.neo4j.org/mailman/listinfo/user
>  43. mailto:matt...@neotechnology.com
>  44. mailto:user@lists.neo4j.org
>  45. mailto:rick.bullo...@burningskysoftware.com
>  46. mailto:User@lists.neo4j.org
>  47. https://lists.neo4j.org/mailman/listinfo/user
>  48. mailto:matt...@neotechnology.com
>  49. http://www.neotechnology.com/
>  50. mailto:User@lists.neo4j.org
>  51. https://lists.neo4j.org/mailman/listinfo/user
>  52. mailto:User@lists.neo4j.org
>  53. https://lists.neo4j.org/mailman/listinfo/user
>  54. mailto:matt...@neotechnology.com
>  55. http://www.neotechnology.com/
>  56. mailto:User@lists.neo4j.org
>  57. https://lists.neo4j.org/mailman/listinfo/user
>  58. mailto:User@lists.neo4j.org
>  59. https://lists.neo4j.org/mailman/listinfo/user
>  60. mailto:User@lists.neo4j.org
>  61. https://lists.neo4j.org/mailman/listinfo/user
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to