I think the binary section recognizer is probably your best best.
If you write an analyzer that ignores terms that consist of only
hexadecimal digits, and contain embedded digits, you will probably
reduce the pollution quite a bit, and it is trivial to write, and not
too expensive to check.
Hi All,
We are experiencing OOM's when binary data contained in text files
(e.g., a base64 section of a text file) is indexed. We have extensive
recognition of file types but have encountered binary sections inside of
otherwise normal text files.
We are using the default value of 128 for te
[
https://issues.apache.org/jira/browse/LUCENE-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kyle Maxwell closed LUCENE-1019.
Resolution: Invalid
Lucene Fields: (was: [Patch Available, New])
Ok, I'm satisfied with D
[
https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated LUCENE-1044:
---
Attachment: LUCENE-1044.take3.patch
Attached another rev of the patch.
I changed th
[
https://issues.apache.org/jira/browse/LUCENE-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540518
]
Karl Wettin commented on LUCENE-1016:
-
I think this is interesting:
http://www.nabble.com/How-to-generate-TermF
1 nov 2007 kl. 17.18 skrev Grant Ingersoll (JIRA):
http://people.apache.org/maven-snapshot-repository/org/apache/lucene/
love++
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]