Mohammad Norouzi wrote:
I registered in Nabble, but to post message you should subscribe to
lucene
mailing list and if you subscribe to mailing list your inbox will become
full of messages. this is very bad!!!
You're using gmail aren't you? Why don't you set up a filter to handle
mail from th
Grant Ingersoll wrote:
I like the mailing list approach much better. With a good set of
rules and folders in place (which takes about 15 minutes to setup),
one can easily manage large volumes of mail w/o batting an eye,
whereas forums require large amounts of navigation, IMO.
Glad I'm not th
karl wettin wrote:
The way I see it (and probably many other) mailing lists are suprior
in many ways, especially when following multiple forums.
It's true. Any forum that I need to subscribe to I find an RSS feed
for so that I can get mail messages. Forums are a pain in the neck
once you'
Daniel Noll wrote:
The only screenshots I can see look like plain text to me, and I'm
currently working on something which needs to convert Word to HTML,
which is why I ask.
wvWare, which I mentioned earlier, can convert word to HTML and does a
pretty good job of maintaining formatting. abiwor
John Haxby wrote:
Sami Siren wrote:
There's also antiword [1] which can convert your .doc to plain text or
PS, not sure how good it is.
antiword isn't very good. I use wvWare
(http://wvware.sourceforge.net/) directly, but you may find that using
abiword is better for you (abi
Sami Siren wrote:
There's also antiword [1] which can convert your .doc to plain text or
PS, not sure how good it is.
antiword isn't very good. I use wvWare (http://wvware.sourceforge.net/)
directly, but you may find that using abiword is better for you (abiword
is an editor, but it also do
Chris Hostetter wrote:
i'm not crypto expert, but i imagine it would probably take the same
amount of statistical guess work to reconstruct meaningful info from
either approach (hashing hte individual words compared to eliminating the
positions) so i would think the trade off of supporting phrase
MC Moisei wrote:
Is there a easy way to clear locks ?
If I redeploy my war file and it happens that there is an indexing
happening the lock is not cleared. I know I can tell JVM to run the
finalizers before it exits but in this case the JVM is not exiting being
a hot deploy.
I'd do this by ha
maureen tanuwidjaja wrote:
Oh is it?I didn't know about that...so Is it means I cant use this Mobile HDD..
Damien McCarthy <[EMAIL PROTECTED]> wrote: FAT 32 imposes a lower file size
limitation than NTF. Attempts to create
files greater that 4Gig on FAT32 will throw error you are seeing.
No
Hello All,
In LIA, Erik and Otis mention using the openoffice.org API for
converting from various formats to something that can be used for indexing.
Does anyone have any examples of doing this that they'd be willing to share?
jch
-
Nadav Har'El wrote:
On Tue, Jan 16, 2007, Rollo du Pre wrote about "Re: Websphere and Dark Matter":
I was hoping it would, yes. Does websphere not release memory back to
the OS when it not longer needs it? I'm concerned that if the memory
spiked for some reason (indexing a large document) th
Rollo du Pre wrote:
We have a scenario where a web search app using Lucene causes
Websphere 5.1 allocated memory to grow but not shrink. JProfiler shows
the heap shrinks back ok, leaving the JVM with over 1GB allocated to
the jvm but only 400MB in use. Websphere does not perform a level 2
garbage
Larry Taylor wrote:
What we need to do is to be able to store a bit mask specifying various
filter flags for a document in the index and then search this field by
specifying another bit mask with desired filters, returning documents
that have any of the specified flags set. In other words, we are
Rajiv Roopan wrote:
Hello, I'm currently running a site which allows users to post. Lately
posts
have been getting out of hand. I was wondering if anyone knows of an open
source spam filter that I can add to my project to scan the posts
(which are
just plain text) for spam?
spamassassin shoul
Volodymyr Bychkoviak wrote:
User has an input (javaScript calendar) on page where he can choose
some date to include in search. Search resolution is day resolution.
If user will enter same date in different time of date he will get
different results (because calendar will also set current hour
John Haxby wrote:
I ran across the problem with DateTools not using UTC when I tried to
use an index created in California from the UK: I was looking for
documents with a particular date stamp but I found documents with a
date stamp from the wrong day. Even more interesting and bizarre
Volodymyr Bychkoviak wrote:
I'm using DateTools with Resolution.DAY.
I know that dates internally are converted to GMT.
Converting dates "2006-10-01 00:00" and "2006-10-01 15:00" from
"Etc/GMT-2" timezone will give us
"20060930" and "20061001" respectively.
But these dates are identical with
/gif
application/msword
the indenting indicates nesting. A message isn't just a bodypart
followed by attachments, it has structure like a file system. Something
which escapes most mail readers. Sigh.
John Haxby wrote:
lude wrote:
You also mentioned indexing each bodypart ("
lude wrote:
You also mentioned indexing each bodypart ("attachment") separately.
Why?
To my mind, there is no use case where it makes sense to search a
particular bodypart
I will give you the use case:
[snip]
3.) The result list would show this:
1. mail-1 'subject'
'Abstract of the messa
lude wrote:
Hi John,
thanks for the detailed answer.
You wrote:
If you're indexing a
multipart/alternative bodypart then index all the MIME headers, but only
index the content of the *first* bodypart.
Does this mean you index just the first file-attachment?
What do you advice, if you have to
lude wrote:
does anybody has an idea what is the best design approch for realizing
the following:
The goal is to index emails and their corresponding file attachments.
One email could contain for example:
I put a fair amount of thought into this when I was doing the design for
our mail server -
Andrzej Bialecki wrote:
Just for the record - I've been using javamail POP and IMAP providers
in the past, and they were prone to hanging with some servers, and
resource intensive. I've been also using Outlook (proper, not Outlook
Express - this is AFAIK impossible to work with) via a Java-COM
Suba Suresh wrote:
Anyone know of good free email libraries I can use for lucene indexing
for Windows Outlook Express and Unix emails??
javamail. Not sure how you get hold of the messages from Outlook
Express, but getting hold of the MIME message in most Unix-based message
stores is relativel
Rajan, Renuka wrote:
I am trying to match accented characters with non-accented characters in French/Spanish and other Western European languages. The use case is that the users may type letters without accents in error and we still want to be able to retrieve valid matches. The one idea, albeit
Michael J. Prichard wrote:
We are actually grabbing emails by becoming part of the SMTP stream.
This part is figured out and we have archived over 600k emails into a
mysql database. The problem is that since we currently store the
blobs in the DB this databases are getting large and searching
ng for "lucene.apache.org" to
look for mail sent to the lucene lists, they search for me variously as
"jch", "john haxby" and "haxby"; they even, occasionally, search for
complete mail addresses. They all work.
The RFC2047 syntax in the example above gives
George Washington wrote:
Is it possible to reconstruct a complete source document from the data
stored in the index, even if the fields are only indexed but not
stored? Because if the answer is "yes" there is no point in
encrypting, unless the index itself can be encrypted. Is it feasible
to e
Andrzej Bialecki wrote:
None of you mentioned yet the aspect that 4k is the memory page size
on IA32 hardware. This in itself would favor any operations using
multiple of this size, and penalize operations using amounts below
this size.
For normal I/O it will rarely make any difference at al
Otis Gospodnetic wrote:
I'm somewhat familiar with ext3 vs. ReiserFS stuff, but that's not really what
I'm after (finding a better/faster FS). What I'm wondering is about different
block sizes on a single (ext3) FS.
If I understand block sizes correctly, they represent a chunk of data that th
petite_abeille wrote:
I would love to see this. I presently have a somewhat unwieldy
conversion table [1] that I would love to get ride of :))
[snip]
[1] http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt
I've attached the perl script -- feed
http://www.unicode.org/Public/4.1.0/u
arnaudbuffet wrote:
if I try to index a text file encoded in Western 1252 for exemple with the Turkish text
"düzenlediğimiz kampanyamıza" the lucene index will contain re encoded data with
�k��
ISOLatin1AccentFilter.removeAccents() converts that string to
"duzenlediğimiz kampanyamıza"
arnaudbuffet wrote:
For text files, data could be in different languages so different
encoding. If data are in Turkish for exemple, all special characters and
accents are not recognized in my lucene index. Is there a way to resolve
problem? How do I work with the encoding ?
I've been looking
Erik Hatcher wrote:
2. How do I search for negative numbers in a range. For example
field:[-3 TO
2] ?
I don't mind hacking code such that my numbers are indexed as
+0001 and
-0001 and then I can override the query parser to change my
query to
[-003 TO +002]. However.. "+"
Aigner, Thomas wrote:
I did a man on top and sure enough there was a PPID command on
Linux (f then B) for parent process. And yes, they always have the same
parent command. Thanks for your help as I'm obviously still a noob on
Unix.
Nope, that doesn't tell you they're different thre
Kan Deng wrote:
1. Performance.
Since all the cached disk data resides outside JVM
heap space, the access efficiency from Java object to
those cached data cannot be too high.
True, but you need to compare the relative speeds. If data has to be
pulled from a file, then you're talking se
John Cherouvim wrote:
I'm having some problems indexing my UTF-8 html pages. I am running
lucene on Linux and I cannot understand why does the index generated
depends on the locale of my operating system.
If I do set | grep LANG I get: LANG=el_GR which is Greek. If I set
this to en_US the inde
Martin Rode wrote:
hello all,
lucene is already pretty fast, but i was wondering if you guys have
experience with using gcj (on linux). how much faster is it for
indexing? personally i have best performance with java-ibm, at least
under linux.
it would be interesting to hear how your exper
Erik Hatcher wrote:
On Jun 17, 2005, at 5:54 PM, [EMAIL PROTECTED] wrote:
Please do not reply to a post on the list if your message actually
isn't a
reply. Post a new message instead.
Sorry about that.. wasn't intentional.. clicked reply to get the reply
address and then forgot to change
Chris Collins wrote:
Ok that part isnt surprising. However only about 1% of 30% of the merge was
spent in the OS.flush call (not very IO bound at all with this controller).
On Linux, at least, measuring the time taken in OS.flush is not a good
way to determine if you're I/O bound -- all tha
Paul Libbrecht wrote:
Le 18 mai 05, à 11:51, John Haxby a écrit :
I haven't tried this, but under Linux (at least), you can specify the
"nolock" parameter to make file locking appen locally. Of course,
this will make it impossible to use NFS to share the index among
several ma
Otis Gospodnetic wrote:
I haven't used Lucene with NFS. My understanding is that the problem
is with lock files when they reside on the NFS server. Yes, you can
change the location of lock files with a system property, but if you
are using NFS to make the index accessible from multiple machines,
Otis Gospodnetic wrote:
Somebody asked about this today, and I just found this through Simpy:
http://www.unine.ch/info/clef/
Scroll half-way through the page, look on the right side: 1,000 most
frequent words for several languages.
Hmm. I'm not sure how valuable that is. For English "los" a
Roy Klein wrote:
Here's the scenario that I can't guarantee won't happen:
There might be 3 transactions in a very short time span (for example, 1
second), here's what they are:
1) update doc1 (DEL doc1, ADD doc1)
2) update doc2 (DEL doc2, ADD doc2)
3) delete doc1
If I process these in order, then a
Maher Martin wrote:
* The user's access rights would be read from Active Directory (i.e
windows group membership, etc)
* On the submission of a query to Lucene - the user / group access
rights would be appended as required search criteria and Lucene would
filter out all results that the user should
44 matches
Mail list logo