RE: Index Rows as Documents? Help me design a solution

2006-07-26 Thread Mike Streeton
of NetSearch -Original Message- From: Doron Cohen [mailto:[EMAIL PROTECTED] Sent: 25 July 2006 22:23 To: java-user@lucene.apache.org Subject: Re: Index Rows as Documents? Help me design a solution Few comments - (from first posting in this thread) The indexing was taking much more than minutes

Re: Index Rows as Documents? Help me design a solution

2006-07-26 Thread Namit Yadav
the multi threading and distribution of the parts of the log to each writer. Mike www.ardentia.com the home of NetSearch -Original Message- From: Doron Cohen [mailto:[EMAIL PROTECTED] Sent: 25 July 2006 22:23 To: java-user@lucene.apache.org Subject: Re: Index Rows as Documents? Help me

Re: Index Rows as Documents? Help me design a solution

2006-07-26 Thread Jeremy Bensley
- From: Doron Cohen [mailto:[EMAIL PROTECTED] Sent: 25 July 2006 22:23 To: java-user@lucene.apache.org Subject: Re: Index Rows as Documents? Help me design a solution Few comments - (from first posting in this thread) The indexing was taking much more than minutes for a 1 MB log file

Re: Index Rows as Documents? Help me design a solution

2006-07-26 Thread Erick Erickson
It feels to me like you're major problem might be file IO with all those files. There's no need to split the files up first and then index the files. Just read through the log and index each row. The code fragment you posted should allow you to get the line back from the line field of each

Re: Index Rows as Documents? Help me design a solution

2006-07-26 Thread Doron Cohen
A document per row is seems correct to me too. If search would be by msisdn / messageid, - and if, as it seems, these are keywords, not free text that needs to be analyzed, they both should have Index.UNTOKENIZED. Also, since no search is to be done by the line content, the line should have

Re: Index Rows as Documents? Help me design a solution

2006-07-25 Thread Daniel Naber
On Dienstag 25 Juli 2006 04:05, Namit Yadav wrote: 1 List SMSIDs of all the SMSes that a phone number had sent (Each SMS message will have a globally unique ID) 2 List SomeData1, SomeData2, SomeData3 and SomeData4 for a given SMSID. How can I do this efficiently? Short answer: use a

Re: Index Rows as Documents? Help me design a solution

2006-07-25 Thread Erick Erickson
Indexing 1M of logs shouldn't take minutes, so you're probably right. A problem I've seen is opening/indexing/closing your index writer too often. You should do something like... (really bad pseudo code here) IndexWriter IW = new IndexWriter(); for (lots and lots and lots of records) {

Re: Index Rows as Documents? Help me design a solution

2006-07-25 Thread Erick Erickson
The code looks good, *assuming* that the IndexWriter you pass in isn't closed/opened between files (this would be a problem if you have lots of files to index..). I've had the IndexWriter.optimize method take a lng time to complete, so I typically don't do this until I'm entirely done...

Re: Index Rows as Documents? Help me design a solution

2006-07-25 Thread Doron Cohen
Few comments - (from first posting in this thread) The indexing was taking much more than minutes for a 1 MB log file. ... I would expect to be able to index at least a of GB of logs within 1 or 2 minutes. 1-2 minutes per GB would be 30-60 GB/Hour, which for a single machine/jvm is a lot -

Index Rows as Documents? Help me design a solution

2006-07-24 Thread Namit Yadav
My question might be very easy for you Lucene experts. But after going through the Lucene documentation / example, I haven't been able to figure out how to solve this problem. I'll be really grateful if someone can help me get a starting point here. Our application tracks SMSes sent from a