RE: Lucene Software/Hardware Setup Question

2010-10-27 Thread Toke Eskildsen
On Tue, 2010-10-26 at 23:17 +0200, Kovnatsky, Eugene wrote:
> Thanks Toke. Very descriptive. A few more questions about your SSD
> drive(s)
>  - what is its current size 

4 * 64GB Samsung MCCOE64G5MPP-0VA00 drives. They were pretty cool two
years ago and still work very well for search-servers (random writes are
not good, but we don't need that for searching):
http://www.tomshardware.com/reviews/flash-ssd-hard-drive,2000-19.html

>  - do you project any growth in your index size

Yes. Hopefully an internal project for maintaining digital objects will
change gears during this year. This will result in objects with a lot of
meta data and some fulltexts. Depending on economy, the amount of
objects will range from 100.000+ to a few million. I would guesstimate
that this would mean a doubling of the index size, due to the richness
of the new objects.

Further out, the projections are unreliable. As a technician I hope for
a serious jump in size within a year or two, but I have hoped for that
the last two years. Politics does not move as fast as technology.

>  - if yes then how do you plan to correlate that with your hardware
> needs

A doubling of the index size makes the existing 256GB/machine a tight
fit. I seem to remember that there are two free slots in our servers, so
adding 2 new consumer-class SSDs is the obvious upgrade. We're switching
to a more memory- and CPU-efficient way of handling sorting and
faceting, so we should not need to boost CPU and RAM.

Regards,
Toke Eskildsen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



adding documents to an existing index

2010-10-27 Thread Yakob
hello all,
I would like to ask of how to add new documents to an existing lucene
index. I mean what's class should I use to achieve this goal.

thanks

-- 
http://jacobian.web.id

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: adding documents to an existing index

2010-10-27 Thread 蒋明原
IndexWriter writer =new IndexWirter(path,analyzer,false);

the 3rd parameter is what you want.
than you can

writer.add(doc)

enjoy .

On Wed, Oct 27, 2010 at 8:04 PM, Yakob  wrote:

> hello all,
> I would like to ask of how to add new documents to an existing lucene
> index. I mean what's class should I use to achieve this goal.
>
> thanks
>
> --
> http://jacobian.web.id
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: adding documents to an existing index

2010-10-27 Thread Yakob
I did searched about this constructor and find that it's already been
deprecated.
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/index/IndexWriter.html#IndexWriter(org.apache.lucene.store.Directory,
org.apache.lucene.analysis.Analyzer, boolean)

I am using lucene 3.0 now.can I really use this constructor or I
should try it first? btw I would appreciate if you gave me a code
sample though. thanks. :-)

On 10/27/10, 蒋明原  wrote:
> IndexWriter writer =new IndexWirter(path,analyzer,false);
>
> the 3rd parameter is what you want.
> than you can
>
> writer.add(doc)
>
> enjoy .
>

-- 
http://jacobian.web.id

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: adding documents to an existing index

2010-10-27 Thread 蒋明原
you are too lazy.download the lucene source code,take a glance and you will
find demos;

On Wed, Oct 27, 2010 at 8:43 PM, Yakob  wrote:

> I did searched about this constructor and find that it's already been
> deprecated.
>
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/index/IndexWriter.html#IndexWriter(org.apache.lucene.store.Directory
> ,
> org.apache.lucene.analysis.Analyzer, boolean)
>
> I am using lucene 3.0 now.can I really use this constructor or I
> should try it first? btw I would appreciate if you gave me a code
> sample though. thanks. :-)
>
> On 10/27/10, 蒋明原  wrote:
> > IndexWriter writer =new IndexWirter(path,analyzer,false);
> >
> > the 3rd parameter is what you want.
> > than you can
> >
> > writer.add(doc)
> >
> > enjoy .
> >
>
> --
> http://jacobian.web.id
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: adding documents to an existing index

2010-10-27 Thread Yakob
well thanks anyway though.

On 10/27/10, 蒋明原  wrote:
> you are too lazy.download the lucene source code,take a glance and you will
> find demos;
>
> On Wed, Oct 27, 2010 at 8:43 PM, Yakob  wrote:
>
>> I did searched about this constructor and find that it's already been
>> deprecated.
>>
>> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/index/IndexWriter.html#IndexWriter(org.apache.lucene.store.Directory
>> ,
>> org.apache.lucene.analysis.Analyzer, boolean)
>>
>> I am using lucene 3.0 now.can I really use this constructor or I
>> should try it first? btw I would appreciate if you gave me a code
>> sample though. thanks. :-)
>>
>> On 10/27/10, 蒋明原  wrote:
>> > IndexWriter writer =new IndexWirter(path,analyzer,false);
>> >
>> > the 3rd parameter is what you want.
>> > than you can
>> >
>> > writer.add(doc)
>> >
>> > enjoy .
>> >
>>
>> --
>> http://jacobian.web.id
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>


-- 
http://jacobian.web.id

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: adding documents to an existing index

2010-10-27 Thread Seth Rosen
Yakob,
Here is a snippet of an example of IndexWriter from the lucene source that
you might find helpful.


> IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new
> StandardAnalyzer(Version.LUCENE_CURRENT), true,
> IndexWriter.MaxFieldLength.LIMITED);

System.out.println("Indexing to directory '" +INDEX_DIR+ "'...");

indexDocs(writer, docDir);

System.out.println("Optimizing...");

writer.optimize();

writer.close();


Seth Rosen
www.architexa.com
Understand & Document Code In Seconds
s...@architexa.com 



2010/10/27 Yakob 

> well thanks anyway though.
>
> On 10/27/10, 蒋明原  wrote:
> > you are too lazy.download the lucene source code,take a glance and you
> will
> > find demos;
> >
> > On Wed, Oct 27, 2010 at 8:43 PM, Yakob  wrote:
> >
> >> I did searched about this constructor and find that it's already been
> >> deprecated.
> >>
> >>
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/index/IndexWriter.html#IndexWriter(org.apache.lucene.store.Directory
> >> ,
> >> org.apache.lucene.analysis.Analyzer, boolean)
> >>
> >> I am using lucene 3.0 now.can I really use this constructor or I
> >> should try it first? btw I would appreciate if you gave me a code
> >> sample though. thanks. :-)
> >>
> >> On 10/27/10, 蒋明原  wrote:
> >> > IndexWriter writer =new IndexWirter(path,analyzer,false);
> >> >
> >> > the 3rd parameter is what you want.
> >> > than you can
> >> >
> >> > writer.add(doc)
> >> >
> >> > enjoy .
> >> >
> >>
> >> --
> >> http://jacobian.web.id
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
> >
>
>
> --
> http://jacobian.web.id
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: adding documents to an existing index

2010-10-27 Thread Yakob
On 10/27/10, Seth Rosen  wrote:
> Yakob,
> Here is a snippet of an example of IndexWriter from the lucene source that
> you might find helpful.
>
>
>> IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new
>> StandardAnalyzer(Version.LUCENE_CURRENT), true,
>> IndexWriter.MaxFieldLength.LIMITED);

// the above code should change true into false so that the index will
still be open right?
// I mean surely the index shouldn't be close
>
> System.out.println("Indexing to directory '" +INDEX_DIR+ "'...");
>
// what confuse me is how can add a new directory path containing new
Documents to a
// lucene class. so that lucene will index those new documents and add
it to an existing
// index.
> indexDocs(writer, docDir);
>
> System.out.println("Optimizing...");
>
> writer.optimize();
>
> writer.close();
>
>
> Seth Rosen
> www.architexa.com
> Understand & Document Code In Seconds
> s...@architexa.com 
>
>
>


-- 
http://jacobian.web.id

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: adding documents to an existing index

2010-10-27 Thread Seth Rosen
Yakob the boolean in the constructor should be true if you want to create a
NEW index in INDEX_DIR and false to append to an existing one as seen here
[1]

As for adding a directory to an index you will need to validate the
directory, then loop through it recursively and add each doc to the writer
you created.

static void indexDocs(IndexWriter writer, File file)
>>
>> throws IOException {
>>
>> // do not try to index files that cannot be read
>>
>> if (file.canRead()) {
>>
>>   if (file.isDirectory()) {
>>
>> String[] files = file.list();
>>
>> // an IO error could occur
>>
>> if (files != null) {
>>
>>   for (int i = 0; i < files.length; i++) {
>>
>> indexDocs(writer, new File(file, files[i]));
>>
>>   }
>>
>> }
>>
>>   } else {
>>
>> System.out.println("adding " + file);
>>
>> try {
>>
>>   writer.addDocument(FileDocument.Document(file));
>>
>> }
>>
>> // at least on windows, some temporary files raise this exception
>>> with an "access denied" message
>>
>> // checking if the file can be read doesn't help
>>
>> catch (FileNotFoundException fnfe) {
>>
>>   ;
>>
>> }
>>
>>   }
>>
>> }
>>
>>   }
>>
>>
[1] 
http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/index/IndexWriter.html#IndexWriter(org.apache.lucene.store.Directory,
org.apache.lucene.analysis.Analyzer, boolean,
org.apache.lucene.index.IndexWriter.MaxFieldLength)

Seth Rosen
www.architexa.com
Understand & Document Code In Seconds
s...@architexa.com 



On Wed, Oct 27, 2010 at 9:51 AM, Yakob  wrote:

> On 10/27/10, Seth Rosen  wrote:
> > Yakob,
> > Here is a snippet of an example of IndexWriter from the lucene source
> that
> > you might find helpful.
> >
> >
> >> IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new
> >> StandardAnalyzer(Version.LUCENE_CURRENT), true,
> >> IndexWriter.MaxFieldLength.LIMITED);
>
> // the above code should change true into false so that the index will
> still be open right?
> // I mean surely the index shouldn't be close
> >
> > System.out.println("Indexing to directory '" +INDEX_DIR+ "'...");
> >
> // what confuse me is how can add a new directory path containing new
> Documents to a
> // lucene class. so that lucene will index those new documents and add
> it to an existing
> // index.
> > indexDocs(writer, docDir);
> >
> > System.out.println("Optimizing...");
> >
> > writer.optimize();
> >
> > writer.close();
> >
> >
> > Seth Rosen
> > www.architexa.com
> > Understand & Document Code In Seconds
> > s...@architexa.com 
> >
> >
> >
>
>
> --
> http://jacobian.web.id
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Does a IndexSearcher call incRef on the underlying reader?

2010-10-27 Thread Michael McCandless
On Wed, Oct 27, 2010 at 1:01 PM, Pulkit Singhal  wrote:

> 1st of all, great book.

Thank you!

> @Question3: It sounds like an IndexReader always starts with a count of zero
> but that should not be a cause of worry because the value only gets acted
> upon in a call to decRef() ... am I right?

Actually, refCount of a new IndexReader starts at 1.  Then the caller
must call close (which under the hood calls decRef) to drop it to 0.

> @Question4: It seems to me that based on you explanation so far, the
> IndexReader will end up closing after the very 1st search. That doesn't
> sound too efficient given that keeping it alive and kicking is something
> that is highly desirable ... no? Am I missing something or does that
> responsibility fall elsewhere?

Actually, no -- SearcherManager also holds a ref.  So when there are
no queries in flight, the refCount will be 1.  It's only when the
searcher is swapped out for a new one that we decRef the old one and
its refCount drops to 0 (once all in-flight queries finish).

> I hope I haven't hijacked my own thread?

I don't think so!

Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Michigan Information Retrieval Enthusiasts Group Quarterly Meetup - November 13, 2010

2010-10-27 Thread Provalov, Ivan
Cengage Learning is organizing a second quarterly meetup in Michigan 
(web-conference and dial-in are available) for the IR Enthusiasts.  Please RSVP 
at http://www.meetup.com/Michigan-Information-Retrieval-Enthusiasts-Group

Presentations:

1. Search Assist Dictionary Based on Corpus Terms Collocation by Drew 
Kozsewnik, Cengage Learning
The presentation will be a brief overview of what search assist is followed by 
a technical discussion about the algorithms that make it work.  The technical 
part will detail phrase extraction (building the dictionaries), runtime 
(indexing and retrieving relevant phrases based on a partial query), and phrase 
correlation (The "Kevin Bacon" feature which returns phrases that often occur 
nearby recommended phrases).
2. Carrot2 Clustering Engine by Stanislaw Osinski and Dawid Weiss, Carrot 
Search (Poland)
Overview of Carrot2 - an Open Source Search Results Clustering Engine.  It can 
automatically organize small collections of documents, e.g. search results, 
into thematic categories.

Schedule:

1. 9-10am breakfast (uRefer, Ann Arbor)
2. 10-11am Presentations (uRefer, Ann Arbor, Web-conferencing)
3. 11am-12pm discussion

Remote Dial-in:
866-394-9514
6087393

Remote Web-Conferencing:
http://www.conferenceplus.com/
866-394-9514
6087393

Location:
uRefer, 924 N. Main St., Suite 3, Ann Arbor, MI

Thank you,

Ivan Provalov
Information Architect
Cengage Learning


Text categorization / classification

2010-10-27 Thread Maria Vazquez
I need to auto-categorize a large number of documents. They are basically news 
articles from major news sources (nytimes, npr, abcnews, etc).
I'd like to categorize them automatically. Any suggestions?
Lucene in Action suggests using a set of documents to build category vectors 
and then comparing each document to each of those vectors and get the closest 
one.
The approach seems pretty simple (from other papers I read on text 
categorization) but maybe you guys know of something out there that already 
does this using Lucene/Solr.
Thanks!
Maria

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Email Indexing

2010-10-27 Thread Hasan Diwan
I'd like to provide myself with a searchable index of email. I'm
familiar with the Javamail library, so will use this to fetch the
mail. Anyone out there done any indexing of email? On Sourceforge,
there's zoe[1], which hasn't had a release since 2004, and a couple of
other projects. I'm also seeing something about sphinx, which reads
like another indexing platform(?). Any advice regarding this is
appreciated.
-- 
Sent from my mobile device
Envoyait de mon telephone mobil

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Text categorization / classification

2010-10-27 Thread Lance Norskog
There are tools for this in the Mahout project. These are oriented
toward large-scale work.

http://mahout.apache.org

There is a big learning curve and you have to learn Hadoop somewhat.

The book 'Collective Intelligence' includes a suite of Python tools
for small-scale experiments.

On Wed, Oct 27, 2010 at 1:12 PM, Maria Vazquez  wrote:
> I need to auto-categorize a large number of documents. They are basically 
> news articles from major news sources (nytimes, npr, abcnews, etc).
> I'd like to categorize them automatically. Any suggestions?
> Lucene in Action suggests using a set of documents to build category vectors 
> and then comparing each document to each of those vectors and get the closest 
> one.
> The approach seems pretty simple (from other papers I read on text 
> categorization) but maybe you guys know of something out there that already 
> does this using Lucene/Solr.
> Thanks!
> Maria
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 
Lance Norskog
goks...@gmail.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Email Indexing

2010-10-27 Thread Troy Wical
On Oct 27, 2010, at 3:57 PM, Hasan Diwan wrote:

> I'd like to provide myself with a searchable index of email. I'm
> familiar with the Javamail library, so will use this to fetch the
> mail. Anyone out there done any indexing of email? On Sourceforge,
> there's zoe[1], which hasn't had a release since 2004, and a couple of
> other projects. I'm also seeing something about sphinx, which reads
> like another indexing platform(?). Any advice regarding this is
> appreciated.

Depends on what your trying to index, I suppose. Maildir or mbox? For some time 
now, off and on, I have been working to index an ezmlm mailing list archive. In 
the end, I went with Swish-E and have made quite a bit of progress. I am short 
of my complete goal though. The issue is that the search results do not return 
results that contain the subject, and there is currently no excerpt or phrase 
highlighting. My problem is the flat text email files I am working with have no 
xml or anything to help the indexer create fields from. I've not yet figured 
out how to convert the emails to xml.

Other than though, it's functional, and very fast. That being said, I'm sure 
Sphinx or Lucene could do the same thing, and I would love to hear from anyone 
out there who is using Lucene to index a list of emails that are mbox format.

You can see my Swish-E implementation, in all it's unfinished glory, at 
http://type2.com/search
It covers roughly 200,000 emails over the past 15 years.

Peace, Troy
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Text categorization / classification

2010-10-27 Thread mvazq...@ova.st
Thanks a lot!
I was reading about Mahout today.
I'll try that out.
Thanks again
Maria

Sent from my iPhone


On Oct 27, 2010, at 20:59, Lance Norskog  wrote:

> There are tools for this in the Mahout project. These are oriented
> toward large-scale work.
> 
> http://mahout.apache.org
> 
> There is a big learning curve and you have to learn Hadoop somewhat.
> 
> The book 'Collective Intelligence' includes a suite of Python tools
> for small-scale experiments.
> 
> On Wed, Oct 27, 2010 at 1:12 PM, Maria Vazquez  wrote:
>> I need to auto-categorize a large number of documents. They are basically 
>> news articles from major news sources (nytimes, npr, abcnews, etc).
>> I'd like to categorize them automatically. Any suggestions?
>> Lucene in Action suggests using a set of documents to build category vectors 
>> and then comparing each document to each of those vectors and get the 
>> closest one.
>> The approach seems pretty simple (from other papers I read on text 
>> categorization) but maybe you guys know of something out there that already 
>> does this using Lucene/Solr.
>> Thanks!
>> Maria
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 
> 
> 
> 
> -- 
> Lance Norskog
> goks...@gmail.com
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Fuzzy Phrase Search

2010-10-27 Thread Andrew Scott
Hi Guys,

I am wondering how I can go about doing a Fuzzy Phrase search using
Lucene.NET 2.9.2 - I've tired looking around everywhere but there doesn't
really seem to be any resources related to this anywhere.

I found this stackoverflow
link-
but hoping that you might be able to help more ?

Thanks,

Andy


Re: Lucene index update

2010-10-27 Thread Pulkit Singhal
Look interesting, what is the merit in having a second index in order to
keep the document id the same? Perhaps I have misunderstood. Just want to
understand your motivation here.

On Wed, Oct 20, 2010 at 2:57 PM, Nilesh Vijaywargiay  wrote:

> I've written a blog regarding a work around for updating index in Lucene
> using parallel reader. It's explained with results and pictures.
>
> It would be great if you have a look at it. The link:
> http://the10minutes.blogspot.com/2010/10/lucene-index-update.html
>
> Thanks
> Nilesh
>


Re: Lucene index update

2010-10-27 Thread Nilesh Vijaywargiay
Pulkit,
Parallel reader takes the union of all fields for a given id. Thus if I want
to add a field or modify a field of a document which has id 2 in index1, I
need to createa a document with id 2 in index2 with the fields I want to
add/modify. Thus parallel reader would treat them as fields of a single
document.
Now if I give doc.getFields() for that document then it would list fields
from index1 and index2.

 On Wed, Oct 27, 2010 at 9:04 PM, Pulkit Singhal wrote:

> Look interesting, what is the merit in having a second index in order to
> keep the document id the same? Perhaps I have misunderstood. Just want to
> understand your motivation here.
>
> On Wed, Oct 20, 2010 at 2:57 PM, Nilesh Vijaywargiay <
> nilesh.vi...@gmail.com
> > wrote:
>
> > I've written a blog regarding a work around for updating index in Lucene
> > using parallel reader. It's explained with results and pictures.
> >
> > It would be great if you have a look at it. The link:
> > http://the10minutes.blogspot.com/2010/10/lucene-index-update.html
> >
> >  >Thanks
> > Nilesh
> >
>


Re: Lucene index update

2010-10-27 Thread Pulkit Singhal
But why do you feel the need to have a parallel reader that combines result
sets across two indices based on docId?

On Thu, Oct 28, 2010 at 12:17 AM, Nilesh Vijaywargiay <
nilesh.vi...@gmail.com> wrote:

> Pulkit,
> Parallel reader takes the union of all fields for a given id. Thus if I
> want
> to add a field or modify a field of a document which has id 2 in index1, I
> need to createa a document with id 2 in index2 with the fields I want to
> add/modify. Thus parallel reader would treat them as fields of a single
> document.
> Now if I give doc.getFields() for that document then it would list fields
> from index1 and index2.
>
>  On Wed, Oct 27, 2010 at 9:04 PM, Pulkit Singhal  >wrote:
>
> > Look interesting, what is the merit in having a second index in order to
> > keep the document id the same? Perhaps I have misunderstood. Just want to
> > understand your motivation here.
> >
> > On Wed, Oct 20, 2010 at 2:57 PM, Nilesh Vijaywargiay <
> > nilesh.vi...@gmail.com
> > > wrote:
> >
> > > I've written a blog regarding a work around for updating index in
> Lucene
> > > using parallel reader. It's explained with results and pictures.
> > >
> > > It would be great if you have a look at it. The link:
> > > http://the10minutes.blogspot.com/2010/10/lucene-index-update.html
> > >
> > >  > >Thanks
> > > Nilesh
> > >
> >
>


Re: Lucene index update

2010-10-27 Thread Nilesh Vijaywargiay
One major reason is to update a field or rather shadow a field.
i have a field named "testField" in index1 and i want to update that field.
When I update, I want only the new value to be reflected, not the value in
old field.
now parallelreader starts from the latest index, i.e index2 and searches for
'testField'. It gets a hit in the index2 itself and doesn't go forward. Thus
I am shadowing the old value of 'testField' with the new value. Does that
make sense?

On Wed, Oct 27, 2010 at 9:38 PM, Pulkit Singhal wrote:

> But why do you feel the need to have a parallel reader that combines result
> sets across two indices based on docId?
>
> On Thu, Oct 28, 2010 at 12:17 AM, Nilesh Vijaywargiay <
>  nilesh.vi...@gmail.com> wrote:
>
> > Pulkit,
> > Parallel reader takes the union of all fields for a given id. Thus if I
> > want
> > to add a field or modify a field of a document which has id 2 in index1,
> I
> > need to createa a document with id 2 in index2 with the fields I want to
> > add/modify. Thus parallel reader would treat them as fields of a single
> > document.
> > Now if I give doc.getFields() for that document then it would list fields
> > from index1 and index2.
> >
> >  On Wed, Oct 27, 2010 at 9:04 PM, Pulkit Singhal <
> pulkitsing...@gmail.com
> > >wrote:
> >
> > > Look interesting, what is the merit in having a second index in order
> to
> > > keep the document id the same? Perhaps I have misunderstood. Just want
> to
> > > understand your motivation here.
> > >
> > > On Wed, Oct 20, 2010 at 2:57 PM, Nilesh Vijaywargiay <
> > > nilesh.vi...@gmail.com
> > > > wrote:
> > >
> > > > I've written a blog regarding a work around for updating index in
> > Lucene
> > > > using parallel reader. It's explained with results and pictures.
> > > >
> > > > It would be great if you have a look at it. The link:
> > > > http://the10minutes.blogspot.com/2010/10/lucene-index-update.html
> > > >
> > > >  > > >Thanks
> > > > Nilesh
> > > >
> > >
> >
>


Re: Email Indexing

2010-10-27 Thread Hasan Diwan
On 27 October 2010 18:16, Troy Wical  wrote:
> Depends on what your trying to index, I suppose. Maildir or mbox? For some 
> time now, off and on, I have been working to index an ezmlm mailing list 
> archive. In the end, I went with Swish-E and have made quite a bit of 
> progress. I am short of my complete goal though. The issue is that the search 
> results do not return results that contain the subject, and there is 
> currently no excerpt or phrase highlighting. My problem is the flat text 
> email files I am working with have no xml or anything to help the indexer 
> create fields from. I've not yet figured out how to convert the emails to xml.

Neither Maildir or mbox -- IMAP/POP doesn't care. Basically, I want to
build the index based on the contents of (my) gmail box. I can
retrieve the messages using IMAP, just need to figure out the
structure of the index.

Converting email to XML? Email me off-list and I'll provide you with
some help (as email => XML has little to do with lucene).
-- 
Sent from my mobile device
Envoyait de mon telephone mobil

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Email Indexing

2010-10-27 Thread Lance Norskog

Tika has some mailbox file parsing that includes metadata parsing.
For POP/IMAP email servers I don't know any tools.

Hasan Diwan wrote:

On 27 October 2010 18:16, Troy Wical  wrote:
   

Depends on what your trying to index, I suppose. Maildir or mbox? For some time 
now, off and on, I have been working to index an ezmlm mailing list archive. In 
the end, I went with Swish-E and have made quite a bit of progress. I am short 
of my complete goal though. The issue is that the search results do not return 
results that contain the subject, and there is currently no excerpt or phrase 
highlighting. My problem is the flat text email files I am working with have no 
xml or anything to help the indexer create fields from. I've not yet figured 
out how to convert the emails to xml.
 

Neither Maildir or mbox -- IMAP/POP doesn't care. Basically, I want to
build the index based on the contents of (my) gmail box. I can
retrieve the messages using IMAP, just need to figure out the
structure of the index.

Converting email to XML? Email me off-list and I'll provide you with
some help (as email =>  XML has little to do with lucene).
   


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org