Re: Not entire document being indexed?

2005-02-25 Thread Andrzej Bialecki
[EMAIL PROTECTED] wrote:
Anyone else has any ideas why wouldn't the whole documents be indexed as 
described below?

Or perhaps someone can enlighten me on how to use Luke to find out if 
the whole document was indexed or not.
I have not used Luke in such capacity before so not sure what to do or 
look for?
Well, you could try to use the Reconstruct  Edit function - this will 
give you an idea what tokens ended up in the index, and which was the 
last one. In Luke 0.6, if the field is stored then you will see two tabs 
- one is for stored content, the other displays tokenized content where 
tokens are separated by commas. If the field was un-stored, then the 
only tab you will get will be the reconstructed content. In any case, 
just scroll down and check what are the last tokens.

You could also look for presence of some special terms that occur only 
at the end of that document, and check if they are present in the index.

There are really only few reasons why this might be happening:
* your extractor has a bug, or
* the max token limit is wrongly set, or
* the indexing process doesn't close the IndexWriter properly.
--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Not entire document being indexed?

2005-02-25 Thread [EMAIL PROTECTED]
Thanks Andrzej and Pasha for your prompt replies and suggestions.
I will try everything you have suggested and report back on the findings!
regards
-pedja

Pasha Bizhan said the following on 2/25/2005 6:32 PM:
Hi, 

whole document was indexed or not.
Luke can help you to give an answer the question: does my index contain a
correct data?
Let do the following steps:
- run Luke
- open the index
- find the specified document (document tab)
- click reconstruct and edit button
- select the field and look the original stored content of this field
reconstructed from index
Does this reconstructed content contain your last 2-3 paragraphs?
Also, 230Kb is not equal 20.000. Try to set  writer.maxFieldLength to 250
000.
Pasha Bizhan
http://lucenedotnet.com


Re: Not entire document being indexed?

2005-02-24 Thread [EMAIL PROTECTED]
Hi Otis
Thanks for the reply, what exactly should I be looking for with Luke?
What would setting the max value to maxInteger do? Is this some 
arbitrary value or...?

-pedja
Otis Gospodnetic said the following on 2/24/2005 2:24 PM:
Use Luke to peek in your index and find out what really got indexed.
You could also try the extreme case and set that max value to the max
Integer.
Otis
--- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 

Hi everyone
I'm having a bizzare problem with a few of the documents here that do
not seem to get indexed entirely.
I use textmining WordExtractor to convert M$ Word to plain text and
then 
index that text.
For example one document which is about 230KB in size when converted
to 
plain text, when indexed and
later searched for a pharse in the last 2-3 paragraphs returns no
hits, 
yet searching anything above those
paragraphs works just fine. WordExtractor does convert the entire 
document to text, I've checked that.

I've tried increasing the number of terms per field from default
10,000 
to 20,000 with writer.maxFieldLength
but that didnt make any difference, still cant find phrases from the 
last 2-3 paragraphs.

Any ideas as to why this could be happening and how I could rectify
it?
thanks,
-pedja

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]