IndexWriter and Directory create param

2003-06-10 Thread Leslie Hughes
Hi,

I'm doing something like:-

Directory dir = FSDirectory.getDirectory(myindex, true);
IndexWriter writer = new IndexWriter(dir, myAnalyser, true);

which gives me a nice clean index. But what if the create params are
different? If I open a directory with create=false then create a writer on
it with create = true will this give problems? Maybe I should do something
like

boolean flag = true/false;
Directory dir = FSDirectory.getDirectory(myindex, flag);
IndexWriter writer = new IndexWriter(dir, myAnalyser, false);


Whilst I'm on the subject, there doesn't appear to be a standard way of
creating a Directory, FSDir has a getDirectory whilst RAMDir uses a
constructor - shouldn't there be a standard method on the Directory
interface (like there is with close)? Or maybe a configurable
DirectoryFactory?


Ideas?

Bye

Les




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexReader thread safety

2003-06-10 Thread Otis Gospodnetic
FAQ?
Yes :)

Otis

--- Eric Jain [EMAIL PROTECTED] wrote:
 Several threads can share a single IndexReader instance. Correct?
 
 --
 Eric Jain
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


__
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexWriter and Directory create param

2003-06-10 Thread Otis Gospodnetic
Hello Les,

 Directory dir = FSDirectory.getDirectory(myindex, true);
 IndexWriter writer = new IndexWriter(dir, myAnalyser, true);
 
 which gives me a nice clean index. But what if the create params are
 different? If I open a directory with create=false then create a
 writer on it with create = true will this give problems?

If I understand you correctly, then the answer is: no, this should not
cause problems.  You could easily try that, no?

 Maybe I should do something like
 
 boolean flag = true/false;
 Directory dir = FSDirectory.getDirectory(myindex, flag);
 IndexWriter writer = new IndexWriter(dir, myAnalyser, false);

I've seen people use code like that.

 Whilst I'm on the subject, there doesn't appear to be a standard way
 of creating a Directory, FSDir has a getDirectory whilst RAMDir uses
a
 constructor - shouldn't there be a standard method on the Directory
 interface (like there is with close)? Or maybe a configurable
 DirectoryFactory?

Perhaps.  Directory is an abstract class.  One could add an abstract
open(...) method, maybe.  I don't have a need for it...

Otis


__
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Duplicating a document in the index.

2003-06-10 Thread Otis Gospodnetic
You could add it twice.
You could probably also get it out of the index (e.g. via search), and
re-add it.
You can have multiple instances of the same document in the index.

Otis

--- Victor Hadianto [EMAIL PROTECTED] wrote:
 Hi list,
 
 Is there an easy way for duplicating a document in the index? Or can
 someone 
 point me to the right direction for looking?
 
 Thanks,
 -- 
 Victor Hadianto
 
 NUIX Pty Ltd
 Level 8, 143 York Street, Sydney 2000
 Phone: (02) 9283 9010
 Fax:   (02) 9283 9020
 
 This message is intended only for the named recipient. If you are not
 the
 intended recipient you are notified that disclosing, copying,
 distributing
 or taking any action in reliance on the contents of this message or
 attachment is strictly prohibited.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


__
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexReader thread safety

2003-06-10 Thread Eric Jain
 Several threads can share a single IndexReader instance. Correct?
 FAQ?
 Yes :)

No.
Where?

--
Eric Jain


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: High Capacity (Distributed) Crawler

2003-06-10 Thread Otis Gospodnetic
Leo,

 The first beta is done (without NIO). It needs, however, further 
 testing. Unfortunatelly, I could not find enough servers which I may
 hit.

Nice.  Pretty much any site is a candidate, as long as you are nice to
it.
You could, for instance, hit all dmoz URLs.  Or you could extract a set
of links from Yahoo.  Or you could try finding that small and large set
of URLs that Google provided a while ago for their Google Challenge.

 I wanted to commit the robot as a part of egothor (it will use it in 
 PULL mode), but we have a nice weather here, so I lost any motivation
 to play with PC ;-).

Yes, I hear some places in central Europe having temperatures of 36-38
C.  Hot!
We are not that lucky in NYC this year :(  Lots of rain and cloudy
weather, which is atypical.

 What interface do you need for Lucene? Will you use PUSH (=the robot 
 will modify Lucene's index) or PULL (=the engine will get deltas from
 
 the robot) mode? Tell me what you need and I will try to do all my
 best.

I'd imagine one would want to use it in the PUSH mode (e.g. the crawler
fetches a web page and adds it to the searchable index).
How does PULL mode work?  I've never heard of web crawlers being used
in the PULL mode.  What exactly does that mean, could you please
describe it?

Thanks,
Otis


 Otis Gospodnetic wrote:
 
 Leo,
 
 Have you started this project?  Where is it hosted?
 It would be nice to see a few alternative implementations of a
 robust
 and scalable java web crawler with the ability to index whatever it
 fetches.
 
 Thanks,
 Otis
 
 --- Leo Galambos [EMAIL PROTECTED] wrote:
   
 
 Hi.
 
 I would like to write $SUBJ (HCDC), because LARM does not offer
 many 
 options which are required by web/http crawling IMHO. Here is my
 list:
 
 1. I would like to manage the decision what will be gathered first
 - 
 this would be based on pageRank, number of errors, connection speed
 etc. 
 etc.
 2. pure JAVA solution without any DBMS/JDBC
 3. better configuration in case of an error
 4. NIO style as it is suggested by LARM specification
 5. egothor's filters for automatic processing of various data
 formats
 6. management of Expires HTTP-meta headers, heuristic rules which
 will 
 describe how fast a page can expire (.php often expires faster than
 .html)
 7. reindexing without any data exports from a full-text index
 8. open protocol between the crawler and a full-text engine
 
 If anyone wants to join (or just extend the wish list), let me
 know,
 please.
 
 -g-
 
 

-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
 
 
 
 __
 Do you Yahoo!?
 Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
 http://calendar.yahoo.com
 

-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
   
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


__
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexReader thread safety

2003-06-10 Thread Otis Gospodnetic
jGuru?
I think there is one about IndexSearcher.  Not quite the same as
IndexReader, but close :)

Otis

--- Eric Jain [EMAIL PROTECTED] wrote:
  Several threads can share a single IndexReader instance. Correct?
  FAQ?
  Yes :)
 
 No.
 Where?
 
 --
 Eric Jain
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


__
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: IndexWriter and Directory create param

2003-06-10 Thread Otis Gospodnetic
I'd say, if a common public method for all Directory implementation
helps you, try it out by modifying the sources locally, and if you are
happy with it, submit the patch.
I've always just used String references to directories where my indices
were, so I never needed this common method.

Otis

--- Leslie Hughes [EMAIL PROTECTED] wrote:
 Hi Otis, Thanks for the reply.  
 
 My DirectoryImpl is configurable in a config file so I dynamically
 instantiate whatever's listed there. Because of the private
 constructor in
 FSDir and the lack of getDirectory on the interface, I'm having to do
 :-
 
  try{
   //Make an FSDir if it's one of those
   return (Directory)Class.forName(myDefaultDirectoryImpl)
   .getMethod(getDirectory, new Class[] {String.class,
 boolean.class})
   .invoke(null, new Object[]{myDefaultIndex, new
 Boolean(false)});
   }catch(Exception ioe) {  }
 
 Which I think is rather funky :-) but some would say not very
 clean
 Anyway adding a getDirectory to the Directory class would be rather
 neat
 then I could use the above for all dirs - this doesn't work with
 RAMDir or
 DBDir at the moment of course - and wrap the whole lot into a
 DirectoryFactory+config.xml file.
 
 On the other point, I've decided to go with creating the new index
 via the
 writer - no real reason, just couldn't see why not :-)  
 
 
 Bye
 
 Les
 
 
 
 
 
  -Original Message-
  From:   Otis Gospodnetic [SMTP:[EMAIL PROTECTED]
  Sent:   Tuesday, June 10, 2003 3:02 PM
  To: Lucene Users List
  Subject:Re: IndexWriter and Directory create param
  
  Hello Les,
  
   Directory dir = FSDirectory.getDirectory(myindex, true);
   IndexWriter writer = new IndexWriter(dir, myAnalyser, true);
   
   which gives me a nice clean index. But what if the create params
 are
   different? If I open a directory with create=false then create a
   writer on it with create = true will this give problems?
  
  If I understand you correctly, then the answer is: no, this should
 not
  cause problems.  You could easily try that, no?
  
   Maybe I should do something like
   
   boolean flag = true/false;
   Directory dir = FSDirectory.getDirectory(myindex, flag);
   IndexWriter writer = new IndexWriter(dir, myAnalyser, false);
  
  I've seen people use code like that.
  
   Whilst I'm on the subject, there doesn't appear to be a standard
 way
   of creating a Directory, FSDir has a getDirectory whilst RAMDir
 uses
  a
   constructor - shouldn't there be a standard method on the
 Directory
   interface (like there is with close)? Or maybe a configurable
   DirectoryFactory?
  
  Perhaps.  Directory is an abstract class.  One could add an
 abstract
  open(...) method, maybe.  I don't have a need for it...
  
  Otis
  
  
  __
  Do you Yahoo!?
  Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
  http://calendar.yahoo.com
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


__
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexReader thread safety

2003-06-10 Thread Eric Jain
 jGuru?

Found it: http://www.jguru.com/faq/view.jsp?EID=492393

Thanks a lot, day saved.

--
Eric Jain


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Duplicating a document in the index.

2003-06-10 Thread Victor Hadianto
On Tue, 10 Jun 2003 05:03 pm, Otis Gospodnetic wrote:
 You could add it twice.
 You could probably also get it out of the index (e.g. via search), and
 re-add it.
 You can have multiple instances of the same document in the index.

Hmm what if the fields data is not available anymore? Is there a way to 
dulicate fields in the index?


 Otis
victor


 --- Victor Hadianto [EMAIL PROTECTED] wrote:
  Hi list,
 
  Is there an easy way for duplicating a document in the index? Or can
  someone
  point me to the right direction for looking?
 
  Thanks,
  --
  Victor Hadianto
 
  NUIX Pty Ltd
  Level 8, 143 York Street, Sydney 2000
  Phone: (02) 9283 9010
  Fax:   (02) 9283 9020
 
  This message is intended only for the named recipient. If you are not
  the
  intended recipient you are notified that disclosing, copying,
  distributing
  or taking any action in reliance on the contents of this message or
  attachment is strictly prohibited.
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: High Capacity (Distributed) Crawler

2003-06-10 Thread Leo Galambos
Otis Gospodnetic wrote:

What interface do you need for Lucene? Will you use PUSH (=the robot 
will modify Lucene's index) or PULL (=the engine will get deltas from

the robot) mode? Tell me what you need and I will try to do all my
best.
   

I'd imagine one would want to use it in the PUSH mode (e.g. the crawler
fetches a web page and adds it to the searchable index).
How does PULL mode work?  I've never heard of web crawlers being used
in the PULL mode.  What exactly does that mean, could you please
describe it?
 

It is a long story, so I will assume, that everything runs on a single 
box - it is the most simple case.
[x] will denote points, where Lucene may have problems with a fast 
implementation, I guess.

Crawler: The crawler stores meta and body of all documents. If you want 
to retrieve the document meta or body (knowing its URI), it costs O(1) 
(2 seeks and 2 read requests in auxiliary data structures). After this 
retrieval you also get a direct handle to meta and body - then the price 
of retrieval becomes O(1), but no extra seeks in any structures. The 
handle is persistent and is related to URI. The meta and body is updated 
as soon as the crawler fetches a new fresh copy.

Engine: engine stores the handle for each document. Moreover it knows 
the last (highest) handle, which is stored in the main index. So the 
trick is this:
1) build up an auxiliary index from new documents. The new documents are 
documents which have their handle greater than the last handle which is 
known to the engine, thus you can iterate them easily - this process can 
run in a separate thread
2) consult the changes. You read meta, which are stored in index, and 
test if they are obsolete (note: you have already got the handle, so it 
smokes). If so, you denote the respective document as deleted and its 
new version (if any) is appended to another index - the index of 
changes. The insertion to the index runs in a separate thread, so the 
main thread is not blocked. BTW: [x] The documents, which are not 
modified, may modify their ranks (depthrank, pagerank, frequencyrank 
etc) in this round.

[x] The two auxiliary indices are then merged with the main index.

Obviously, the weak point is the test if anything is changed. This can 
be easily solved with the index dynamization I use. Despite Lucene, I 
order barrels (segments in your terminology) by their size. I do not 
want to describe all the details - I hate long e-mails ;-), but the 
dynamization guarantees that:
a) the query time is never worse than 8x, comparing with 
fully-optimalized index (if you buy 8x faster HW, you overcome this easily)
b) the documents, which are often modified, are stored in small barrels 
of the main index. It means, that their actualization is fast.

So, I process only the small barrels once a day, and the larger ones 
less often. If we say, that 5M of docs are updated daily, PULL mode can 
handle this load in few minutes. Unfortunately, the slowest point is the 
HTML parser which may run few hours :-(.

If you want to actualize other 10^10 crap pages once a month, it can be 
done too, but it is out of my first assumption above ;-).

-g-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]