Re: Words Indexing strategies

2010-02-22 Thread Alejandro Tejada
Some time ago, i posted a message asking for volunteers to create a Wikipedia CD/DVD. Since then, i have been working on this project and have done some advances, that will be published as soon they work as expected. These are the steps that i am following to process the XML databases:

Re: Words Indexing strategies

2010-02-13 Thread Alejandro Tejada
. :-) Have a nice weekend! Alejandro -- View this message in context: http://n4.nabble.com/Words-Indexing-strategies-tp1473753p1554526.html Sent from the Revolution - User mailing list archive at Nabble.com. ___ use-revolution mailing list use

Re: Words Indexing strategies

2010-02-12 Thread Bernard Devlin
On Fri, Feb 12, 2010 at 3:24 AM, Alejandro Tejada capellan2...@gmail.com wrote: I have a dll named: dbsqlite.dll (452 K) in my Rev Studio instalation. If an experienced database developer could lend a hand, i would be really grateful. Hi Alejandro Ok, since you don't know anything about

Re: Words Indexing strategies

2010-02-12 Thread Richard Gaskin
Bernard Devlin wrote: However, it looks to me like the existing indexes don't contain enough information for you to calculate frequency of occurrence (a measure of relevance). Once again, MetaCard to the rescue! :) Raney included this little gem in MC's Examples stack, and using repeat for

Re: Words Indexing strategies

2010-02-11 Thread Bernard Devlin
On Wed, Feb 10, 2010 at 10:30 PM, Alejandro Tejada capellan2...@gmail.com wrote: Yes, each one of these 28 text files will be compressed in gz format. When users look for a word, or many words, only these file(s) are decompressed and searched. Like Brian, I was going to suggest existing search

Re: Words Indexing strategies

2010-02-11 Thread Alejandro Tejada
Hi Bernard, on Thu, 11 Feb 2010 09:13:46 + Bernard Devlin wrote: Like Brian, I was going to suggest existing search technologies like Lucene. Why re-invent the wheel? I understand you not wanting to ship Java and get the user to install it. However there may be other pre-existing

Re: Words Indexing strategies

2010-02-10 Thread Alejandro Tejada
Hi Bernard, on Wed, 10 Feb 2010 07:20:36 + Bernard Devlin wrote: Can I just clarify your problem? You want to be able to search for phrases (partial sentences, possibly with boolean logic) inside the text stored in the xml nodes of the article, once the article is found in the index?

Re: Words Indexing strategies

2010-02-10 Thread Bernard Devlin
On Wed, Feb 10, 2010 at 2:56 PM, Alejandro Tejada capellan2...@gmail.com wrote: No, it's not a search inside the displayed article. It's a global search, within a general index created using all words from all articles of Wikipedia. (I do not believe that it's necessary to load this full

Re: Words Indexing strategies

2010-02-10 Thread Brian Yennie
Alejandro, The first step for this would likely include creating an inverted index. This means you store something like: monkey:1,34,3827,21314 Where the word being indexed in monkey and the numbers that follow are article IDs. Using this information it is pretty trivial to implement AND /

Re: Words Indexing strategies

2010-02-10 Thread Richard Gaskin
The ambitious Alejandro Tejada wrote: It's a global search, within a general index created using all words from all articles of Wikipedia. (I do not believe that it's necessary to load this full index in memory, instead just open specific parts of this index when users start searching) For

Re: Words Indexing strategies

2010-02-10 Thread Alejandro Tejada
Many thanks for replying this question. :-) on Wed, 10 Feb 2010 15:30:23 + Bernard Devlin wrote: OK, so that's why you mention the different files for each letter of the alphabet. Yes, each one of these 28 text files will be compressed in gz format. When users look for a word, or many

Re: Words Indexing strategies

2010-02-10 Thread Brian Yennie
Yes, this is correct and should work fine, but how could i write in the word index a range of article where a word appears consecutively: baboon:1934,2345,2346,2347,2348,2349,2350,2351,2352,2567,3578 If this were your format, you could compact to something like: baboon:1934,2345-2352,2567,3578

Re: Words Indexing strategies

2010-02-09 Thread Bernard Devlin
On Tue, Feb 9, 2010 at 12:00 AM, Alejandro Tejada capellan2...@gmail.com wrote: Now, i am looking for advice to create an index structure for searching specific words inside article's text. i have been unable to implement a fast search algorithm, using multiple words, similar to Wikipedia's

Words Indexing strategies

2010-02-08 Thread Alejandro Tejada
Hi all, Some time ago, i posted a message asking for volunteers to create a Wikipedia CD/DVD. Since then, i have been working on this project and have done some advances, that will be published as soon they work as expected. Now, i need advice about possible strategies to create a fast and