Some time ago, i posted a message asking for
volunteers to create a Wikipedia CD/DVD.
Since then, i have been working on this project
and have done some advances, that will be
published as soon they work as expected.
These are the steps that i am following to
process the XML databases:
. :-)
Have a nice weekend!
Alejandro
--
View this message in context:
http://n4.nabble.com/Words-Indexing-strategies-tp1473753p1554526.html
Sent from the Revolution - User mailing list archive at Nabble.com.
___
use-revolution mailing list
use
On Fri, Feb 12, 2010 at 3:24 AM, Alejandro Tejada
capellan2...@gmail.com wrote:
I have a dll named: dbsqlite.dll (452 K) in my Rev Studio instalation.
If an experienced database developer could lend a hand, i would be
really grateful.
Hi Alejandro
Ok, since you don't know anything about
Bernard Devlin wrote:
However, it looks to me like the existing indexes don't contain enough
information for you to calculate frequency of occurrence (a measure of
relevance).
Once again, MetaCard to the rescue! :)
Raney included this little gem in MC's Examples stack, and using repeat
for
On Wed, Feb 10, 2010 at 10:30 PM, Alejandro Tejada
capellan2...@gmail.com wrote:
Yes, each one of these 28 text files will be compressed
in gz format. When users look for a word, or many words,
only these file(s) are decompressed and searched.
Like Brian, I was going to suggest existing search
Hi Bernard,
on Thu, 11 Feb 2010 09:13:46 +
Bernard Devlin wrote:
Like Brian, I was going to suggest existing search technologies like
Lucene. Why re-invent the wheel? I understand you not wanting to
ship Java and get the user to install it. However there may be other
pre-existing
Hi Bernard,
on Wed, 10 Feb 2010 07:20:36 +
Bernard Devlin wrote:
Can I just clarify your problem? You want to be able to search for
phrases (partial sentences, possibly with boolean logic) inside the
text stored in the xml nodes of the article, once the article is found
in the index?
On Wed, Feb 10, 2010 at 2:56 PM, Alejandro Tejada
capellan2...@gmail.com wrote:
No, it's not a search inside the displayed article.
It's a global search, within a general index created
using all words from all articles of Wikipedia.
(I do not believe that it's necessary to load this full
Alejandro,
The first step for this would likely include creating an inverted index. This
means you store something like:
monkey:1,34,3827,21314
Where the word being indexed in monkey and the numbers that follow are
article IDs. Using this information it is pretty trivial to implement AND /
The ambitious Alejandro Tejada wrote:
It's a global search, within a general index created
using all words from all articles of Wikipedia.
(I do not believe that it's necessary to load this full
index in memory, instead just open specific parts
of this index when users start searching)
For
Many thanks for replying this question. :-)
on Wed, 10 Feb 2010 15:30:23 +
Bernard Devlin wrote:
OK, so that's why you mention the different files for each letter of
the alphabet.
Yes, each one of these 28 text files will be compressed
in gz format. When users look for a word, or many
Yes, this is correct and should work fine, but how could i write in the
word index a range of article where a word appears consecutively:
baboon:1934,2345,2346,2347,2348,2349,2350,2351,2352,2567,3578
If this were your format, you could compact to something like:
baboon:1934,2345-2352,2567,3578
On Tue, Feb 9, 2010 at 12:00 AM, Alejandro Tejada
capellan2...@gmail.com wrote:
Now, i am looking for advice to create an index structure for searching
specific words inside article's text. i have been unable to implement
a fast search algorithm, using multiple words, similar to Wikipedia's
Hi all,
Some time ago, i posted a message asking for
volunteers to create a Wikipedia CD/DVD.
Since then, i have been working on this project
and have done some advances, that will be
published as soon they work as expected.
Now, i need advice about possible strategies
to create a fast and
14 matches
Mail list logo