[Simile-Widgets] Re: Here in Southern California

Jeff Roehl Thu, 23 Aug 2012 13:45:25 -0700

Hey Oliver Keys from Wikipedia.org.

I am glad you are interested in the wikitimeline.net concept and I look forward 
to engaging the Wikipedia community in a discussion about its potential. We 
developed Wikitimelins.net strictly for resume enhancement purposes. :)


You asked/said:

>> I think it would be pretty awesome, yep :). 
>> How exactly would it work; we'd link through to your setup? 
>> Your software would be hosted here? 
>> How it's formatted will alter who needs to approve what (if it's not 
>>something that involves developer work or approval, for example, it's a 
>>community matter), 
>> but however we do it I would be very interested to see more feedback and 
>>reactions from the community: we try to include them in any technical 
>>changes. 


************************************************************************************** 


First off, how did you get a job at Wikipedia? Do you have to know somebody? 
Does bribery work? <snicker, just kidding>

How many servers does Wikipedia have now? Are you in the Fort Lauderdale, 
Florida Wikipedia server center?

**************************************************************************************

I believe a timeline of a Wikipedia article could be placed in 
a Wikipedia article using the following notation:


<script src="http://wikitimelines.net/H/P/HP40C1A2X.js"; 
type="text/javascript"></script>
<div id="HP40C1A2X"></div>

Of course the magic part is not quite that easy, read on.

I think you/me would only need 1 fast server for this ($300 a month, Shawn?). 
Just give it a quad processor, 16 gigs of memory and a big pipe. I 
could probably serve up a million timelines a day off a server like that, with 
about 30 instances of my database running concurrently. It would be a very busy 
server (so we would probably have to pack it in ice <ha ha>). You could use 
Carbonite to back it up, because in the end, the timelines back-end would 
consist of millions of little databases. Where the server is physically located 
and who actually owns/maintains it makes no difference to us, as long as we/I 
have access to it.

The back-end for timelines was designed for:

1) Portability
2) Storage optimization (no indexes needed, almost)
3) Ease of replication and backup
4) Ease of multi-user usage (no large databases or records to lock)
5) Heavy demand
6) Speed, speed and speed (I have it on a cheap, slow server, that is why it 
appears a bit slow right now)

The heart of the timelines is a rather complex date parsing system and a 
presentation layer (10,000 lines of code, 3000+ man hours of work).

Here is a step by step listing of its (back-end) algorithm traversing its task 
of greatest friction. :)

(In these steps I use the word "I" instead of "we" for clarity. I have 2 to 3 
people working on this project, at any one time, so using "we" would be more 
accurate.)

1) A timeline is requested.
2) I check if the timelines directory exists (I have all the 7 million 
Wikipedia titles and have assigned them all unique ID's).
3) If the directory dose exist (and therefore the timeline has already gone 
through initial processing), I send the timeline JavaScript via AJAX, then I 
send the timeline data as XML or JSON.
4) If the timeline doesn't exist, I create a directory for it.
5) In this directory I create the following tables:

Create Table epochs    (Id c(9), Selected l)
Create Table pics      (Id c(9), Caption m, bigpic m, startdate T, Date T, 
Current c(1), modified T, added T, Height N(4,0), Width N(4,0), Link m)
Create Table mess      (Id c(9), Name c(35), email c(35), website c(35), Date 
T, Active c(1), Mess m)
Create Table sen       (sen c(9), numdates N(2), Para c(9), Start N(5), End 
N(5),  startd T, Endd T, First c(1), Current c(1), Deleted T, added T, Color 
c(6), tsen N(4))
Create Table para      (Id c(9), Fixed m, dates m, marked m,  Current c(1), 
added T, First c(1), Deleted T)
Create Table decorator (Id c(9), startdate T, enddate T, Color c(6), opacity 
N(3), startlabel m, Current c(1), Type c(1), Deleted T, modified T ,added T)
Create Table allmags   (Id c(9), Start T, End T, unit c(1), mag N(4), Current 
c(1), Order N(4,0), band N(3), Deleted T, modified T ,added T)
Create Table tljsdb    (Id c(9), band N(2), Prop c(1), Value m, Current c(1), 
Deleted T, modified T ,added T)
Create Table global    (Date d, Height N(10), Width N(10), tlheight N(10), 
Current c(1), gotpics l, modified T ,added T, picsavail l, rtotal N(10), rcount 
N(10), lasttime T)

6) Then I pull the Wikipedia article from your (en.Wikipedia.org) website 
servers.

7) I then pull out and cleanup each individual paragraph from the article and 
stick each one into a database.
8) I then mark all of the (suspected) dates in the paragraph and save that into 
another field in the database (very complicated).
9) I then do sentence disambiguation, which is a lot more complex than we had 
originally thought it would be, mine is nearly 100% 
accurate. http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation
10) I then send that into a comprehensive date disambiguation algorithm to see 
if the (suspected) dates are really dates.
11) If they are, I first detect "continuous dates". Dates that denote 
a continuum. Example - "He was prime minister from January 20, 1874 to August 
2, 1880".
12) I then grab single dates. Example "He became prime minister on January 20, 
1874."
13) I then grab "widow dates" like "He became prime minister in January of that 
year." because in order for this to make any sense (in natural language) a year 
is almost always in the previous sentence or paragraph.
14) I then parse out all pictures, from the article, and disambiguate all of 
the potential dates, from the pictures caption, using the same algorithm I used 
on the paragraphs. These dates are used for picture placement on the timeline. 
Users can turn pictures on or off and can change the dates on the pictures, to 
adjust where they are placed on the timeline.
15) I then construct the JavaScript for the new timeline.
16) I send the timelines JavaScript to the client browser.
17) I then construct the XML data for the timeline.
18) I then send the XML to the browser (actually the 
timelines JavaScript requests the XML as it executes).

Of course, this whole thing is much more involved than this. I just wanted to 
give you an overview.

If the timeline already exists (in my databases), it only takes seconds 
to construct and display it.

If the timeline dose not exist, it can take up to 60 seconds to traverse steps 
1 to 18 above (depending on the size of the article). Luckily this is a one 
shot deal, as it only happens once for each Wikipedia article. Each timeline is 
only "born" once.

I hope this helps!

We have a rather long list of improvements for the website and the back-end. We 
just put the website up as a beta to get as much feedback as quickly as 
possible.

So it is BACK TO WORK! lol
 
Thanks 
Jeff Roehl
[email protected]
(818) 912-7530

-- 
You received this message because you are subscribed to the Google Groups 
"SIMILE Widgets" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/simile-widgets?hl=en.

[Simile-Widgets] Re: Here in Southern California

Reply via email to