It's obviously going to depend on your configuration, time and
hardware budget, but I think the basic "grab the stream to timestamped
flat files and post-process later" approach has a lot going for it.
Especially on a Linux server, scripting languages are really good and
efficient at the post-processing, and the tweets are coming in at such
a high rate that you might well want to be using something other than
a conventional RDBMS like MySQL or PostgreSQL for your data analysis
and management anyhow. So why stuff things into MySQL, only to need to
pull them out again for a MapReduce?

My original design called for a process to sit on the streaming API
and dump the tweets into PostgreSQL one at a time, but I ended up just
collecting them into hourly JSON files, converting them to CSV with a
simple Perl script, then putting them into PostgreSQL with the
blazingly fast "COPY" operator. It doesn't take very long to build a
sizable tweet database that way. And I can do filtering at the JSON or
CSV level before anything even goes into PostgreSQL.

On Jan 16, 10:13 am, GeorgeMedia <georgeme...@gmail.com> wrote:
> Just looking for thoughts on this.
>
> I am consuming the gardenhose via a php app on my web server. So far
> so good. The script simply creates a new file every X amount of time
> and starts feeding the stream into it so I get a continuous stream of
> fresh data and I can delete old data via cron. I plan to access the
> stream (files) with separate processes for further json parsing and
> data mining.
>
> But then that got me to thinking about simply feeding the data into a
> MySQL database for easier data manipulation and indexing. Would that
> cause a more stressful server load with the constant INSERT queries vs
> a process just dumping the data into a file [ via PHP fputs() ] that
> is perpetually open?
>
> What about simply running the php process and accessing the "stream"
> directly? Only grabbing a snapshot of the data when a process needs
> it? I'm not really concerned with historical data as my web based app
> is more focused on trends at a given moment. Just wondering out loud
> if simply letting the process run in the background grabbing data
> would eventually fill up any caches or system memory.

Reply via email to