history.py and .trace files

Martin Holst Swende Wed, 09 Feb 2011 12:00:12 -0800

On 02/09/2011 02:31 AM, Steve Pinkham wrote:
> On 02/08/2011 08:08 PM, Andres Riancho wrote:
>> Steve,
>>> noSQL servers are usually fast because they are in-memory systems.
>>> sqlite can be used in that mode also if you like.
>> mongodb is not an in-memory db!
> In practice, it is.  It stores all indexes in memory and uses memory
> mapped files. It will automatically consume all available memory (which
> is a good thing or bad thing depending on what else you want to use the
> server for).
>
> http://www.mongodb.org/display/DOCS/Indexing+Advice+and+FAQ#IndexingAdviceandFAQ-MakesureyourindexescanfitinRAM.
>
> http://www.mongodb.org/display/DOCS/Caching
>
Hi all,


I have to say I disagree that MongoDB is called a memory-db. There are
such things as memory-databases, e.g. H2, and MongoDB is not one of
them. These databases keep *all* data in memory, which is another matter
than using the memory for indices (which are orders of magnitude smaller
than the data) and caching (which I would guess that all daemon-mode
databases tries to do as best as possible).

I also disagree that they are "usually fast because they are in-memory
systems". They are usually fast because they basically let the 'C' in
Brewers CAP-theorem suffer, that is to say that they do not enforce
consistency across all nodes. This allows for better partition tolerance
and availability. They often employ  "eventual consistency". An example
of one eventual-consistency system is the internet DNS system.
Individual nodes (dns servers) may give stale information about a
hostname, but eventually updates will reach all nodes and the system
will be consistent again. Why is this important? Like the DNS-example,
such a system can be built without any locks on readers or writers.
Since it is ok for a reader to get 'stale' information, a writer can
create a new version of a data-post first, then update the pointer.
Neither reader nor writer have to wait.

Some databases, such as CouchDB have gone further, using MVCC with a
built-in git vcs to handle simultaneous modification of data on several
nodes. It also has an append-only filesystem-implementation to further
eliminate locking at the filesystem-level (and ensure that file
corruption cannot occur)

Having said all this, I concur that using e.g. MongoDB for w3af is
probably not necessary, it sounds strange that sqlite would be unable to
handle the somewhat modest amounts of data we're talking about. Also, I
can see that concerns whether to really switch to a daemon-mode database
arise. That totally depends on what is the purpose of w3af - if the
purpose is to be a good scanner which is easy to use and install,
daemon-db is a bad choice. If the purpose is to be the best - regardless
of ease of installation and use - then I wouldn't blink before switching
to a daemon database if that gives any advantage.

Two more comments I disagree with:

"It's useful in distributed, massively parallel systems, but offers no real
benefit for single user databases."
and 
"noSQL is just the new term for key-value stores."


It is true that it is useful for distributed, massively parallel system,
but there are also advantages to using it for data which fits the
dynamic (schemaless) model. Having no schema enforced by the database
does not mean that the database is just a disk-based hash table with
blobs for values. I would instead say that noSQL is more like a new
generation of the object databases, but now with generic API's
(json/bson/http) and wide language support. Certain kinds of data fit
very well into these models.

I have written a proxy which saves http traffic into a MongoDB
(http://martin.swende.se/hg/#hatkit_proxy-t1/
<http://martin.swende.se/hgwebdir.cgi/hatkit_proxy/>) and a framework to
analyse traffic from this database
(http://martin.swende.se/hg#hatkit_fiddler-t1). Http traffic looks very
non-uniform. Some requests are basically "GET / HTTP/1.1" while others
contain forms or json and lots and lots of headers. Using MongoDB, it is
possible to represent the data more at an object-level, e.g.
{ request:
    { method: "GET",
    headers:{ Content-Length: 1233, Host : "foobar.com", Foo: "bar"}
    parameters: {gaz: "onk"}
                    },
    response : {...}
}

MongoDB has very powerful querying-facilities
(http://www.mongodb.org/display/DOCS/Advanced+Queries). Since the object
is stored with this structure in the database, it is possible to reach
into objects
(http://www.mongodb.org/display/DOCS/Dot+Notation+(Reaching+into+Objects) 
<http://www.mongodb.org/display/DOCS/Dot+Notation+%28Reaching+into+Objects%29>),
and perform e.g these kind of queries:

"give me response.body where request.parameters.filename exists", or
"give me request.body.parameters where
request.body.parameters.__viewstate does not exist"

Also, MongoDB has very powerful aggregation mechanisms
(http://www.mongodb.org/display/DOCS/Aggregation), where queries like
the following can be used:
"Organized by request.headers.host give me all unique parameter names.",
or, "organized by request.url.path, give me all unique response header
keys". To generate these, you create javascript 'reduce' functions which
are executed inside the database.

Another reason, beside being very dynamic, why http traffic is not
necessarily particularly suited for SQL, is that it is pretty much
non-relational. Relational databases are good for relational data, e.g.
where you have Employees, which have Eoles, and are sorted in differnet
Offices etc etc - where all the data is heavily related to other data.
Http traffic is basically request and response. No relations to much else.

Oh, and one more thing about incides. When I started with the
hatkit_fiddler, I decided to wait with adding indices to see where they
were needed. So far, I haven't felt the need to add any indices at all,
it's fast enough anyway...

So, I believe that some really cool things could come from a switch to
MongoDB. But I am not convinced that performance should be a driving
reason for such a switch.

Regards,
/Martin Holst Swende

------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb

_______________________________________________
W3af-develop mailing list
W3af-develop@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/w3af-develop

Re: [W3af-develop] core/data/db/history.py and .trace files

Reply via email to