Hi Andreas,
many thanks for your thorough investigation. Just to point out that
biomart does not do the caching, however, the database backend caching
do kick in for identical queries and hence, the difference in response
time by repeating the same query. Therefore, we can infer that its not
the internet slowness, its just the source locations (machines) where
different databases are hosted, is taking the time. So you would benefit
by moving these locally, but its going to be on the DB end - not
internet. Final decision is yours as you might want to find the
appropriate balance between maintenance and benefits :) For source data,
Ensembl databases are fairly straightforward to dump from their ftp. For
others, please contact the relevant guys from this table of contents:
http://database.oxfordjournals.org/content/2011.toc
Best,
Syed
On 15/04/2012 17:59, andreas H wrote:
Hi Syed,
using webservices API I get more or less the same behaviour:
$bash-prompt > time wget
'http://www.biomart.org/biomart/martservice?query=<?xml version="1.0"
encoding="UTF-8"?><!DOCTYPE Query><Query virtualSchemaName = "default"
formatter = "TSV" header = "0" uniqueRows = "0" count = ""
datasetConfigVersion = "0.8"><Dataset name="hsapiens_gene_ensembl"
config="gene_ensembl_ap"><Filter name="hgnc_symbol"
value="RAN"/><Attribute name="go_id"/></Dataset></Query>' -O result.txt
--17:37:45--
http://www.biomart.org/biomart/martservice?query=%3C?xml%20version=%221.0%22%20encoding=%22UTF-8%22?%3E%3C!DOCTYPE%20Query%3E%3CQuery%20virtualSchemaName%20=%20%22default%22%20formatter%20=%20%22TSV%22%20header%20=%20%220%22%20uniqueRows%20=%20%220%22%20count%20=%20%22%22%20datasetConfigVersion%20=%20%220.8%22%3E%3CDataset%20name=%22hsapiens_gene_ensembl%22%20config=%22gene_ensembl_ap%22%3E%3CFilter%20name=%22hgnc_symbol%22%20value=%22RAN%22/%3E%3CAttribute%20name=%22go_id%22/%3E%3C/Dataset%3E%3C/Query%3E
=> `result.txt'
Resolving www.biomart.org... 206.108.121.49
Connecting to www.biomart.org|206.108.121.49|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
[ <=> ] 1,815 --.--K/s
17:38:03 (48.08 MB/s) - `result.txt' saved [1815]
real 0m17.942s
user 0m0.000s
sys 0m0.030s
By watching wget's output, the longest time is spend here:
Resolving www.biomart.org... 206.108.121.49
Connecting to www.biomart.org|206.108.121.49|:80... connected.
HTTP request sent, awaiting response...
Doing it a second time, results in only a few seconds wait time (half a
second to 2 seconds).
I have tried this from two machines belonging to two different
organisations in London. The speed of the connections I assume is quite
good.
These may be useful:
1) Running say RAS query from organisation A takes 17 seconds.
After a few minutes, running the same query from organisation B, takes
less than 2 seconds.
Running new, say LOX, query from B takes 17 seconds. After a few minutes
doing the same from A takes 2 seconds. So I guess some cache at biomart
short-circuits a lot of query/processing time *at biomart*. Which I
think means that network speed is OK but processing/querying is not
(though even half a second could be considered too much by some standards).
2) Replacing 'go_id' as attribute name by 'ensembl_gene_id' results in
half a second queries from first time. replacing by 'ensembl_peptide_id'
takes 3 seconds the first time, half a second afterwards.
If you can point me to the URL of an online biomart query form, I will
try that too.
Thanks,
Andreas
On 13/04/2012 23:01, Syed Haider wrote:
Hi Andreas,
I wonder if you see the same response time when you connect to the
website directly or using webservice API. Just trying to isolate the
R-specific and biomart-specific response time. I am assuming that your
institutional internet speed is good enough. I suggest the above because
if the response time could be improved, it can save you a lot of time
and hassle of managing these database locally.
Best,
Syed
On 12/04/2012 15:27, andreas H wrote:
Hi Syed,
thanks for your reply,
speed, bandwidth are the reasons.
e.g. **unless I am doing something wrong**
right now using R and biomaRt (with factory settings) :
require(biomaRt)
a_mart=useMart(biomart="ensembl",dataset='hsapiens_gene_ensembl')
system.time({getBM(attributes=c('go_id'),filters='hgnc_symbol',values=c("KRAS"),
mart=a_mart) })
# user system elapsed
# 0.007 0.001 23.218
which is too slow. Admittedly, the second time this runs *within the
same R session*, it drops down to 3 seconds, but this is not a likely
scenario.
Thanks,
Andreas
On 12/04/2012 14:59, Syed Haider wrote:
Hi Andreas,
Before getting into the details of database updates, i wonder why would
you prefer to have databases downloaded locally as opposed to using it
whereever its hosted. BioMart is a system that enables data integration
by means of federation. Hence, it kind of defeats the purpose if one
downloads everything locally. Moreover, if you have data locally, you
enter into a rather bigger probem of updates for each of the sources
you
downloaded as all of them have their own major/minor release cycles
etc.
I might be missing something important w.r.t your requirements, hence i
ask :)
Best,
Syed
On 12/04/2012 14:10, andreas H wrote:
Hello there,
After days of searching around I can't find a simple step-by-step
guide
on how to install biomart locally including databases. Could you point
me to such a document if what I want to do is possible at all?
I have also read the biomart 0.8 install manual, I have managed to set
up a webserver, I can make it read data from local source but I can't
find how to automatically download and install these sources locally.
If you can contribute more, then preferably:
1) I would like to know how to select a database and transfer it
locally
to my own, local mysql server, without having to go to each vendor's
website download their files and install it to my local db. I mean is
there a click-and-install feature in *mart-configurator* or
something? I
visualise this as I give my local mysql password and the databses are
install automatically there.
2) I would like to install only a small subset of databases locally -
the ones that i use a lot.
3) Ideally, I would like with a few clicks to update the biomart
databases which are installed locally, every few weeks/months.
My aim right now is to convert between ensembl, uniprot, hugo protein
IDs and get GO terms for each protein.
Thanking you in advance,
andreas
The Institute of Cancer Research: Royal Cancer Hospital, a charitable
Company Limited by Guarantee, Registered in England under Company No.
534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP.
This e-mail message is confidential and for use by the addressee only.
If the message is received by anyone other than the addressee, please
return the message to the sender by replying to it and then delete the
message from your computer and network.
_______________________________________________
Users mailing list
[email protected]
https://lists.biomart.org/mailman/listinfo/users