That is a really old paper that basically pre-dates all of the recent
important work in neural networks.
You should look for works on Rectified Linear Units (ReLU), drop-out
regularization, parameter servers (downpour sgd) and deep learning.
Map-reduce as you have used it will not produce
have you considered implementing using something like spark? That could be
much easier than raw map-reduce
On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni unmeshab...@gmail.com
wrote:
In KNN like algorithm we need to load model Data into cache for predicting
the records.
Here is the
The block size controls lots of things in Hadoop.
It affects read parallelism, scalability, block allocation and other
aspects of operations either directly or indirectly.
On Sun, May 12, 2013 at 10:38 AM, shashwat shriparv
dwivedishash...@gmail.com wrote:
The block size is for allocation
Another interesting alternative is the EWAH implementation of java bitsets
that allow efficient compressed bitsets with very fast OR operations.
https://github.com/lemire/javaewah
See also https://code.google.com/p/sparsebitmap/ by the same authors.
On Sun, May 12, 2013 at 1:11 PM, Bertrand
This sounds (with no real evidence) like you are a bit light on memory for
that number of cores. That could cause you to be spilling map outputs
early and very much slowing things down.
On Fri, May 10, 2013 at 11:30 PM, David Parks davidpark...@yahoo.comwrote:
We’ve got a cluster of 10x
I think that you just said what the OP said.
Your two cases reduce to the same single case that they had.
Whether this matters is another question, but it seems like it could in
cases where splits != blocks, especially if a split starts near the end of
a block which could give an illusion of
I think that having more than 6 drives is better.
More memory never hurts. If you have too little, you may have to run with
fewer slots than optimal.
10GB networking is good. If not, having more than 2 1GBe ports is good, at
least on distributions that can deal with them properly.
On Mon,
It is rarely practical to do exhaustive comparisons on datasets of this
size.
The method used is to heuristically prune the cartesian product set and
only examine pairs that have a high likelihood of being near.
This can be done in many ways. Your suggestion of doing a map-side join is
a
It may or may not help you in your current distress, but MapR's
distribution could handle this pretty easily.
One method is direct distcp between clusters, but you could also use MapR's
mirroring capabilities to migrate data.
You can also carry a MapR cluster, change the IP addresses and relight
matter because once it gets going, it moves data much faster.
On Apr 14, 2013 6:15 AM, Ted Dunning tdunn...@maprtech.com wrote:
Lance,
Never say never.
Linux programs can read from the right kind of Hadoop cluster without
using FUSE.
On Fri, Apr 12, 2013 at 10:15 AM, Lance
On Sun, Apr 14, 2013 at 10:33 AM, Mathias Herberts
mathias.herbe...@gmail.com wrote:
This is absolutely true. Distcp dominates cp for large copies. On the
other hand cp dominates distcp for convenience.
In my own experience, I love cp when copying relatively small amounts of
data
Lance,
Never say never.
Linux programs can read from the right kind of Hadoop cluster without using
FUSE.
On Fri, Apr 12, 2013 at 10:15 AM, Lance Norskog goks...@gmail.com wrote:
Shell 'cp' only works if you use 'fuse', which makes the HDFS file system
visible as a Unix mounted file
This isn't a very Hadoop question.
A Bloom filter is a very low level data structure that doesn't really any
correlate in SQL. It allows you to find duplicates quickly and
probabilistically. In return for a small probability of a false positive,
it uses less memory.
On Fri, Mar 29, 2013 at
Putting each document into a separate file is not likely to be a great
thing to do.
On the other hand, putting them all into one file may not be what you want
either.
It is probably best to find a middle ground and create files each with many
documents and each a few gigabytes in size.
On Fri,
The EMR distributions have special versions of the s3 file system. They
might be helpful here.
Of course, you likely aren't running those if you are seeing 5MB/s.
An extreme alternative would be to light up an EMR cluster, copy to it,
then to S3.
On Thu, Mar 28, 2013 at 4:54 AM, Himanish
Another Ted piping in.
For Hadoop use, it is dangerous to use anything but a static class for your
mapper and reducer functions since you may accidentally think that you can
access a closed variable from the parent. A static class cannot reference
those values so you know that you haven't made
Also, Canonical just announced that MapR is available in the Partner repos.
On Thu, Mar 28, 2013 at 7:22 AM, Nitin Pawar nitinpawar...@gmail.comwrote:
apache bigtop has builds done for ubuntu
you can check them at jenkins mentioned on bigtop.apache.org
On Thu, Mar 28, 2013 at 11:37 AM,
And, of course, due credit should be given here. The advanced clustering
algorithms in Crunch were lifted from the new stuff in Mahout pretty much
step for step.
The Mahout group would have loved to have contributions from the Cloudera
guys instead of re-implementation, but you can't legislate
Spark would be an excellent choice for the iterative sort of k-means.
It could be good for sketch-based algorithms as well, but the difference
would be much less pronounced.
On Wed, Mar 27, 2013 at 3:39 PM, Charles Earl charlesce...@me.com wrote:
I would think also that starting with centers
Try file:///fs4/outdir
Symbolic links can also help.
Note that this file system has to be visible with the same path on all
hosts. You may also be bandwidth limited by whatever is serving that file
system.
There are cases where you won't be limited by the file system. MapR, for
instance, has
Chaining the jobs is a fantastically inefficient solution. If you use Pig
or Cascading, the optimizer will glue all of your map functions into a
single mapper. The result is something like:
(mapper1 - mapper2 - mapper3) = reducer
Here the parentheses indicate that all of the map functions
The MapR videos on programming and map-reduce are all general videos.
The videos that cover capabilities like NFS, snapshots and mirrors are all
MapR specific since ordinary Hadoop distributions like Cloudera,
Hortonworks and Apache can't support those capabilities.
The videos that cover MapR
Yeah... you can make this work.
First, if your setup is relatively small, then you won't need Hadoop.
Second, having lots of kinds of actions is a very reasonable thing to have.
My own suggestion is that you analyze these each for their predictive
power independently and then combine them at
The delay due to replication is rarely a large problem in traditional
map-reduce programs since many writes are occurring at once. The real
problem comes because you are consuming 3x the total disk bandwidth so that
the theoretical maximum equilibrium write bandwidth is limited to the
lesser of
wrote:
We have seen in several of our Hadoop clusters that LVM degrades
performance of our M/R jobs, and I remembered a message where
Ted Dunning was explaining something about this, and since
that time, we don't use LVM for Hadoop data directories.
About RAID volumes, the best performance
All of these suggestions tend to founder on the problem of key management.
What you need to do is
1) define your threats.
2) define your architecture including key management.
3) demonstrate how the architecture defends against the threat environment.
I haven't seen more than a cursory
Works with a real-time version of Hadoop such as MapR.
But you are right that HDFS and MapReduce were never intended for real-time
use.
On Fri, Feb 1, 2013 at 1:40 AM, Mohammad Tariq donta...@gmail.com wrote:
How are going to store videos in HDFS?By 'playing video on the browser' I
assume
We have tested both machines in our labs at MapR and both work well. Both
run pretty hot so you need to keep a good eye on that.
The R720 will have higher wattage per unit of storage due to the smaller
number of drives per chassis. That may be a good match for ordinary Hadoop
due to the lower
Are you asking about change management for configurations and such? If so,
there are good tools out there for managing that including puppet, chef and
ansible.
Or are you asking about something else? Both Cloudera and MapR have tools
that help with centralized configuration management of
Incremental backups are nice to avoid copying all your data again.
You can code these at the application layer if you have nice partitioning
and keep track correctly.
You can also use platform level capabilities such as provided for by the
MapR distribution.
On Fri, Jan 25, 2013 at 3:23 PM,
Also, you may have to adjust your algorithms.
For instance, the conventional standard algorithm for SVD is a Lanczos
iterative algorithm. Iteration in Hadoop is death because of job
invocation time ... what you wind up with is an algorithm that will handle
big data but with a slow-down factor
.
And I am pretty sure it does not have a separate partition for root.
Please help me explain what u meant and what else precautions should I
take.
Thanks,
Regards,
Ouch Whisper
01010101010
On Jan 18, 2013 11:11 PM, Ted Dunning tdunn...@maprtech.com wrote:
Where do you find 40gb disks
Answer B sounds pathologically bad to me.
A or C are the only viable options.
Neither B nor D work. B fails because it would be extremely hard to get
the right records to the right components and because it pollutes data
input with configuration data. D fails because statics don't work in
The colon is a reserved character in a URI according to RFC 3986[1].
You should be able to percent encode those colons as %3A.
[1] http://tools.ietf.org/html/rfc3986
On Wed, Dec 26, 2012 at 1:00 PM, Mohit Anchlia mohitanch...@gmail.comwrote:
It looks like hadoop fs -put command doesn't like
The technical term for this is copying. You may have heard of it.
It is a subject of such long technical standing that many do not consider
it worthy of detailed documentation.
Distcp effects a similar process and can be modified to combine the input
files into a single file.
)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:937)
On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning tdunn...@maprtech.comwrote:
The technical term for this is copying. You may have heard of it.
It is a subject of such long technical
You can write a script to parse the Hadoop job list and send an alert.
The trick of putting a retry into your workflow system is a nice one. If
your program won't allow multiple copies to run at the same time, then if
you re-invoke the program every, say, hour, then 5 retries implies that the
Also, I think that Oozie allows for timeouts in job submission. That might
answer your need.
On Sat, Dec 22, 2012 at 2:08 PM, Ted Dunning tdunn...@maprtech.com wrote:
You can write a script to parse the Hadoop job list and send an alert.
The trick of putting a retry into your workflow
On Thu, Dec 20, 2012 at 7:38 AM, Michael Segel michael_se...@hotmail.comwrote:
While Ted ignores that the world is going to end before X-Mas, he does hit
the crux of the matter head on.
If you don't have a place to put it, the cost of setting it up would kill
you, not to mention that you can
Yes it does make sense, depending on how much compute each byte of data
will require on average. With ordinary Hadoop, it is reasonable to have
half a dozen 2TB drives. With specialized versions of Hadoop considerably
more can be supported.
From what you say, it sounds like you are suggesting
Also, the moderators don't seem to read anything that goes by.
On Wed, Nov 28, 2012 at 4:12 AM, sathyavageeswaran
sat...@morisonmenon.comwrote:
In this group once anyone subscribes there is no exit route.
-Original Message-
From: Tony Burton [mailto:tbur...@sportingindex.com]
Sent:
On Sat, Nov 24, 2012 at 5:19 AM, Bart Verwilst li...@verwilst.be wrote:
**
... I'm not sure that i understand your comment about repeating values in
fmsswitchvalues, since they are different from the ones in fmssession?
I was just pointing out that there were fields in the fmssession record
IT sounds like you could benefit from reading the basic papers on
map-reduce in general. Hadoop is a reasonable facsimile of the original
Google systems.
Try looking at this: http://research.google.com/archive/mapreduce.html
On Mon, Nov 19, 2012 at 7:14 AM, Kartashov, Andy
Conventional enterprise backup systems are rarely scaled for hadoop needs.
Both bandwidth and size are typically lacking.
My employer, Mapr, offers a hadoop-derived distribution that includes both
point in time snapshots and remote mirrors. Contact me off line for more info.
Sent from my
Create cannot be idempotent because of the problem of watches and
sequential files.
Similarly, mkdirs, rename and delete cannot generally be idempotent. In
particular applications, you might find it is OK to treat them as such, but
there are definitely applications where they are not idempotent.
On Sun, Oct 28, 2012 at 9:15 PM, David Parks davidpark...@yahoo.com wrote:
I need a unique permanent ID assigned to new item encountered, which has
a constraint that it is in the range of, let’s say for simple discussion,
one to one million.
Having such a limited range may require that you
, I can better understand the problem.
2012/10/29 Ted Dunning tdunn...@maprtech.com
Create cannot be idempotent because of the problem of watches and
sequential files.
Similarly, mkdirs, rename and delete cannot generally be idempotent. In
particular applications, you might find it is OK
This is better asked on the Zookeeper lists.
The first answer is that global atomic operations are a generally bad idea.
The second answer is that if you an batch these operations up then you can
cut the evilness of global atomicity by a substantial factor.
Are you sure you need a global
Unification in a parallel cluster is a difficult problem. Writing very
large scale unification programs is an even harder problem.
What problem are you trying to solve?
One option would be that you need to evaluate a conventionally-sized
rulebase against many inputs. Map-reduce should be
If you are going to mention commercial distros, you should include MapR as
well. Hadoop compatible, very scalable and handles very large numbers of
files in a Posix-ish environment.
On Mon, Oct 15, 2012 at 1:35 PM, Brian Bockelman bbock...@cse.unl.eduwrote:
Hi,
We use HDFS to process data
Uhhh... Alexey, did you really mean that you are running 100 mega bit per
second network links?
That is going to make hadoop run *really* slowly.
Also, putting RAID under any DFS, be it Hadoop or MapR is not a good recipe
for performance. Not that it matters if you only have 10mega bytes per
Harsh,
THanks for the plug. Rajesh has been talking to us.
On Fri, Oct 12, 2012 at 8:36 AM, Harsh J ha...@cloudera.com wrote:
Hi Rajesh,
Please head over to the Apache Mahout project. See
https://cwiki.apache.org/MAHOUT/logistic-regression.html
Apache Mahout is homed at
It depends on your distribution. Some distributions are more efficient at
driving spindles than others.
Ratios as high as 2 spindles per core are sometimes quite reasonable.
On Fri, Oct 12, 2012 at 10:46 AM, Patai Sangbutsarakum
silvianhad...@gmail.com wrote:
I have read around about the
I think that this rule of thumb is to prevent people configuring 2 disk
clusters with 16 cores or 48 disk machines with 4 cores. Both
configurations could make sense in narrow applications, but both would most
probably be sub-optimal.
Within narrow bands, I doubt you will see huge changes. I
by Hadoop.
Hi Lance,
I'm curious if you've gotten that to work with a decent-sized (e.g.
250 node) cluster? Even a trivial cluster seems to crush SolrCloud
from a few months ago at least...
Thanks,
--tim
- Original Message -
| From: Ted Dunning tdunn...@maprtech.comtdunn
I prefer to create indexes in the reducer personally.
Also you can avoid the copies if you use an advanced hadoop-derived distro.
Email me off list for details.
Sent from my iPhone
On Oct 9, 2012, at 7:47 PM, Mark Kerzner mark.kerz...@shmsoft.com wrote:
Hi,
if I create a Lucene index in
The answer is really the same. Your problem is just using a goofy
representation for negative numbers (after all, negative numbers are a
relatively new concept in accounting).
You still need to use the account number as the key and the date as a sort
key. Many financial institutions also
On Tue, Oct 2, 2012 at 7:05 PM, Hank Cohen hank.co...@altior.com wrote:
There is an important difference between real time and real fast
Real time means that system response must meet a fixed schedule.
Real fast just means sooner is better.
Good thought, but real-time can also include a
Why are you changing the TTL on DNS if you aren't moving the name? If you
are just changing the config to a new name, then caching won't matter.
On Wed, Sep 26, 2012 at 1:46 PM, Patai Sangbutsarakum
silvianhad...@gmail.com wrote:
Hi Hadoopers,
My production Hadoop 0.20.2 cluster has been
at the begining) this
list of N nearest points for every point in the file. Where N is a
parameter given to the job. Let's say 10 points. That's it.
No calculation after-wards, only querying that list.
Thank you
On Thu, Aug 30, 2012 at 11:05 PM, Ted Dunning tdunn...@maprtech.comwrote:
I
?
and calculating the distance as i go
On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning tdunn...@maprtech.comwrote:
I don't mean that.
I mean that a k-means clustering with pretty large clusters is a useful
auxiliary data structure for finding nearest neighbors. The basic outline
is that you
That was a stupid joke. It wasn't real advice.
Have you sent email to the specific email address listed?
On Thu, Aug 30, 2012 at 12:35 AM, sathyavageeswaran sat...@morisonmenon.com
wrote:
I have tried every trick to get self unsubscribed. Yesterday I got a mail
saying you can't unsubscribe
...@morisonmenon.com
wrote:
Of course have sent emails to all permutations and combinations of emails
listed with appropriate subject matter.
** **
*From:* Ted Dunning [mailto:tdunn...@maprtech.com]
*Sent:* 30 August 2012 10:12
*To:* user@hadoop.apache.org
*Cc:* Dan Yi; Jay
*Subject:* Re
** **
*From:* Ted Dunning [mailto:tdunn...@maprtech.com]
*Sent:* 30 August 2012 10:28
*To:* user@hadoop.apache.org
*Cc:* Dan Yi; Jay
*Subject:* Re: How to unsubscribe (was Re: unsubscribe)
** **
Can you say which addresses you sent emails so?
** **
The merging of mailing
On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan dextermorga...@gmail.comwrote:
I understand your solution ( i think) , didn't think of that, in that
particular way.
I think that lets say i have 1M data-points, and running knn , that the
k=1M and n=10 (each point is a cluster that requires up
points?
join of a file with it self.
Thanks
On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning tdunn...@maprtech.comwrote:
On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan
dextermorga...@gmail.comwrote:
I understand your solution ( i think) , didn't think of that, in that
particular way.
I
Mahout is getting some very fast knn code in version 0.8.
The basic work flow is that you would first do a large-scale clustering of
the data. Then you would make a second pass using the clustering to
facilitate fast search for nearby points.
The clustering will require two map-reduce jobs, one
Mongo has the best out of box experience of anything, but can be limited in
terms of how far it will scale.
Hbase is a bit tricky to manage if you don't have expertise in managing
Hadoop.
Neither is a great idea if your data objects can be as large as 10MB.
On Wed, May 23, 2012 at 8:30 AM,
No. 2.0.0 will not have the same level of ha as MapR. Specifically, the job
tracker hasn't been addressed and the name node Issues have only been partially
addressed.
On May 22, 2012, at 8:08 AM, Martinus Martinus martinus...@gmail.com wrote:
Hi Todd,
Thanks for your answer. Is that will
MapR provides this out of the box in a completely Hadoop compatible
environment.
Doing this with straight Hadoop involves a fair bit of baling wire.
On Tue, Jan 3, 2012 at 1:10 PM, alo alt wget.n...@googlemail.com wrote:
Hi Mac,
hdfs has at the moment no solution for an complete backup- and
Joey is speaking precisely, but in an intentionally very limited way.
Apache HDFS, the file system that comes with Apache Hadoop does not
support NFS.
On the other hand, maprfs which is a part of the commercial MapR
distribution which is based on Apache Hadoop does support NFS natively and
HDFS is a filesystem that is designed to support map-reduce computation.
As such, the semantics differ from what SVN or GIT would want to have.
HBase provides versioned values. That might suffice for your needs.
On Mon, Nov 21, 2011 at 9:58 AM, Stuti Awasthi stutiawas...@hcl.com wrote:
Do we
How big is that?
On Mon, Nov 21, 2011 at 9:26 PM, Stuti Awasthi stutiawas...@hcl.com wrote:
Hi Ted,
Well in my case document size can be big, which is not good to keep in
Hbase. So I rule out this option.
** **
Thanks
** **
*From:* Ted Dunning [mailto:tdunn
are going bigger than MBs then it is
not good to use Hbase for storage.
** **
Any Comments
** **
*From:* Ted Dunning [mailto:tdunn...@maprtech.com]
*Sent:* Tuesday, November 22, 2011 11:43 AM
*To:* hdfs-user@hadoop.apache.org
*Subject:* Re: Version control of files present in HDFS
, not 12GB. So
about 1-in-72 such failures risks data loss, rather than 1-in-12. Which is
still unacceptable, so use 3x replication! :-)
--Matt
On Mon, Nov 7, 2011 at 4:53 PM, Ted Dunning tdunn...@maprtech.com wrote:
3x replication has two effects. One is reliability. This is probably
more
By snapshots, I mean that you can freeze a copy of a portion of the the
file system for later use as a backup or reference. By mirror, I mean that
a snapshot can be transported to another location in the same cluster or to
another cluster and the mirrored image will be updated atomically to the
for this usage, however.
On Tue, Nov 8, 2011 at 7:32 AM, Rita rmorgan...@gmail.com wrote:
Thats a good point. What is hdfs is used as an archive? We dont really use
it for mapreduce more for archival purposes.
On Mon, Nov 7, 2011 at 7:53 PM, Ted Dunning tdunn...@maprtech.com wrote:
3x
Depending on which distribution and what your data center power limits are
you may save a lot of money by going with machines that have 12 x 2 or 3 tb
drives. With suitable engineering margins and 3 x replication you can have
5 tb net data per node and 20 nodes per rack. If you want to go all
There is no way to do this for standard Apache Hadoop.
But other, otherwise Hadoop compatible, systems such as MapR do support this
operation.
Rather than push commercial systems on this mailing list, I would simply
recommend anybody who is curious to email me.
On Sat, Aug 27, 2011 at 12:07 PM,
HDFS is not a normal file system. Instead highly optimized for running
map-reduce. As such, it uses replicated storage but imposes a write-once
model on files.
This probably makes it unsuitable as a primary storage for VM's.
What you need is either a conventional networked storage device or if
It is also worth using dd to verify your raw disk speeds.
Also, expressing disk transfer rates in bytes per second makes it a bit
easier for most of the disk people I know to figure out what is large or
small.
Each of these disks disk should do about 100MB/s when driven well. Hadoop
does OK,
To pile on, thousands or millions of documents are well within the range
that is well addressed by Lucene.
Solr may be an even better option than bare Lucene since it handles lots of
the boilerplate problems like document parsing and index update scheduling.
On Tue, May 31, 2011 at 11:56 AM,
itr.nextToken() is inside the if.
On Tue, May 24, 2011 at 7:29 AM, maryanne.dellasa...@gdc4s.com wrote:
while (itr.hasMoreTokens()) {
if(count == 5)
{
word.set(itr.nextToken());
output.collect(word, one);
}
count++;
}
ZK started as sub-project of Hadoop.
On Thu, May 19, 2011 at 7:27 AM, M. C. Srivas mcsri...@gmail.com wrote:
Interesting to note that Cassandra and ZK are now considered Hadoop
projects.
There were independent of Hadoop before the recent update.
On Thu, May 19, 2011 at 4:18 AM, Steve
Try using the Apache Mahout code that solves exactly this problem.
Mahout has a distributed row-wise matrix that is read one row at a time.
Dot products with the vector are computed and the results are collected.
This capability is used extensively in the large scale SVD's in Mahout.
On Tue,
How is it that 36 processes are not expected if you have configured 48 + 12
= 50 slots available on the machine?
On Wed, May 11, 2011 at 11:11 AM, Adi adi.pan...@gmail.com wrote:
By our calculations hadoop should not exceed 70% of memory.
Allocated per node - 48 map slots (24 GB) , 12 reduce
On Sat, Apr 30, 2011 at 12:18 AM, elton sky eltonsky9...@gmail.com wrote:
I got 2 questions:
1. I am wondering how hadoop MR performs when it runs compute intensive
applications, e.g. Monte carlo method compute PI. There's a example in
0.21,
QuasiMonteCarlo, but that example doesn't use
Check out S4 http://s4.io/
On Fri, Apr 29, 2011 at 10:13 PM, Luiz Fernando Figueiredo
luiz.figueir...@auctorita.com.br wrote:
Hi guys.
Hadoop is well known to process large amounts of data but we think that we
can do much more than it. Our goal is try to serve pseudo-streaming near of
Cooccurrence analysis is commonly used in recommendations. These produce
large intermediates.
Come on over to the Mahout project if you would like to talk to a bunch of
people who work on these problems.
On Fri, Apr 29, 2011 at 9:31 PM, elton sky eltonsky9...@gmail.com wrote:
Thank you for
I would recommend taking this question to the Mahout mailing list.
The short answer is that matrix multiplication by a column vector is pretty
easy. Each mapper reads the vector in the configure method and then does a
dot product for each row of the input matrix. Results are reassembled into
a
Turing completion isn't the central question here, really. The truth
is, map-reduce programs have considerably pressure to be written in a
scalable fashion which limits them to fairly simple behaviors that
result in pretty linear dependence of run-time on input size for a
given program.
The cool
Sounds like this paper might help you:
Predicting Multiple Performance Metrics for Queries: Better Decisions
Enabled by Machine Learning by Ganapathi, Archana, Harumi Kuno,
Umeshwar Daval, Janet Wiener, Armando Fox, Michael Jordan, David
Patterson
http://radlab.cs.berkeley.edu/publication/187
Hbase is very good at this kind of thing.
Depending on your aggregation needs OpenTSDB might be interesting since they
store and query against large amounts of time ordered data similar to what
you want to do.
It isn't clear to whether your data is primarily about current state or
about
nothing architecture.
This may be more database terminology that could be addressed by hbase,
but I think it is good background for the questions of memory mapping
files in hadoop.
Kevin
-Original Message-
From: Ted Dunning [mailto:tdunn...@maprtech.com]
Sent: Tuesday, April 12, 2011
, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
Then one could MMap the blocks pertaining to the HDFS file and piece
them together. Lucene's MMapDirectory implementation does just this
to avoid an obscure JVM bug.
On Mon, Apr 11, 2011 at 9:09 PM, Ted Dunning tdunn...@maprtech.com
wrote
Blocks live where they land when first created. They can be moved due to
node failure or rebalancing, but it is typically pretty expensive to do
this. It certainly is slower than just reading the file.
If you really, really want mmap to work, then you need to set up some native
code that builds
Actually, it doesn't become trivial. It just becomes total fail or total
win instead of almost always being partial win. It doesn't meet Benson's
need.
On Tue, Apr 12, 2011 at 11:09 AM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
To get around the chunks or blocks problem, I've been
Benson is actually a pretty sophisticated guy who knows a lot about mmap.
I engaged with him yesterday on this since I know him from Apache.
On Tue, Apr 12, 2011 at 7:16 PM, M. C. Srivas mcsri...@gmail.com wrote:
I am not sure if you realize, but HDFS is not VM integrated.
Depending on the function that you want to use, it sounds like you want to
use a self join to compute transposed cooccurrence.
That is, it sounds like you want to find all the sets that share elements
with X. If you have a binary matrix A that represents your set membership
with one row per set
Also, it only provides access to a local chunk of a file which isn't very
useful.
On Mon, Apr 11, 2011 at 5:32 PM, Edward Capriolo edlinuxg...@gmail.comwrote:
On Mon, Apr 11, 2011 at 7:05 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
Yes you can however it will require customization
1 - 100 of 609 matches
Mail list logo